SPO600 SIMD Lab
Resources
Auto-vectorization
- Auto-Vectorization in GCC - Main project page for the GCC auto-vectorizer.
- Auto-vectorization with gcc 4.7 - An excellent discussion of the capabilities and limitations of the GCC auto-vectorizer, intrinsics for providing hints to GCC, and other code pattern changes that can improve results. Note that there has been some improvement in the auto-vectorizer since this article was written. This article is strongly recommended.
- Intel (Auto)Vectorization Tutorial - this deals with the Intel compiler (ICC) but the general technical discussion is valid for other compilers such as gcc and llvm
Inline Assembly Language
- Inline Assembly Language
- ARM Developer Information Centre
- The short guide to the ARMv8 instruction set: ARMv8 Instruction Set Overview ("ARM ISA Overview")
- The long guide to the ARMv8 instruction set: ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile ("ARM ARM")
C Intrinsics - AArch64 SIMD
SQDMULH Instruction
Many of the AArch64 "Advanced SIMD" instructions are designed for use with multimedia data. In this lab, we will be using the SQDMULH instruction, which is a "Signed Saturating Doubling Multiply returning High Half". Breaking this down:
- As a vector (SIMD) instruction, this operation works on multiple values in parallel. It can operate on 16- or 32-bit values; since we're dealing with 16-bit signed sound samples, we will use 16-bit values.
- "Saturating" means that if the result overflows (or underflows) the maximum (or minimum) values, the result will be the maximum (or minimum) value. This is useful for graphics, where brightening a pixel that is at 90% brightness by an additional 50% should produce a pixel that is at maximum brightness, even though that's not mathematically correct. Likewise, a sound sample that is increased in volume should not increase past the maximum signal limit.
- We're going to use this instruction to multiply sound samples by a volume scaling factor (V). This instruction doubles the result, so that the V factor will effectively be converted from a 16-bit value to a 17-bit value. We can treat this as a fixed-point number.
- The result of multiplying two 16-bit numbers together is a 32-bit number. In our fixed-point representation, the 32-bit result has sixteen bits to the right of the radix point. Since this instruction takes the "high half" of the result, lowest 16 bits are discarded, keeping only the integer portion of the result -- which is exactly what we need.
Setup
Get the files for this lab on one of the SPO600 Servers and perform the lab on an AArch64 system.
- Unpack the archive
/public/spo600-simd-lab.tgz
Part 1: Auto-Vectorization
- The
vol1.c
file is the same as the one in the SPO600 Algorithm Selection Lab. Modify theMakefile
so that this file is compiled with the option-fopt-info-vec-all
, which will display information about the decisions that the compiler is making about the vectorization of each loop. - Compile
vol1.c
and review the compiler output.- The output will contain sections that start with "Analyzing loop at vol1.c:###" where ### is a line number.
- If the section ends with "note: not vectorized: reason" then the loop is not vectorized, and the reason will explain why.
- If the section ends with "note: LOOP VECTORIZED", then the loop was vectorized (compiled to use SIMD instructions).
- The output will contain sections that start with "Analyzing loop at vol1.c:###" where ### is a line number.
- Examine the output to see which loop(s) are vectorized.
- Modify the code so that one more loop is vectorized.
Part 2: Inline Assembler
- Look at
add.c
. Make sure that you understand how the inline assembler code works and why. - Modify the code to calculate
b mod a
using inline assembler, and print the result. This will help you to understand the inline assembler syntax. (Remember thatb mod a
is the remainder ofb/a
). - The file
vol_inline.c
contains a version of the volume scaling problem which uses inline assembler and the SQDMULH instruction. Copy, build, and verify the operation of this program on an AArch64 system. - Test the performance of this solution and compare it to your previous solution(s). Adjust the number of samples (in vol.h) to produce a measurable runtime, and adjust your code for comparable test conditions (number of samples, 1 array vs. 2 arrays, and so forth).
- Consider the questions marked with a Q: in this file, and incorporate those answers into your blog post.
Part 3: C Intrinsics
- The file
vol_intrinsics.c
contains a version of the volume scaling problem which uses C Intrinsics to access AArch64 SIMD instructions. Copy, build, and verify the operation of this program on an AArch64 system. - Test the performance of this solution and compare it to your previous solution(s). Adjust the number of samples (in vol.h) to produce a measurable runtime, and adjust your code for comparable test conditions (number of samples, 1 array vs. 2 arrays, and so forth).
- Consider the questions marked with a Q: in this file, and incorporate those answers into your blog post.
Optional
Extend the inline and C intrinsics versions of the code to operate correctly on both AArch64 and x86_64 systems. Use preprocessor directives to select the correct assembler/intrinsic code for each platform (for example, #ifdef __ARM_ARCH_ISA_A64
to select the AARCH64 code) -- use cpp -dM /dev/null
to see all of the pre-defined macros for a given platform. Include equivalent C code so that if the program is compiled on a system which is not AArch64 and not x86_64, it will still work.
Deliverables
Write up the lab on your blog. Make sure that the post makes sense to someone who does not know the course context -- write with enough background and detail that it makes sense to someone stumbling on to your blog post. Include answers to all of the questions marked with Q: in the source files, and include a detailed and accurate performance summary of the five different volume scaling approaches taken so far (Algorithms 0, 1, and 2 from the Algorithm Selection lab, and the two implementations from this lab: inline assembler and intrinsics). Reflect on the advantages and disadvantages of the various approaches.