SPO600 SIMD Lab

Purpose of this Lab
In this lab, you will investigate the use of SIMD instructions in software, using auto-vectorization, inline assembler, and C intrinsics.

This lab is not used in the current semester.
Please refer to the other labs in the SPO600 Labs category.

Resources

Auto-vectorization

Auto-Vectorization in GCC - Main project page for the GCC auto-vectorizer.
Auto-vectorization with gcc 4.7 - An excellent discussion of the capabilities and limitations of the GCC auto-vectorizer, intrinsics for providing hints to GCC, and other code pattern changes that can improve results. Note that there has been some improvement in the auto-vectorizer since this article was written. This article is strongly recommended.
Intel (Auto)Vectorization Tutorial - this deals with the Intel compiler (ICC) but the general technical discussion is valid for other compilers such as gcc and llvm

Inline Assembly Language

Inline Assembly Language
ARM Developer Information Centre
- ARM Cortex-A Series Programmer’s Guide for ARMv8-A
The short guide to the ARMv8 instruction set: ARMv8 Instruction Set Overview ("ARM ISA Overview")
The long guide to the ARMv8 instruction set: ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile ("ARM ARM")

C Intrinsics - AArch64 SIMD

SQDMULH Instruction

Many of the AArch64 "Advanced SIMD" instructions are designed for use with multimedia data. In this lab, we will be using the SQDMULH instruction, which is a "Signed Saturating Doubling Multiply returning High Half". Breaking this down:

As a vector (SIMD) instruction, this operation works on multiple values in parallel. It can operate on 16- or 32-bit values; since we're dealing with 16-bit signed sound samples, we will use 16-bit values.
"Saturating" means that if the result overflows (or underflows) the maximum (or minimum) values, the result will be the maximum (or minimum) value. This is useful for graphics, where brightening a pixel that is at 90% brightness by an additional 50% should produce a pixel that is at maximum brightness, even though that's not mathematically correct. Likewise, a sound sample that is increased in volume should not increase past the maximum signal limit.
We're going to use this instruction to multiply sound samples by a volume scaling factor (V). This instruction doubles the result, so that the V factor will effectively be converted from a 16-bit value to a 17-bit value. We can treat this as a fixed-point number.
The result of multiplying two 16-bit numbers together is a 32-bit number. In our fixed-point representation, the 32-bit result has sixteen bits to the right of the radix point. Since this instruction takes the "high half" of the result, lowest 16 bits are discarded, keeping only the integer portion of the result -- which is exactly what we need.

Setup

Get the files for this lab on one of the SPO600 Servers and perform the lab on an AArch64 system.

Unpack the archive /public/spo600-simd-lab.tgz

Part 1: Auto-Vectorization

The vol1.c file is the same as the one in the SPO600 Algorithm Selection Lab. Modify the Makefile so that this file is compiled with the option -fopt-info-vec-all, which will display information about the decisions that the compiler is making about the vectorization of each loop.
Compile vol1.c and review the compiler output.
- The output will contain sections that start with "Analyzing loop at vol1.c:###" where ### is a line number.
  - If the section ends with "note: not vectorized: reason" then the loop is not vectorized, and the reason will explain why.
  - If the section ends with "note: LOOP VECTORIZED", then the loop was vectorized (compiled to use SIMD instructions).
Examine the output to see which loop(s) are vectorized.
Modify the code so that one more loop is vectorized.

Part 2: Inline Assembler

Look at add.c. Make sure that you understand how the inline assembler code works and why.
Modify the code to calculate b mod a using inline assembler, and print the result. This will help you to understand the inline assembler syntax. (Remember that b mod a is the remainder of b/a).
The file vol_inline.c contains a version of the volume scaling problem which uses inline assembler and the SQDMULH instruction. Copy, build, and verify the operation of this program on an AArch64 system.
Test the performance of this solution and compare it to your previous solution(s). Adjust the number of samples (in vol.h) to produce a measurable runtime, and adjust your code for comparable test conditions (number of samples, 1 array vs. 2 arrays, and so forth).
Consider the questions marked with a Q: in this file, and incorporate those answers into your blog post.

Part 3: C Intrinsics

The file vol_intrinsics.c contains a version of the volume scaling problem which uses C Intrinsics to access AArch64 SIMD instructions. Copy, build, and verify the operation of this program on an AArch64 system.
Test the performance of this solution and compare it to your previous solution(s). Adjust the number of samples (in vol.h) to produce a measurable runtime, and adjust your code for comparable test conditions (number of samples, 1 array vs. 2 arrays, and so forth).
Consider the questions marked with a Q: in this file, and incorporate those answers into your blog post.

Optional

Extend the inline and C intrinsics versions of the code to operate correctly on both AArch64 and x86_64 systems. Use preprocessor directives to select the correct assembler/intrinsic code for each platform (for example, #ifdef __ARM_ARCH_ISA_A64 to select the AARCH64 code) -- use cpp -dM /dev/null to see all of the pre-defined macros for a given platform. Include equivalent C code so that if the program is compiled on a system which is not AArch64 and not x86_64, it will still work.

Deliverables

Benchmarking is Hard
Accurate benchmarking (performance testing) requires careful, methodical, accurate work. Take the time to make sure that your test results are repeatable, accurately reflect only the time taken for volume scaling, and of sufficient magnitude to enable comparisons. Do not compare results between machines, but do compare different algorithms on the same machine. In order to have this lab accepted as complete, you will need to post reasonable test results.

Write up the lab on your blog. Make sure that the post makes sense to someone who does not know the course context -- write with enough background and detail that it makes sense to someone stumbling on to your blog post. Include answers to all of the questions marked with Q: in the source files, and include a detailed and accurate performance summary of the five different volume scaling approaches taken so far (Algorithms 0, 1, and 2 from the Algorithm Selection lab, and the two implementations from this lab: inline assembler and intrinsics). Reflect on the advantages and disadvantages of the various approaches.

CDOT Wiki β