Open main menu

CDOT Wiki β

Changes

GPU621/Intel Advisor

No change in size, 15:04, 23 November 2018
remove borders on images
== Vectorization Examples ==
[[File:Vectorization-example-serial.png|border]]
=== Serial Version ===
=== SIMD Version ===
[[File:Vectorization-example-simd.png|border]]
<source lang="cpp">
The following image illustrates the loop-carried dependency when two pointers overlap.
[[File:Pointer-alias.png|border]]
=== Magnitude of a Vector ===
As you can see, there is a loop-carried dependency with the variable <code>sum</code>. The diagram below illustrates why the loop cannot be vectorized (nor can it be threaded). The dashed rectangle represents a single iteration in the loop, and the arrows represents dependencies between nodes. If an arrow crosses the iteration rectangle, then those iterations cannot be executed in parallel.
[[File:Magnitude-node-dependency-graph.png|border]]
To resolve the loop-carried dependency, use <code>simd</code> and the <code>reduction</code> clause to tell the compiler to auto-vectorize the loop and to reduce the array of elements to a single value. Each SIMD lane will compute its own sum and then combine the results into a single sum at the end.
<source lang="cpp">
However, if the data is not aligned, the vectorizer may have to use a '''peeled''' loop to address the misalignment. So instead of vectorizing the entire loop, an extra loop needs to be inserted to perform operations on the front-end of the array that not aligned with memory.
[[File:Memory-alignment-peeled.png|border]]
A remainder loop is the result of having a number of elements in the array that is not evenly divisible by the vector length (the total number of elements of a certain data type that can be loaded into a vector register).
[[File:Memory-alignment-remainder.png|border]]
=== Padding ===
For example, if you have a <code>4 x 19</code> array of floats, and your system access to a 128-bit vector registers, then you should add 1 column to make the array <code>4 x 20</code> so that the number of columns is evenly divisible by the number of floats that can be loaded onto a 128-bit vector register, which is 4 floats.
[[File:Memory-alignment-padding.png|border]]
=== Aligned vs Unaligned Instructions ===
49
edits