Open main menu

CDOT Wiki β

Changes

GPU621/Intel Advisor

4,805 bytes added, 10:00, 23 November 2018
no edit summary
== Introduction ==
 
[https://software.intel.com/en-us/advisor Intel Advisor] is software tool that is bundled with [https://software.intel.com/en-us/parallel-studio-xe Intel Parallel Studio] that is used to analyze a program to...
= Vectorization =
Vectorization is the process of utilizing vector registers to perform a single instruction on multiple values all at the same time. A CPU register is a very, very tiny block of memory that sits right on top of the CPU. A 64-bit CPU can store 8 bytes of data in a single register. A vector register is an expanded version of a CPU register. A 128-bit vector register can store 16 bytes of data. A 256-bit vector register can store 32 bytes of data. The vector register can then be divided into lanes, where each lane stores a single value of a certain data type. A 128-bit vector register can be divided into the following ways: * 16 lanes: 16x characters (1 byte each)* 8 lanes: 8x shorts (2 bytes each)* 4 lanes: 4x integers / floats (4 bytes each)* 2 lanes: 2x doubles (8 byte eachs) <source lang== Register =="cpp">a | b | c | d | e | f | g | h
== Vector Register ==1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 10 | 20 | 30 | 401.5 | 2.5 | 3.5 | 4.5 3.14159 | 3.14159</source>
== Instruction Set Architecture ==
 
* SSE
* SSE2
* SSE3
* SSSE3
* SSE4.1
* SSE4.2
* AVX
* AVX2
 
[https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2 Intel Intrinsics SSE,SSE2,SSE3,SSSE3,SSE4.1,SSE4.2]
== Examples ==
 
[INSERT IMAGE HERE]
[[File:CPUCacheline.png|center|frame]]
 
<source lang="cpp">
int a[8] = { 1, 2, 3, 4, 5, 6, 7, 8 };
 
for (int i = 0; i < 8; i++) {
a[i] *= 10;
}
</source>
 
= Intel Advisor Tutorial Example =
 
You can find the sample code in the directory of your Intel Parallel Studio installation. Just unzip the file and you can build the code on the command line or in Visual Studio.
 
Typically: C:\Program Files (x86)\IntelSWTools\Advisor 2019\samples\en\C++\vec_samples.zip
 
[https://software.intel.com/en-us/advisor-tutorial-vectorization-windows-cplusplus Intel® Advisor Tutorial: Add Efficient SIMD Parallelism to C++ Code Using the Vectorization Advisor]
 
== Loop Unrolling ==
 
The compiler can "unroll" a loop so that the body of the loop is duplicated a number of times, and as a result, reduce the number of conditional checks and counter increments per loop.
 
Warning: Do not write your code like this. The compiler will do it for you, unless you tell it not to.
<source lang="cpp">
// Workshop 2 - Calculate PI by integrating 1/(1+x^2)#pragma nounroll<// w2.serial.cppsource>
#include <iostreamsource lang="cpp">#include for (int i = 0; i <iomanip>50; i++) {#include <cstdlib> foo(i);#include <chrono>#include <omp.h>using namespace std::chrono;}
// report system timefor (int i = 0; i < 50; i+=5) {// foo(i);void reportTime foo(const char* msg, steady_clock::duration spani+1) {; auto ms = duration_cast<milliseconds>foo(spani+2); std::cout << msg << " - took - " <<foo(i+3); ms.count foo(i+4) << " milliseconds" << std::endl;
}
int main// foo(0)// foo(int argc, char** argv1) { if // foo(argc != 2) { std::cerr // foo(3)// foo(4)// ...// foo(45)// foo(46)// foo(47)// foo(48)// foo(49)<< argv/source> == Dependencies == === Pointer Alias === A pointer alias means that two pointers point to the same location in memory or the two pointers overlap in memory. If you compile the vec_samples project with the `NOALIAS` macro, the `matvec` function declaration will include the `restrict` keyword. The `restrict` keyword will tell the compiler that pointers `a` and `b` do not overlap and that the compiler is free optimize the code blocks that uses the pointers. [0INSERT IMAGE HERE] << ": invalid number of arguments\n"; std[[File::cerr << "Usage: " << argv[0CPUCacheline.png|center|frame]]  multiply.c<< source lang=" no_of_slices\ncpp";> return 1;#ifdef NOALIAS }void matvec(int size1, int size2, FTYPE a[][size2], FTYPE b[restrict], FTYPE x[], FTYPE wr[]) int i;#else void matvec(int nthreads; size1, int n = std::atoi(argvsize2, FTYPE a[][size2], FTYPE b[], FTYPE x[], FTYPE wr[1]); int mnt = omp_get_max_threads();#endif</source>  steady_clockTo learn more about the `restrict` keyword and how the compiler can optimize code if it knows that two pointers do not overlap, you can visit this StackOverflow thread:[https:time_point ts, te;//stackoverflow.com/a/30827880 What does the restrict keyword mean in C++?]
double sum = 0.0; // scalar accumulator // calculate pi by integrating the area under 1/(1 + x^2) in n steps double pi = 0.0; double stepSize = 1.0 / (double)n;Loop-Carried Dependency ===
ts = steady_clock::now();Pointers that overlap one another may introduce a loop-carried dependency when those pointers point to an array of data. The vectorizer will make this assumption and, as a result, will not auto-vectorize the code.
#pragma omp parallel { int iIn the code example below, tid`a` is a function of `b`. If pointers `a` and `b` overlap, nt; double xthen there exists the possibility that if `a` is modified then `b` will also be modified, psum; tid = omp_get_thread_num(); nt = omp_get_num_threads(); if (tid == 0) nthreads = nt;and therefore may create the possibility of a loop-carried dependency. This means the loop cannot be vectorized.
<source lang="cpp">void func(int* a, int* b) { ... for (i = tid, psum = 0.0; i < nsize1; i += nt+) { x for (j = ((double)i 0; j < size2; j++ 0.5) * stepSize;{ psum +a[i] = 1.0 / foo(1.0 + x * xb[j]);
}
#pragma omp critical
sum += psum;
}
}</source> The following image illustrates the loop-carried dependency when two pointers overlap. [INSERT IMAGE HERE][[File:CPUCacheline.png|center|frame]]  pi = 4= Memory Alignment == Intel Advisor can detect if there are any memory alignment issues that may produce inefficient vectorization code. A loop can be vectorized if there are no data dependencies across loop iterations. However, if the data is not aligned, the vectorizer may have to use a "peeled" loop to address the misalignment. So instead of vectorizing the entire loop, an extra loop needs to be inserted to perform operations on the front-end of the array that not aligned with memory.0 * sum * stepSize;
te = steady_clock::now();[INSERT IMAGE HERE]
std::cout === Alignment === To align data elements to an `x` amount of bytes in memory, use the `align` macro. Code snippet that is used to align the data elements in the 'vec_samples' project.<< "n source lang= " << n << cpp"\n" <<>// Tell the compiler to align the a, b, and x arrays mnt << " threads available\n" <<// boundaries. This allows the vectorizer to use aligned instructions nthreads << " threads used// and produce faster code.\nTime = " << std::fixed << std::setprecision#ifdef _WIN32_declspec(align(ALIGN_BOUNDARY, OFFSET)) FTYPE a[ROW][COLWIDTH];_declspec(align(ALIGN_BOUNDARY, OFFSET)) FTYPE b[ROW];_declspec(align(ALIGN_BOUNDARY, OFFSET)) FTYPE x[COLWIDTH];_declspec(align(ALIGN_BOUNDARY, OFFSET)) FTYPE wr[COLWIDTH];#elseFTYPE a[ROW][COLWIDTH] __attribute__((align(15ALIGN_BOUNDARY, OFFSET))) <<; "\n piFTYPE b[ROW] __attribute__((align(exactALIGN_BOUNDARY, OFFSET) = " << 3.141592653589793 <<)); "\n piFTYPE x[COLWIDTH] __attribute__((align(calcdALIGN_BOUNDARY, OFFSET))) = " << pi << std::endl; reportTimeFTYPE wr[COLWIDTH] __attribute__((align("Integration"ALIGN_BOUNDARY, te - tsOFFSET)));}#endif // _WIN32
</source>
= Intel Advisor Tutorial Example == Padding ===
== Loop Unrolling ==Even if the array elements are aligned with memory, say at 16 byte boundaries, you might still encounter a "remainder" loop that deals with back-end of the array that cannot be included in the vectorized code. The vectorizer will have to insert an extra loop at the end of the vectorized loop to perform operations on the back-end of the array.
== Pointer Alias ==To address this issue, add some padding.
== Memory Alignment ==For example, if you have a `4 x 19` array of floats, and your system access to a 128-bit vector registers, then you should add 1 column to make the array `4 x 20` so that the number of columns is evenly divisible by the number of floats that can be loaded onto a 128-bit vector register, which is 4 floats.
== Dependencies ==[INSERT IMAGE HERE]
= Summary =
49
edits