Changes

Jump to: navigation, search

GPU621/Intel Advisor

7,012 bytes added, 10:57, 28 November 2018
Vectorization Advisor
# [mailto:jespiritu@myseneca.ca?subject=GPU621 Jeffrey Espiritu]
# [mailto:tahmed36@myseneca.ca?subject=GPU621 Thaharim Ahmed]
# [mailto:tnolte@myseneca.ca?subject=GPU621 Thomas Nolte]
# [mailto:jespiritu@myseneca.ca;tahmed36@myseneca.ca?subject=GPU621/DPS921 eMail All]
== Introduction ==
[https://software.intel.com/en-us/advisor Intel Advisor] is software tool that you can use to help you add multithreading to your applicatin application or parts of your application without disrupting your normal software development. Not only can you use it add multithreading to your application, it can be used to determine whether the performance improvements that come with multithreading are worth adding when you consider the costs associated with multithreading such as maintainability, more difficult to debug, and the effort with refactoring or reorganizing your code to resolve data dependencies.
It is also a tool that can help you add vectorization to your program or to improve the efficiency of code that is already vectorized.
Intel Advisor is bundled with [https://software.intel.com/en-us/parallel-studio-xe Intel Parallel Studio].
 
Intel Advisor is separated into two workflows Vectorization Advisor and Threading Advisor.
 
= Vectorization Advisor =
 
The Vectorization Advisor is a tool for optimizing your code through vectorization. This tool will help identify loops that are high-impact and under-optimized, It also reports on what blocking loops from being vectorized and details on where it is safe to ignore the compiler's warnings and force vectorization. Finally it offers in-line code specific recommendations on how to fix these issues.
 
== Roofline Analysis ==
 
Roofline charts provide a visual analysis of the performance ceiling imposed on your program given the hard-ware of your computer. This provides an entry point for optimization highlighting loops that are having the most impact on performance and loops with the most room for improvement.
 
The key use of roofline analysis is to profile an application and display if it is optimized for the hard-ware it's running on.
 
Roofline analysis allows us to tackle 2 key points:
 
* What are the bottlenecks limiting performance?
* what loops are inhibiting performance the most?
 
 
[[File:Roofline-Chart-Example.png]]
 
== Survey Report ==
 
Provides code-specific recommendations for fixing vectorization issues. This allows the programmer to solve these issues providing three key points of information:
 
* Where in the code would vectorization be the most impactful.
* How you can further improve vectorized loops.
* Which loops are not vectorized and information on how they can be.
 
 
[[File:Survey-Report-Example.png]]
 
 
=== Trip Count and FLOPS Analysis ===
 
Complementing the survey reports trip count and FLOPS analysis provides in-line messages that allow you to make better decisions on how to improve individual loops. These messages include:
 
* Number of time the loop iterates.
* Data about FLOPS (Floating point Operations Per Second).
 
 
[[File:In-Line-Analysis-Example.png]]
 
 
After Identifying what loops benefit the most from vectorization you can simple select them individually to run more detailed report on them.
 
 
== Data Dependencies Report ==
 
Compilers may fail to vectorize loops due to potential data dependencies. This feature collects all the error messages from the compiler and creates a report for the programmer. The report allow the programmer to discern for themselves if these data dependencies actually exist and whether or not to force the compiler to ignore the error and vectorize the loop anyways. If the data dependencies really do exist the report provides information on the type of dependency and how to resolve the issue.
 
 
[[File:Data-Dependency-Example.png]]
 
= Threading Advisor =
 
The Threading Advisor tool is used to model, tune, and test the performance of various multi threading designs such as OpenMP, Threading Building Blocks (TBB), and Microsoft Task Parallel Library (TPL) without the hindering the development of the project. The tool accomplishes this by helping you with prototyping thread options, testing scalability of the project for larger systems, and optimizing faster. It will also help identify issues before implementing parallelization like eliminating data-sharing issues during design. The tool is primarily used for adding threading to the C, C++, C#, and Fotran languages.
 
== Annotations ==
 
Annotations can be inserted into your code to help design the potential parallelization for analysis. This way of designing multi threading prevents early error in the code's design to build up and cause slower performance then expected. This does not impact the design of your current code as the compiler ignores the annotations (they're only there to help model your design). This provides you with the ability to keep your code serial and prevents the bugs that can come from multiple threading while in your design phase.
 
 
[[File:Annotation-Example.jpg]]
 
== Scalability Analysis ==
 
Enables the evaluation of the performance and scalability of the various threading designs. The evaluation of the number of CPU's versus the Grain-size provides an easy to follow results on the impact of the common bottle necks found in all multi threading code when attempting to scale up a project without the need to test it on multiple high end machines yourself.
 
 
[[File:Scalabilty-Analysis-Example.png]]
 
== Dependencies Report ==
 
The threading advisor's dependencies report works similar to the vectorization's. It will provide information on the data dependency errors a programmer encounters when parallelizing code including data-sharing, deadlocks, and races. The report also displays code snippets it finds is related to the dependency errors you can then follow these code snippets to their exact location and begin handling the errors on a case by case basis.
 
= Work Flow =
 
With these two tool we can start to come up with a work flow for optimizing our code.
 
 
[[File:Work-Flow-Example.png]]
= Vectorization =
__m128i prod = _mm_unpacklo_epi64(prod01, prod23); // (ab3,ab2,ab1,ab0)
</source>
 
Code sample was taken from this StackOverflow thread: [https://stackoverflow.com/questions/17264399/fastest-way-to-multiply-two-vectors-of-32bit-integers-in-c-with-sse Fastest way to multiply two vectors of 32bit integers in C++, with SSE]
Here is a link to an interactive guide to Intel Intrinsics: [https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2 Intel Intrinsics SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2]
== Vectorization Examples ==
[INSERT IMAGE HERE][[File:CPUCachelineVectorization-example-serial.png|center|frame]]
=== Serial Version ===
=== SIMD Version ===
 
[[File:Vectorization-example-simd.png]]
<source lang="cpp">
If you compile the vec_samples project with the macro, the <code>matvec</code> function declaration will include the <code>restrict</code> keyword. The <code>restrict</code> keyword will tell the compiler that pointers <code>a</code> and <code>b</code> do not overlap and that the compiler is free optimize the code blocks that uses the pointers.
 
[INSERT IMAGE HERE]
[[File:CPUCacheline.png|center|frame]]
==== multiply.c ====
The following image illustrates the loop-carried dependency when two pointers overlap.
[INSERT IMAGE HERE[File:Pointer-alias.png]=== Magnitude of a Vector === To demonstrate a more familiar example of a loop-carried dependency that would block the auto-vectorization of a loop, I'm going to include a code snippet that calculates the magnitude of a vector. To calculate the magnitude of a vector: <code>length = sqrt(x^2 + y^2 + z^)</code> <source lang="cpp">for (int i = 0; i < n; i++) sum += x[i] * x[i]; length = sqrt(sum);</source> As you can see, there is a loop-carried dependency with the variable <code>sum</code>. The diagram below illustrates why the loop cannot be vectorized (nor can it be threaded). The dashed rectangle represents a single iteration in the loop, and the arrows represents dependencies between nodes. If an arrow crosses the iteration rectangle, then those iterations cannot be executed in parallel. [[File:CPUCachelineMagnitude-node-dependency-graph.png|center|frame]] To resolve the loop-carried dependency, use <code>simd</code> and the <code>reduction</code> clause to tell the compiler to auto-vectorize the loop and to reduce the array of elements to a single value. Each SIMD lane will compute its own sum and then combine the results into a single sum at the end. <source lang="cpp">#pragma omp simd reduction(+:sum)for (int i = 0; i < n; i++) sum += x[i] * x[i]; length = sqrt(sum);</source>
== Memory Alignment ==
Intel Advisor can detect if there are any memory alignment issues that may produce inefficient vectorization code.
A loop can be vectorized if there are no data dependencies across loop iterations. However, if the data is not aligned, the vectorizer may have to use a "peeled" loop to address the misalignment. So instead of vectorizing the entire loop, an extra loop needs to be inserted to perform operations on the front-end of the array that not aligned with memory.
=== Peeled and Remainder Loops === However, if the data is not aligned, the vectorizer may have to use a '''peeled''' loop to address the misalignment. So instead of vectorizing the entire loop, an extra loop needs to be inserted to perform operations on the front-end of the array that not aligned with memory. [INSERT IMAGE HERE[File:Memory-alignment-peeled.png]] A remainder loop is the result of having a number of elements in the array that is not evenly divisible by the vector length (the total number of elements of a certain data type that can be loaded into a vector register). [[File:Memory-alignment-remainder.png]] === Padding === Even if the array elements are aligned with memory, say at 16 byte boundaries, you might still encounter a "remainder" loop that deals with back-end of the array that cannot be included in the vectorized code. The vectorizer will have to insert an extra loop at the end of the vectorized loop to perform operations on the back-end of the array. To address this issue, add some padding. For example, if you have a <code>4 x 19</code> array of floats, and your system has access to 128-bit vector registers, then you should add 1 column to make the array <code>4 x 20</code> so that the number of columns is evenly divisible by the number of floats that can be loaded into a 128-bit vector register, which is 4 floats. [[File:Memory-alignment-padding.png]]
=== Aligned vs Unaligned Instructions ===
|}
The functions are taken from Intel's interactive guide to Intel Intrinsics: [https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2 Intel Intrinsics SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2] == Alignment = Aligning Data ===
To align data elements to an <code>x</code> amount of bytes in memory, use the <code>align</code> macro.
#endif // _WIN32
</source>
 
=== Padding ===
 
Even if the array elements are aligned with memory, say at 16 byte boundaries, you might still encounter a "remainder" loop that deals with back-end of the array that cannot be included in the vectorized code. The vectorizer will have to insert an extra loop at the end of the vectorized loop to perform operations on the back-end of the array.
 
To address this issue, add some padding.
 
For example, if you have a <code>4 x 19</code> array of floats, and your system access to a 128-bit vector registers, then you should add 1 column to make the array <code>4 x 20</code> so that the number of columns is evenly divisible by the number of floats that can be loaded onto a 128-bit vector register, which is 4 floats.
 
[INSERT IMAGE HERE]
 
= Summary =
50
edits

Navigation menu