Open main menu

CDOT Wiki β

Changes

GPU621/Intel Advisor

3,103 bytes added, 13:38, 23 November 2018
no edit summary
== Introduction ==
[https://software.intel.com/en-us/advisor Intel Advisor] is software tool that you can use to help you add multithreading to your applicatin or parts of your application without disrupting your normal software development. Not only can you use it add multithreading to your application, it can be used to determine whether the performance improvements that come with multithreading are worth adding when you consider the costs associated with multithreading such as maintainability, more difficult to debug, and the effort with refactoring or reorganizing your code to resolve data dependencies. It is also a tool that can help you add vectorization to your program or to improve the efficiency of code that is already vectorized. Intel Advisor is bundled with [https://software.intel.com/en-us/parallel-studio-xe Intel Parallel Studio] that is used to analyze a program to...
= Vectorization =
A 128-bit vector register can be divided into the following ways:
* 16 lanes: 16x 8-bit characters (1 byte each)* 8 lanes: 8x shorts (2 bytes each)16-bit integers* 4 lanes: 4x 32-bit integers / floats (4 bytes each)* 2 lanes: 2x 64-bit integers* 2 lanes: 2x 64-bit doubles (8 byte eachs)
<source lang="cpp"pre>
a | b | c | d | e | f | g | h
1.5 | 2.5 | 3.5 | 4.5
1000 | 2000
3.14159 | 3.14159
</sourcepre>
== Instruction Set Architecture ==
[https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions SSE] stands for Streaming SIMD Extensions which refers to the addition of a set of SIMD instructions as well as new XMM registers.
 
List of SIMD extensions:
* SSE
* SSE2
* AVX
* AVX2
 
(For Unix/Linux) To display what instructions set your processor support, you can use the following commands:
 
<pre>
$ uname -a
$ lscpu
$ cat /proc/cpuinfo
</pre>
=== Example ===
</source>
Here is a link to an interactive guide to Intel Intrinsics: [https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SSE,SSE2,SSE3,SSSE3,SSE4_1,SSE4_2 Intel Intrinsics SSE,SSE2,SSE3,SSSE3,SSE4.1,SSE4.2]
== Examples ==
[INSERT IMAGE HERE]
[[File:CPUCacheline.png|center|frame]]
 
=== Serial Version ===
<source lang="cpp">
for (int i = 0; i < 8; i++) {
a[i] *= 10;
}
</source>
 
=== SIMD Version ===
 
<source lang="cpp">
int a[8] = { 1, 2, 3, 4, 5, 6, 7, 8 };
int ten[4] = { 10, 10, 10, 10 }
__m128i va, v10;
 
v10 = _mm_loadu_si128((__m128i*)&ten);
 
for (int i = 0; i < 8; i+=4) {
va = _mm_loadu_si128((__m128i*)&a[i]); // 1 2 3 4
va = _mm_mullo_epi32(va, v10); // 10 20 30 40
_mm_storeu_si128((__m128i*)&a[i], va); // [10, 20, 30, 40]
}
</source>
= Intel Advisor Tutorial Example =
You can find the sample code in the directory of your Intel Parallel Studio installation. Just unzip the file and you can open the solution in Visual Studio or build the code on the command line or in Visual Studio.
Typically: C:\Program Files (x86)\IntelSWTools\Advisor 2019\samples\en\C++\vec_samples.zip
Here is a great tutorial on how to use Intel Advisor to vectorize your code. [https://software.intel.com/en-us/advisor-tutorial-vectorization-windows-cplusplus Intel® Advisor Tutorial: Add Efficient SIMD Parallelism to C++ Code Using the Vectorization Advisor]
== Loop Unrolling ==
[INSERT IMAGE HERE]
 
=== Aligned vs Unaligned Instructions ===
 
There are two versions of SIMD instructions for loading into and storing from vector registers: aligned and unaligned.
 
The following table contains a list of Intel Intrinsics functions for both aligned and unaligned load and store instructions.
 
{| class="wikitable"
! Aligned
! Unaligned
! Description
|-
| __m128d _mm_loadu_pd (double const* mem_addr)
| __m128d _mm_load_pd (double const* mem_addr)
| Load 128-bits (composed of 2 packed double-precision (64-bit) floating-point elements) from memory into dst.
|-
| __m128 _mm_load_ps (float const* mem_addr)
| __m128 _mm_loadu_ps (float const* mem_addr)
| Load 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) from memory into dst.
|-
| __m128i _mm_load_si128 (__m128i const* mem_addr)
| __m128i _mm_loadu_si128 (__m128i const* mem_addr)
| Load 128-bits of integer data from memory into dst.
|-
| void _mm_store_pd (double* mem_addr, __m128d a)
| void _mm_storeu_pd (double* mem_addr, __m128 a)
| Store 128-bits (composed of 2 packed double-precision (64-bit) floating-point elements) from a into memory.
|-
| void _mm_store_ps (float* mem_addr, __m128 a)
| void _mm_storeu_ps (float* mem_addr, __m128 a)
| Store 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) from a into memory.
|-
| void _mm_store_si128 (__m128i* mem_addr, __m128i a)
| void _mm_storeu_si128 (__m128i* mem_addr, __m128i a)
| Store 128-bits of integer data from a into memory.
|}
=== Alignment ===
49
edits