Difference between revisions of "GPU621/Intel Advisor"

From CDOT Wiki
Jump to: navigation, search
(Roofline Analysis)
Line 25: Line 25:
 
Roofline charts provide a visual analysis of the performance ceiling imposed on your program given the hard-ware of your computer. This provides an entry point for optimization highlighting loops with the most room for improvement.  
 
Roofline charts provide a visual analysis of the performance ceiling imposed on your program given the hard-ware of your computer. This provides an entry point for optimization highlighting loops with the most room for improvement.  
  
Roofline analysis allows us to tackle a few key points:
+
The key use of roofline analysis is to profile an application and display if it is optimized for the hard-ware it's running on.
  
- Is the current program optimized for the hard-ware it's running on?
+
Roofline analysis allows us to tackle a 2 key points:
- What are the bottlenecks limiting performance?
+
 
- what loops are inhibiting performance the most?
+
* What are the bottlenecks limiting performance?
 +
* what loops are inhibiting performance the most?
  
 
[[File:Roofline-Chart-Example.png]]
 
[[File:Roofline-Chart-Example.png]]

Revision as of 10:51, 26 November 2018

Intel Parallel Studio Advisor

Group Members

  1. Jeffrey Espiritu
  2. Thaharim Ahmed
  3. Thomas Nolte
  4. eMail All

Introduction

Intel Advisor is software tool that you can use to help you add multithreading to your applicatin or parts of your application without disrupting your normal software development. Not only can you use it add multithreading to your application, it can be used to determine whether the performance improvements that come with multithreading are worth adding when you consider the costs associated with multithreading such as maintainability, more difficult to debug, and the effort with refactoring or reorganizing your code to resolve data dependencies.

It is also a tool that can help you add vectorization to your program or to improve the efficiency of code that is already vectorized.

Intel Advisor is bundled with Intel Parallel Studio.

Intel Advisor is separated into two workflows Vectorization Advisor and Threading Advisor.

Vectorization Advisor

The Vectorization Advisor is a tool that provides analysis on whether your loops are utilizing SIMD instructions or not, what is preventing your code from being vectorized, reports on efficiency and how to increase it.

Roofline Analysis

Roofline charts provide a visual analysis of the performance ceiling imposed on your program given the hard-ware of your computer. This provides an entry point for optimization highlighting loops with the most room for improvement.

The key use of roofline analysis is to profile an application and display if it is optimized for the hard-ware it's running on.

Roofline analysis allows us to tackle a 2 key points:

  • What are the bottlenecks limiting performance?
  • what loops are inhibiting performance the most?

Roofline-Chart-Example.png

Vectorization

Vectorization is the process of utilizing vector registers to perform a single instruction on multiple values all at the same time.

A CPU register is a very, very tiny block of memory that sits right on top of the CPU. A 64-bit CPU can store 8 bytes of data in a single register.

A vector register is an expanded version of a CPU register. A 128-bit vector register can store 16 bytes of data. A 256-bit vector register can store 32 bytes of data.

The vector register can then be divided into lanes, where each lane stores a single value of a certain data type.

A 128-bit vector register can be divided into the following ways:

  • 16 lanes: 16x 8-bit characters
  • 8 lanes: 8x 16-bit integers
  • 4 lanes: 4x 32-bit integers / floats
  • 2 lanes: 2x 64-bit integers
  • 2 lanes: 2x 64-bit doubles
a | b | c | d | e | f | g | h

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

10  | 20  | 30  | 40

1.5 | 2.5 | 3.5 | 4.5

1000    | 2000

3.14159 | 3.14159

SIMD Extensions

SSE stands for Streaming SIMD Extensions which refers to the addition of a set of SIMD instructions as well as new XMM registers. SIMD stands for Single Instruction Multiple Data and refers to instructions that can perform a single operation on multiple elements in parallel.

List of SIMD extensions:

  • SSE
  • SSE2
  • SSE3
  • SSSE3
  • SSE4.1
  • SSE4.2
  • AVX
  • AVX2

(For Unix/Linux) To display what instruction set and extensions set your processor supports, you can use the following commands:

$ uname -a
$ lscpu
$ cat /proc/cpuinfo

SSE Examples

For SSE >= SSE4.1, to multiply two 128-bit vector of signed 32-bit integers, you would use the following Intel intrisic function:

__m128i _mm_mullo_epi32(__m128i a, __m128i b)

Prior to SSE4.1, the same thing can be done with the following sequence of function calls.

// Vec4i operator * (Vec4i const & a, Vec4i const & b) {
// #ifdef
__m128i a13 = _mm_shuffle_epi32(a, 0xF5);              // (-,a3,-,a1)
__m128i b13 = _mm_shuffle_epi32(b, 0xF5);              // (-,b3,-,b1)
__m128i prod02 = _mm_mul_epu32(a, b);                  // (-,a2*b2,-,a0*b0)
__m128i prod13 = _mm_mul_epu32(a13, b13);              // (-,a3*b3,-,a1*b1)
__m128i prod01 = _mm_unpacklo_epi32(prod02, prod13);   // (-,-,a1*b1,a0*b0)
__m128i prod23 = _mm_unpackhi_epi32(prod02, prod13);   // (-,-,a3*b3,a2*b2)
__m128i prod = _mm_unpacklo_epi64(prod01, prod23);     // (ab3,ab2,ab1,ab0)

Code sample was taken from this StackOverflow thread: Fastest way to multiply two vectors of 32bit integers in C++, with SSE

Here is a link to an interactive guide to Intel Intrinsics: Intel Intrinsics SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2

Vectorization Examples

Vectorization-example-serial.png

Serial Version

int a[8] = { 1, 2, 3, 4, 5, 6, 7, 8 };

for (int i = 0; i < 8; i++) {
    a[i] *= 10;
}

SIMD Version

Vectorization-example-simd.png

int a[8] = { 1, 2, 3, 4, 5, 6, 7, 8 };
int ten[4] = { 10, 10, 10, 10 }
__m128i va, v10;

v10 = _mm_loadu_si128((__m128i*)&ten);

for (int i = 0; i < 8; i+=4) {
    va = _mm_loadu_si128((__m128i*)&a[i]);         // 1  2  3  4
    va = _mm_mullo_epi32(va, v10);                 // 10 20 30 40
    _mm_storeu_si128((__m128i*)&a[i], va);         // [10, 20, 30, 40]
}

Intel Advisor Tutorial Example

Here is a great tutorial on how to use Intel Advisor to vectorize your code. Intel® Advisor Tutorial: Add Efficient SIMD Parallelism to C++ Code Using the Vectorization Advisor

You can find the sample code in the directory of your Intel Parallel Studio installation. Just unzip the file and you can open the solution in Visual Studio or build on the command line.

For default installations, the sample code would be located here: C:\Program Files (x86)\IntelSWTools\Advisor 2019\samples\en\C++\vec_samples.zip

Loop Unrolling

The compiler can "unroll" a loop so that the body of the loop is duplicated a number of times, and as a result, reduce the number of conditional checks and counter increments per loop.

Warning: Do not write your code like this. The compiler will do it for you, unless you tell it not to.

#pragma nounroll
for (int i = 0; i < 50; i++) {
    foo(i);
}

for (int i = 0; i < 50; i+=5) {
    foo(i);
    foo(i+1);
    foo(i+2);
    foo(i+3);
    foo(i+4);
}

// foo(0)
// foo(1)
// foo(2)
// foo(3)
// foo(4)
// ...
// foo(45)
// foo(46)
// foo(47)
// foo(48)
// foo(49)

Data Dependencies

Pointer Alias

A pointer alias means that two pointers point to the same location in memory or the two pointers overlap in memory.

If you compile the vec_samples project with the macro, the matvec function declaration will include the restrict keyword. The restrict keyword will tell the compiler that pointers a and b do not overlap and that the compiler is free optimize the code blocks that uses the pointers.

multiply.c

#ifdef NOALIAS
void matvec(int size1, int size2, FTYPE a[][size2], FTYPE b[restrict], FTYPE x[], FTYPE wr[])
#else
void matvec(int size1, int size2, FTYPE a[][size2], FTYPE b[], FTYPE x[], FTYPE wr[])
#endif

To learn more about the restrict keyword and how the compiler can optimize code if it knows that two pointers do not overlap, you can visit this StackOverflow thread: What does the restrict keyword mean in C++?

Loop-Carried Dependency

Pointers that overlap one another may introduce a loop-carried dependency when those pointers point to an array of data. The vectorizer will make this assumption and, as a result, will not auto-vectorize the code.

In the code example below, a is a function of b. If pointers a and b overlap, then there exists the possibility that if a is modified then b will also be modified, and therefore may create the possibility of a loop-carried dependency. This means the loop cannot be vectorized.

void func(int* a, int* b) {
    ...
    for (i = 0; i < size1; i++) {
        for (j = 0; j < size2; j++) {
            a[i] = foo(b[j]);
        }
    }
}

The following image illustrates the loop-carried dependency when two pointers overlap.

Pointer-alias.png

Magnitude of a Vector

To demonstrate a more familiar example of a loop-carried dependency that would block the auto-vectorization of a loop, I'm going to include a code snippet that calculates the magnitude of a vector.

To calculate the magnitude of a vector: length = sqrt(x^2 + y^2 + z^)

for (int i = 0; i < n; i++)
    sum += x[i] * x[i];

length = sqrt(sum);

As you can see, there is a loop-carried dependency with the variable sum. The diagram below illustrates why the loop cannot be vectorized (nor can it be threaded). The dashed rectangle represents a single iteration in the loop, and the arrows represents dependencies between nodes. If an arrow crosses the iteration rectangle, then those iterations cannot be executed in parallel.

Magnitude-node-dependency-graph.png

To resolve the loop-carried dependency, use simd and the reduction clause to tell the compiler to auto-vectorize the loop and to reduce the array of elements to a single value. Each SIMD lane will compute its own sum and then combine the results into a single sum at the end.

#pragma omp simd reduction(+:sum)
for (int i = 0; i < n; i++)
    sum += x[i] * x[i];

length = sqrt(sum);

Memory Alignment

Intel Advisor can detect if there are any memory alignment issues that may produce inefficient vectorization code.

A loop can be vectorized if there are no data dependencies across loop iterations.

Peeled and Remainder Loops

However, if the data is not aligned, the vectorizer may have to use a peeled loop to address the misalignment. So instead of vectorizing the entire loop, an extra loop needs to be inserted to perform operations on the front-end of the array that not aligned with memory.

Memory-alignment-peeled.png

A remainder loop is the result of having a number of elements in the array that is not evenly divisible by the vector length (the total number of elements of a certain data type that can be loaded into a vector register).

Memory-alignment-remainder.png

Padding

Even if the array elements are aligned with memory, say at 16 byte boundaries, you might still encounter a "remainder" loop that deals with back-end of the array that cannot be included in the vectorized code. The vectorizer will have to insert an extra loop at the end of the vectorized loop to perform operations on the back-end of the array.

To address this issue, add some padding.

For example, if you have a 4 x 19 array of floats, and your system has access to 128-bit vector registers, then you should add 1 column to make the array 4 x 20 so that the number of columns is evenly divisible by the number of floats that can be loaded into a 128-bit vector register, which is 4 floats.

Memory-alignment-padding.png

Aligned vs Unaligned Instructions

There are two versions of SIMD instructions for loading into and storing from vector registers: aligned and unaligned.

The following table contains a list of Intel Intrinsics functions for both aligned and unaligned load and store instructions.

Aligned Unaligned Description
__m128d _mm_load_pd (double const* mem_addr) __m128d _mm_loadu_pd (double const* mem_addr) Load 128-bits (composed of 2 packed double-precision (64-bit) floating-point elements) from memory into dst.
__m128 _mm_load_ps (float const* mem_addr) __m128 _mm_loadu_ps (float const* mem_addr) Load 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) from memory into dst.
__m128i _mm_load_si128 (__m128i const* mem_addr) __m128i _mm_loadu_si128 (__m128i const* mem_addr) Load 128-bits of integer data from memory into dst.
void _mm_store_pd (double* mem_addr, __m128d a) void _mm_storeu_pd (double* mem_addr, __m128 a) Store 128-bits (composed of 2 packed double-precision (64-bit) floating-point elements) from a into memory.
void _mm_store_ps (float* mem_addr, __m128 a) void _mm_storeu_ps (float* mem_addr, __m128 a) Store 128-bits (composed of 4 packed single-precision (32-bit) floating-point elements) from a into memory.
void _mm_store_si128 (__m128i* mem_addr, __m128i a) void _mm_storeu_si128 (__m128i* mem_addr, __m128i a) Store 128-bits of integer data from a into memory.

The functions are taken from Intel's interactive guide to Intel Intrinsics: Intel Intrinsics SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2

Aligning Data

To align data elements to an x amount of bytes in memory, use the align macro.

Code snippet that is used to align the data elements in the vec_samples project.

// Tell the compiler to align the a, b, and x arrays 
// boundaries.  This allows the vectorizer to use aligned instructions
// and produce faster code.
#ifdef _WIN32
_declspec(align(ALIGN_BOUNDARY, OFFSET)) FTYPE a[ROW][COLWIDTH];
_declspec(align(ALIGN_BOUNDARY, OFFSET)) FTYPE b[ROW];
_declspec(align(ALIGN_BOUNDARY, OFFSET)) FTYPE x[COLWIDTH];
_declspec(align(ALIGN_BOUNDARY, OFFSET)) FTYPE wr[COLWIDTH];
#else
FTYPE a[ROW][COLWIDTH]	__attribute__((align(ALIGN_BOUNDARY, OFFSET)));
FTYPE b[ROW]			__attribute__((align(ALIGN_BOUNDARY, OFFSET)));
FTYPE x[COLWIDTH]		__attribute__((align(ALIGN_BOUNDARY, OFFSET)));
FTYPE wr[COLWIDTH]		__attribute__((align(ALIGN_BOUNDARY, OFFSET)));
#endif // _WIN32