Difference between revisions of "DPS921/Intel Advisor"

From CDOT Wiki
Jump to: navigation, search
(Intel Roof-line Analysis)
(Roof-line Analysis)
 
(34 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
= Intel Advisor =
 
= Intel Advisor =
Intel Advisor is a set of tools used to measure the performance of an application. They are Vectorization advisor, Roofline, Threading, Off loading, and flow Graph
+
Intel Advisor is a design and analysis tool for optimizing performance. This is done by using the Advisor tools that will help to analyze and improve threading, vectorization, and memory use within the application.Advisor supports C, C++, Fortran, Python, and OpenMP. The Tools highlighted in this wiki page are the survey analysis, dependencies analysis, roof-line analysis, and memory access pattern analysis. Survey Analysis will identify points in your code where vectorization or parallelization is possible and improvement to make the execution faster. Dependencies Analysis will identify data Dependencies in your code. Roofline Analysis will find performance headroom against hardware limitations and get insights for an effective optimization roadmap. Memory Access Pattern Analysis  can check for various memory issues, such as non-contiguous memory accesses and unit strides.
  
 
== Group Members ==
 
== Group Members ==
Line 7: Line 7:
 
[mailto:asothilingam1@myseneca.ca?subject=SB Anojan Sothilingam]
 
[mailto:asothilingam1@myseneca.ca?subject=SB Anojan Sothilingam]
  
= Intel Survey Analysis =
+
= Survey Analysis =
 
A Survey analysis will create a Survey Report that outlines instances:
 
A Survey analysis will create a Survey Report that outlines instances:
  
Line 17: Line 17:
  
 
• Provide general performance issue
 
• Provide general performance issue
 +
 +
== How to set up a Survey Analysis  ==
 +
Microsoft Visual Studio Integration
 +
 +
1. From the Tools menu, choose Intel Advisor > Start Survey Analysis
 +
 +
[[File:Intel_advisor.png|1000px]]
 +
 +
Intel Advisor GUI
 +
 +
1. Select new project
 +
 +
[[File:Intel-Advisor-GUI-1.png|1000px]]
 +
 +
2. Enter the project name and select Create Project
 +
 +
[[File:Intel-Advisor-GUI-2.png|1000px]]
 +
 +
3. Find and Select the executable file of your program and press OK
 +
 +
[[File:Intel-Advisor-GUI-3.png|1000px]]
 +
 +
4. Select the collect button below Survey Target
 +
 +
[[File:Intel-Advisor-GUI-4.png|1000px]]
 +
 +
[[File:Intel-Advisor-GUI-5.png|1000px]]
  
 
== Code Example ==
 
== Code Example ==
Line 125: Line 152:
 
</source>
 
</source>
  
== How to set up a Survey Analysis  ==
+
= Dependencies Analysis =
 +
The compiler will be unable to vectorize loops if there are potential data dependencies. The dependencies analysis will create a dependencies report that shows where possible data dependencies exist. The report will also have details about the type of dependency and how to solve the dependency.
 +
 
 +
== How to set up a Dependencies Analysis  ==
 +
1.
 +
 
 +
[[File:DA_SS1.png|1000px]]
 +
 
 +
[[File:DA_SS2.png|1000px]]
 +
 
 +
= Roof-line Analysis =
 +
The roofline tool creates a tool line model, to represent an application's performance in relation to hardware limitations, including memory bandwidth and computational peaks. To measure performance we use 2 axes with GFLOPs (Giga Floating point operations per second)  on the y-axis, and AI(Arithmetic Intensity(FLOPs/Byte)) on the x-axis both in log scale, with this we can begin to build our roof-line. Now for any given machine, its CPU can only perform so many FLOPs so we can plot the CPU cap on our chart to represent this. Like the CPU a memory system can only supply so many gigabytes, we can represent this by a diagonal line(N GB/s * X FLOPs/Byte = Y GFLOPs/s). (pic) This chart represents the machine's hardware limitation, and it's best performance at a given AI
 +
 
 +
Every function, or loop, will have specific AI, when ran we can record its GFLOPs, Because we know Its AI won't change and any optimization we do will only change the performance, this is useful when we want to measure the performance of a given change or optimization.
 +
 
 +
 
 +
[[File:Flops.PNG]]
 +
 
 +
= How to set up a Roof-line Analysis =
 
Microsoft Visual Studio Integration
 
Microsoft Visual Studio Integration
  
[[File:Intel_advisor.png|1000px]]
+
1. select project
 +
 
 +
[[File:Step_1.1.PNG]]
 +
 
 +
2. Go to Intel advisor and select roof-line tool
 +
 
 +
[[File:Step_2.png]]
 +
 
 +
3. let roof-line tool analyze data
  
Intel Advisor GUI
+
[[File:Step_3.PNG]]
  
[[File:Intel-Advisor-GUI-1.png|1000px]]
+
4. review data
  
[[File:Intel-Advisor-GUI-2.png|1000px]]
+
[[File:Step_4.PNG]]
  
[[File:Intel-Advisor-GUI-3.png|1000px]]
+
= Memory Access Pattern Analysis =
 +
We can use the MAP analysis tool to check for various memory issues, such as non-contiguous memory accesses and unit strides. Also we can get information about types of memory access in selected loops/functions, how you traverse your data, and how it affects your vector efficiency and cache bandwidth usage.
  
[[File:Intel-Advisor-GUI-4.png|1000px]]
+
= How to set up Memory Access Pattern Analysis =
 +
step 1 run roof-line tools
  
[[File:Intel-Advisor-GUI-5.png|1000px]]
+
[[File:Step_4.PNG]]
  
= Intel Roof-line Analysis =
+
Step 2 run Map tool
The roofline tool creates a tool line model, to represent an application's performance in relation to hardware limitations, including memory bandwidth and computational peaks. To measure performance we use 2 axes with GFLOPs (Giga Flops/sec)  on the y-axis, and AI(Arithmetic Intensity(FLOPs/Byte)) on the x-axis both in log scale, with this we can begin to build our roof-line. Now for any given machine, its CPU can only perform so many FLOPs so we can plot the CPU cap on our chart to represent this. Like the CPU a memory system can only supply so many gigabytes, we can represent this by a diagonal line(N GB/s * X FLOPs/Byte = Y GFLOPs/s). (pic) This chart represents the machine's hardware limitation, and it's best performance at a given AI
 
  
Every function, or loop, will have specific AI, when ran we can record its GFLOPs Because we know Its AI won't change and any optimization we do will only change the performance, this is useful when we want to measure the performance of a given change or optimization.
+
[[File:Step_5.PNG]]
  
 +
Step 3 Review data
  
[[File:Flops.PNG]]
+
[[File:Step_6.PNG]]
  
<source>#include <iostream>
+
<source>/* Copyright (C) 2010-2017 Intel Corporation. All Rights Reserved.
#include <iomanip>
+
*
#include <cstdlib>
+
* The source code, information and material ("Material")
#include <chrono>
+
* contained herein is owned by Intel Corporation or its
#include <omp.h>
+
* suppliers or licensors, and title to such Material remains
using namespace std::chrono;
+
* with Intel Corporation or its suppliers or licensors.
#define NUM_THREADS 1
+
* The Material contains proprietary information of Intel or
// report system time
+
* its suppliers and licensors. The Material is protected by
//
+
* worldwide copyright laws and treaty provisions.
void reportTime(const char* msg, steady_clock::duration span) {
+
* No part of the Material may be used, copied, reproduced,
    auto ms = duration_cast<milliseconds>(span);
+
* modified, published, uploaded, posted, transmitted, distributed
    std::cout << msg << " - took - " <<
+
* or disclosed in any way without Intel's prior express written
        ms.count() << " milliseconds" << std::endl;
+
* permission. No license under any patent, copyright or other
}
+
* intellectual property rights in the Material is granted to or
 +
* conferred upon you, either expressly, by implication, inducement,
 +
* estoppel or otherwise. Any license under such intellectual
 +
* property rights must be express and approved by Intel in writing.
 +
* Third Party trademarks are the property of their respective owners.
 +
* Unless otherwise agreed by Intel in writing, you may not remove
 +
* or alter this notice or any other notice embedded in Materials
 +
* by Intel or Intel's suppliers or licensors in any way.  
 +
 +
* This file is intended for use with the "Memory Access 101" tutorial.
 +
*/
 +
 +
#include <iostream>
 +
#include <time.h>
 +
using namespace std;
  
int main(int argc, char** argv) {
+
const int LOOPS = 1500000;
    if (argc != 2) {
+
const int SIZE = 14992;
        std::cerr << argv[0] << ": invalid number of arguments\n";
+
const int STEPS = SIZE / 2;
        std::cerr << "Usage: " << argv[0] << "  no_of_slices\n";
 
        return 1;
 
    }
 
    int n = std::atoi(argv[1]);
 
    int* t;
 
    steady_clock::time_point ts, te;
 
  
    // calculate pi by integrating the area under 1/(1 + x^2) in n steps
+
float floatArray[SIZE];
    ts = steady_clock::now();
+
double doubleArray[SIZE];
    int mt = omp_get_num_threads(), nthreads;
 
    double pi;
 
    double stepSize = 1.0 / (double)n;
 
    omp_set_num_threads(NUM_THREADS);
 
    t = new int[3];
 
    #pragma omp parallel
 
    {
 
        int i, tid, nt;
 
        double x, sum;
 
        tid = omp_get_thread_num();
 
        nt = omp_get_num_threads();
 
        if (tid == 0) nthreads = nt;
 
        for ( i = tid, sum=0.0; i<n; i+=nt) {
 
            x = ((double)i + 0.5) * stepSize;
 
            sum += 1.0 / (1.0 + x * x);
 
        }
 
            #pragma omp critical
 
                pi += 4.0 * sum * stepSize;
 
    }
 
  
    te = steady_clock::now();
+
time_t start;
 +
time_t finish;
  
    std::cout << "n = " << n <<"      " << nthreads <<
+
int main()
        std::fixed << std::setprecision(15) <<
+
{
        "\n pi(exact) = " << 3.141592653589793 <<
+
// Contiguous data access, same number of iterations as the noncontiguous.
        "\n pi(calcd) = " << pi << std::endl;
+
start = time(NULL);
    reportTime("Integration", te - ts);
+
#pragma nounroll
}
+
for (float i = 0; i < LOOPS; i++)
 +
{
 +
#pragma nounroll
 +
for (int j = 0; j < STEPS; j += 1)
 +
{
 +
floatArray[j] = i;
 +
}
 +
}
 +
finish = time(NULL);
 +
cout << "Contiguous Float:    " << finish - start << "\n";
  
 +
// Contiguous data access on doubles, so that it should require roughly
 +
// the same number of cache line loads as the 2-stride float loop.
 +
start = time(NULL);
 +
#pragma nounroll
 +
for (double i = 0; i < LOOPS; i++)
 +
{
 +
#pragma nounroll
 +
for (int j = 0; j < STEPS; j += 1)
 +
{
 +
doubleArray[j] = i;
 +
}
 +
}
 +
finish = time(NULL);
 +
cout << "Contiguous Double:  " << finish - start << "\n";
  
</source>
+
// Stride-2 float. Same number of iterations as the contiguous version,
 +
// same number of cache line loads as the double loop. Slower than both.
 +
start = time(NULL);
 +
#pragma nounroll
 +
for (float i = 0; i < LOOPS; i++)
 +
{
 +
#pragma nounroll
 +
for (int j = 0; j < STEPS * 2; j += 2)
 +
{
 +
floatArray[j] = i;
 +
}
 +
}
 +
finish = time(NULL);
 +
cout << "Noncontiguous Float: " << finish - start << "\n";
  
= Intel Memory Access Pattern Analysis =
+
return EXIT_SUCCESS;
We can use the MAP analysis tool to check for various memory issues, such as non-contiguous memory accesses and unit strides.
+
}
  
= Intel Dependencies Analysis =
+
</source>
The compiler will be unable to vectorize loops if there are potential data dependencies. The dependencies analysis will create a dependencies report that shows where possible data dependencies exist. The report will also have details about the type of dependency and how to solve the dependency.
 
  
== Sources ==
+
= Sources =
 
https://www.youtube.com/watch?v=h2QEM1HpFgg - Roofline Analysis in Intel Advisor tutorial
 
https://www.youtube.com/watch?v=h2QEM1HpFgg - Roofline Analysis in Intel Advisor tutorial
  
Line 221: Line 301:
  
 
https://software.intel.com/content/www/us/en/develop/documentation/advisor-user-guide/top/survey-trip-counts-flops-and-roofline-analyses/survey-analysis.html
 
https://software.intel.com/content/www/us/en/develop/documentation/advisor-user-guide/top/survey-trip-counts-flops-and-roofline-analyses/survey-analysis.html
 +
 +
https://techdecoded.intel.io/quickhits/advantages-of-vectorization-and-the-effects-of-data-size/#gs.mgu8q9

Latest revision as of 17:57, 7 December 2020

Intel Advisor

Intel Advisor is a design and analysis tool for optimizing performance. This is done by using the Advisor tools that will help to analyze and improve threading, vectorization, and memory use within the application.Advisor supports C, C++, Fortran, Python, and OpenMP. The Tools highlighted in this wiki page are the survey analysis, dependencies analysis, roof-line analysis, and memory access pattern analysis. Survey Analysis will identify points in your code where vectorization or parallelization is possible and improvement to make the execution faster. Dependencies Analysis will identify data Dependencies in your code. Roofline Analysis will find performance headroom against hardware limitations and get insights for an effective optimization roadmap. Memory Access Pattern Analysis can check for various memory issues, such as non-contiguous memory accesses and unit strides.

Group Members

saketepe

Anojan Sothilingam

Survey Analysis

A Survey analysis will create a Survey Report that outlines instances:

• Where vectorization or parallelization will be most effective

• Describe if vectorized loops are beneficial or not

• Un-vectorized loops and explain why they have been Un-vectorized

• Provide general performance issue

How to set up a Survey Analysis

Microsoft Visual Studio Integration

1. From the Tools menu, choose Intel Advisor > Start Survey Analysis

Intel advisor.png

Intel Advisor GUI

1. Select new project

Intel-Advisor-GUI-1.png

2. Enter the project name and select Create Project

Intel-Advisor-GUI-2.png

3. Find and Select the executable file of your program and press OK

Intel-Advisor-GUI-3.png

4. Select the collect button below Survey Target

Intel-Advisor-GUI-4.png

Intel-Advisor-GUI-5.png

Code Example

//==============================================================
//
// SAMPLE SOURCE CODE - SUBJECT TO THE TERMS OF SAMPLE CODE LICENSE AGREEMENT,
// http://software.intel.com/en-us/articles/intel-sample-source-code-license-agreement/
//
// Copyright 2017 Intel Corporation
//
// THIS FILE IS PROVIDED "AS IS" WITH NO WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT
// NOT LIMITED TO ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
// PURPOSE, NON-INFRINGEMENT OF INTELLECTUAL PROPERTY RIGHTS.
//
// =============================================================

#include <iostream>			
#include <random>
#include <vector>
using namespace std;

#define REPEAT 3000
#define SIZE 70000

// Despite the name, the vector data type has nothing to do with SIMD vectors. 
// To avoid confusion, I'm renaming the data type to "dynamic_array".
typedef vector<double> dynamic_array;

int dummy = 0;

double doublesArray[SIZE * 2];
float floatsArray[SIZE];
dynamic_array firstDynamic(SIZE);
dynamic_array secondDynamic(SIZE);

int main()
{
	for (int i = 0; i < SIZE; i++)
	{
		firstDynamic[i] = doublesArray[i] = (rand() % 5000) + 1.0;
		secondDynamic[i] = doublesArray[i + SIZE] = (rand() % 5000) + 1.0;
		floatsArray[i] = (rand() % 5000) + 1.0f;
	}
	/****************** Normal Unit Stride Loop ******************/
	cout << "Normal Unit Stride Loop: ";
	for (int r = 0; r < REPEAT; r++)
	{
		for (int i = 0; i < SIZE; i++)
		{
			doublesArray[i] = i * 42.0 + 7;
		}
	}
	cout << "DONE\n";
	/****************** Vector Length Demo Loop ******************/
	cout << "Vector Length Demo Loop: ";
	for (int r = 0; r < REPEAT; r++)
	{
		for (int i = 0; i < SIZE; i++)
		{
			floatsArray[i] = i * 42.0f + 7;
		}
	}
	cout << "DONE\n";
	/******************** Constant Stride Loop *******************/
	cout << "Constant Stride Loop:    ";
	for (int r = 0; r < REPEAT; r++)
	{
		dummy++; // Prevents interchange. Too fast to contribute to loop time.
		for (int i = 0; i < SIZE; i++)
		{
			doublesArray[i * 2] = i * 42.0 + 7;
		}
	}
	cout << "DONE\n";
	/******************** Variable Stride Loop *******************/
	cout << "Variable Stride Loop:    ";
	for (int r = 0; r < REPEAT; r++)
	{
		for (int i = 0; i < SIZE; i++)
		{
			doublesArray[i + (i / 2)] = i * 42.0 + 7;
		}
	}
	cout << "DONE\n";
	/******************** True Dependency Loop *******************/
	cout << "True Dependency Loop:    ";
	for (int r = 0; r < REPEAT; r++)
	{
		for (int i = 1; i < SIZE; i++)
		{
			doublesArray[i] = i * 42.0 + doublesArray[i - 1];
		}
	}
	cout << "DONE\n";
	/****************** Assumed Dependency Loop ******************/
	cout << "Assumed Dependency Loop: ";
	for (int r = 0; r < REPEAT; r++)
	{
		for (int i = 0; i < SIZE; i++)
		{
			firstDynamic[i] = i * 42.0 + secondDynamic[i];
		}
	}
	cout << "DONE\n";
	return EXIT_SUCCESS;
}

Dependencies Analysis

The compiler will be unable to vectorize loops if there are potential data dependencies. The dependencies analysis will create a dependencies report that shows where possible data dependencies exist. The report will also have details about the type of dependency and how to solve the dependency.

How to set up a Dependencies Analysis

1.

DA SS1.png

DA SS2.png

Roof-line Analysis

The roofline tool creates a tool line model, to represent an application's performance in relation to hardware limitations, including memory bandwidth and computational peaks. To measure performance we use 2 axes with GFLOPs (Giga Floating point operations per second) on the y-axis, and AI(Arithmetic Intensity(FLOPs/Byte)) on the x-axis both in log scale, with this we can begin to build our roof-line. Now for any given machine, its CPU can only perform so many FLOPs so we can plot the CPU cap on our chart to represent this. Like the CPU a memory system can only supply so many gigabytes, we can represent this by a diagonal line(N GB/s * X FLOPs/Byte = Y GFLOPs/s). (pic) This chart represents the machine's hardware limitation, and it's best performance at a given AI

Every function, or loop, will have specific AI, when ran we can record its GFLOPs, Because we know Its AI won't change and any optimization we do will only change the performance, this is useful when we want to measure the performance of a given change or optimization.


Flops.PNG

How to set up a Roof-line Analysis

Microsoft Visual Studio Integration

1. select project

Step 1.1.PNG

2. Go to Intel advisor and select roof-line tool

Step 2.png

3. let roof-line tool analyze data

Step 3.PNG

4. review data

Step 4.PNG

Memory Access Pattern Analysis

We can use the MAP analysis tool to check for various memory issues, such as non-contiguous memory accesses and unit strides. Also we can get information about types of memory access in selected loops/functions, how you traverse your data, and how it affects your vector efficiency and cache bandwidth usage.

How to set up Memory Access Pattern Analysis

step 1 run roof-line tools

Step 4.PNG

Step 2 run Map tool

Step 5.PNG

Step 3 Review data

Step 6.PNG

/* Copyright (C) 2010-2017 Intel Corporation. All Rights Reserved.
 *
 * The source code, information and material ("Material") 
 * contained herein is owned by Intel Corporation or its 
 * suppliers or licensors, and title to such Material remains 
 * with Intel Corporation or its suppliers or licensors.
 * The Material contains proprietary information of Intel or 
 * its suppliers and licensors. The Material is protected by 
 * worldwide copyright laws and treaty provisions.
 * No part of the Material may be used, copied, reproduced, 
 * modified, published, uploaded, posted, transmitted, distributed 
 * or disclosed in any way without Intel's prior express written 
 * permission. No license under any patent, copyright or other
 * intellectual property rights in the Material is granted to or 
 * conferred upon you, either expressly, by implication, inducement, 
 * estoppel or otherwise. Any license under such intellectual 
 * property rights must be express and approved by Intel in writing.
 * Third Party trademarks are the property of their respective owners.
 * Unless otherwise agreed by Intel in writing, you may not remove 
 * or alter this notice or any other notice embedded in Materials 
 * by Intel or Intel's suppliers or licensors in any way. 
 
 * This file is intended for use with the "Memory Access 101" tutorial.
 */
 
#include <iostream>				
#include <time.h>
using namespace std;

const int LOOPS = 1500000;
const int SIZE = 14992;
const int STEPS = SIZE / 2;

float floatArray[SIZE];
double doubleArray[SIZE];

time_t start;
time_t finish;

int main()
{
	// Contiguous data access, same number of iterations as the noncontiguous.
	start = time(NULL);
	#pragma nounroll
	for (float i = 0; i < LOOPS; i++)
	{
		#pragma nounroll
		for (int j = 0; j < STEPS; j += 1)
		{
			floatArray[j] = i;
		}
	}
	finish = time(NULL);
	cout << "Contiguous Float:    " << finish - start << "\n";

	// Contiguous data access on doubles, so that it should require roughly 
	// the same number of cache line loads as the 2-stride float loop.
	start = time(NULL);
	#pragma nounroll
	for (double i = 0; i < LOOPS; i++)
	{
		#pragma nounroll
		for (int j = 0; j < STEPS; j += 1)
		{
			doubleArray[j] = i;
		}
	}
	finish = time(NULL);
	cout << "Contiguous Double:   " << finish - start << "\n";

	// Stride-2 float. Same number of iterations as the contiguous version,
	// same number of cache line loads as the double loop. Slower than both.
	start = time(NULL);
	#pragma nounroll
	for (float i = 0; i < LOOPS; i++)
	{
		#pragma nounroll
		for (int j = 0; j < STEPS * 2; j += 2)
		{
			floatArray[j] = i;
		}
	}
	finish = time(NULL);
	cout << "Noncontiguous Float: " << finish - start << "\n";

	return EXIT_SUCCESS;
}

Sources

https://www.youtube.com/watch?v=h2QEM1HpFgg - Roofline Analysis in Intel Advisor tutorial

https://software.intel.com/content/www/us/en/develop/articles/intel-advisor-roofline.html - Intel Advisor tutorial:roofline

https://software.intel.com/content/www/us/en/develop/documentation/advisor-user-guide/top/survey-trip-counts-flops-and-roofline-analyses/survey-analysis.html

https://techdecoded.intel.io/quickhits/advantages-of-vectorization-and-the-effects-of-data-size/#gs.mgu8q9