Team Lion F2017

From CDOT Wiki
Revision as of 11:50, 5 January 2018 by Jsbhamber2 (talk | contribs) (Locks & Waits)
Jump to: navigation, search

Group Members

Intel Parallel Studio vTune Amplifier

  1. Jagmeet Bhamber
  2. Shivam Gupta
  3. Yong Kuk Kim

What is VTune Amplifier?

  • A tool created by Intel to provide performance analysis on software.
  • Offers both a GUI and command-line version for both Windows and Linux
  • GUI only for OSX
  • Basic features available on both Intel and AMD processors, but advanced features only for Intel

How to use it?

  • Available as a standalone unit or part of the following packages:
    • Intel Parallel Studio XE Cluster Edition and Professional Edition
    • Intel Media Server Studio Professional Edition
    • Intel System Studio

Can be run on a local machine


Hotspots

Basic hotspot analysis

We used our workshop 6 as an example to demonstrate this particular aspect of Intel Vtune Amplifer

Summary.PNG


Function timmings.PNG

the image above shows the timings for each function

matmul_0 - represents serial version

matmul_1 - represents serial version with reverse logic

matmul_2 - uses cilk_for

matmul_3 - uses cilk_for and reducer hyperboject

matmul_4 - uses cilk_for, reducer and vectorization

Advanced hotspot analysis

Parallelism

Concurrency

  • Best for visualizing thread parallelism on available cores, finding areas with high or low concurrency, and identifying serial bottlenecks in your code
  • Provides information on how many threads were running at each moment during application execution
  • Includes threads which are currently running or ready to run and therefore are not waiting at a defined waiting or blocking API
  • Also shows CPU time while the hotspot was executing and estimates its effectiveness either by CPU usage or by Threads Concurrency

Results of Concurrency tests on Workshop 6

I ran the Concurrency test on each of the functions in Workshop 6. I isolated each function by commenting out all others, then ran them 1 by 1. This was to get an idea of how they perform on their own. Finally I ran them all together to see how the program runs overall.

matmul_0 (Serial)

double matmul_0(const double* a, const double* b, double* c, int n) {
	for (int i = 0; i < n; i++) {
		for (int j = 0; j < n; j++) {
			double sum = 0.0;
			for (int k = 0; k < n; k++)
				sum += a[i * n + k] * b[k * n + j];
			c[i * n + j] = sum;
		}
	}
	double diag = 0.0;
	for (int i = 0; i < n; i++)
		diag += c[i * n + i];
	return diag;
}

Conc-01.png Conc-02.png

matmul_1 (Serial with j-k loops reversed)

double matmul_1(const double* a, const double* b, double* c, int n) {
	
	for (int i = 0; i < n; i++) {
		for (int k = 0; k < n; k++) {
			double sum = 0.0;
			for (int j = 0; j < n; j++)
				sum += a[i * n + k] * b[k * n + j];
			c[i * n + k] = sum;
		}
	}
	double diag = 0.0;
	for (int i = 0; i < n; i++)
		diag += c[i * n + i];
	return diag;
}

Conc-11.png Conc-12.png

matmul_2 (Cilk Plus with cilk_for)

double matmul_2(const double* a, const double* b, double* c, int n) {
	
	cilk_for (int i = 0; i < n; i++) {
		cilk_for (int j = 0; j < n; j++) {
			double sum = 0.0;
			for(int k = 0; k < n; k++) {
				sum += a[i * n + k] * b[k * n + j];
			}
			c[i * n + j] = sum;
		}
	}

	double diag = 0.0;
	for (int i = 0; i < n; i++)
		diag += c[i * n + i];
	return diag;
}

Conc-21.png Conc-22.png

matmul_3 (+array notation, reducer)

double matmul_3(const double* a, const double* b, double* c, int n) {
	
	cilk_for(int i = 0; i < n; i++) {
		cilk_for(int j = 0; j < n; j++) {
			double sum = 0.0;
			for (int k = 0; k < n; k++) {
				sum += a[i * n + k] * b[k * n + j];
			}
			c[i * n + j] = sum;
		}
	}

	cilk::reducer_opadd <double> diag(0.0);
	cilk_for(int i = 0; i < n; i++) {
		diag += c[i * n + i];
	}
	return diag.get_value();
}

Conc-31.png Conc-32.png

matmul_4 (+vectorization)

double matmul_4(const double* a, const double* b, double* c, int n) {
	
	cilk_for(int i = 0; i < n; i++) {
		cilk_for(int j = 0; j < n; j++) {
			double sum = 0.0;
#pragma simd
			for (int k = 0; k < n; k++) {
				sum += a[i * n + k] * b[k * n + j];
			}
			c[i * n + j] = sum;
		}
	}

	cilk::reducer_opadd <double> diag(0.0);
	cilk_for(int i = 0; i < n; i++) {
		diag += c[i * n + i];
	}
	return diag.get_value();
}

Conc-41.png Conc-42.png

Final test with all functions

Conc-51.png Conc-52.png

Conc-53.png

Locks & Waits

  • Best for locating causes of low concurrency, such as heavily used locks and large critical sections.
  • Locks are when threads are waiting too long on synchronization objects.
  • Uses user-mode sampling and tracing collection to identify processes.
  • This analysis shows time spent waiting on synchronizations.


Lock1.png

Lock2.png

Lock3.png

HPC Performance Characterization

Microarchitecture

General Exploration

Memory Access

references

https://en.wikipedia.org/wiki/VTune

https://software.intel.com/en-us/get-started-with-vtune

https://software.intel.com/en-us/vtune-amplifier-help-analysis-types

https://software.intel.com/en-us/vtune-amplifier-help-basic-hotspots-analysis

https://software.intel.com/en-us/vtune-amplifier-help-advanced-hotspots-analysis

https://software.intel.com/en-us/vtune-amplifier-help-concurrency-analysis

https://software.intel.com/en-us/vtune-amplifier-help-locks-and-waits-analysis

https://software.intel.com/en-us/vtune-amplifier-help-hpc-performance-characterization-analysis

https://software.intel.com/en-us/vtune-amplifier-help-general-exploration-analysis

https://software.intel.com/en-us/vtune-amplifier-help-memory-access-analysis