Open main menu

CDOT Wiki β

Changes

Team Lion F2017

2,335 bytes added, 12:53, 5 January 2018
no edit summary
[[File:Summary.PNG]]
===Advanced hotspot analysis===
[[File:Function_timmings.PNG]]
 
the image above shows the timings for each function
 
matmul_0 - represents serial version
 
matmul_1 - represents serial version with reverse logic
 
matmul_2 - uses cilk_for
 
matmul_3 - uses cilk_for and reducer hyperboject
 
matmul_4 - uses cilk_for, reducer and vectorization
====matmul_0 (Serial)====
<pre>
double matmul_0(const double* a, const double* b, double* c, int n) {
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
double sum = 0.0;
for (int k = 0; k < n; k++)
sum += a[i * n + k] * b[k * n + j];
c[i * n + j] = sum;
}
}
double diag = 0.0;
for (int i = 0; i < n; i++)
diag += c[i * n + i];
return diag;
}
</pre>
[[File:Conc-01.png]]
====matmul_1 (Serial with j-k loops reversed)====
<pre>
double matmul_1(const double* a, const double* b, double* c, int n) {
for (int i = 0; i < n; i++) {
for (int k = 0; k < n; k++) {
double sum = 0.0;
for (int j = 0; j < n; j++)
sum += a[i * n + k] * b[k * n + j];
c[i * n + k] = sum;
}
}
double diag = 0.0;
for (int i = 0; i < n; i++)
diag += c[i * n + i];
return diag;
}
</pre>
[[File:Conc-11.png]]
====matmul_2 (Cilk Plus with cilk_for)====
<pre>
double matmul_2(const double* a, const double* b, double* c, int n) {
cilk_for (int i = 0; i < n; i++) {
cilk_for (int j = 0; j < n; j++) {
double sum = 0.0;
for(int k = 0; k < n; k++) {
sum += a[i * n + k] * b[k * n + j];
}
c[i * n + j] = sum;
}
}
 
double diag = 0.0;
for (int i = 0; i < n; i++)
diag += c[i * n + i];
return diag;
}
</pre>
[[File:Conc-21.png]]
====matmul_3 (+array notation, reducer)====
<pre>
double matmul_3(const double* a, const double* b, double* c, int n) {
cilk_for(int i = 0; i < n; i++) {
cilk_for(int j = 0; j < n; j++) {
double sum = 0.0;
for (int k = 0; k < n; k++) {
sum += a[i * n + k] * b[k * n + j];
}
c[i * n + j] = sum;
}
}
 
cilk::reducer_opadd <double> diag(0.0);
cilk_for(int i = 0; i < n; i++) {
diag += c[i * n + i];
}
return diag.get_value();
}
</pre>
[[File:Conc-31.png]]
====matmul_4 (+vectorization)====
<pre>
double matmul_4(const double* a, const double* b, double* c, int n) {
cilk_for(int i = 0; i < n; i++) {
cilk_for(int j = 0; j < n; j++) {
double sum = 0.0;
#pragma simd
for (int k = 0; k < n; k++) {
sum += a[i * n + k] * b[k * n + j];
}
c[i * n + j] = sum;
}
}
 
cilk::reducer_opadd <double> diag(0.0);
cilk_for(int i = 0; i < n; i++) {
diag += c[i * n + i];
}
return diag.get_value();
}
</pre>
[[File:Conc-41.png]]
===Locks & Waits===
===HPC Performance Characterization===* Best for locating causes of low concurrency, such as heavily used locks and large critical sections.* Locks are when threads are waiting too long on synchronization objects.* Uses user-mode sampling and tracing collection to identify processes. * This analysis shows time spent waiting on synchronizations.
==Microarchitecture==
===General Exploration===[[File:Lock1.png]]
[[File:Lock2.png]]
===Memory Access===[[File:Lock3.png]]
==references==
https://software.intel.com/en-us/vtune-amplifier-help-locks-and-waits-analysis
https://software.intel.com/en-us/vtune-amplifier-help-hpc-performance-characterization-analysis https://software.intel.com/en-us/vtune-amplifier-help-general-exploration-analysisvtuneampxe_hotspots_win_c
https://software.intel.com/en-us/vtune-amplifier-help-memory-access-analysisvtuneampxe_locks_win_c
60
edits