60
edits
Changes
no edit summary
===Basic hotspot analysis===
[[File:Summary.PNG]]
[[File:Function_timmings.PNG]]
the image above shows the timings for each function
matmul_0 - represents serial version
matmul_1 - represents serial version with reverse logic
matmul_2 - uses cilk_for
matmul_3 - uses cilk_for and reducer hyperboject
matmul_4 - uses cilk_for, reducer and vectorization
* Also shows CPU time while the hotspot was executing and estimates its effectiveness either by CPU usage or by Threads Concurrency
====Results of Concurrency tests on Workshop 6====
====matmul_0 (Serial)====
<pre>
double matmul_0(const double* a, const double* b, double* c, int n) {
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
double sum = 0.0;
for (int k = 0; k < n; k++)
sum += a[i * n + k] * b[k * n + j];
c[i * n + j] = sum;
}
}
double diag = 0.0;
for (int i = 0; i < n; i++)
diag += c[i * n + i];
return diag;
}
</pre>
[[File:Conc-01.png]]
[[File:Conc-02.png]]
====matmul_1 (Serial with j-k loops reversed)====
<pre>
double matmul_1(const double* a, const double* b, double* c, int n) {
for (int i = 0; i < n; i++) {
for (int k = 0; k < n; k++) {
double sum = 0.0;
for (int j = 0; j < n; j++)
sum += a[i * n + k] * b[k * n + j];
c[i * n + k] = sum;
}
}
double diag = 0.0;
for (int i = 0; i < n; i++)
diag += c[i * n + i];
return diag;
}
</pre>
[[File:Conc-11.png]]
[[File:Conc-12.png]]
<pre>double matmul_2 (const double* a, const double* b, double* c, int n) { cilk_for (int i = 0; i < n; i++) { cilk_for (int j = 0; j < n; j++) { double sum = 0.0; for(int k = 0; k < n; k++) { sum += a[i * n + k] * b[k * n + j]; } c[i * n + j] = sum; } }
double diag = 0.0;
for (int i = 0; i < n; i++)
diag += c[i * n + i];
return diag;
}
</pre>
[[File:Conc-21.png]]
[[File:Conc-22.png]]
<pre>double matmul_3 (const double* a, const double* b, double* c, int n) { cilk_for(int i = 0; i < n; i++) { cilk_for(int j = 0; j < n; j++) { double sum = 0.0; for (int k = 0; k < n; k++) { sum += a[i * n + k] * b[k * n + j]; } c[i * n + j] = sum; } }
cilk::reducer_opadd <double> diag(0.0);
cilk_for(int i = 0; i < n; i++) {
diag += c[i * n + i];
}
return diag.get_value();
}
</pre>
[[File:Conc-31.png]]
[[File:Conc-32.png]]
<pre>double matmul_4 (const double* a, const double* b, double* c, int n) { cilk_for(int i = 0; i < n; i++) { cilk_for(int j = 0; j < n; j++) { double sum = 0.0;#pragma simd for (int k = 0; k < n; k++) { sum += a[i * n + k] * b[k * n + j]; } c[i * n + j] = sum; } }
cilk::reducer_opadd <double> diag(0.0);
cilk_for(int i = 0; i < n; i++) {
diag += c[i * n + i];
}
return diag.get_value();
}
</pre>
[[File:Conc-41.png]]
[[File:Conc-42.png]]
====Final test with all running functions====
[[File:Conc-51.png]]
[[File:Conc-52.png]]
===HPC Performance CharacterizationLocks & Waits===
* Best for locating causes of low concurrency, such as heavily used locks and large critical sections.
* Locks are when threads are waiting too long on synchronization objects.
* Uses user-mode sampling and tracing collection to identify processes.
* This analysis shows time spent waiting on synchronizations.
[[File:Lock2.png]]
==references==
https://software.intel.com/en-us/vtune-amplifier-help-locks-and-waits-analysis
https://software.intel.com/en-us/vtune-amplifier-help-hpc-performance-characterization-analysis https://software.intel.com/en-us/vtune-amplifier-help-general-exploration-analysisvtuneampxe_hotspots_win_c
https://software.intel.com/en-us/vtune-amplifier-help-memory-access-analysisvtuneampxe_locks_win_c