Open main menu

CDOT Wiki β

Changes

DPS921/ASCII

3,004 bytes added, 15:21, 30 November 2018
vTune Amplifier with OpenMP (Alex)
==vTune Amplifier with OpenMP (Alex)==
'''vTune Amplifier Overview:'''
We used vTune Amplifier to profile our code to check it's initial performance.
 
vTune Amplifier is a code profiler that can identify:
* Where time is being spent during the code’s execution
* Amount of concurrency in the code
* Bottlenecks created by synchronous primitives
 
vTune Amplifier has multiple analysis it can perform but we analyzed the ASCII art program with the '''Hotspot Analysis''' and '''HPC Performance Characterization'''.
 
To preface the following analysis, this was done using a 30 second video at 1080p.
 
[[File: Vt_types.png]]
 
The initial hotspot analysis of our serial code took '''44.904 seconds''' to complete with the '''imageToTextScaledNative''' function as the top hotspot.
[[File:Ascii_art_serial_hotspot.png]]
 
Looking at the thread visualization pane we can see that only one thread is used by the serial implementation of the function.
[[File:Ascii_art_serial_threads.png]]
 
To fix this we added the statement:
 
#pragma omp parallel for
 
to our code. So referring to our sudocode, the new code looked something like this:
 
#pragma omp parallel for
for( j ) {
for( k ) {
int sum=0
for(y) {
for(x) {
int index // using j, k x, y
sum += input[index];
}
}
int ave // get average using sum
for(y) {
for(x) {
int index // using j, k x, y
int charIndex // using x, y
output[index] = ascii[charIndex]
}
}
}
 
This improved our overall runtime and reduced it to '''36.095 seconds'''.
 
[[File:Ascii_art_parallel_hotspot.png]]
 
Note that our overall CPU time has increased. This is a good sign as that means that we are utilizing our CPU more.
 
Looking at our thread usage we see that we are using 8 threads instead of 1 and the presence of the '''omp$parallel''' attached to our function.
 
[[File:Ascii_art_parallel_threads.png]]
 
Then we ran our code through the HPC Performance Characterization analysis to verify the efficiency of our OpenMP implementation.
 
[[File:Ascii_art_hpc_og.png]]
 
This analysis showed that we could potentially gain '''1.311''' seconds of runtime if we optimized OpenMP. This occurred because there is an imbalance in the work distribution of our threads which means one or more threads complete their task before the other threads. This means they are idle and wait at the barrier for the other threads to complete their work ultimately resulting in unused processing power.
 
To fix this we added dynamic scheduling to our code so the workload is dynamically allocated so when one thread finishes their work and there is more work to be done, the thread will pick up more work.
 
So our pragma statement was changed to:
 
#pragma omp parallel for schedule(dynamic)
 
Running our code again through a HPC Performance Characterization analysis yielded these results:
 
[[File:Ascii_art_hpc_cur.png]]
==Intel Adviser (Dmytro)==
31
edits