Changes

Jump to: navigation, search

GPU621/Intel Parallel Studio VTune Amplifier

8,253 bytes added, 10:11, 9 December 2021
Conclusion
For more information on System Overview click [https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance/platform-analysis-group/system-overview-analysis.html here]
==='''Versions of the software:'''===
*Standalone VTune Profiler Graphical Interface
*Web Server Interface
== Demonstration ==
===Baseline===
Prefix sums are trivial to compute in sequential models of computation, by using the formula {{math|1=''y<sub>i</sub>'' = ''y''<sub>''i'' &minus; 1</sub> + ''x<sub>i</sub>''}} to compute each output value in sequence order.
The Prefix Scan algorithm implemented in Serial, OpenMP and TBB version will be run with an array of size '''2^30'''. The metrics will be collected through Vtune.
 
===Single-thread Prefix Scan===
Starting with the Serial approach to the Prefix Scan Problem.
</source>
====Performance====
As expected, the serail version CPU utilization is considered poor due to the fact that only one thread is used for data utilization and Prefix Scan Algorithm. As can be seen from the Hotspot report the main function took '''2.297 ''' seconds under the Intel compiler with no optimization. Interestingly, the deallocation is also taken a lot of CPU time with '''0.6833 ''' seconds.
[[File:Serial_Scan.png]]
 
===OpenMP Prefix Scan===
Next goes the OpenMP implementation using Worksharing and barrier, single directives and 8 threads
====Code====
<source>
// Iurii Kondrakov
// OMP_Main.cpp
// 2021.12.07
 
#include <iostream>
#include <chrono>
#include <omp.h>
 
const int MAX_TILES = 8;
 
template <typename T, typename C>
T reduce(
const T* in, // points to the data set
int n, // number of elements in the data set
C combine, // combine operation
T initial // initial value
) {
 
for (int i = 0; i < n; i++)
initial = combine(initial, in[i]);
return initial;
}
 
template <typename T, typename C>
void excl_scan(
const T* in, // source data
T* out, // output data
int size, // size of data sets
C combine, // combine operation
T initial // initial value
) {
 
if (size > 0) {
for (int i = 0; i < size - 1; i++) {
out[i] = initial;
initial = combine(initial, in[i]);
}
out[size - 1] = initial;
}
}
 
template <typename T, typename C>
void scan(
const T* in, // source data
T* out, // output data
int size, // size of source, output data sets
C combine, // combine expression
T initial // initial value
)
{
if (size > 0) {
// allocate memory for maximum number of tiles
T stage1Results[MAX_TILES];
T stage2Results[MAX_TILES];
#pragma omp parallel num_threads(MAX_TILES)
{
int itile = omp_get_thread_num();
int ntiles = omp_get_num_threads();
int tile_size = (size - 1) / ntiles + 1;
int last_tile = ntiles - 1;
int last_tile_size = size - last_tile * tile_size;
 
// step 1 - reduce each tile separately
stage1Results[itile] = reduce(in + itile * tile_size,
itile == last_tile ? last_tile_size : tile_size,
combine, T(0));
#pragma omp barrier
 
// step 2 - perform exclusive scan on stage1Results
// store exclusive scan results in stage2Results[]
#pragma omp single
excl_scan(stage1Results, stage2Results, ntiles,
combine, T(0));
 
// step 3 - scan each tile separately using stage2Results[]
excl_scan(in + itile * tile_size, out + itile * tile_size,
itile == last_tile ? last_tile_size : tile_size,
combine, stage2Results[itile]);
}
}
}
 
// report system time
void reportTime(const char* msg, std::chrono::steady_clock::duration span) {
auto ms = std::chrono::duration_cast<std::chrono::microseconds>(span);
std::cout << msg << " - took - " <<
ms.count() << " microseconds" << std::endl;
}
 
int main(int argc, char** argv) {
if (argc > 2) {
std::cerr << argv[0] << ": invalid number of arguments\n";
std::cerr << "Usage: " << argv[0] << "\n";
std::cerr << "Usage: " << argv[0] << " power_of_2\n";
return 1;
}
std::cout << "OMP Prefix Scan" << std::endl;
 
// initial values for testing
const int N = 9;
const int in_[N]{ 3, 1, 7, 0, 1, 4, 5, 9, 2 };
 
// command line arguments - none for testing, 1 for large arrays
int n, nt{ 1 };
if (argc == 1) {
n = N;
}
else {
n = 1 << std::atoi(argv[1]);
if (n < N) n = N;
}
int* in = new int[n];
int* out = new int[n];
 
// initialize
for (int i = 0; i < N; i++)
in[i] = in_[i];
for (int i = N; i < n; i++)
in[i] = 1;
auto add = [](int a, int b) { return a + b; };
 
std::chrono::steady_clock::time_point ts, te;
 
// Exclusive Prefix Scan
ts = std::chrono::steady_clock::now();
scan<int, decltype(add)>(in, out, n, add, (int)0);
te = std::chrono::steady_clock::now();
 
std::cout << MAX_TILES << " thread" << (nt > 1 ? "s" : "") << std::endl;
for (int i = 0; i < N; i++)
std::cout << out[i] << ' ';
std::cout << out[n - 1] << std::endl;
reportTime("Exclusive Scan", te - ts);
 
delete[] in;
delete[] out;
}
</source>
 
====Performance====
As can be seen from the screenshot below, in the OpenMP solution, the work is spread unevenly between 8 threads. It can be described by the fact that the first node is responsible for initializing the arrays and single construct. Also, there is a lot of idle time due to the barrier construct. But the Prefix can itself seems to fall into average optimal CPU utilization.
 
[[File:OMP_Scan.png]]
 
===TBB Prefix Scan===
Finally, the TBB solution that uses tbb:parallel scan and Body functor, as well as auto partitioning
====Code====
<source>
// Iurii Kondrakov
// TBB_Main.cpp
// 2021.12.07
 
#include <iostream>
#include <chrono>
#include <tbb/tbb.h>
#include <tbb/parallel_reduce.h>
 
template<typename T, typename C>
class Body {
T accumul_;
const T* const in_;
T* const out_;
const T identity_;
const C combine_;
 
public:
Body(T* out, const T* in, T i, C c) :
accumul_(i), identity_(i),
in_(in), out_(out),
combine_(c) {}
 
Body(Body& src, tbb::split) :
accumul_(src.identity_), identity_(src.identity_),
in_(src.in_), out_(src.out_),
combine_(src.combine_) {}
 
template<typename Tag>
void operator() (const tbb::blocked_range<T>& r, Tag) {
T temp = accumul_;
for (auto i = r.begin(); i < r.end(); i++) {
if (Tag::is_final_scan())
out_[i] = temp;
temp = combine_(temp, in_[i]);
}
accumul_ = temp;
}
 
T get_accumul() {
return accumul_;
}
 
void reverse_join(Body& src) {
accumul_ = combine_(accumul_, src.accumul_);
}
 
void assign(Body& src) {
accumul_ = src.accumul_;
}
};
 
 
// report system time
void reportTime(const char* msg, std::chrono::steady_clock::duration span) {
auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(span);
std::cout << msg << " - took - " <<
ms.count() << " milliseconds" << std::endl;
}
 
int main(int argc, char** argv) {
if (argc > 3) {
std::cerr << argv[0] << ": invalid number of arguments\n";
std::cerr << "Usage: " << argv[0] << "\n";
std::cerr << "Usage: " << argv[0] << " power_of_2\n";
std::cerr << "Usage: " << argv[0] << " power_of_2 grainsize\n";
return 1;
}
 
unsigned grainsize{ 0 };
 
if (argc == 3) {
grainsize = (unsigned)atoi(argv[2]);
std::cout << "TBB Prefix Scan - grainsize = "
<< grainsize << std::endl;
} else {
std::cout << "TBB Prefix Scan - auto partitioning" << std::endl;
}
 
// initial values for testing
const int N = 9;
const int in_[N]{ 3, 1, 7, 0, 1, 4, 5, 9, 2 };
 
// command line arguments - none for testing, 1 for large arrays
int n;
if (argc == 1) {
n = N;
}
else {
n = 1 << std::atoi(argv[1]);
if (n < N) n = N;
}
int* in = new int[n];
int* out = new int[n];
 
// initialize
for (int i = 0; i < N; i++)
in[i] = in_[i];
for (int i = N; i < n; i++)
in[i] = 1;
auto add = [](int a, int b) { return a + b; };
 
// Inclusive Prefix Scan
std::chrono::steady_clock::time_point ts, te;
ts = std::chrono::steady_clock::now();
Body<int, decltype(add)> body(out, in, 0, add);
if (argc == 3)
tbb::parallel_scan(tbb::blocked_range<int>(0, n, grainsize), body);
else
tbb::parallel_scan(tbb::blocked_range<int>(0, n), body);
te = std::chrono::steady_clock::now();
for (int i = 0; i < N; i++)
std::cout << out[i] << ' ';
std::cout << out[n - 1] << std::endl;
reportTime("Exclusive Scan", te - ts);
 
delete[] in;
delete[] out;
}
</source>
 
====Performance====
As can be seen from the screenshot below, there is a lot of overhead due to tbb::parallel_scan scheduling. Additionally, it seems that most work is done by thread 1, which can be explained by the fact that the array is still initialized serially. The solution can be optimized by choosing the proper grain size which is the first suggestion Vtune gave.
 
[[File:TBB_Scan.png]]
== Sources ==
70
edits

Navigation menu