DPS921/OpenACC vs OpenMP Comparison
Project Overview
The idea of this project is to introduce OpenACC as a parallel processing library, compare how parallelization is done in different libraries, and identify benefits of using each of the libraries. According to description of both libraries, OpenACC does parallelization more automatically whereas OpenMP allows developers to manually set regions to parallelize and assign to threads. The deliverable of this project would be a introduction to OpenACC along with performance comparison between OpenMP and OpenACC, and a discussion on usage of each one.
Group Members
1. Ruiqi Yu
2. Hanlin Li
3. Le Minh Pham
Progress
OpenACC
What is OpenACC
OpenAcc (Open Accelerators) is a programming standard for parallel computing on accelerators such as GPUs, mainly targets Nvidia GPUs. OpenACC is designed to simplify GPU programming, unlike CUDA and OpenCL where you need to write your programs in a different way to achieve GPU acceleration, OpenACC takes a similar approach as OpenMP, which is inserting directives into the code to offload computation onto GPUs and parallelize the code at CUDA core level. It is possible for programmers to create efficient parallel OpenACC code with only minor changes to a serial CPU code.
Example
#pragma acc kernels
{
for (int i = 0; i < N; i++) {
y[i] = a * x[i] + y[i];
}
}
GPU offloading
[image]
Installation
Originally, OpenACC compilation is supported by the PGI compiler which requires subscription, there has been new options in recent years.
Nvidia HPC SDK
Evolved from PGI Compiler community edition
GCC
Latest GCC version, GCC 10 has support to OpenACC 2.6
OpenMP vs OpenACC
OpenMP GPU offloading
We are comparing with OpenMP because OpenMP started support of offloading to accelerators starting OpenMP 4.0 using `target` constructs. OpenACC uses directives to tell the compiler where to parallelize loops, and how to manage data between host and accelerator memories. OpenMP takes a more generic approach, it allows programmers to explicitly spread the execution of loops, code regions and tasks across teams of threads.
OpenMP's directives tell the compiler to generate parallel code in that specific way, leaving little room to the discretion of the compiler and the optimizer.
Code comparison
Explicit conversions
OpenACC OpenMP
#pragma acc kernels #pragma omp target
{ {
#pragma acc loop worker #pragma omp parallel for private(tmp)
for(int i = 0; i < N; i++){ for(int i = 0; i < N; i++){
tmp = …; tmp = …;
array[i] = tmp * …; array[i] = tmp * …;
} }
#pragma acc loop vector #pragma omp simd
for(int i = 0; i < N; i++) for(int i = 0; i < N; i++)
array2[i] = …; array2[i] = …;
} }
ACC parallel
OpenACC OpenMP
#pragma acc parallel #pragma omp target
{ #pragma omp parallel
#pragma acc loop {
for(int i = 0; i < N; i++){ #pragma omp for private(tmp) nowait
tmp = …; for(int i = 0; i < N; i++){
array[i] = tmp * …; tmp = …;
} array[i] = tmp * …;
#pragma acc loop }
for(int i = 0; i < N; i++) #pragma omp for simd
array2[i] = …; for(int i = 0; i < N; i++)
} array2[i] = …;
}
ACC Kernels
OpenACC OpenMP
#pragma acc kernels #pragma omp target
{ #pragma omp parallel
for(int i = 0; i < N; i++){ {
tmp = …; #pragma omp for private(tmp)
array[i] = tmp * …; for(int i = 0; i < N; i++){
for(int i = 0; i < N; i++) tmp = …;
array2[i] = … array[i] = tmp * …;
} }
#pragma omp for simd
for(int i = 0; i < N; i++)
array2[i] = …
}
Copy vs. PCopy
OpenACC OpenMP
int x[10],y[10]; int x[10],y[10];
#pragma acc data copy(x) pcopy(y) #pragma omp target data map(x,y)
{ {
... ...
#pragma acc kernels copy(x) pcopy(y) #pragma omp target update to(x)
{ #pragma omp target map(y)
// Accelerator Code {
... // Accelerator Code
} ...
... }
} }