Open main menu

CDOT Wiki β

Changes

DPS921/OpenACC vs OpenMP Comparison

2,674 bytes added, 21:23, 3 December 2020
m
no edit summary
- Nov 9, 2020: Added project description
- Nov 13, 2020: Determine content sections to be discussed
- Nov 15, 2020: Investigate OpenACC learning material and tools
- Nov 18, 2020: Successful installation of required compiler and compilation of OpenACC code
- Nov 19, 2020: Adding MPI into discussion
</source>
=== OpenMP CPU Implementation ===
An OpenMP implementation would look like the following, with shared data and reduction on summation values
<source>
=== OpenACC Basic Implementation ===
A proper basic OpenACC implementation looks like this.
<source>
#pragma acc data copyin(A[0:nx]) copyout(Anew[0:nx])
while ( err > tol && iter < iter_max ) {
err=0.0f;
Or you can let the compiler handle it by using <code>kernel</code> instead of <code>parallel loop</code>. You will be notified during compilation how the compiler thinks this thing should be parallelized.
<source>
while ( err > tol && iter < iter_max ) {
err=0.0f;
#pragma acc kernel
for(int i = 1; i < nx-1; i++) {
Anew[i] = 0.5f * (A[i+1] + A[i-1]);
err = fmax(err, fabs(Anew[i] - A[i]));
}
#pragma acc kernel
for( int i = 1; i < nx-1; i++ ) {
A[i] = Anew[i];
}
iter++;
}
</source>
 
=== OpenMP GPU Basic Implementation ===
Here's the OpenMP GPU basic implementation. Almost everything is the same, just need to enclose everything into an <code>omp target</code> region
<source>
while ( err > tol && iter < iter_max ) {
err=0.0f;
#pragma omp target
{
#pragma omp parallel for shared(nx, Anew, A) reduction(max:err)
for(int i = 1; i < nx-1; i++) {
Anew[i] = 0.5f * (A[i+1] + A[i-1]);
err = fmax(err, fabs(Anew[i] - A[i]));
}
#pragma omp parallel for shared(nx, Anew, A)
for( int i = 1; i < nx-1; i++ ) {
A[i] = Anew[i];
}
iter++;
}
}
</source>
 
=== OpenACC Proper Implementation ===
The above implementation is actually slower than the serial version, that is due to the fact that there exists data transfer at the end of each iteration. In order to prevent that from happening, we need to copy the data into the accelerator's memory and copy it out when done.
<source>
#pragma acc data copyin(A[0:nx]) copyout(Anew[0:nx])
}
#pragma acc kernel
for( int i = 1; i < nx-1; i++ ) {
A[i] = Anew[i];
}
iter++;
}
</source>
In the above code, we added a <code>copyin(list)</code> for the original matrix of values, and <code>copyout(list)</code> for the computed matrix of results. There are other related directives such as <code>copy(list)</code> which is the combination of both <code>copyin(list)</code> and <code>copyout(list)</code>, <code>create(list)</code> for creating a memory region in the accelerator but not copy any data into it, and <code>present(list)</code> which indicates the list is already in the accelerator, which is often used along with <code>create(list)</code>
 
=== OpenMP GPU Proper Implementation ===
Similar to OpenACC, the basic is slow because of data transfer issues, but to optimize OpenMP, you need to explicitly tell the thread how to team up and how to distribute.
<source>
#pragma omp target data map(alloc:Anew) map(A)
while ( err > tol && iter < iter_max ) {
err=0.0f;
 
#pragma omp target teams distribute parallel for reduction(max:error) collapse(2) schedule(static,1)
for(int i = 1; i < nx-1; i++) {
Anew[i] = 0.5f * (A[i+1] + A[i-1]);
err = fmax(err, fabs(Anew[i] - A[i]));
}
 
#pragma omp target teams distribute parallel for collapse(2) schedule(static,1)
for( int i = 1; i < nx-1; i++ ) {
A[i] = Anew[i];
}
</source>
 
=== Execution time ===
</source>
Here we can insert additional instructions into the inner loop on how many gangs and vectors to use. <code>Gang</code> and <code>vector</code> are OpenACC terminologies. A <code>vector</code> is one thread performing single operation on multiple data (SIMD), a <code>worker</code> computes one <code>vector</code>, and a <code>gang</code> comprises of multiple workers that share resource.
 
[[File: Gangworkervector.png]]
== OpenACC with MPI ==
When OpenMP and OpenACC works together, it is usually one CPU with several accelerators as that is how OpenMP works. When there are multiple CPUs and each have access to multiple accelerators, OpenMP will not be enough, and we can introduce MPI.
As we learned that MPI is used to allow communication and data transfer between threads during parallel execution. In the case of multiple accelerators, one of the ways we can use the two together is to use MPI to communicate between different accelerators.
36
edits