Open main menu

CDOT Wiki β

Changes

DPS921/OpenACC vs OpenMP Comparison

1,962 bytes added, 02:20, 3 December 2020
m
no edit summary
== OpenACC with OpenMP ==
OpenMP and OpenACC can be used together. HoweverUsing the example above, PGI stated that there are still some issues when interoperating between OpenMP and OpenACC we can easily come up with something like <source>#pragma acc data copyin(A[0:nx]) copyout(Anew[https0:nx])while ( err > tol && iter < iter_max ) { err=0.0f; #pragma omp parallel for shared(nx, Anew, A) #pragma acc kernel for(int i = 1; i < nx-1; i++) { Anew[i] = 0.5f * (A[i+1] + A[i-1]); err = fmax(err, fabs(Anew[i] - A[i])); } #pragma omp parallel for shared(nx, Anew, A) #pragma acc kernel for( int i = 1; i < nx-1; i++ ) { A[i] = Anew[i]; } iter++;}<//pgroupsource> this way, for each thread created by OpenMP, the computation will be offloaded to an accelerator, with results joined back together. Combining OpenACC and OpenMP together may be an overkill for the 1D example, a 2D example may be a better fit.com/resources/openacc_faq<source>#pragma acc data copy(A), create(Anew)while ( error > tol && iter < iter_max ){ error = 0.htmf;  #cpupragma omp parallel for shared(m, n, Anew, A) #pragma acc kernels loop gang(32), vector(16) for( int j = 1; j < n-1; j++) { #pragma acc loop gang(16), vector(32) for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmaxf( error, since their runtime library are not completely threadfabsf(Anew[j][i]-A[j][i])); } } #pragma omp parallel for shared(m, n, Anew, A) #pragma acc kernels loop for( int j = 1; j < n-1; j++) { #pragma acc loop gang(16), vector(32) for( int i = 1; i < m-safe1; i++ ) { A[j][i] = Anew[j][i]; } }  iter++;}</source>Here we can insert additional instructions into the inner loop on how many gangs and vectors to use. They <code>Gang</code> and <code>vector</code> are looking forward to improving the interaction between the two libraries in the future releasesOpenACC terminologies. A <code>vector</code> is one thread performing single operation on multiple data (SIMD), a <code>worker</code> computes one <code>vector</code>, and a <code>gang</code> comprises of multiple workers that share resource.
== OpenACC with MPI ==
When OpenMP and OpenACC works together, it is usually one CPU with several accelerators as that is how OpenMP works. When there are multiple CPUs and each have access to multiple accelerators, OpenMP will not be enough, and we can introduce MPI.
As we learned that MPI is used to allow communication and data transfer between threads during parallel execution. In the case of multiple accelerators, one of the ways we can use the two together is to use MPI to communicate between different accelerators.
36
edits