Difference between revisions of "GPU610/Team DAG"
(→Assignment 1) |
(→Assignment 1) |
||
(6 intermediate revisions by the same user not shown) | |||
Line 7: | Line 7: | ||
== Progress == | == Progress == | ||
=== Assignment 1 === | === Assignment 1 === | ||
− | + | Project selection discussed with Chris Szalwinski. Configuring local working environment and hardware for working with the CERN project source code. | |
+ | |||
Profile of the Drive_God_lin program utilizing only 1 core/thread on the CPU (forcing serialized execution of all OpenMP Pragmas in the C+ and Fortran code) showed 4 primary targets to rewrite using CUDA kernels. | Profile of the Drive_God_lin program utilizing only 1 core/thread on the CPU (forcing serialized execution of all OpenMP Pragmas in the C+ and Fortran code) showed 4 primary targets to rewrite using CUDA kernels. | ||
− | All 4 of these procedure calls are part of the Fortran library included for executing the analysis. | + | All 4 of these procedure calls are part of the Fortran library included for executing the analysis. |
− | + | ||
+ | (Each sample counts as 0.01 seconds.) | ||
+ | |||
+ | {| class="wikitable" border="1" | ||
+ | |+ Methods most likely to offer parallel improvements via CUDA kernels (Top 5 based on Flat Profile). | ||
+ | (Each sample counts as 0.01 seconds.) | ||
+ | ! %Time !! Cum Sec !! Self Sec !! Calls !! Self ms/Call !! Total ms/call !! Name | ||
+ | |- | ||
+ | | 47.66 || 9.99 || 9.99 || 314400 || 0.03 || 0.03 || zfunr_ | ||
+ | |||
+ | |- | ||
+ | |||
+ | | 30.45 || 16.73 || 6.38 || 524 || 12.18 || 12.18 || ordres_ | ||
+ | |||
+ | |- | ||
+ | |||
+ | | 10.84 || 18.64 || 2.27 || 314400 || 0.01 || 0.01 || cfft_ | ||
+ | |||
+ | |- | ||
+ | |||
+ | | 3.53 || 19.38 || 0.74 || 314400 || 0.00 || 0.04 || tunelasr_ | ||
+ | |||
+ | |- | ||
+ | |||
+ | | 3.34 || 20.08 || 0.70 || 1048 || 0.67 || 13.42 || spectrum_ | ||
+ | |} | ||
+ | |||
+ | Drive_God_lin.c - Contains OpenMP pragma for parallelization already. This should be converted over to a CUDA kernel to across Many -Cores instead of the avail CPU threads (or specified OPT_NUM_THREADS = 1). Because the test run used 1 optimal number of threads, the program executes serially. This produces a profile with the maximum percentages of time being used by 5 of the Fortran methods. | ||
+ | |||
+ | The first priority is to examine how the parallel pragma in the Drive_God_lin.c program divides up the task in to more CPU threads (forking), and if that process or smaller steps of that process can be re-written to be called by CUDA threads. | ||
+ | |||
+ | A parallel region is a block of code that will be executed by multiple threads. This is the fundamental OpenMP parallel construct. | ||
+ | |||
+ | From the Drive_God_lin.c code: | ||
+ | |||
+ | while (readDrivingTerms(drivingTermsFile, &turns, dataFilePath, sizeof(dataFilePath))) { | ||
+ | ... /* loop containing code to parse datafile terms from the DrivingTermsFilePath. */ | ||
+ | /* includes File IO | ||
+ | #pragma omp parallel for private(i, horizontalBpmCounter, verticalBpmCounter, kk, maxamp, calculatednattunex, calculatednattuney) | ||
+ | for (i = pickstart; i < maxcounthv; ++i) { | ||
+ | ... | ||
+ | // call to sussix4noise Fortran program code. | ||
+ | } | ||
+ | |||
+ | |||
+ | OpenMP provides three directives that are merely conveniences: | ||
+ | PARALLEL DO / parallel for | ||
+ | PARALLEL SECTIONS | ||
+ | PARALLEL WORKSHARE (fortran only) | ||
+ | |||
+ | For the most part, these directives behave identically to an individual PARALLEL directive being immediately followed by a separate work-sharing directive. | ||
+ | Most of the rules, clauses and restrictions that apply to both directives are in effect. See the OpenMP API for details. | ||
+ | An example using the PARALLEL DO / parallel for combined directive is shown below. | ||
+ | |||
+ | eg: | ||
+ | |||
+ | #pragma omp parallel for shared(a,b,c,chunk) private(i) schedule(static,chunk) | ||
+ | for (i=0; i < n; i++) | ||
+ | c[i] = a[i] + b[i]; | ||
+ | |||
+ | |||
+ | The private list for the variables, and no shared - identifies that each of the threads created for this parallel execution will have their own copy of each variable. | ||
+ | |||
+ | |||
+ | The important Loop prior to the Fortran Call is below: | ||
+ | |||
+ | for (kk = 0; kk < MAXTURNS; ++kk) { | ||
+ | doubleToSend[kk] = matrix[horizontalBpmCounter][kk]; | ||
+ | doubleToSend[kk + MAXTURNS] = matrix[verticalBpmCounter][kk]; | ||
+ | doubleToSend[kk + 2 * MAXTURNS] = 0.0; | ||
+ | doubleToSend[kk + 3 * MAXTURNS] = 0.0; | ||
+ | } | ||
+ | /* This calls the external Fortran code (tbach) */ | ||
+ | sussix4drivenoise_(&doubleToSend[0], &tune[0], &litude[0], &phase[0], &allfreqsx[0], &allampsx[0], &allfreqsy[0], &allampsy[0], sussixInputFilePath); | ||
+ | |||
+ | |||
+ | This sets the array of data 'doubleToSend[]' from the data buffers read from the data files previously. | ||
+ | |||
+ | The later restriction on the efficiency of the existing code in addition to the file IO is the inclusion of a CRITICAL pragma, which restricts the execution of that block of code to 1 thread at a time. This may be difficult to convert or optimize for CUDA since these sections of code appear to make frequent use if file IO, and parallel IO may not be available in the operating environment. | ||
+ | ------------------------------------------------- | ||
+ | Due to difficulties in code conversion, limitations of Direct file IO in the methods, and library linking, the attempt to optimize methods from the Drive_God project have been abandoned for a more feasible topic. | ||
+ | New Project Topic: Optimizing Summed-Area-Tables with GPGPU algorithms. | ||
− | + | Work is based on initial examples from another member of class, with examinations of improved serial algorithm and using this as a step towards a parallel kernel. | |
− | + | The improved serial algorithm uses 2-pass prefix-scan-sum to generate the Summed-Area-Table. The code for the kernels will aim to start with implementing the 2 pass prefix-scan-sum on the GPU, possibly look at an improved algorithm released by Nehab,Maximo,Lima and Hoppe in 2011 that reduces memory traffic by overlapping computation of block partitions and buffering intermediate partition sums across multiple stage passes. | |
− | + | Initial Naive Serial Algorithm: Iterative Summation: O -> 1/4 *n^4 ? ( 1/2*n^2) *(1/2* n^2) | |
+ | (based on sum of operations iterated across from 0 to n in 2 dimensions. ) | ||
+ | ------------------------------------ | ||
+ | /* Creates the summarized area table */ | ||
− | + | void summarizedAreaTable(float** a, float** b, int size){ | |
+ | int k = 0; | ||
+ | float sum = 0.0; | ||
+ | for(int i = size-1; i >= 0; i--){ | ||
+ | for(int j = 0; j < size; j++){ | ||
+ | for(int k = i; k < size; k++){ | ||
+ | for(int m = 0; m <= j; m++){ | ||
+ | sum += a[k][m]; | ||
+ | } | ||
+ | } | ||
+ | b[i][j] = sum; | ||
+ | sum = 0.0; | ||
+ | } | ||
+ | } | ||
+ | } | ||
− | |||
− | + | ------------------------------------------ | |
+ | Improved Serial SAT Algorithm: 2-pass Serial Prefix Sum : O-> 2*n^2 | ||
+ | (Requires 1.5x the memory as above, to store interim stage values) | ||
− | + | void twopassscan_SAT(float* a, float* b, float* c, int size) | |
+ | { | ||
+ | // perform serial prefix sum on horizontal dimension - store in b | ||
+ | for(int j = 0; j < size; j++) | ||
+ | { | ||
+ | b[0][j] = a[0][j]; // no recursive math for first column -- else out of bounds | ||
+ | for(int i = 1; i < size; i++) | ||
+ | { | ||
+ | // result element in row = source element + recursive element sum of previous item in row | ||
+ | b[i][j] = a[i][j] + b[i-1][j]; | ||
+ | } | ||
+ | } | ||
+ | |||
+ | // Now perform prefix sum along vertical dimension on intermediate result b | ||
+ | for(int x = 0; x < size; x ++) | ||
+ | { | ||
+ | c[x][0] = b[x][0]; // no recusive sum for first elements down columns | ||
+ | for(y = 1; y< size; y++) | ||
+ | { | ||
+ | c[x][y] = b[x][y] + c[x][y-1]; | ||
+ | } | ||
+ | } | ||
− | + | } | |
=== Assignment 2 === | === Assignment 2 === | ||
=== Assignment 3 === | === Assignment 3 === |
Latest revision as of 23:49, 20 April 2013
GPU610/DPS915 | Student List | Group and Project Index | Student Resources | Glossary
Contents
Team DAG
Team Members
- Chris Schreiber, Team Lead
Progress
Assignment 1
Project selection discussed with Chris Szalwinski. Configuring local working environment and hardware for working with the CERN project source code.
Profile of the Drive_God_lin program utilizing only 1 core/thread on the CPU (forcing serialized execution of all OpenMP Pragmas in the C+ and Fortran code) showed 4 primary targets to rewrite using CUDA kernels.
All 4 of these procedure calls are part of the Fortran library included for executing the analysis.
(Each sample counts as 0.01 seconds.)
%Time | Cum Sec | Self Sec | Calls | Self ms/Call | Total ms/call | Name |
---|---|---|---|---|---|---|
47.66 | 9.99 | 9.99 | 314400 | 0.03 | 0.03 | zfunr_ |
30.45 | 16.73 | 6.38 | 524 | 12.18 | 12.18 | ordres_ |
10.84 | 18.64 | 2.27 | 314400 | 0.01 | 0.01 | cfft_ |
3.53 | 19.38 | 0.74 | 314400 | 0.00 | 0.04 | tunelasr_ |
3.34 | 20.08 | 0.70 | 1048 | 0.67 | 13.42 | spectrum_ |
Drive_God_lin.c - Contains OpenMP pragma for parallelization already. This should be converted over to a CUDA kernel to across Many -Cores instead of the avail CPU threads (or specified OPT_NUM_THREADS = 1). Because the test run used 1 optimal number of threads, the program executes serially. This produces a profile with the maximum percentages of time being used by 5 of the Fortran methods.
The first priority is to examine how the parallel pragma in the Drive_God_lin.c program divides up the task in to more CPU threads (forking), and if that process or smaller steps of that process can be re-written to be called by CUDA threads.
A parallel region is a block of code that will be executed by multiple threads. This is the fundamental OpenMP parallel construct.
From the Drive_God_lin.c code:
while (readDrivingTerms(drivingTermsFile, &turns, dataFilePath, sizeof(dataFilePath))) { ... /* loop containing code to parse datafile terms from the DrivingTermsFilePath. */ /* includes File IO
- pragma omp parallel for private(i, horizontalBpmCounter, verticalBpmCounter, kk, maxamp, calculatednattunex, calculatednattuney)
for (i = pickstart; i < maxcounthv; ++i) {
...
// call to sussix4noise Fortran program code.
}
OpenMP provides three directives that are merely conveniences:
PARALLEL DO / parallel for
PARALLEL SECTIONS
PARALLEL WORKSHARE (fortran only)
For the most part, these directives behave identically to an individual PARALLEL directive being immediately followed by a separate work-sharing directive. Most of the rules, clauses and restrictions that apply to both directives are in effect. See the OpenMP API for details. An example using the PARALLEL DO / parallel for combined directive is shown below.
eg:
- pragma omp parallel for shared(a,b,c,chunk) private(i) schedule(static,chunk)
for (i=0; i < n; i++) c[i] = a[i] + b[i];
The private list for the variables, and no shared - identifies that each of the threads created for this parallel execution will have their own copy of each variable.
The important Loop prior to the Fortran Call is below:
for (kk = 0; kk < MAXTURNS; ++kk) { doubleToSend[kk] = matrix[horizontalBpmCounter][kk]; doubleToSend[kk + MAXTURNS] = matrix[verticalBpmCounter][kk]; doubleToSend[kk + 2 * MAXTURNS] = 0.0; doubleToSend[kk + 3 * MAXTURNS] = 0.0; } /* This calls the external Fortran code (tbach) */ sussix4drivenoise_(&doubleToSend[0], &tune[0], &litude[0], &phase[0], &allfreqsx[0], &allampsx[0], &allfreqsy[0], &allampsy[0], sussixInputFilePath);
This sets the array of data 'doubleToSend[]' from the data buffers read from the data files previously.
The later restriction on the efficiency of the existing code in addition to the file IO is the inclusion of a CRITICAL pragma, which restricts the execution of that block of code to 1 thread at a time. This may be difficult to convert or optimize for CUDA since these sections of code appear to make frequent use if file IO, and parallel IO may not be available in the operating environment.
Due to difficulties in code conversion, limitations of Direct file IO in the methods, and library linking, the attempt to optimize methods from the Drive_God project have been abandoned for a more feasible topic.
New Project Topic: Optimizing Summed-Area-Tables with GPGPU algorithms.
Work is based on initial examples from another member of class, with examinations of improved serial algorithm and using this as a step towards a parallel kernel.
The improved serial algorithm uses 2-pass prefix-scan-sum to generate the Summed-Area-Table. The code for the kernels will aim to start with implementing the 2 pass prefix-scan-sum on the GPU, possibly look at an improved algorithm released by Nehab,Maximo,Lima and Hoppe in 2011 that reduces memory traffic by overlapping computation of block partitions and buffering intermediate partition sums across multiple stage passes.
Initial Naive Serial Algorithm: Iterative Summation: O -> 1/4 *n^4 ? ( 1/2*n^2) *(1/2* n^2)
(based on sum of operations iterated across from 0 to n in 2 dimensions. )
/* Creates the summarized area table */
void summarizedAreaTable(float** a, float** b, int size){
int k = 0; float sum = 0.0; for(int i = size-1; i >= 0; i--){ for(int j = 0; j < size; j++){ for(int k = i; k < size; k++){ for(int m = 0; m <= j; m++){ sum += a[k][m]; } } b[i][j] = sum; sum = 0.0; } }
}
Improved Serial SAT Algorithm: 2-pass Serial Prefix Sum : O-> 2*n^2 (Requires 1.5x the memory as above, to store interim stage values)
void twopassscan_SAT(float* a, float* b, float* c, int size) {
// perform serial prefix sum on horizontal dimension - store in b for(int j = 0; j < size; j++) { b[0][j] = a[0][j]; // no recursive math for first column -- else out of bounds for(int i = 1; i < size; i++) { // result element in row = source element + recursive element sum of previous item in row b[i][j] = a[i][j] + b[i-1][j]; } } // Now perform prefix sum along vertical dimension on intermediate result b for(int x = 0; x < size; x ++) { c[x][0] = b[x][0]; // no recusive sum for first elements down columns for(y = 1; y< size; y++) { c[x][y] = b[x][y] + c[x][y-1]; } }
}