1
edit
Changes
→Assignment 1
== Progress ==
=== Assignment 1 ===
Profile of the Drive_God_lin program utilizing only 1 core/thread on the CPU (forcing serialized execution of all OpenMP Pragmas in the C+ and Fortran code) showed 4 primary targets to rewrite using CUDA kernels.
(Each sample counts as 0.01 seconds.)
{| class="wikitable" border="1"
|}
Drive_God_lin.c - Contains OpenMP pragma for parallelization already. This should be converted over to a CUDA kernel to across Many -Cores instead of the avail CPU threads (or specified OPT_NUM_THREADS = 1). Because the test run used 1 optimal number of threads, the program executes serially. This produces a profile with the maximum percentages of time being used by 3 5 of the Fortran methods.
The first priority is to examine how the parallel pragma in the Drive_God_lin.c program divides up the task in to more CPU threads (forking), and if that process or smaller steps of that process can be re-written to be called by CUDA threads.
while (readDrivingTerms(drivingTermsFile, &turns, dataFilePath, sizeof(dataFilePath))) {
... /* loop containing code to parse datafile terms from the DrivingTermsFilePath. */
/* includes File IO
#pragma omp parallel for private(i, horizontalBpmCounter, verticalBpmCounter, kk, maxamp, calculatednattunex, calculatednattuney)
for (i = pickstart; i < maxcounthv; ++i) {
...
// call to sussix4noise Fortran program code.
}
OpenMP provides three directives that are merely conveniences:
PARALLEL DO / parallel for
PARALLEL SECTIONS
An example using the PARALLEL DO / parallel for combined directive is shown below.
eg: #pragma omp parallel for \ shared(a,b,c,chunk) private(i) \ schedule(static,chunk)
#pragma omp parallel for shared(a,b,c,chunk) private(i) schedule(static,chunk)
for (i=0; i < n; i++)
c[i] = a[i] + b[i];
The private list for the variables, and no shared - identifies that each of the threads created for this parallel execution will have their own copy of each variable.
The important Loop prior to the Fortran Call is below: for (kk = 0; kk < MAXTURNS; ++kk) { doubleToSend[kk] = matrix[horizontalBpmCounter][kk]; doubleToSend[kk + MAXTURNS] = matrix[verticalBpmCounter][kk]; doubleToSend[kk + 2 * MAXTURNS] = 0.0; doubleToSend[kk + 3 * MAXTURNS] = 0.0; } /* This calls the external Fortran code (tbach) */ sussix4drivenoise_(&doubleToSend[0], &tune[0], &litude[0], &phase[0], &allfreqsx[0], &allampsx[0], &allfreqsy[0], &allampsy[0], sussixInputFilePath);
The important Loop prior later restriction on the efficiency of the existing code in addition to the Fortran Call file IO is the inclusion of a CRITICAL pragma, which restricts the execution of that block of code to 1 thread at a time. is below:This may be difficult to convert or optimize for CUDA since these sections of code appear to make frequent use if file IO, and parallel IO may not be available in the operating environment.-------------------------------------------------
Due to difficulties in code conversion, limitations of Direct file IO in the methods, and library linking, the attempt to optimize methods from the Drive_God project have been abandoned for (kk = 0; kk < MAXTURNS; ++kk) {a more feasible topic.
void summarizedAreaTable(float** a, float** b, int size){ int k = 0; float sum = 0.0; for(int i = size-1; i >= 0; i--){ for(int j = 0; j < size; j++){ /* This calls the external Fortran code for(int k = i; k < size; k++){ for(tbachint m = 0; m <= j; m++) */{ sum += a[k][m]; } } b[i][j] = sum; sum = 0.0; } }}
=== Assignment 2 ===
=== Assignment 3 ===