Changes

Jump to: navigation, search

BetaT

10,185 bytes added, 18:10, 12 April 2017
no edit summary
equations they can be used to model and study magnetohydrodynamics. courtesy of wikipedia ("https://en.wikipedia.org/wiki/Navier%E2%80%93Stokes_equations")
=== problem Application Code to be parallelized===
The problem with this application comes in the main function trying to calculate the finite-difference
}
 === Initial Speed Tests ran with no optimization on linux ===
By using the command line argument cat /proc/cpuinfo
||12500 x 12500 || 220198||
|}
 
=== gprof ===
System Specifications
 
== Application 2 Calculating Pi==
This application is pretty straightforward, it calculates Pi to the decimal point which is given by the user. So an input of 10 vs 100,000 will calculate Pi to either the 10th or 100 thousandth decimal.
=== problem Application code to be parallelized ===
Inside the function calculate we have:
I Believe the 2 for loops will cause a delay in the program execution time.
=== Initial Speed Tests ran with no optimization on linux ===
for this test the linux VM has:
||500000 ||671163||
|}
 
=== gprof ===
'''
for (int i=0; i <= nx-1; i++)
{
if (i*dx >= 0.5 && i*dx <= 1)
u[i][it] = un[i][it-1] - c*dt/dx*(un[i][it-1]-un[i-1][it-1]);
}
}'''
u[k * nt + 0] = 1;
}
for (int it = 1; it <= nx - 1; it++)
{
u[m * nx + it] = un[m * nx + it - 1] - c*dt / dx*(un[m * nx + it - 1] - un[(m - 1) * nx + it - 1]);
}
}'''
After these implementations, testing the code produced the same results as the original program, so it is a positive confirmation that we can proceed to optimizing the cod using the GPU
== Optimizing Problems Parallelizing with 2 Kernels ==The kernels have been initialized as a 2D Grid '''dim3 dGrid(nbx, nbx); AND dim3 dBlock(ntpb, ntpb);'''  In the first kernel I have Replaced the for loop statement.The goal of this first statement was to set the first value in each column to either 1 or 2 based off the condition in the if statement.The for loop is not needed.  === INITIALIZE KERNEL === __global__ void Initalize(float* u, float* un, int nx, int nt, float dx) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < nx && j < nx) { if (i*dx >= 0.5 && i*dx <= 1) { u[i * nx] = 2; } else { u[i * nx] = 1; } } } === CALCULATE WAVE KERNEL === This was the tricky part in converting the original code into the kernel.I have removed the 2 inner for loops but kept the outer loop.The program takes 2 arrays. Let us say the X's represent the arrays below  __global__ void Calculate (float* u, float* un,int nx, int c, float dx, float dt) { int j = blockIdx.x * blockDim.x + threadIdx.x; int i = blockIdx.y * blockDim.y + threadIdx.y; // removes from instructions because no need to do this NX amount of times float total = c*dt / dx; if (i < nx && j < nx) { for (int it = 1; it <= nx- 1; it++) { if (i != 0 || i < nx ) { un[i * nx + it-1] = u[i * nx + it-1]; __syncthreads(); u[it] = un[1 * nx + it - 1]; __syncthreads(); u[i * nx + it ] = un[i * nx + it- 1] - c*dt / dx* (un[i * nx + it - 1] - un[(i - 1) * nx + it - 1]); __syncthreads(); } } } ==== HOW THE ALGORITHM WORKS ==== This is focusing on the algorithm inside the CALCULATE Kernel only. 1. We begin with 2 Arrays  [[File:2Arrazs.png]]  2. The first column of the First array is initialized by the INITIALIZE Kernel. [[File:Initialize.png]] 3. The second array copies the values from the first column of the First array [[File:Copy1stColumn.png]] 4. The First array copies a single value from the Second array [[File:2ndCall.png]] 5. The remaining values for the 2nd column of the First array are calculated through the Second array as follows. [[File:3rdCall.png]] 6. The 2nd column of the First array is now copied into the 2nd column of the Second array and the cycle is repeated until finished. [[File:LAstReset.png]] == CPU VS GPU Loop Comparisons Only== Executing the program again with a problem size of 2000 2000 or 4,000,000 we yield the following results.  Keep in mind these times are only for the kernel launches and not the program as a whole.
At the beginning of the program, a handle was created for grabbing the computers Cuda Device, through this handle we can extract the Cuda Device properties to examine our parametersPARALLIZED GPU CODE
//list the properties Fist for loop - took - 0 millisecs 2nd for Loop - took - 0 millisecs Press any key to continue . . .
std::cout << "Name:" << prop.name << std::endl;ORIGINAL CPU CODE
std::cout << "Compute Capability: " << prop.major << '.' << prop.minor << std::endl;Initialize arrays loop - took - 17 milliseconds Fist for loop - took - 1 millisecs 2nd for Loop - took - 15373 millisecsstd::cout << "Total Global Memory: " << prop Press any key to continue .totalGlobalMem << std::endl; std::cout << "Max Threads per block: " << prop.maxThreadsPerBlock << std::endl; std::cout << "Clock Rate in khz: " << prop.clockRate << "\n\n";
= OPTIMIZATION =
We grab the devices properties so that we do not exceed resources.== OVERALL EXECUTION OF PROGRAM FOR CPU, PARALLELIZED GPU AND OPTIMIZED CODE ==
The original algorithm was split into 2 kernels. The first kernel causing no problems is as follows TIMES ARE IN MILLISECONDS
N Linux Visual No Parallel Parallized
2000 ^ 2 1160 20520 6749
5000 ^ 2 28787 127373 n/a
10000 ^ 2 124179 522576 n/a
__global__ void Initalize=== Windows Display Driver Crash for problem size > 2000 & 2000 === When I try to give the program an argument of over 2000 & 2000 it will inform me that the windows dispay driver has crashed and rebooted.  After some research I discovered that this is an issue caused by the kernel taking too long to execute. Windows has a default time limit where it will reset the CUDA GPU if it thinks it is frozen due to the amount of time it is taking to perform its calculations. This is called the Timeout detection & recovery method (double* u, double* un, int nx, int nt, double dxTDR). A potential solution I found on the CUDA programming forum on NVidea's website suggested I try the following in the registry:   { To Change the Graphic device timeout, use the following steps. int i = blockIdxExit all Apps and Programs.x * blockDim Press the WinKey+R keys to display the Run dialog.x + threadIdx Type regedit.x; exe and click OK to open the registry editor. int j = blockIdxNavigate to the following registry key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers With the GraphicsDrivers key selected, on the Edit menu, click New, and then select the following registry value from the drop-down menu specific to your version of Windows (32 bit, or 64 bit): (NOTE: The TdrDelay name is Case Sensitive) For 64 bit Windows a. Select QWORD (64-bit) value. b.y * blockDimType TdrDelay as the Name and click Enter.y + threadIdx c. Double-click TdrDelay and add 8 for the Value data and clickOK.y;
for (int k = 0; k <= nx - 1The above potential solution did not solve my problem.... The second solution I found was to change one of the properties on the GPU device named: kernelExecTimeoutEnabled; k++) if (k*dx >= 0This property supposedly controls whether or not the device can be timed out.5 && k*dx <= 1) { u[k * nt] = 2; __syncthreadsA value of (1); } else { u[k * nt] = 1; __syncthreadsmeans it can be timed out, while a value of (0); } }means it is disabled.
The above also did not solve my issue with the display driver crashing.
==== Solution to Windows Display Driver Crashing ====
The best way to prevent this error from happening is to make sure the kernel does not take too long to execute... So I altered my code and switched the Kernel Launch statement from a 2D grid to a 1D grid.
The second kernel works perfectly find for arguments less than 1024 1024 (user inputs 2 values), anything higher for example an argument This reduced the number of 2000 2000 will crash threads firing in the NVidia driver and results will be set to pre kernel launch. The kernel code In the Calculate Kernel which is below:you can see the old one had all the threads from the ( y dimension) sitting idle doing nothing except slowing down the execution.
'''==== PARALLELIZED CALCULATE WAVE KERNEL ==== __global__ void Calculate (doublefloat* u, doublefloat* un,int nx, int c, double float dx, double float dt)
{
for (int it j = blockIdx.x * blockDim.x + threadIdx.x; int i = 1blockIdx.y * blockDim.y + threadIdx.y; it if (i < nx && j <= nx - 1; it++) { for (int k it = 01; k it <= nx - 1; kit++) { un[k * nx + it - 1] = u[k * nx + it - 1]; } for if (int m i != 1; m 0 || i <= nx - 1; m++)
{
un[i * nx + it-1] = u[0 i * nx + it-1]; __syncthreads(); u[it] = un[1 * nx + it - 1]; __syncthreads(); u[m i * nx + it] = un[m i * nx + it - 1] - c*dt / dx*(un[m i * nx + it - 1] - un[(m i - 1) * nx + it - 1]); __syncthreads(); } } }'''
== Solution to first Kernel problem == = OPTIMIZED CALCULATE WAVE KERNEL CHANGES=====The code below has been altered to remove the (j) variable and combined the two (if) statements into one, so that we can reduce (Thread Divergence), as well as move the (- c*dt/dx* ) recurring instruction set, and place it into a variable called total, so that each thread is NOT performing the same operation which causes a decrease in performance.
The problem was resulting because of this calculation, '''u[m * nx + it] = un[m * nx + it - 1] - c*dt / dx*(un[m * nx + it - 1] - un[(m - 1) * nx + it - 1]);''' Perhaps using the c, dt & dx values incorporated to many trips to global memory which caused a hang in the operation and CUDA automatically crashed the driver. To solve this problem a scalar variable (local variable) was created to store this value in registered memory for each thread to access. '''double total = c*dt / dx;'''... Now the program executes with an argument of 2000 2000 and yeilds similar results to the original program.
''' // kernerl __global__ void Calculate (doublefloat* u, doublefloat* un,int nx, int c, double float dx, double float dt)
{
double int i = blockIdx.x * blockDim.x + threadIdx.x; float total = c*dt / dx; for if (int it = 1; it i <nx && i != nx - 1; it++0) { for (int k it = 01; k it <= nx - 1; kit++) { un[k i * nx + it - 1] = u[k i * nx + it - 1]; } for __syncthreads(int m = 1; m <= nx - 1); m++) { u[0 * nx + it] = un[1 * nx + it - 1]; __syncthreads(); u[m i * nx + it] = un[m i * nx + it - 1] - total * ( un[m i * nx + it - 1] - un[(m i - 1) * nx + it - 1] ); __syncthreads(); } } } With this optimized code it is now possible to execute with a problem size > 2000 & 2000. ==== ORIGINAL INITIALIZATION KERNEL ====The Initialize kernel has also been redesigned. Below is the original:  __global__ void Initalize(float* u, float* un, int nx, int nt, float dx) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < nx && j < nx) { if (i*dx >= 0.5 && i*dx <= 1) { u[i * nx] = 2; __syncthreads(); } else { u[i * nx] = 1; __syncthreads(); }
}
} ===== OPTIMIZED INITIALIZATION KERNEL CHANGES =====  I removed the variable (j), removed the syncthreads() which were not needed, I also removed the function running on the CPU that initializes all indexes int he arrays to 0, and moved it into the GPU below. __global__ void Initalize(float* u, float* un, int nx, int nt, float dx) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < nx) { for (int it = 0; it < nx; it++) { u[i * nx + it] = 0; un[i * nx + it] = 0; } if (i*dx >= 0.5 && i*dx <= 1) u[i * nx] = 2; else u[i * nx] = 1; } } == POST OPTIMIZATION - Execution Comparison Times== If you have not, please take a look at section 3.1.1.1(just above), as it shows how the first iteration of optimization has been delivered. Below is a comparison of times from the original CPU to the newly optimized kernel execution. These comaprison times are for the WHOLE execution of the program, not just parts. These include memory transfers, allocation, de-allocation and calculations.  TIMES ARE IN MILLISECONDS  N Linux Visual No Parallel Parallized Optimized_A (2000 ^ 2) 1160 | 20520 | 6749 | 971 (5000 ^ 2) 28787 | 127373 | n/a | 1417 (10000 ^ 2) 124179 | 522576 | n/a | 3054  [[File:ParallelizedVSOptimized.png]] == SECOND OPTIMIZATION == === Shared Memory === In order to speed up the execution time I will incorporate shared data into the Calculate Kernel. The problem I am facing is determining in what way to use shared memory.  As I outlined above in section 2.2.2 regarding how to calculation on each Array is performed the program is calculating column by column and not rows by rows. However, it is also moving between rows after calculating each column. I can only allocate a static array and not dynamic so my shared memory will be the same size I use as my predefined ntpb variable, which represents the threads I use per block. So as of writing this, my ntpb variable is 32, therefor each shared array will be a size of 128 bytes. I cannot copy the whole array into shared memory, and I cannot copy the array row by row, so we will need to copy the array column by column into shared memory. As for the second array it has become clear that it is no longer needed, as we can simply use the shared memory array to perform the calculations of each column and save the results in the original arrays next column, then copy that column into the shared array and repeat the calculations. === SHARED MEMORY KERNEL ===  // kernerl __global__ void Calculate(float* u, float* un, int nx, int c, float dx, float dt) { __shared__ float s[ntpb]; int i = blockIdx.x * blockDim.x + threadIdx.x; int t = threadIdx.x; float total = c*dt / dx; if (i < nx && i != 0 && t != 0) { for (int it = 1; it <= nx - 1; it++) { s[t - 1] = u[(i - 1) * nx + it - 1]; u[it] = s[1]; __syncthreads(); u[i * nx + it] = s[t] - total * (s[t] - s[t - 1]); __syncthreads(); } } } === EXECUTION COMPARISON BETWEEN OPTIMIZED AND SHARED KERNELS === Below in milliseconds are the execution times for the former Kernel and new shared Kernel {| class="wikitable sortable" border="1" cellpadding="5"|+ Time Comparison! n !! Optimized !! Shared |-||2000 x 2000 ||971|| 661 |||-||5000 x 5000 ||1417|| 936 |||-||10000 x 10000 ||3054|| 2329 |||== THIRD OPTIMIZATION == === SAVING TRAVEL COSTS BY REMOVING THE UNNECESSARY ARRAY === As we discovered above, the second array is not necessary while we are performing all the calculations on Shared Memory which can be seen in section 3.3.2. This provides us with the ability to further optimize our Kernel by reducing the amount of time we spend transferring data across the PCI bus. Below is an image of the data transfer times for the CALCULATE kernel.  Since both of the original Arrays are not needed in the final Kernel solution, we can save 50% of our transfer time across the PCI bus by removing one of the arrays.  [[File:MEmCpy10000.png]] === GETTING 100% OCCUPANCY PER MULTIPROCESSOR=== '''Occupancy Calculator
== Problem with second Kernel == The CUDA Toolkit includes a spreadsheet that accepts as parameters the compute capability, the number of threads per block, the number of registers per thread and the shared memory per block. This spreadsheet evaluates these parameters against the resource limitations of the specified compute capability. This spreadsheet is named CUDA_Occupancy_Calculator.xls and stored under the ../tools/ sub-directory of the installed Toolkit.'''
Unfortunately a new problem has risen, when the argument is raised above 2000 & 2000 the NVidia driver once again crashes and I am stuck with no solution currentlySource--> https://scs.senecac.on.ca/~gpu610/pages/content/resou.html
With the existing CALCULATE Kernel the CUDA Occupancy Calculator is providing the following statistics as shown below...
=== Re - Parallelize ===
The original parallelized solution I had was greatly flawed... So I have parallelized the code once again, this time using Thread Identifiers from the Grid.
== New Kernel ==[[File:OriginalCalculator.png]]
''' The current CALCULATE Kernel is only utilizing 50% of the MultiProcessor as shown above. If the threads per block are switched from 32 to 512 we will achieve 100% occupancy as shown below.   [[File:100Calculator.png]]  === CALCULATE KERNEL === Here is the final CALCULATE Kernel for the application.The changes include removal of the second array.  // kernerl __global__ void InitalizeCalculate(float* u, float* un, int nx, int ntc, float dx, float dt) { __shared__ float s[ntpb]; int i = blockIdx.x * blockDim.x + threadIdx.x; int j t = blockIdx.y * blockDim.y + threadIdx.y; if (i < nx && j < nx) { // replace k with i. because i represents the x dimension which will ascend in a range from 0 to nx. // So i * dx will essentially by equivalent to...for (int k = 0; k < nx; k++) if (i*dx >= 0.5 && i*dx <= 1) // replace k here with i for the X dimension for the same reason as above u[i * nx] = 2; else u[i * nx] = 1; } } The old code was:
float total = c*dt / dx; if (i < nx && i != 0 && t != 0) { for (int k it = 01; k it <= nx - 1; kit++) if { s[t - 1] = u[(ki - 1) *dx >= 0.5 && k*dx <= nx + it - 1) ]; { u[k * nxit] = 2s[1]; } else __syncthreads(); { u[k i * nx+ it] = s[t] - total * (s[t] - s[t - 1]); __syncthreads(); }''' } }So I removed the for loop and simply changed the "k" in the "if" statement to represnt the threadIdx.x identifier because X is the fastest moving dimension it will range from 0 to NX like a for look. This matches the orignal Naiver output. After this a separate kernel was created with the following code...=== OPTIMIZATION TIME COMPARISONS ===
Below is a graph comparing times between Optimizations illustrating the amount of execution time saved in each iteration.
int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; // The original code had the following statement:: u[m * nx + it] = un[m * nx + it - 1] - c*dt / dx*(un[m * nx + it - 1] - un[(m - 1) * nx + it - 1]); // Rather than having each thread perform this calculation which will be an additional 2 instructions per thread, i have just stored it times are listed in a variable float total = c*dt / dx;milliseconds.
if (i < nx && j < nx)[[File:OPTIMIZATIONCOMPARISON.png]] { // The original code as can be seen below is basically copying array un to array u. So i arranged the threads to do the same un[j * nx + i] = u[j * nx + i]; CONCLUSIONS = __syncthreads(); if (i != 0) = OVERALL TIME COMPARISONS == { // This part was a bit trickier. As seen in Below are the original code below array u would access final comparisons of all threads in execution times between the [0,0] [0,1] [0,2] etc.CPU and GPUAll times are in milliseconds. // And copy a value from array un's [1,1] [1,2File:finalCompare.png] [1,3]..etc range. The trick here was the -1 difference at the end // Because in == APPLICATION OUTPUT ==  Upon completion of the original for look, (application it) starts at the value 1, I added and if condition to make sure the threads don't perform the operation will create a file based on the thread output of value 0the algorithm. But it can still be access through The following image below displays that output comparing the original program to the -1 operatorparallelized program. u[[iFile:outputs.png] = un[1 * nx + i-1]; __syncthreads(); }== FINAL THOUGHTS == }Upon completion of this Project I have learned a few things:
Compared First, I learned that not all program can be parallelized even if they seem to the original code..be a good candidate to begin with.
Secondly, understand the algorithm of the application is a key factor in being able to optimize the solution, because sometimes you will need to rearrange the code in order to obtain better performance from the GPU and understanding the algorithm will help ensure that the output at the end of the program will remain the same.
for (int it = 1; it <= nx Thirdly the management of resources and constraints, having registers, shared memory, constant memory, latency, threads, and multi- 1; it++) { for (int k = 0; k <= nx - 1; k++) { un[k * nx + it - 1] = u[k * nx + it - 1]; } for (int m = 1; m <= nx - 1; m++) { u[0 * nx + it] = un[1 * nx + it - 1]; u[m * nx + it] = un[m * nx + it - 1] - c*dt / dx*(un[m * nx + it - 1] - un[(m - 1) * nx + it - 1]); } } }processors are all factors which need to be considered when using the GPU. Understanding how these resources can impact and influence your program helps deciding which ones to use in specific situations.
212
edits

Navigation menu