212
edits
Changes
BetaT
,no edit summary
equations they can be used to model and study magnetohydrodynamics. courtesy of wikipedia ("https://en.wikipedia.org/wiki/Navier%E2%80%93Stokes_equations")
=== problem Application Code to be parallelized===
The problem with this application comes in the main function trying to calculate the finite-difference
}
=== Initial Speed Tests ran with no optimization on linux ===
By using the command line argument cat /proc/cpuinfo
||12500 x 12500 || 220198||
|}
=== gprof ===
System Specifications
== Application 2 Calculating Pi==
This application is pretty straightforward, it calculates Pi to the decimal point which is given by the user. So an input of 10 vs 100,000 will calculate Pi to either the 10th or 100 thousandth decimal.
=== problem Application code to be parallelized ===
Inside the function calculate we have:
I Believe the 2 for loops will cause a delay in the program execution time.
=== Initial Speed Tests ran with no optimization on linux ===
for this test the linux VM has:
||500000 ||671163||
|}
=== gprof ===
The next problem encountered is like a vector we cannot transfer a 2D array to a *cudaMalloc* or *cudaMemCPy* function call. So the 2D array needs to be converted into a single array. And that brings another problem regarding the original algorithm which is shown below.
'''
for (int i=0; i <= nx-1; i++)
{
if (i*dx >= 0.5 && i*dx <= 1)
u[i][it] = un[i][it-1] - c*dt/dx*(un[i][it-1]-un[i-1][it-1]);
}
'''
for (int k = 0; k <= nx - 1; k++) if (k*dx >= 0.5 && k*dx <= 1) { u[k * nt + 0] = 2; } else { u[k * nt + 0] = 1; } for (int it = 1; it <= nx - 1; it++) { for (int k = 0; k <= nx - 1; k++) { un[k * nx + it - 1] = u[k * nx + it - 1]; } for (int m = 1; m <= nx - 1; m++) { u[0 * nx + it] = un[1 * nx + it - 1]; u[m * nx + it] = un[m * nx + it - 1] - c*dt / dx*(un[m * nx + it - 1] - un[(m - 1) * nx + it - 1]); } }''' It is possible to now iterate through a single dimensional array using the original algorithm. One more small problem came up, not every element in the array is initialized so there are parts which are pointing to a nullptr garbage in memory. Beause originally the program used a vector, by default all elememnts are initialized to 0, now to solve this problem a separate function is implemented to fill each array index with he value of 0 upon initialization. After these implementations, testing the code produced the same results as the original program, so it is a positive confirmation that we can proceed to optimizing the cod using the GPU == Parallelizing with 2 Kernels ==The kernels have been initialized as a 2D Grid '''dim3 dGrid(nbx, nbx); AND dim3 dBlock(ntpb, ntpb);''' In the first kernel I have Replaced the for loop statement.The goal of this first statement was to set the first value in each column to either 1 or 2 based off the condition in the if statement.The for loop is not needed. === INITIALIZE KERNEL === __global__ void Initalize(float* u, float* un, int nx, int nt, float dx) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < nx && j < nx) { if (i*dx >= 0.5 && i*dx <= 1) { u[i * nx] = 2; } else { u[i * nx] = 1; } } } === CALCULATE WAVE KERNEL === This was the tricky part in converting the original code into the kernel.I have removed the 2 inner for loops but kept the outer loop.The program takes 2 arrays. Let us say the X's represent the arrays below __global__ void Calculate (float* u, float* un,int nx, int c, float dx, float dt) { int j = blockIdx.x * blockDim.x + threadIdx.x; int i = blockIdx.y * blockDim.y + threadIdx.y; // removes from instructions because no need to do this NX amount of times float total = c*dt / dx; if (i < nx && j < nx) { for (int it = 1; it <= nx- 1; it++) { if (i != 0 || i < nx ) { un[i * nx + it-1] = u[i * nx + it-1]; __syncthreads(); u[it] = un[1 * nx + it - 1]; __syncthreads(); u[i * nx + it ] = un[i * nx + it- 1] - c*dt / dx* (un[i * nx + it - 1] - un[(i - 1) * nx + it - 1]); __syncthreads(); } } } ==== HOW THE ALGORITHM WORKS ==== This is focusing on the algorithm inside the CALCULATE Kernel only. 1. We begin with 2 Arrays [[File:2Arrazs.png]] 2. The first column of the First array is initialized by the INITIALIZE Kernel. [[File:Initialize.png]] 3. The second array copies the values from the first column of the First array [[File:Copy1stColumn.png]] 4. The First array copies a single value from the Second array [[File:2ndCall.png]] 5. The remaining values for the 2nd column of the First array are calculated through the Second array as follows. [[File:3rdCall.png]] 6. The 2nd column of the First array is now copied into the 2nd column of the Second array and the cycle is repeated until finished. [[File:LAstReset.png]] == CPU VS GPU Loop Comparisons Only== Executing the program again with a problem size of 2000 2000 or 4,000,000 we yield the following results. Keep in mind these times are only for the kernel launches and not the program as a whole. PARALLIZED GPU CODE Fist for loop - took - 0 millisecs 2nd for Loop - took - 0 millisecs Press any key to continue . . . ORIGINAL CPU CODE Initialize arrays loop - took - 17 milliseconds Fist for loop - took - 1 millisecs 2nd for Loop - took - 15373 millisecs Press any key to continue . . . = OPTIMIZATION = == OVERALL EXECUTION OF PROGRAM FOR CPU, PARALLELIZED GPU AND OPTIMIZED CODE == TIMES ARE IN MILLISECONDS N Linux Visual No Parallel Parallized 2000 ^ 2 1160 20520 6749 5000 ^ 2 28787 127373 n/a 10000 ^ 2 124179 522576 n/a === Windows Display Driver Crash for problem size > 2000 & 2000 === When I try to give the program an argument of over 2000 & 2000 it will inform me that the windows dispay driver has crashed and rebooted. After some research I discovered that this is an issue caused by the kernel taking too long to execute. Windows has a default time limit where it will reset the CUDA GPU if it thinks it is frozen due to the amount of time it is taking to perform its calculations. This is called the Timeout detection & recovery method (TDR). A potential solution I found on the CUDA programming forum on NVidea's website suggested I try the following in the registry: To Change the Graphic device timeout, use the following steps. Exit all Apps and Programs. Press the WinKey+R keys to display the Run dialog. Type regedit.exe and click OK to open the registry editor. Navigate to the following registry key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers With the GraphicsDrivers key selected, on the Edit menu, click New, and then select the following registry value from the drop-down menu specific to your version of Windows (32 bit, or 64 bit): (NOTE: The TdrDelay name is Case Sensitive) For 64 bit Windows a. Select QWORD (64-bit) value. b. Type TdrDelay as the Name and click Enter. c. Double-click TdrDelay and add 8 for the Value data and clickOK. The above potential solution did not solve my problem.... The second solution I found was to change one of the properties on the GPU device named: kernelExecTimeoutEnabled;This property supposedly controls whether or not the device can be timed out. A value of (1) means it can be timed out, while a value of (0) means it is disabled. The above also did not solve my issue with the display driver crashing. ==== Solution to Windows Display Driver Crashing ==== The best way to prevent this error from happening is to make sure the kernel does not take too long to execute... So I altered my code and switched the Kernel Launch statement from a 2D grid to a 1D grid. This reduced the number of threads firing in the kernel. In the Calculate Kernel which is below you can see the old one had all the threads from the ( y dimension) sitting idle doing nothing except slowing down the execution. ==== PARALLELIZED CALCULATE WAVE KERNEL ==== __global__ void Calculate (float* u, float* un,int nx, int c, float dx, float dt) { int j = blockIdx.x * blockDim.x + threadIdx.x; int i = blockIdx.y * blockDim.y + threadIdx.y; if (i < nx && j < nx) { for (int it = 1; it <= nx- 1; it++) { if (i != 0|| i < nx ) { un[i * nx + it-1] = u[i * nx + it-1]; __syncthreads(); u[it] = un[1 * nx + it - 1]; __syncthreads(); u[i * nx + it ] = un[i * nx + it- 1] - c * dt/dx * (un[i * nx + it - 1] - un[(i - 1) * nx + it - 1]); __syncthreads(); } } } ===== OPTIMIZED CALCULATE WAVE KERNEL CHANGES=====The code below has been altered to remove the (j) variable and combined the two (if) statements into one, so that we can reduce (Thread Divergence), as well as move the (- c*dt/dx* ) recurring instruction set, and place it into a variable called total, so that each thread is NOT performing the same operation which causes a decrease in performance. // kernerl __global__ void Calculate(float* u, float* un, int nx, int c, float dx, float dt) { int i = blockIdx.x * blockDim.x + threadIdx.x; float total = c*dt / dx; if (i <nx && i != nt 0) { for (int it = 1; it <= nx - 1; it++) { un[i * nx + it - 1] = u[i * nx + it - 1]; __syncthreads(); u[it] = un[1 * nx + it - 1]; __syncthreads(); u[i * nx + it] = un[i * nx + it - 1] - total * (un[i * nx + it - 1] - un[(i - 1) * nx + it - 1]); __syncthreads(); } } } With this optimized code it is now possible to execute with a problem size > 2000 & 2000. ==== ORIGINAL INITIALIZATION KERNEL ====The Initialize kernel has also been redesigned. Below is the original: __global__ void Initalize(float* u, float* un, int nx, int nt, float dx) { int i= blockIdx.x * blockDim.x +threadIdx.x; int j = blockIdx.y * blockDim.y +threadIdx.y; if (i < nx && j < nx) { if (i*dx >= 0.5 && i*dx <= 1) { u[i * nx] = 2; __syncthreads(); } else { u[i * nx] = 1; __syncthreads(); } } } ===== OPTIMIZED INITIALIZATION KERNEL CHANGES ===== I removed the variable (j), removed the syncthreads() which were not needed, I also removed the function running on the CPU that initializes all indexes int he arrays to 0, and moved it into the GPU below. __global__ void Initalize(float* u, float* un, int nx, int nt, float dx) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < nx) { for (int it = 0; it < nx; it++) { u[i * nx + it] = 0; un[i * nx + it] = 0; } if (i*dx >= 0.5 && i*dx <= 1) u[i * nx] = 2; else u[i * nx] = 1; } } == POST OPTIMIZATION - Execution Comparison Times== If you have not, please take a look at section 3.1.1.1(just above), as it shows how the first iteration of optimization has been delivered. Below is a comparison of times from the original CPU to the newly optimized kernel execution. These comaprison times are for the WHOLE execution of the program, not just parts. These include memory transfers, allocation, de-allocation and calculations. TIMES ARE IN MILLISECONDS N Linux Visual No Parallel Parallized Optimized_A (2000 ^ 2) 1160 | 20520 | 6749 | 971 (5000 ^ 2) 28787 | 127373 | n/a | 1417 (10000 ^ 2) 124179 | 522576 | n/a | 3054 [[File:ParallelizedVSOptimized.png]] == SECOND OPTIMIZATION == === Shared Memory === In order to speed up the execution time I will incorporate shared data into the Calculate Kernel. The problem I am facing is determining in what way to use shared memory.
Here is the final CALCULATE Kernel for the application.
The changes include removal of the second array.
Below are the final comparisons of all execution times between the CPU and GPU.