212
edits
Changes
BetaT
,no edit summary
equations they can be used to model and study magnetohydrodynamics. courtesy of wikipedia ("https://en.wikipedia.org/wiki/Navier%E2%80%93Stokes_equations")
=== problem Application Code to be parallelized===
The problem with this application comes in the main function trying to calculate the finite-difference
}
=== Initial Speed Tests ran with no optimization on linux ===
By using the command line argument cat /proc/cpuinfo
||12500 x 12500 || 220198||
|}
=== gprof ===
System Specifications
== Application 2 Calculating Pi==
This application is pretty straightforward, it calculates Pi to the decimal point which is given by the user. So an input of 10 vs 100,000 will calculate Pi to either the 10th or 100 thousandth decimal.
=== problem Application code to be parallelized ===
Inside the function calculate we have:
I Believe the 2 for loops will cause a delay in the program execution time.
=== Initial Speed Tests ran with no optimization on linux ===
for this test the linux VM has:
||500000 ||671163||
|}
=== gprof ===
'''
for (int i=0; i <= nx-1; i++)
{
if (i*dx >= 0.5 && i*dx <= 1)
u[i][it] = un[i][it-1] - c*dt/dx*(un[i][it-1]-un[i-1][it-1]);
}
}'''
u[k * nt + 0] = 1;
}
for (int it = 1; it <= nx - 1; it++)
{
u[m * nx + it] = un[m * nx + it - 1] - c*dt / dx*(un[m * nx + it - 1] - un[(m - 1) * nx + it - 1]);
}
}'''
After these implementations, testing the code produced the same results as the original program, so it is a positive confirmation that we can proceed to optimizing the cod using the GPU
== Optimizing Parallelizing with 2 Kernels ==
The kernels have been initialized as a 2D Grid '''dim3 dGrid(nbx, nbx); AND dim3 dBlock(ntpb, ntpb);'''
The for loop is not needed.
=== INITIALIZE KERNEL ===
__global__ void Initalize(float* u, float* un, int nx, int nt, float dx)
{
}
=== CALCULATE WAVE KERNEL ===
This was the tricky part in converting the original code into the kernel.
The program takes 2 arrays. Let us say the X's represent the arrays below
6. The next kernel below will execute 2nd column of the following calucationsFirst array is now copied into the 2nd column of the Second array and the cycle is repeated until finished.
When I try to give the program an argument of over 2000 & 2000 it will inform me that the windows dispay driver has crashed and rebooted.
==== PARALLELIZED CALCULATE WAVE KERNEL ====
__global__ void Calculate (float* u, float* un,int nx, int c, float dx, float dt)
{
int j = blockIdx.x * blockDim.x + threadIdx.x;
int i = blockIdx.y * blockDim.y + threadIdx.y;
if (i < nx && j < nx)
{
for (int it = 1; it <= nx- 1; it++)
{
if (i != 0 || i < nx )
{
un[i * nx + it-1] = u[i * nx + it-1];
__syncthreads();
u[it] = un[1 * nx + it - 1];
__syncthreads();
u[i * nx + it ] = un[i * nx + it- 1] - c * dt/dx * (un[i * nx + it - 1] - un[(i - 1) * nx + it - 1]);
__syncthreads();
}
}
}
===== OPTIMIZED CALCULATE WAVE KERNEL CHANGES=====
The code below has been altered to remove the (j) variable and combined the two (if) statements into one, so that we can reduce (Thread Divergence), as well as move the (- c*dt/dx* ) recurring instruction set, and place it into a variable called total, so that each thread is NOT performing the same operation which causes a decrease in performance.
// kernerl
__global__ void Calculate(float* u, float* un, int nx, int c, float dx, float dt)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
float total = c*dt / dx;
if (i < nx && i != 0)
{
for (int it = 1; it <= nx - 1; it++)
{
un[i * nx + it - 1] = u[i * nx + it - 1];
__syncthreads();
u[it] = un[1 * nx + it - 1];
__syncthreads();
u[i * nx + it] = un[i * nx + it - 1] - total * (un[i * nx + it - 1] - un[(i - 1) * nx + it - 1]);
__syncthreads();
}
}
}
With this optimized code it is now possible to execute with a problem size > 2000 & 2000.
==== ORIGINAL INITIALIZATION KERNEL ====
The Initialize kernel has also been redesigned. Below is the original:
__global__ void Initalize(float* u, float* un, int nx, int nt, float dx)
{
}
=== EXECUTION COMPARISON BETWEEN OPTIMIZED AND SHARED KERNELS ===
Below in milliseconds are the execution times for the former Kernel and new shared Kernel
{| class="wikitable sortable" border="1" cellpadding="5"
|+ Time Comparison
! n !! Optimized !! Shared
|-
||2000 x 2000 ||971|| 661 ||
|-
||5000 x 5000 ||1417|| 936 ||
|-
||10000 x 10000 ||3054|| 2329 ||
|}
== THIRD OPTIMIZATION ==
=== SAVING TRAVEL COSTS BY REMOVING THE UNNECESSARY ARRAY ===
As we discovered above, the second array is not necessary while we are performing all the calculations on Shared Memory which can be seen in section 3.3.2. This provides us with the ability to further optimize our Kernel by reducing the amount of time we spend transferring data across the PCI bus. Below is an image of the data transfer times for the CALCULATE kernel.
Since both of the original Arrays are not needed in the final Kernel solution, we can save 50% of our transfer time across the PCI bus by removing one of the arrays.
[[File:MEmCpy10000.png]]
=== GETTING 100% OCCUPANCY PER MULTIPROCESSOR===
'''Occupancy Calculator
The CUDA Toolkit includes a spreadsheet that accepts as parameters the compute capability, the number of threads per block, the number of registers per thread and the shared memory per block. This spreadsheet evaluates these parameters against the resource limitations of the specified compute capability. This spreadsheet is named CUDA_Occupancy_Calculator.xls and stored under the ../tools/ sub-directory of the installed Toolkit.'''
Source--> https://scs.senecac.on.ca/~gpu610/pages/content/resou.html
With the existing CALCULATE Kernel the CUDA Occupancy Calculator is providing the following statistics as shown below...
[[File:OriginalCalculator.png]]
The current CALCULATE Kernel is only utilizing 50% of the MultiProcessor as shown above. If the threads per block are switched from 32 to 512 we will achieve 100% occupancy as shown below.
[[File:100Calculator.png]]
=== CALCULATE KERNEL ===
Here is the final CALCULATE Kernel for the application.
The changes include removal of the second array.
// kernerl
__global__ void Calculate(float* u, int nx, int c, float dx, float dt)
{
__shared__ float s[ntpb];
int i = blockIdx.x * blockDim.x + threadIdx.x;
int t = threadIdx.x;
float total = c*dt / dx;
if (i < nx && i != 0 && t != 0)
{
for (int it = 1; it <= nx - 1; it++)
{
s[t - 1] = u[(i - 1) * nx + it - 1];
u[it] = s[1];
__syncthreads();
u[i * nx + it] = s[t] - total * (s[t] - s[t - 1]);
__syncthreads();
}
}
}
=== OPTIMIZATION TIME COMPARISONS ===
Below is a graph comparing times between Optimizations illustrating the amount of execution time saved in each iteration.
The times are listed in milliseconds.
[[File:OPTIMIZATIONCOMPARISON.png]]
= CONCLUSIONS =
== OVERALL TIME COMPARISONS ==
Below are the final comparisons of all execution times between the CPU and GPU.
All times are in milliseconds.
[[File:finalCompare.png]]
== APPLICATION OUTPUT ==
Upon completion of the application it will create a file based on the output of the algorithm. The following image below displays that output comparing the original program to the parallelized program.
[[File:outputs.png]]
== FINAL THOUGHTS ==
Upon completion of this Project I have learned a few things:
First, I learned that not all program can be parallelized even if they seem to be a good candidate to begin with.
Secondly, understand the algorithm of the application is a key factor in being able to optimize the solution, because sometimes you will need to rearrange the code in order to obtain better performance from the GPU and understanding the algorithm will help ensure that the output at the end of the program will remain the same.
Thirdly the management of resources and constraints, having registers, shared memory, constant memory, latency, threads, and multi-processors are all factors which need to be considered when using the GPU. Understanding how these resources can impact and influence your program helps deciding which ones to use in specific situations.