Open main menu

CDOT Wiki β

Changes

BetaT

5,788 bytes added, 18:43, 29 March 2017
no edit summary
I will attempt to optimize the naiver strokes flow velocity program as that application is more interesting to me.
 
= Parallelize =
 
== Original Problems when converting a program to use the GPU ==
 
In order to use the GPU with this program some changes need to be implemented.
 
The original program was using a two dimensional Vector. unfortunately a vector cannot be copied using *cudaMemCpy* or allocated for using *cudaMalloc*, so there needs to be a change, the vector will be converted into a 2D Array of float** u[][]
 
Now that the program is iterating trough an Array rather than a Vector the program executes at a much faster speed, but it can still go faster so I will continue optimization with the GPU.
 
The next problem encountered is like a vector we cannot transfer a 2D array to a *cudaMalloc* or *cudaMemCPy* function call. So the 2D array needs to be converted into a single array. And that brings another problem regarding the original algorithm which is shown below.
'''
for (int i=0; i <= nx-1; i++)
{
if (i*dx >= 0.5 && i*dx <= 1)
{
u[i][0] = 2;
}
else
{
u[i][0] = 1;
}
}
// Finite-difference loop:
for (int it=1; it<=nt-1; it++)
{
for (int k=0; k<=nx-1; k++)
{
un[k][it-1] = u[k][it-1];
}
for (int i=1; i<=nx-1; i++)
{
u[0][it] = un[1][it-1];
u[i][it] = un[i][it-1] - c*dt/dx*(un[i][it-1]-un[i-1][it-1]);
}
}'''
 
Order to get this algorithm to work using a 1D array some changes are made, see below
 
'''for (int i = 0; i <= nt - 1; i++)
{
for (int k =0; k <= nx - 1; k++)
if (i*dx >= 0.5 && i*dx <= 1)
{
u[i * nt + 0] = 2;
}
else
{
u[i * nt + 0] = 1;
}
}
 
for (int it = 1; it <= nx - 1; it++)
{
for (int k = 0; k <= nx - 1; k++)
{
int h = it * nx + k;
un[k * nx + it - 1] = u[k * nx + it - 1];
}
for (int m = 1; m <= nx - 1; m++)
{
u[0 * nx + it] = un[1 * nx + it - 1];
u[m * nx + it] = un[m * nx + it - 1] - c*dt / dx*(un[m * nx + it - 1] - un[(m - 1) * nx + it - 1]);
//u[m * nx + it] = un[m * nx + it - 1] - c*dt / dx*(un[m * nx + it - 1] - un[(m - 1) * nx + it - 1]);
}
}'''
 
 
It is possible to now iterate through a single dimensional array using the original algorithm.
 
One more small problem came up, not every element in the array is initialized so there are parts which are pointing to a nullptr garbage in memory. Beause originally the program used a vector, by default all elememnts are initialized to 0, now to solve this problem a separate function is implemented to fill each array index with he value of 0 upon initialization.
 
After these implementations, testing the code produced the same results as the original program, so it is a positive confirmation that we can proceed to optimizing the cod using the GPU
 
== Optimizing Problems ==
 
At the beginning of the program, a handle was created for grabbing the computers Cuda Device, through this handle we can extract the Cuda Device properties to examine our parameters
 
//list the properties
std::cout << "Name: " << prop.name << std::endl;
std::cout << "Compute Capability: " << prop.major << '.' << prop.minor << std::endl;
std::cout << "Total Global Memory: " << prop.totalGlobalMem << std::endl;
std::cout << "Max Threads per block: " << prop.maxThreadsPerBlock << std::endl;
std::cout << "Clock Rate in khz: " << prop.clockRate << "\n\n";
 
 
We grab the devices properties so that we do not exceed resources.
 
The original algorithm was split into 2 kernels. The first kernel causing no problems is as follows:
 
'''__global__ void Initalize(double* u, double* un, int nx, int nt, double dx)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
 
for (int k = 0; k <= nx - 1; k++)
if (k*dx >= 0.5 && k*dx <= 1)
{
u[k * nt] = 2;
__syncthreads();
}
else
{
u[k * nt] = 1;
__syncthreads();
}
 
}'''
 
The second kernel works perfectly find for arguments less than 1024 1024 (user inputs 2 values), anything higher for example an argument of 2000 2000 will crash the driver and results will be set to pre kernel launch. The kernel code is below:
 
'''__global__ void Calculate (double* u, double* un,int nx, int c, double dx, double dt)
{
for (int it = 1; it <= nx - 1; it++)
{
for (int k = 0; k <= nx - 1; k++)
{
un[k * nx + it - 1] = u[k * nx + it - 1];
}
for (int m = 1; m <= nx - 1; m++)
{
u[0 * nx + it] = un[1 * nx + it - 1];
u[m * nx + it] = un[m * nx + it - 1] - c*dt / dx*(un[m * nx + it - 1] - un[(m - 1) * nx + it - 1]);
}
}
}'''
 
== Solution to first Kernel problem ==
 
The problem was resulting because of this calculation, '''u[m * nx + it] = un[m * nx + it - 1] - c*dt / dx*(un[m * nx + it - 1] - un[(m - 1) * nx + it - 1]);''' Perhaps using the c, dt & dx values incorporated to many trips to global memory which caused a hang in the operation and CUDA automatically crashed the driver. To solve this problem a scalar variable (local variable) was created to store this value in registered memory for each thread to access. '''double total = c*dt / dx;'''... Now the program executes with an argument of 2000 2000 and yeilds similar results to the original program. Unfortunately a new problem has risen, when the argument is raised above 2000 the program once again crashes and I am stuck with no solution currently...
 
'''__global__ void Calculate (double* u, double* un,int nx, int c, double dx, double dt)
{
double total = c*dt / dx;
for (int it = 1; it <= nx - 1; it++)
{
for (int k = 0; k <= nx - 1; k++)
{
un[k * nx + it - 1] = u[k * nx + it - 1];
}
for (int m = 1; m <= nx - 1; m++)
{
u[0 * nx + it] = un[1 * nx + it - 1];
u[m * nx + it] = un[m * nx + it - 1] - total * ( un[m * nx + it - 1] - un[(m - 1) * nx + it - 1] );
}
}
}'''
212
edits