Changes

BetaT

5,788 bytes added, 18:43, 29 March 2017

no edit summary

I will attempt to optimize the naiver strokes flow velocity program as that application is more interesting to me.

= Parallelize =

== Original Problems when converting a program to use the GPU ==

In order to use the GPU with this program some changes need to be implemented.

The original program was using a two dimensional Vector. unfortunately a vector cannot be copied using *cudaMemCpy* or allocated for using *cudaMalloc*, so there needs to be a change, the vector will be converted into a 2D Array of float** u[][]

Now that the program is iterating trough an Array rather than a Vector the program executes at a much faster speed, but it can still go faster so I will continue optimization with the GPU.

The next problem encountered is like a vector we cannot transfer a 2D array to a *cudaMalloc* or *cudaMemCPy* function call. So the 2D array needs to be converted into a single array. And that brings another problem regarding the original algorithm which is shown below.

'''

for (int i=0; i <= nx-1; i++)

{

if (i*dx >= 0.5 && i*dx <= 1)

{

u[i][0] = 2;

}

else

{

u[i][0] = 1;

}

// Finite-difference loop:

for (int it=1; it<=nt-1; it++)

{

for (int k=0; k<=nx-1; k++)

{

un[k][it-1] = u[k][it-1];

}

for (int i=1; i<=nx-1; i++)

{

u[0][it] = un[1][it-1];

u[i][it] = un[i][it-1] - c*dt/dx*(un[i][it-1]-un[i-1][it-1]);

}

}'''

Order to get this algorithm to work using a 1D array some changes are made, see below

'''for (int i = 0; i <= nt - 1; i++)

{

for (int k =0; k <= nx - 1; k++)

if (i*dx >= 0.5 && i*dx <= 1)

{

u[i * nt + 0] = 2;

}

else

{

u[i * nt + 0] = 1;

}

for (int it = 1; it <= nx - 1; it++)

{

for (int k = 0; k <= nx - 1; k++)

{

int h = it * nx + k;

un[k * nx + it - 1] = u[k * nx + it - 1];

}

for (int m = 1; m <= nx - 1; m++)

{

u[0 * nx + it] = un[1 * nx + it - 1];

u[m * nx + it] = un[m * nx + it - 1] - c*dt / dx*(un[m * nx + it - 1] - un[(m - 1) * nx + it - 1]);

//u[m * nx + it] = un[m * nx + it - 1] - c*dt / dx*(un[m * nx + it - 1] - un[(m - 1) * nx + it - 1]);

}

}'''

It is possible to now iterate through a single dimensional array using the original algorithm.

One more small problem came up, not every element in the array is initialized so there are parts which are pointing to a nullptr garbage in memory. Beause originally the program used a vector, by default all elememnts are initialized to 0, now to solve this problem a separate function is implemented to fill each array index with he value of 0 upon initialization.

After these implementations, testing the code produced the same results as the original program, so it is a positive confirmation that we can proceed to optimizing the cod using the GPU

== Optimizing Problems ==

At the beginning of the program, a handle was created for grabbing the computers Cuda Device, through this handle we can extract the Cuda Device properties to examine our parameters

//list the properties

std::cout << "Name: " << prop.name << std::endl;

std::cout << "Compute Capability: " << prop.major << '.' << prop.minor << std::endl;

std::cout << "Total Global Memory: " << prop.totalGlobalMem << std::endl;

std::cout << "Max Threads per block: " << prop.maxThreadsPerBlock << std::endl;

std::cout << "Clock Rate in khz: " << prop.clockRate << "\n\n";

We grab the devices properties so that we do not exceed resources.

The original algorithm was split into 2 kernels. The first kernel causing no problems is as follows:

'''__global__ void Initalize(double* u, double* un, int nx, int nt, double dx)

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

for (int k = 0; k <= nx - 1; k++)

if (k*dx >= 0.5 && k*dx <= 1)

{

u[k * nt] = 2;

__syncthreads();

}

else

{

u[k * nt] = 1;

__syncthreads();

}

}'''

The second kernel works perfectly find for arguments less than 1024 1024 (user inputs 2 values), anything higher for example an argument of 2000 2000 will crash the driver and results will be set to pre kernel launch. The kernel code is below:

'''__global__ void Calculate (double* u, double* un,int nx, int c, double dx, double dt)

{

for (int it = 1; it <= nx - 1; it++)

{

for (int k = 0; k <= nx - 1; k++)

{

un[k * nx + it - 1] = u[k * nx + it - 1];

}

for (int m = 1; m <= nx - 1; m++)

{

u[0 * nx + it] = un[1 * nx + it - 1];

u[m * nx + it] = un[m * nx + it - 1] - c*dt / dx*(un[m * nx + it - 1] - un[(m - 1) * nx + it - 1]);

}

}'''

== Solution to first Kernel problem ==

The problem was resulting because of this calculation, '''u[m * nx + it] = un[m * nx + it - 1] - c*dt / dx*(un[m * nx + it - 1] - un[(m - 1) * nx + it - 1]);''' Perhaps using the c, dt & dx values incorporated to many trips to global memory which caused a hang in the operation and CUDA automatically crashed the driver. To solve this problem a scalar variable (local variable) was created to store this value in registered memory for each thread to access. '''double total = c*dt / dx;'''... Now the program executes with an argument of 2000 2000 and yeilds similar results to the original program. Unfortunately a new problem has risen, when the argument is raised above 2000 the program once again crashes and I am stuck with no solution currently...

'''__global__ void Calculate (double* u, double* un,int nx, int c, double dx, double dt)

{

double total = c*dt / dx;

for (int it = 1; it <= nx - 1; it++)

{

for (int k = 0; k <= nx - 1; k++)

{

un[k * nx + it - 1] = u[k * nx + it - 1];

}

for (int m = 1; m <= nx - 1; m++)

{

u[0 * nx + it] = un[1 * nx + it - 1];

u[m * nx + it] = un[m * nx + it - 1] - total * ( un[m * nx + it - 1] - un[(m - 1) * nx + it - 1] );

}

}'''

Jadach1

212

edits

CDOT Wiki β

Changes

BetaT

CDOT Wiki ^β