Changes

← Older edit

Hu3Team

2,323 bytes added, 20:46, 11 November 2015

no edit summary

= ~~Project Name Goes here~~ GPU610's Assignment =

== Team Members ==

# [mailto:bdigiuseppecardosode@myseneca.ca?subject=gpu610 Bruno Di Giuseppe]

Like this you are able to get a nice heat dispersion calculation.

I was worried about data dependency, since, as I said, each element depends on each other to be calculated. But this solution uses 2 matrix, one old and one new, where the new matrix will receive the average value of the old matrix and, if the difference is still bigger than epsilon, then the old matrix ~~recieves~~ receives the values of the new matrix and the whole iteration happens again, where the new matrix is going to receive the average values of the old matrix, that now holds the most recent values.

So this is a good candidate for parallelization because we can send each iteration of the average calculation to a different GPU thread and since this is a simple average calculation, the GPU will be able to do it.

[[File:execution-NO_CUDA.JPG]]

Number of iterations x Value reached that represents the first value smaller than Epsilon.

====Carlos's Findings====

=== Assignment 2 ===

====CUDA Coding====

Based on assignment 1, we added the source code to make possible the program to be executed on a CUDA device, as follows.

<pre>

__global__ void copyMat(const double *w, double *u){

}

</pre>

Moreover, we made the input of the error tolerance (Epsilon) to be set on the code. After lots of difficulties found while we were coding, we finally got good results in comparison with the code of assignment 1. The runtime was decreased, and it made us to see the power that CUDA may provide to optimize the processing.

=== Assignment 3 ===

====CUDA Coding====

Based on assignment 2, we made optimizations to speed up the execution, as follows.

<pre>

__global__ void copyMat(const float *w, float *u){

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

if (i < M && j < N) {

u[i * M + j] = w[i * M + j];

}

__syncthreads();

}

__global__ void calcHeat(float *w, float *u, float *d, int m, int n, float* d_array){

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

int tx = threadIdx.x;

int ty = threadIdx.y;

__shared__ float s_u[ntpb][ntpb];

__shared__ float s_w[ntpb][ntpb];

__shared__ float s_dif[ntpb][ntpb];

if (tx < ntpb && ty < ntpb) {

s_w[ty][tx] = w[j * M + i];

s_u[ty][tx] = w[j * M + i];

}

__syncthreads();

if ( ( tx < (ntpb-1) && ty < (ntpb-1) ) && ( tx >0 && ty > 0 ) && ( i < M && j < N ) ) {

s_w[ty][tx] = ( s_u[ty - 1][tx] + s_u[ty + 1][tx] + s_u[ty][tx - 1] + s_u[ty][tx + 1] ) / 4.0;

s_dif[ty][tx] = fabsf(s_w[ty][tx] - s_u[ty][tx]);

//if (s_dif[ty][tx] < 0){ s_dif[ty][tx] *= -1; }

}

__syncthreads();

if (tx < ntpb && ty < ntpb) {

w[j * M + i] = s_w[ty][tx];

//u[j * M + i] = s_w[ty][tx];

d_array[j * M + i] = s_dif[ty][tx];

}

__syncthreads();

}

__global__ void bigDiff(float* d_array, float* d, int m, int n){

int i = blockIdx.x * blockDim.x + threadIdx.x;

for (int x = 1; i + x < m*n; x *= 2) {

if (d_array[i] > *d || d_array[i + x] > *d){

if (d_array[i] > d_array[i + x])

*d = d_array[i];

else

*d = d_array[i + x];

}

__syncthreads();

}

</pre>

We made use of shared memory to speed up the memory access for the kernel, along with coalesced memory access. We were already doing a simple reduction for getting the biggest difference, but with these tow optimizations alone we were able to get a speed up of almost 50% from the first not-optimized CUDA solution.

Because the code works getting neighboring elements from the matrix, we had to make a bigger check on the heat calculation part, to avoid illegal memory access.

====Comparing the results====

As a result of the source code included, it was possible to reduce the processing time.

Moreover, we made the input of the error tolerance (Epsilon) to be set on the code. After lots of difficulties found while we were coding, we finally got good results in comparison with the code of assignment 1. The runtime was decreased, and it made us to see the power that CUDA may provide to optimize the processing.

~~=== Assignment 3 ===~~[[File:runtime-CUDA.JPG]]

Carlos Henrique De Andrade Silva

1

edit

Changes

Hu3Team

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools