Changes

Jump to: navigation, search

Team CNT

2,173 bytes added, 22:09, 18 April 2013
Assignment 3
=== Assignment 2 ===
I am not sure we are doing the Team work or individual. But since I did not here hear anything from my team mate I decided to go with my first assignment. I met many difficulties to adapt the existing c++ code to transfer some computations on GPU. In my first assignment I am doing different manipulations with the image. Image is a class. And this is one of the reason of delay with my second assignment. Apparently, I can't pass class to the Kernel. Kernel accepts only low types variable. So all the time I was truing trying different approaches how to parallelize my code. I even bought a new computer with CUDA compatible GPU cardto be able to spend more time on the tasks. And finally, when I found that method image.negate() (not image rotation) would be easy to try to compute on Kernel, I've met some other difficulties. My code was done, I was happy. I decided to make 1000 negates of image to make sense of parallelyzing the code. I rewrote some code in my first assignment, remade the profile for first assignment (1000 negate operations). It took around 13 second seconds on Linux. I profiled my code with Kernel computations on CUDA profiler, and.....it took around 60 8 seconds....So here is my new profile for assignment 1:  [[File:profile_new.png]] the code for negate method is here:<pre>void Image::negateImage(Image& oldImage)/*negates image*/{ int rows, cols, gray; rows = N; cols = M; gray = Q; Image tempImage(N,M,Q);for(int j=0; j<1000; j++){ for(int i = 0; i < rows; i++) { for(int j = 0; j < cols; j++) tempImage.pixelVal[i][j] = -(pixelVal[i][j]) + 255;  } } oldImage = tempImage;}</pre> and this is screen shot of my CUDA profile: [[File:negate.png]]  the code for Kernel is :<pre>__global__ void cudaNegateImage2D(int *result, const int *work, int ni, int nj){ int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < ni && j < nj) { result[i * nj + j] = -(work[i]) + 255; }}</pre> 
=== Assignment 3 ===
Now I realized I did not use shared memory for assignment 2. So I changed that. In my assignment 3 I am using shared memory. As for assignment 2 I am doing 1000 times image negate. Last lectures for this course helped a lot to learn to optimize. Specially coalesced access. It is amazing! In assignment 2 my negate function calculations became faster almost twice just because I was using the kernel. I was thinking there is nothing to be done more. And now my negate function calculations became faster more than twice than in assignment 2 thanks to the optimization. This is my profile for assignment 3:
[[Image:Profiler_scr.png]]
 
The code for Kernel is here:
<pre>
__global__ void cudaNegateImage2D_Coalescence(int *result, const int *work, int ni, int nj)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
__shared__ float cache_a[NTPB][NTPB];
__shared__ float cache_b[NTPB][NTPB];
 
cache_a[threadIdx.y][threadIdx.x] = work[j * ni + i];
//cache_b[threadIdx.x][threadIdx.y] = b[j * NTPB + j];
__syncthreads();
if (i < ni && j < nj) {
//result[i * nj + j] = -(work[i * nj + j]) + 255;
cache_b[threadIdx.y][threadIdx.x] = -cache_a[threadIdx.y][threadIdx.x] + 255;
}
__syncthreads();
 
result[j * ni + i] = cache_b[threadIdx.y][threadIdx.x];
}
</pre>
It is amazing course and very educational assignments.

Navigation menu