Changes

Jump to: navigation, search

DPS915 Toad

5,098 bytes added, 21:24, 13 November 2015
Assignment 2
=== Assignment 2 ===
 
 
We decided to parallelize the Image processor written by Christopher Ginac. It is a serial C++ program that supports many functions, such as: negation, rotation, translation, flipping and more. An image processor is perfect to apply a parallel solution to, as it naturally contains a matrix of data. So let's get started!
 
Road Block Number 1
 
The first problem we had was with the C++ program itself. Paying respect to the mighty C++, the author of the original code used a class based implementation to store the data members and methods for his image processor. Normally, this would be great! However, with our Cuda implementation, it caused a rather difficult problem. We could not access the private data members within the kernel. Thinking on this problem for a little while, we came up with a rather simple solution! We would copy the data into temporary variable, which we would then pass to the device. This would have worked wonderfully, if we didn't hit road block number 2.
 
Road Block Number 2
 
The author stored the pixels in a 2 dimensional array (int ** pixelVal[rows][cols]). In terms of a serial program and in regards to how an image is represented, this makes perfect sense. However, copying that array to the device was a difficult problem that we did not know how to solve. Luckily, we realized that we could serialize this 2D array into a 1D array in Row Major format. Then, our kernel could receive it as it does any other array and all our problems would be solved! Success!
 
Once we got passed Road Block Number 2, we realized that we wouldn't try to parallelize functions that required row or column manipulation. Therefore, we decided to parallelize 'negation' - that is, inverting each pixels colour. Although serializing the data before and after the kernel call adds two more double loops to the code, we are only interested in the negation portion. The rest, we can optimize in Assignment 3. So, we time-stamped before the kernel call and subtracted that time from the current time after the call finishes to get the 'parallel' time. This would be exactly analogous to the serial program, as it only had one loop. Let's get started!
 
The code:
 
 
<pre>
__global__ void cuda_negate_image(int * new_array, int * old_array, int rows, int cols){
 
int tIdx = threadIdx.x;
int i = blockDim.x * blockIdx.x + threadIdx.x;
int n = rows * cols;
 
if (i < n)
new_array[i] = -(old_array[i]) + 255;
}
 
 
 
void Image::negateImage(Image& oldImage)
/*negates image*/
{
int rows, cols, gray;
rows = N;
cols = M;
gray = Q;
 
//convert 2D array to 1D array
 
int * h_old_array = new int[N*M];
int * d_old_array;
int * d_new_array;
int k = 0;
 
//copy array to 1D array (Row Major)
for(int i = 0; i < N; i++){
for(int j = 0; j < M; j++){
h_old_array[k] = oldImage.pixelVal[i][j];
k++;
}
}
 
int nblks = (N*M + ntpb - 1) / ntpb;
 
cudaMalloc((void**)&d_old_array, rows * cols* sizeof(int));
cudaMalloc((void**)&d_new_array, rows * cols* sizeof(int));
 
cudaMemcpy(d_old_array, h_old_array, rows * cols* sizeof(int), cudaMemcpyHostToDevice);
 
int start_s=clock();
for(int i = 0; i < 1000; i++){
cuda_negate_image<<<nblks, ntpb>>>(d_new_array, d_old_array, rows, cols);
}
 
int stop_s=clock();
cout << "time: " << (stop_s-start_s)/double(CLOCKS_PER_SEC)*1000 << endl;
int * result = new int[N*M];
cudaMemcpy(result, d_new_array, rows * cols *sizeof(int), cudaMemcpyDeviceToHost);
 
k = 0;
//copy array from 1D to 2D array
for(int i = 0; i < N; i++){
for(int j = 0; j < M; j++){
pixelVal[i][j] = result[k];
k++;
}
}
//free the Cuda memory
cudaFree(d_old_array);
cudaFree(d_new_array);
 
}</pre>
 
The original image:
 
http://i.imgur.com/VJm1IIS.png
 
After the negation:
 
http://i.imgur.com/llnfQJT.png
 
The results:
 
As you can see from Assignment 1, it took the serial program 230 milliseconds to negate this image of equal size. Now, seeing as this is mighty GTX 960, we thought we would not iterate through the process once, but 1000 times! That would mean that the serial implementation would take approximately 230 seconds to complete. So how fast did the GPU do it in?
 
http://i.imgur.com/9AtFQ48.png
 
We saw the GPU smash 1000 iterations in 22 milliseconds. That's over 10,000 times faster! Clearly, image processing begs to be worked on by parallel processors. The massive throughput of the 1024 cuda cores, which can operate on thousands of pixels at the same time reduces the time, really beat the CPU without a sweat. Here is the NSIGHT performance analysis:
 
http://i.imgur.com/H9P0pWX.png
 
http://i.imgur.com/2bdRt8T.png
 
http://i.imgur.com/zdFf4DD.png
 
In closing, we're interested in optimizing our image processor to handle 2D arrays without serialization. That way, we can extent our GPU power to the rest of the methods, without worrying about columns and rows within a 1D array. All in all, we learned a lot and look forward to applying our knowledge further in the coming weeks!
 
Source of Serial Program:
 
http://www.dreamincode.net/forums/topic/174119-image-processing-tutorial/
 
=== Assignment 3 ===

Navigation menu