Changes

← Older edit

DPS915 Toad

2,669 bytes added, 21:28, 8 December 2015

→‎Assignment 3

The original image:

~~http~~[[Image:~~//i.imgur.com/VJm1IIS~~Org_image_before_negation.png]]

After the negation:

~~http~~[[Image:~~//i.imgur.com/llnfQJT~~Original_image_toad.png]]

The results:

As you can see from Assignment 1, it took the serial program 230 milliseconds to negate this image of equal size. Now, seeing as this is mighty GTX 960, we thought we would not iterate through the process once, but 1000 times! That would mean that the serial implementation would take approximately 230 seconds to complete. So how fast did the GPU do it in?

~~http~~[[Image:~~//i.imgur.com/9AtFQ48~~Gpu_speed_img_1.png]]

We saw the GPU smash 1000 iterations in 22 milliseconds. That's over 10,000 times faster! Clearly, image processing begs to be worked on by parallel processors. The massive throughput of the 1024 cuda cores, which can operate on thousands of pixels at the same time reduces the time, really ~~beat~~ beating the CPU without much of a sweat. Here is the NSIGHT performance analysis:

~~http~~[[Image:~~//i.imgur.com/H9P0pWX~~Nsight_performance_analysis_1.png]]

~~http~~[[Image:~~//i.imgur.com/2bdRt8T~~Nsight_performance_analysis_2.png]]

~~http~~[[Image:~~//i.imgur.com/zdFf4DD~~Nsight_performance_analysis_3.png]]

In closing, we're interested in optimizing our image processor to handle 2D arrays without serialization. That way, we can extent our GPU power to the rest of the methods, without worrying about columns and rows within a 1D array. All in all, we learned a lot and look forward to applying our knowledge further in the coming weeks!

=== Assignment 3 ===

The first optimization was did was to precompute the product of rows * cols outside of the kernel itself. That makes sense for our code, because we're effectively running through 1000 image files of size, 18mb. That way, we pull some strain off the GPU.

When we ran that, we got improved the performance from 22 milliseconds to 21 milliseconds. It sounds small, but spanned over the course of many more images being processes, it's quite the increase.

[[Image:Result1.png]]

[[Image:improvement1.png]]

We tried to use shared memory, however our array size was simply too large.

In the end, after countless and tiring changes to the code, we settled on this simple configuration as being the fastest for our purpose of negating 1000 images.

Kernel:

<pre>

__global__ void cuda_negate_image(int * new_array, int * old_array, int n){

int tIdx = threadIdx.x;

int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < n)

new_array[i] = -(old_array[i]) + 255;

}

</pre>

Method:

<pre>

void Image::negateImage(Image& oldImage)

/*negates image*/

{

int rows, cols, gray;

rows = N;

cols = M;

gray = Q;

//convert 2D array to 1D array

int * h_old_array = new int[N*M];

int * d_old_array;

int * d_new_array;

int k = 0;

cout << rows * cols << endl;

//copy array to 1D array (Row Major)

for(int i = 0; i < N; i++){

for(int j = 0; j < M; j++){

h_old_array[k] = oldImage.pixelVal[i][j];

k++;

}

int nblks = (N*M + ntpb - 1) / ntpb;

cout << rows * cols;

cudaMalloc((void**)&d_old_array, rows * cols* sizeof(int));

cudaMalloc((void**)&d_new_array, rows * cols* sizeof(int));

cudaMemcpy(d_old_array, h_old_array, rows * cols* sizeof(int), cudaMemcpyHostToDevice);

int n = rows * cols;

int start_s=clock();

for(int i = 0; i < 1000; i++){

cuda_negate_image<<<nblks, ntpb>>>(d_new_array, d_old_array, n);

}

int stop_s=clock();

cout << "time: " << (stop_s-start_s)/double(CLOCKS_PER_SEC)*1000 << endl;

int * result = new int[N*M];

cudaMemcpy(result, d_new_array, rows * cols *sizeof(int), cudaMemcpyDeviceToHost);

k = 0;

//copy array from 1D to 2D array

for(int i = 0; i < N; i++){

for(int j = 0; j < M; j++){

pixelVal[i][j] = result[k];

k++;

}

cudaFree(d_old_array);

cudaFree(d_new_array);

}

</pre>

As you can see, some clear optimization paths in the method are the two, double for loops that convert the 2D array into a 1D array (and back again). Unfortunately, we couldn't get the CUDA kernel to handle very large 2D arrays in the way we wanted. We tried very hard to do this, but simply could not :(

Kristoffer Vukasinovic

1

edit

Changes

DPS915 Toad

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools