Changes

← Older edit

GroupNumberUndefined

2,557 bytes added, 19:31, 11 April 2017

→‎Assignment 3

[[File:EnlargeImage.PNG]]

New Implementation :

[[File:EnlargeImageOld.PNG]]

And this is the kernel implementation :

[[File:Snippet.PNG]]

Before and after parallization analysis of Enlarge function:

[[File:EnlargeFunctionAnalysis.jpg]]

----

Another function that we chose to parallelize is the rotate function, which will take the user's input as the degrees in which to rotate the picture.

The new implementation and the kernel is:

[[File:newRotate.jpg]]

Kernel: [[File:rotateKernel.JPG]] Before and after parallization analysis of Rotate function: [[File:RotateFunctionAnalysis.jpg]]

=== Assignment 3 ===

The optimization for the rotate function and kernel was something I had to think about. At first, I was taking the suggestion of putting the check bounds and filling empty pixels inside the kernel, but I realized that is not possible without passing the image itself inside the kernel, and removing the use of "inBounds()" function caused the program to crash even before its execution. Therefore I decided to go against putting the image inside the shared memory for the kernel calculation because of the cudaMemcpy.

It had been concluded that a majority of the run time is allocated to the memory transfer between the host and device. Considering that the rotate kernel only needs to populate two arrays, passing the image into the device memory will only add to the run time, especially with larger resolution images.

Upon further inspection of the function and kernel, I realized that the array of pixels taken from the oldImage was never used inside the kernel, so it was removed entirely. This include the removal of its memory allocation and the copying of the array from host to device, further reducing the run time of the function.

[[File:hArrayRemove.jpg]]

Furthermore, I previously put the "check for bounds" calculation and the "fill in empty pixels" calculation inside two separate nested for-loops. I have combined them into one, removing one nested for loops which will increase performance dramatically.

[[File:NestedCombined.jpg]]

Overall, this is what the optimized rotateImage() function and the rotate() kernel looks like:

[[File:OptimizedFunction.jpg]]

Some calculation previously done inside the kernel (finding the center of images and finding radians calculation) were moved to outside the kernel and its value passed in. Kernel:

[[File:OptimizedKernel.jpg]]

Profiling with the same images gives the following result.

[[File:OptimizedChart.jpg]]

For optimization of the enlarge function, there are not a lot of options in which it can be optimized, only choice I did was to put some of the calculations into a register, which is the resulting image showing the final copy of the enlarge function. There were no significant improvements in the performance, not worth documenting.

[[File:OptimzedFunc.PNG]]

Andreybykin

62

edits

CDOT Wiki β

Changes

GroupNumberUndefined

CDOT Wiki ^β