39
edits
Changes
→Assignment 3
=== Assignment 3 ===
The optimization for the rotate function and kernel was something I had to think about. At first, I was taking the suggestion of putting the check bounds and filling empty pixels inside the kernel, but I realized that is not possible without passing the image itself inside the kernel, and removing the use of "inBounds()" function caused the program to crash even before its execution. Therefore I decided to go against putting the image inside the shared memory for the kernel calculation because of the cudaMemcpy.
It had been concluded that a majority of the run time is allocated to the memory transfer between the host and device. Considering that the rotate kernel only needs to populate two arrays, passing the image into the device memory will only add to the run time, especially with larger resolution images.
Upon further inspection of the function and kernel, I realized that the array of pixels taken from the oldImage was never used inside the kernel, so it was removed entirely. This include the removal of its memory allocation and the copying of the array from host to device, further reducing the run time of the function.
Furthermore, I previously put the "check for bounds" calculation and the "fill in empty pixels" calculation inside two separate nested for-loops. I have combined them into one, removing one nested for loops which will increase performance dramatically.