64
edits
Changes
→Assignment 3
=== Assignment 3 ===
<h3h4>Shared memory</h3h4>
We attempted two potential applications of shared memory. Our first try was to store the entire image in shared memory. We realized that there was no need because this source image was only accessed once in the algorithm. Therefore transferring the image from global to shared, and then back to global, would be very inefficient. To make shared memory efficient, we would need to access it multiple times for any actual speedup.
<h3h4>Constant memory</h3h4>
We declared our image as constant memory. Constant memory was implemented and that increased our speed. However because the size of the image is dynamic, this can't be considered real constant memory.
<h3h4>Instruction mixing</h3h4>
For instruction mixing, we unrolled the inner loop with a constant brush size. We removed the option for the user to change the brush size and unrolled the loop with a constant brush size. We have so many checks and other functions inside of our inner loop that unrolling was causing it to slow down slightly. The result was a slight speed down and therefore we did not use instruction mixing. However we kept some functions that slightly sped the processing up.
<h3h4>Reduce CUDA API calls</h3h4>
We attempted to use only 1 array to hold both the source and destination. Because the destination image calculations depend on data from the source, we cannot change the source as we go.
<h3h4>Coalesced Access</h3h4>
Our last attempt at optimization was to try using coalesced access. We switched the ''x'' and ''y'' grid values so that all the values are adjacent to each other in memory. This resulted in a 1.3x speedup when combined with constant memory and occupancy optimization.
<h3h4>Optimization results</h3h4>