Changes

Jump to: navigation, search

BLAStoise

4,566 bytes added, 10:32, 13 April 2017
Assignment 3
=== Assignment 2 ===
<h4>Parallelized Oil Painting Program</h4>
Despite our initial decision of choosing to parallelize the Sudoku solver in assignment 1, we came to the conclusion that parallelizing the oil painting program would be better suited for us. We were able to grasp the logic behind the oil painting program whereas we had a lot of trouble working out the logic behind solving Sudoku puzzles.  <h4>The Code</h4>[[Media: A2-Blastoise.zip]] This is a zip file that contains our full code and an executable version. To create your own project in visual studio with this code, you will need to download OpenCV. You will need the following property settings in your project: add the include path of OpenCV to VC++ Directories -> Include Directories and also add opencv_world320d.lib to Linker -> Input and add a post build command to copy OpenCV dll files into your project directory. (The last step can also be done by copying OpenCV bin files into debug directory of your project.) In the serial program, the oil painting worked by going one pixel at a time in a double for loop going across the height and width of the iamge.
<pre>
for (int j=0; j < width; j++)
{
result[i*width+j] = ProcessPixel(j,i);//pixel processing
}
}
</pre>
We were able to remove the need for this loop through the utilization of a kernel. In the main function, we created the following block and grid and called our oil painting kernel:. The ceil function was used to calculate the grid dimensions, so that based on the block size the entire image will be included in the block.
<pre>
const dim3 block(16ntpb, 16, 1ntpb);// ntpb was calculated based on device property (maxThreadsDim).
const dim3 grid(ceil((float)width / block.x), ceil((float)height / block.y), 1);
oilPaint << <grid, block >> >(gpu_src, gpu_dst, width, height);
In our kernel, through the use of the primitive types, we are able to determine the exact position of the pixel in the 2D array:. Now instead of iterating through every pixel in the image, each thread in the block will adjust its own pixel intensity. The overall logic of the code stayed the same. We moved the double for loop into kernel and used i and j to locate the pixel.
<pre>
</pre>
<h4>The Results</h4>
The first graph is a comparison of the original execution time and the parallelize version. There was a considerable speed up. The original seems to have a growth rate that is exponential while the parallelized version is practically logarithmic. The second graph shows the time spent in the kernel in milliseconds, while also showing the percentage of time spend on the device verses host. The time spent in the kernel increases with the problem size, as expected. You can see that the time spent on the device also has a growth rate that is logarithmic, this means that for extremely small problem sizes the parallel version might not have a great speed up. (This is caused by the amount of CUDA API calls.)
[[File:A2-Result.PNG]]
=== Assignment 3 ===
 
[[Media: OilPaint.zip]] This is our complete optimized solution, including the executable, code, and an image. The original parallelized solution took 3 arguments, the brush size, intensity level, the file name. Our default brush size was 5, and intensity level was 20, these values would be good for testing. Our optimized solution took only 1 argument, the file name. To run the parallel and optimized examples, the OpenCV files, specifically opencv_world320d.dll needs to be in the same directory as the executable. We tried multiple optimization method, some of them worked while others made our times worse. Below we have all our attempts at optimizations, what worked and what made it worse.
 
<h4>Shared memory</h4>
 
We attempted two potential applications of shared memory. Our first try was to store the entire image in shared memory. We realized that there was no need because this source image was only accessed once in the algorithm. Therefore transferring the image from global to shared, and then back to global, would be very inefficient. To make shared memory efficient, we would need to access it multiple times for any actual speedup.
 
 
Our second attempt was to store the array needed to calculate intensity. We realize that this would not be possible as each thread would require their own colour value arrays for calculations done at different times
 
 
<h4>Constant memory</h4>
We declared our image as constant memory. Constant memory was implemented and that increased our speed. However because the size of the image is dynamic, this can't be considered real constant memory.
 
 
<h4>Instruction mixing</h4>
For instruction mixing, we unrolled the inner loop with a constant brush size. We removed the option for the user to change the brush size and unrolled the loop with a constant brush size. We have so many checks and other functions inside of our inner loop that unrolling was causing it to slow down slightly. The result was a slight speed down and therefore we did not use instruction mixing. However we kept some functions that slightly sped the processing up.
 
 
<h4>Reduce CUDA API calls</h4>
We attempted to use only 1 array to hold both the source and destination. Because the destination image calculations depend on data from the source, we cannot change the source as we go.
 
 
<h4>Coalesced Access</h4>
Our last attempt at optimization was to try using coalesced access. We switched the ''x'' and ''y'' grid values so that all the values are adjacent to each other in memory. This resulted in a 1.3x speedup when combined with constant memory and occupancy optimization.
 
 
 
<h4>Optimization results</h4>
 
[[File:Capture.PNG]]
 
These are our results after different optimization steps. The combination of our optimizations resulted in a 1.3x speedup compared to our un-optimized version.
64
edits

Navigation menu