49
edits
Changes
→Device Memory Management
In the original program a bmp is loaded into an vector of uint8_t. This is not ideal for CUDA, therefore an array of pinned memory was allocated. This array contains the same amount of elements but stores them as a structure, "BGRPixel" which is three contiguous floats. The vector is then transferred over to pinned memory.
=== Device Memory Management ===
To get a blurred pixel the surrounding pixels must be sampled, in some cases this means sampling pixels outside the bounds of the image. In the original, a simple if check was used to determine if the pixel was outside the bounds or the image, if it was a black pixel was returned instead. This if statement most likely would have caused massive thread divergence in a kernel, therefore the images created in device memory featured additional padding of black pixels to compensate for this. Two such images were created, one to perform horizontal blur and one to perform vertical blur. Other small device arrays were also needed to store the Gaussian integrals that are used to produce the blurring effect.<br>{| class="wikitable mw-collapsible mw-collapsed"! Padding example|-| <div style="display:inline;">
[[File:shrunk.png]]
</div>
<div style="display:inline;">
[[File:pad.png]]
</div>
<br>
This is how the image would be padded for 3x3 sigma blur.
The original image is 2560x1600 -> 11.7MB
With blur sigmas [x = 3, y = 3] and conversion to float the padded images will be 2600x1640 -> 48.8MB
Increase of 4.1% pixels and with the conversion for uint8_t to float total increase of 317% in memory requirements on the GPU
Since two padded images are needed at least 97.6MB will be on the GPU
|}
=== Host to Device ===