1
edit
Changes
→Assignment 3
Optimization presented some interesting challenges. The following were important factors in the attempt to optimize:
The greatest gains were made in the convolution, and canny kernels with minimal gains in grey world, auto contrast and no gains in resize. No changes were made to either Gaussian as they were implemented as convolution kernels.
[[File:A3.png]]
One change that improved performance everywhere where able to be implemented was changing all variables from double to float. This improved performance across the board.
<b><font style="font-size:140%"> Shared Memory</font></b>
Due to the nature of the algorithms, shared memory was not an option for any kernel.
First, is to pad, or use different types in your data structures.
<pre>
struct RGBApixel {
unsigned char Blue; unsigned char Green; unsigned char Red; unsigned char Alpha;
}
</pre>
This struct is stored in one bank in shared memory since each char takes up 8 bits. If you need access to the individual elements of this struct, it will cause a 4 way bank conflict since the tread has to read from the bank 4 times. To prevent this, we could do this
<pre>
struct RGBApixel {
unsigned char32_t Blue; unsigned char32_t Green; unsigned char32_t Red; unsigned char32_t Alpha;
}
</pre>
This struct is now spread across 4 banks since char now 32 bits, or one full bank each. With this, bank conflicts are eliminated.
Convolution Kernel
<b><font style="font-size:140%"> Convolution Kernel </font></b>
The biggest gains in optimization came from the convolution kernels.
old:
cudaMalloc((void**)&_gpuKernel, _kernelXSize * _kernelYSize * sizeof(float));
cudaMemcpy(_gpuKernel, _kernel, _kernelXSize * _kernelYSize * sizeof(float),
cudaMemcpyHostToDevice);
</pre>
new
<pre>
const int MAX_KERNEL_RADIUS = 7;
__constant__ float convolutionKernel[(MAX_KERNEL_RADIUS * 2) * (MAX_KERNEL_RADIUS * 2) + 1];
Due to the nature of some of the kernels, thread divergence was a major slowdown due to having to do edge checks. This was eliminated by padding the outside of the image.
<pre>__global__ void padImageKernel( const GPU_RGBApixel* img, GPU_RGBApixel* output, int width, int height,int wPad, int hPad){ int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; int paddedWidth = width + wPad, * 2; int paddedHeight = height + hPad* 2; if (x >= paddedWidth || y >= paddedHeight){ return;{ }
output[imageIndex(x, y, paddedHeight)] = img[imageIndex(xCorr, yCorr, height)];
}
</pre>
This removed the edge checks from the convolution kernel allowing for a dramatic increase in Gaussian.
}
<b><font style="font-size:140%"> Canny </font></b>
Canny was optimized not through the kernel itself, but through some of the functions canny used. The functions were rewritten with a little rearrangement of code and slight manipulation of their logic in order for performance gains.
}
The optimization for grey world was minimal, and in turn provided a minimal increase in execution.
newGrey[imageIndex(x, y, height)] = grey;
<b><font style="font-size:140%"> Autocontrast</font></b>
The only optimization made was reducing global memory access from 2 to 1.
RGBApixel& resultPixel = result[index];
When optimizing resize, on the development machine where the code was being written, there seemed to be an increase of performance but when ran on the benchmark system there was no performance increase shown.