Open main menu

CDOT Wiki β

Changes

GPU610/TeamEh

293 bytes added, 14:41, 5 December 2014
Assignment 3
Optimization presented some interesting challenges. The following were important factors in the attempt to optimize:
 · * Filtering problems were independent of other filters · * Filtering is 2D in nature · * Filters can and are regularly convolved to form new filters · * Filters require an apron of neighboring pixels to calculate the filter
The greatest gains were made in the convolution, and canny kernels with minimal gains in grey world, auto contrast and no gains in resize. No changes were made to either Gaussian as they were implemented as convolution kernels.
[[File:A3.png]]
One change that improved performance everywhere where able to be implemented was changing all variables from double to float. This improved performance across the board.
<b><font style="font-size:140%"> Shared Memory</font></b>
Due to the nature of the algorithms, shared memory was not an option for any kernel.
First, is to pad, or use different types in your data structures.
<pre>
struct RGBApixel {
  unsigned char Blue;  unsigned char Green;  unsigned char Red;  unsigned char Alpha; 
}
</pre>
This struct is stored in one bank in shared memory since each char takes up 8 bits. If you need access to the individual elements of this struct, it will cause a 4 way bank conflict since the tread has to read from the bank 4 times. To prevent this, we could do this
<pre>
struct RGBApixel {
  unsigned char32_t Blue;  unsigned char32_t Green;  unsigned char32_t Red;  unsigned char32_t Alpha; 
}
</pre>
This struct is now spread across 4 banks since char now 32 bits, or one full bank each. With this, bank conflicts are eliminated.
Convolution Kernel
 
<b><font style="font-size:140%"> Convolution Kernel </font></b>
The biggest gains in optimization came from the convolution kernels.
old:
 //in operation constructor<pre>
cudaMalloc((void**)&_gpuKernel, _kernelXSize * _kernelYSize * sizeof(float));
 
cudaMemcpy(_gpuKernel, _kernel, _kernelXSize * _kernelYSize * sizeof(float),
 
cudaMemcpyHostToDevice);
</pre>
new
 
<pre>
const int MAX_KERNEL_RADIUS = 7;
 
__constant__ float convolutionKernel[(MAX_KERNEL_RADIUS * 2) * (MAX_KERNEL_RADIUS * 2) + 1];
 //before running the filter checkError(cudaMemcpyToSymbol(convolutionKernel, _kernel, _kernelXSize * _kernelYSize * sizeof(float)), "initializing kernel");</pre>
Due to the nature of some of the kernels, thread divergence was a major slowdown due to having to do edge checks. This was eliminated by padding the outside of the image.
<pre>__global__ void padImageKernel( const GPU_RGBApixel* img, GPU_RGBApixel* output, int width, int height,int wPad, int hPad){ int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; int paddedWidth = width + wPad, * 2; int paddedHeight = height + hPad* 2; if (x >= paddedWidth || y >= paddedHeight){ return;{ }
int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; int paddedWidth = width + wPad * 2; int paddedHeight = height + hPad * 2; if (x >= paddedWidth || y >= paddedHeight) { return; }  int xCorr = (x < wPad) ? width + x : x;  xCorr = (xCorr >= width + wPad) ? xCorr - width : xCorr; int yCorr = (y < hPad) ? height + y : y;  yCorr = (yCorr >= height + hPad) ? yCorr - height : yCorr; output[imageIndex(x, y, paddedHeight)] = img[imageIndex(xCorr, yCorr, height)];
output[imageIndex(x, y, paddedHeight)] = img[imageIndex(xCorr, yCorr, height)];
}
</pre>
This removed the edge checks from the convolution kernel allowing for a dramatic increase in Gaussian.
}
Canny
<b><font style="font-size:140%"> Canny </font></b>
Canny was optimized not through the kernel itself, but through some of the functions canny used. The functions were rewritten with a little rearrangement of code and slight manipulation of their logic in order for performance gains.
}
Greyworld<b><font style="font-size:140%"> Grey world </font></b>
The optimization for grey world was minimal, and in turn provided a minimal increase in execution.
newGrey[imageIndex(x, y, height)] = grey;
<b><font style="font-size:140%"> Autocontrast</font></b>
The only optimization made was reducing global memory access from 2 to 1.
RGBApixel& resultPixel = result[index];
Rezise<b><font style="font-size:140%"> Resize </font></b>
When optimizing resize, on the development machine where the code was being written, there seemed to be an increase of performance but when ran on the benchmark system there was no performance increase shown.