1
edit
Changes
→Assignment 3
=== Assignment 3 ===
Optimization presented some interesting challenges. The following were important factors in the attempt to optimize:
· Filtering problems were independent of other filters
· Filtering is 2D in nature
· Filters can and are regularly convolved to form new filters
· Filters require an apron of neighboring pixels to calculate the filter
The greatest gains were made in the convolution, and canny kernels with minimal gains in grey world, auto contrast and no gains in resize. No changes were made to either Gaussian as they were implemented as convolution kernels.
[[Image:https://imgur.com/aWCDGRB|thumb|1265px| ]]
One change that improved performance everywhere where able to be implemented was changing all variables from double to float. This improved performance across the board.
Shared Memory
Due to the nature of the algorithms, shared memory was not an option for any kernel.
In all of the kernels, the calculations performed for a resultant pixel were independent from all other resultant pixels. This meant that one pixels result was completely independent of all other pixels. An attempt was made in the resize kernel but after careful analysis of shared memory access and further understanding of the algorithm, it was determined that shared memory would only result in performance loss.
While researching shared memory, we did come across the problem of bank conflicts which was not covered in class.
Shared memory is divided into banks, each bank is made up of 4bytes (32 bits).
Bank 1 2 3 4
Byte | 1 2 3 4 | 1 2 3 4 | 1 2 3 4 | 1 2 3 4 |
If multiple threads try to access the same bank, a conflict occurs, and access has to be serialized. This can be prevented in multiple ways.
First, is to pad, or use different types in your data structures.
struct RGBApixel {
unsigned char Blue;
unsigned char Green;
unsigned char Red;
unsigned char Alpha;
}
This struct is stored in one bank in shared memory since each char takes up 8 bits. If you need access to the individual elements of this struct, it will cause a 4 way bank conflict since the tread has to read from the bank 4 times. To prevent this, we could do this
struct RGBApixel {
unsigned char32_t Blue;
unsigned char32_t Green;
unsigned char32_t Red;
unsigned char32_t Alpha;
}
This struct is now spread across 4 banks since char now 32 bits, or one full bank each. With this, bank conflicts are eliminated.
Convolution Kernel
The biggest gains in optimization came from the convolution kernels.
Constant memory use improved the access times in the convolution kernel.
old:
//in operation constructor
cudaMalloc((void**)&_gpuKernel, _kernelXSize * _kernelYSize * sizeof(float));
cudaMemcpy(_gpuKernel, _kernel, _kernelXSize * _kernelYSize * sizeof(float),
cudaMemcpyHostToDevice);
new
const int MAX_KERNEL_RADIUS = 7;
__constant__ float convolutionKernel[(MAX_KERNEL_RADIUS * 2) * (MAX_KERNEL_RADIUS * 2) + 1];
//before running the filter
checkError(cudaMemcpyToSymbol(convolutionKernel, _kernel,
_kernelXSize * _kernelYSize * sizeof(float)), "initializing kernel");
Due to the nature of some of the kernels, thread divergence was a major slowdown due to having to do edge checks. This was eliminated by padding the outside of the image.
__global__ void padImageKernel(
const GPU_RGBApixel* img, GPU_RGBApixel* output, int width, int height,
int wPad, int hPad)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int paddedWidth = width + wPad * 2;
int paddedHeight = height + hPad * 2;
if (x >= paddedWidth || y >= paddedHeight)
{
return;
}
int xCorr = (x < wPad) ? width + x : x;
xCorr = (xCorr >= width + wPad) ? xCorr - width : xCorr;
int yCorr = (y < hPad) ? height + y : y;
yCorr = (yCorr >= height + hPad) ? yCorr - height : yCorr;
output[imageIndex(x, y, paddedHeight)] = img[imageIndex(xCorr, yCorr, height)];
}
This removed the edge checks from the convolution kernel allowing for a dramatic increase in Gaussian.
//old
for (int i = -kXRad; i <= kXRad; i++)
{
for (int j = -kYRad; j <= kYRad; j++)
{
//wrap image
int xCorr = (x + i < 0) ? width + i : x + i;
xCorr = (xCorr >= width) ? xCorr - width : xCorr;
int yCorr = (y + j < 0) ? height + j : y + j;
yCorr = (yCorr >= height) ? yCorr - height : yCorr;
const float kernelElement = kernel[kernelIndex(i, j, kernelXSize, kernelYSize)];
const GPU_RGBApixel& imageElement = img[imageIndex(xCorr, yCorr, height)];
outPixel.Red += kernelElement * imageElement.Red;
outPixel.Green += kernelElement * imageElement.Green;
outPixel.Blue += kernelElement * imageElement.Blue;
}
}
//new
for (int i = -kXRad; i <= kXRad; i++)
{
for (int j = -kYRad; j <= kYRad; j++)
{
const float kernelElement = convolutionKernel[kernelIndex(i, j, kernelXSize, kernelYSize)];
const GPU_RGBApixel imageElement = img[imageIndex(xCorr + i, yCorr + j, height + kernelYSize - 1)];
outPixel.Red += kernelElement * imageElement.Red;
outPixel.Green += kernelElement * imageElement.Green;
outPixel.Blue += kernelElement * imageElement.Blue;
}
}
Canny
Canny was optimized not through the kernel itself, but through some of the functions canny used. The functions were rewritten with a little rearrangement of code and slight manipulation of their logic in order for performance gains.
notMaxSuppression used to have a large if block in the middle of the kernel that could potentially split into 4 threads. It was rewritten using a pre-computed table of outcomes and a little “mathemagic” to eliminate the if statements. It has reduced it reduces it from four paths to two.
//changed the angle finding logic
//old
int index = imageIndex(x, y, height);
const float& gM = gradMagnitude[index];
result[index] = gM;
//don't touch the edges
if (x < 1 || y < 1 || x >= width - 1 || y >= height - 1)
{
return;
}
const float& gA = gradAngle[index];
int i;
int j;
result[index] = gradMagnitude[index];
if (((gA >= -M_PId8) && (gA <= M_PId8)) || (gA >= 7 * M_PId8) || (gA <= -7 * M_PId8))
{
i = 0;
j = -1;
}
else if (((gA <= 3 * M_PId8) && (gA > M_PId8)) || ((gA <= -5 * M_PId8) && (gA > -7 * M_PId8)))
{
i = -1;
j = 1;
}
else if (((gA >= 5 * M_PId8) && (gA < 7 * M_PId8)) || ((gA < -M_PId8) && (gA >= -3 * M_PId8)))
{
i = -1;
j = -1;
}
else
{
i = -1;
j = 0;
}
if (!((gM <= gradMagnitude[imageIndex(x + i, y + j, height)])
|| (gM <= gradMagnitude[imageIndex(x - 1, y - j, height)])))
{
result[index] = gM;
}
//new
int xCorr = x + 1;
int yCorr = y + 1;
int heightCorr = height + 2;
int index = imageIndex(xCorr, yCorr, heightCorr);
const float gM = gradMagnitude[index];
const int sectorConvertX[9] = { 0, -1, -1, -1, -1, -1, -1, 0 , 0 };
const int sectorConvertY[9] = { -1, 1, 1, -1, -1, 0, 0, -1 , -1 };
const float gA = fabsf(gradAngle[imageIndex(x, y, height)]);
const int sector = int(gA / M_PI) * 8;
int i = sectorConvertX[sector];
int j = sectorConvertY[sector];
if (!((gM <= gradMagnitude[imageIndex(xCorr + i, yCorr + j, heightCorr)])
|| (gM <= gradMagnitude[imageIndex(xCorr - 1, yCorr - j, heightCorr)])))
{
result[index] = gM;
}
else
{
result[index] = 0.0;
}
hysterisisSecondPass was padded and had its inner double loop unrolled and the inner loop was replaced with divisions to eliminate branching.
//old
if (pixel >= lowerThreshold && pixel < upperThreshold)
{
int iLower = (x == 0) ? 0 : -1;
int iUpper = (x == width - 1) ? 0 : 1;
int jLower = (height == 0) ? 0 : -1;
int jUpper = (x == width - 1) ? 0 : 1;
ret = 0.0;
for (int i = iLower; i <= iUpper; i++)
{
for (int j = jLower; j <= jUpper; j++)
{
if (image[imageIndex(x + i, y + j, height)])
{
ret = upperThreshold + 1;
}
}
}
}
//new
if (level >= lowerThreshold && level < upperThreshold)
{
level = 0.0;
level += image[imageIndex(xCorr + 1, yCorr + 1, heightCorr)] / upperThreshold;
level += image[imageIndex(xCorr + 0, yCorr + 1, heightCorr)] / upperThreshold;
level += image[imageIndex(xCorr + -1, yCorr + 1, heightCorr)] / upperThreshold;
level += image[imageIndex(xCorr + 1, yCorr + 0, heightCorr)] / upperThreshold;
level += image[imageIndex(xCorr + -1, yCorr + 0, heightCorr)] / upperThreshold;
level += image[imageIndex(xCorr + 1, yCorr + -1, heightCorr)] / upperThreshold;
level += image[imageIndex(xCorr + 0, yCorr + -1, heightCorr)] / upperThreshold;
level += image[imageIndex(xCorr + -1, yCorr + -1, heightCorr)] / upperThreshold;
level = level == 0 ? 0 : 255;
//level = (level) * (upperThreshold + 1);
}
Greyworld
The optimization for grey world was minimal, and in turn provided a minimal increase in execution.
Global memory calls were reduced from twelve to one, as well as reducing branching if statements.
if(averageR > 0.05){
newGrey[ imageIndex(x, y, height) ].Red = (255 < Original[ imageIndex(x, y, height) ].Red * averageAll / averageR)?
255 : Original[ imageIndex(x, y, height) ].Red * averageAll / averageR;
}
else
{
newGrey[ imageIndex(x, y, height) ].Red = (255 < Original[ imageIndex(x, y, height) ].Red + averageAll)?
255 : Original[ imageIndex(x, y, height) ].Red + averageAll;
}
if(averageB > 0.05){
newGrey[ imageIndex(x, y, height) ].Blue = (255 < Original[ imageIndex(x, y, height) ].Blue * averageAll / averageB)?
255 : Original[ imageIndex(x, y, height) ].Blue * averageAll / averageB;
}
else
{
newGrey[ imageIndex(x, y, height) ].Blue = (255 < Original[ imageIndex(x, y, height) ].Blue + averageAll)?
255 : Original[ imageIndex(x, y, height) ].Blue + averageAll;
}
if(averageG > 0.05){
newGrey[ imageIndex(x, y, height) ].Green = (255 < Original[ imageIndex(x, y, height) ].Green * averageAll / averageG)?
255 : Original[ imageIndex(x, y, height) ].Green * averageAll / averageG;
}
else
{
newGrey[ imageIndex(x, y, height) ].Green = (255 < Original[ imageIndex(x, y, height) ].Green + averageAll)?
255 : Original[ imageIndex(x, y, height) ].Green + averageAll;
}
GPU_RGBApixel orig = Original[imageIndex(x, y, height)];
GPU_RGBApixel grey = { 0, 0, 0 };
if(averageR > 0.05)
{
grey.Red = orig.Red * averageAll / averageR;
}
else
{
grey.Red = orig.Red + averageAll;
}
if(averageB > 0.05)
{
grey.Blue = orig.Blue * averageAll / averageB;
}
else
{
grey.Blue = orig.Blue + averageAll;
}
if(averageG > 0.05)
{
grey.Green = orig.Green * averageAll / averageG;
}
else
{
grey.Green = orig.Green + averageAll;
}
if(grey.Red > 255) grey.Red = 255;
if(grey.Blue > 255) grey.Blue = 255;
if(grey.Green > 255) grey.Green = 255;
newGrey[imageIndex(x, y, height)] = grey;
Autocontrast
The only optimization made was reducing global memory access from 2 to 1.
From:
const RGBApixel& imgPixel = img[imageIndex(x, y, height)];
RGBApixel& resultPixel = result[imageIndex(x, y, height)];
To:
Int index = imageIndex(x, y, height);
const RGBApixel& imgPixel = img[index];
RGBApixel& resultPixel = result[index];
Rezise
When optimizing resize, on the development machine where the code was being written, there seemed to be an increase of performance but when ran on the benchmark system there was no performance increase shown.
Logic was tinkered with, and some redundant caculations were eliminated storing the reults in local registers. As well, global memory access was reduced from 12 times a thread to 4. Finally, the result from the pixel calculation was stored in a temp local register and then that was writted to global memory, reducing the writes to global memory.
newImage[i * NewWidth + j].Red =
(ebmpBYTE)((1.0 - ThetaI - ThetaJ + ThetaI*ThetaJ)*(OldImage[I * OldWidth + J].Red)
+ (ThetaI - ThetaI*ThetaJ)*(OldImage[(I + 1) * OldWidth + J].Red)
+ (ThetaJ - ThetaI*ThetaJ)*(OldImage[I * OldWidth + J + 1].Red)
+ (ThetaI*ThetaJ)*(OldImage[(I + 1) * OldWidth + J + 1].Red));
newImage[i * NewWidth + j].Green =
(ebmpBYTE)((1.0 - ThetaI - ThetaJ + ThetaI*ThetaJ)*(OldImage[I * OldWidth + J].Green)
+ (ThetaI - ThetaI*ThetaJ)*(OldImage[(I + 1) * OldWidth + J].Green)
+ (ThetaJ - ThetaI*ThetaJ)*(OldImage[I * OldWidth + J + 1].Green)
+ (ThetaI*ThetaJ)*(OldImage[(I + 1) * OldWidth + J + 1].Green));
newImage[i * NewWidth + j].Blue =
(ebmpBYTE)((1.0 - ThetaI - ThetaJ + ThetaI*ThetaJ)*(OldImage[I * OldWidth + J].Blue)
+ (ThetaI - ThetaI*ThetaJ)*(OldImage[(I + 1) * OldWidth + J].Blue)
+ (ThetaJ - ThetaI*ThetaJ)*(OldImage[I * OldWidth + J + 1].Blue)
+ (ThetaI*ThetaJ)*(OldImage[(I + 1) * OldWidth + J + 1].Blue));
float t4 = ThetaI*ThetaJ;
float t1 = 1.0 - ThetaI - ThetaJ + t4;
float t2 = ThetaI - t4;
float t3 = ThetaJ - t4;
int p1 = I * OldWidth + J;
int p2 = (I + 1) * OldWidth + J;
int p3 = I * OldWidth + J + 1;
int p4 = (I + 1) * OldWidth + J + 1;
RGBApixel temp;
RGBApixel temp1 = OldImage[p1];
RGBApixel temp3 = OldImage[p3];
RGBApixel temp2 = OldImage[p2];
RGBApixel temp4 = OldImage[p4];
temp.Red =(ebmpBYTE)((t1)*(temp1.Red)
+ (t2)*(temp2.Red)
+ (t3)*(temp3.Red)
+ (t4)*(temp4.Red));
temp.Green =
(ebmpBYTE)((t1)*(temp1.Green)
+ (t2)*(temp2.Green)
+ (t3)*(temp3.Green)
+ (t4)*(temp4.Green));
temp.Blue =
(ebmpBYTE)((t1)*(temp1.Blue)
+ (t2)*(temp2.Blue)
+ (t3)*(temp3.Blue)
+ (t4)*(temp4.Blue));
During the optimization of resize, an attempt was made to make use of share memory. While eventually it was made to work, there was a 250% performance decrease. Here was the last working attempt at shared memory use. This method introduced branching in order to check for edge detection of the blocks. This method caused severe bank conflicts, and after further analysis revealed that it would not work due to each resultant pixels calculations were independent of every other pixel thus shared memory use was not possible.
__global__ void c_newPixel(RGBApixel * OldImage, RGBApixel *newImage,
int OldWidth, int OldHeight, int NewWidth, int NewHeight)
{
int j = blockIdx.x * blockDim.x + threadIdx.x;
int i = blockIdx.y * blockDim.y + threadIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;
if (i >= NewHeight - 1 || j >= NewWidth - 1)
{
return;
}
int I, J;
float ThetaI, ThetaJ;
ThetaJ = (float)(j*(OldHeight - 1.0)) / (float)(NewHeight - 1.0);
J = (int)floor(ThetaJ);
ThetaJ -= J;
ThetaI = (float)(i*(OldWidth - 1.0)) / (float)(NewWidth - 1.0);
I = (int)floor(ThetaI);
ThetaI -= I;
int blkArraySize = (blockDim.x + 1) * (blockDim.x + 1);
int blkArrayWidth = (blockDim.x + 1) * ty;
extern __shared__ RGBApixel pixel[];
pixel[tx + blkArrayWidth] = OldImage[I * OldWidth + J];
pixel[tx + blkArrayWidth + blkArraySize] = OldImage[(I + 1) * OldWidth + J];
/*
if (ty == blockDim.y - 1){
int I1;
float ThetaI1;
ThetaI1 = (float)((i + 1)*(OldWidth - 1.0)) / (float)(NewWidth - 1.0);
I1 = (int)floor(ThetaI1);
int J1;
float ThetaJ1;
ThetaJ1 = (float)((j)*(OldHeight - 1.0)) / (float)(NewHeight - 1.0);
J1 = (int)floor(ThetaJ1);
pixel[tx + blockDim.x + blkArrayWidth] = OldImage[I1 * OldWidth + J1];
pixel[tx + blockDim.x + blkArrayWidth + blkArraySize] = OldImage[(I1 + 1) * OldWidth + J1];
}
if (tx == blockDim.x - 1){
int J1;
float ThetaJ1;
ThetaJ1 = (float)((j + 1)*(OldHeight - 1.0)) / (float)(NewHeight - 1.0);
J1 = (int)floor(ThetaJ1);
int I1;
float ThetaI1;
ThetaI1 = (float)((i)*(OldWidth - 1.0)) / (float)(NewWidth - 1.0);
I1 = (int)floor(ThetaI1);
pixel[tx + blkArrayWidth + 1] = OldImage[I1 * OldWidth + J1];
pixel[tx + blkArrayWidth + 1 + blkArraySize] = OldImage[(I1 + 1) * OldWidth + J1];
}*/
__syncthreads();
RGBApixel pixelResult;
/*
RGBApixel p1 = pixel[tx + blkArrayWidth];
RGBApixel p2 = pixel[tx + blkArraySize + blkArrayWidth];
RGBApixel p3 = pixel[tx + blkArrayWidth + 1];
RGBApixel p4 = pixel[tx + blkArraySize + blkArrayWidth + 1];
*/
float t4 = ThetaI*ThetaJ;
float t1 = 1.0 - ThetaI - ThetaJ + t4;
float t2 = ThetaI - t4;
float t3 = ThetaJ - t4;
int ps1 = tx + blkArrayWidth;
int ps2 = tx + blkArraySize + blkArrayWidth;
int ps3 = ps1 + 1;
int ps4 = ps2 + 1;
pixelResult.Red =
(ebmpBYTE)((t1)*(pixel[ps1].Red)
+ (t2)*(pixel[ps2].Red)
+ (t3)*(pixel[ps3].Red)
+ (t4)*(pixel[ps4].Red));
pixelResult.Green =
(ebmpBYTE)((t1)*(pixel[ps1].Green)
+ (t2)*(pixel[ps2].Green)
+ (t3)*(pixel[ps3].Green)
+ (t4)*(pixel[ps4].Green));
pixelResult.Blue =
(ebmpBYTE)((t1)*(pixel[ps1].Blue)
+ (t2)*(pixel[ps2].Blue)
+ (t3)*(pixel[ps3].Blue)
+ (t4)*(pixel[ps4].Blue));
newImage[i * NewWidth + j] = pixelResult;
}