Changes

← Older edit

TudyBert

7,752 bytes added, 13:29, 19 April 2013

→‎Assignment 3

Here's the flat profile for 50 runs of enlarging a 512px x 512px image 4 times:

% cumulative self self total

19.64 2.75 0.55 Image::enlargeImage(int, Image&)

1.07 2.78 0.03 4194304 0.00 0.00 Image::getPixelVal(int, int)

The code for enlargeImage():

int rows, cols, gray;

}

oldImage = tempImage;

The four for loops look like they could be parallelized since they just serve as counters. From the flat file, it seems that the majority of the time is spent in the overloaded operator=() method. The code for this is:

<div>

N = oldImage.N;

M = oldImage.M;

Q = oldImage.Q;

if(dim1 != NULL)

{

delete[] dim1;

}

pixelVal = new int* [N];

dim1 = new int[N*M];

for(int i = 0; i < N; i++)

{

pixelVal[i] = new int [M];

for(int j = 0; j < M; j++)

{

pixelVal[i][j] = oldImage.pixelVal[i][j];

dim1[i*N + j] = oldImage.dim1[i*N + j];

}

</div>

~~<nowiki>~~

~~void Image::operator=(const Image& oldImage)~~

~~/*copies oldImage~~ The chunk of the processing is wasted on copying the two arrays over from one image to another. If I have time I might look into ~~whatever you = it~~ parallelizing this as well. It would be interesting to see if the speed of the GPU can overcome the overhead of copying to*/and from the device.

{=== Assignment 2 ===For Assignment 2 I simply put the four for loops into a kernel and replaced the outermost loop with thread indices. I made a helper method that set up memory on the device and launched the kernel with a 1 dimension array of blocks each containing 1 thread. I launched as many blocks of 1 thread as there were rows in the image file. I figured this was the quickest way to get this method parallelized. Unfortunately I hit a wall with my data sizes. The CPU version of the enlarge image method fails when run for more than 50 loops. The error thrown is a Visual Studio debugging error so I'm think VS isn't too happy with having the CPU hogged for so long. As a result I've had to extrapolate times for larger loops by assuming a linear increase in time taken.

~~N = oldImage.N;~~

~~M = oldImage.M;~~

~~Q = oldImage.Q;~~Here's the code for newly parallelized method:

int idx = blockIdx.x * blockDim.x + threadIdx.x;

int enlargeRow, enlargeCol;

__shared__ int pixel;

for(int j = 0; j < nj; j++)

{

pixel = work[idx * nj + j];

enlargeRow = idx * factor;

enlargeCol = j * factor;

for(int c = enlargeRow; c < (enlargeRow + factor); c++)

{

for(int d = enlargeCol; d < (enlargeCol + factor); d++)

{

result[d + c * blockDim.x * gridDim.x * factor] = pixel;

}

~~if(dim1 != NULL)~~While I did see a decrease in the time taken to run 50 loops, the decrease wasn't as significant as I had hoped. Obviously this kernel isn't optimized so I'm looking forward to some more impressive results as I update the code.

{=== Assignment 3 ===After making sure memory access is coalesced and replacing the second counter loop with threads from a 2 dimensional block of 2 dimensional threads, I've achieved significant speed ups in the program. All it took was launching the kernel with an optimized 2D array of blocks each containing a 2D array of threads. For assignment 2 I had a grid with 1 thread for each column in the image. That meant each thread was running 3 nested for loops to do the necessary calculations for enlarging. Figuring out the math for calculating the correct index in the arrays proved to be tricky. Although I knew exactly what to do in concept, the two extra nested for loops threw me off. For a long time the image was being enlarged correctly but the physical dimensions of the image weren't increasing. Once I had that figured out the image was enlarging but not to the new dimensions. After some tracing and trial and error I managed to find the right formula to calculate the indices. Here's the final, optimized enlarge method:

~~delete[] dim1~~int jdx = blockIdx.x * blockDim.x + threadIdx.x;

}int idx = blockIdx.y * blockDim.y + threadIdx.y;

int k = idx + jdx * blockDim.x * gridDim.x;

int enlargeRow, enlargeCol;

~~pixelVal~~ __shared__ int* [N] pixel;

~~dim1~~ pixel = ~~new int~~work[~~N*M~~k];

enlargeRow = idx * factor;

enlargeCol = jdx * factor;

~~for~~ __syncthreads(~~int i = 0~~); ~~i < N; i++)~~

{ for(int c = enlargeRow; c < (enlargeRow + factor); c++)

~~pixelVal[i] = new int [M];~~ {

for(int j d = 0enlargeCol; d < ~~j < M~~(enlargeCol + factor); jd++)

{

~~pixelVal~~ result[~~i][j~~c + d * blockDim.x * gridDim.x * factor] = ~~oldImage.pixelVal[i][j]~~pixel;

~~dim1[i*N + j] = oldImage.dim1[i*N + j]~~__syncthreads();

}

} I enjoyed parallelizing this program and really wish I could have figured out the CERN project. To make myself feel better I also parallelized the rotate image method.

}

~~</nowiki>~~

~~=== Assignment 3 ===~~I was going to paste the code snippet here but I'm getting frustrated with the formatting. Why is it so difficult to nicely format code on a Wiki? [http://pastebin.com/ZZV9KRJN Here] it is.

Rwstanica

1

edit

Changes

TudyBert

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools