37
edits
Changes
→Assignment 3
}
As seen above, for all of the functions we converted to parallel, there is a lot of copying to and from the device. The use of a temp image makes this task even more intensive as we are basically copying the image to and from the device twice. In order to remove these cudaMemcpy() calls from each of the functions, we decided to simply create the image we are going to process directly onto the device allowing our parallel functions to operate on the pixels of the image without having to copy both the image and and a temporary image to and from the device. With this solution we were able to effectively reduce our 6 3 cudaMemcpy() calls per function from the parallel code into 2 0 cudaMemcpy() calls per function in our optimized code. To implement this solution, we moved the intensive cudaMemcpy() calls to just be called when the image is created as seen below. Image::Image(int numRows, int numCols, int grayLevels) /* Creates an Image of numRows x numCols and creates the arrays for it*/ { N = numRows; M = numCols; Q = grayLevels; cudaMalloc((void**)&pixelVal, N * M * sizeof(int)); cudaMemset(pixelVal, 0, N * M * sizeof(int)); } Image::Image(const Image& oldImage) /*copies oldImage into new Image object*/ { N = oldImage.N; M = oldImage.M; Q = oldImage.Q; int sz = M * N * sizeof(int); cudaMalloc((void**)&pixelVal, sz); cudaMemcpy(pixelVal, oldImage.pixelVal, sz, cudaMemcpyDeviceToDevice); } void Image::operator=(const Image& oldImage) /*copies oldImage into whatever you = it to*/ { N = oldImage.N; M = oldImage.M; Q = oldImage.Q; if (pixelVal) { cudaFree(pixelVal); } int sz = M * N * sizeof(int); cudaMalloc((void**)&pixelVal, sz); cudaMemcpy(pixelVal, oldImage.pixelVal, sz, cudaMemcpyDeviceToDevice); } Since we decided to store the image directly onto the device, we also had to add some cudaMemcpy() calls to the functions that handled retrieving and delivering the image to the host. This process was originally handled in the getPixelVal() and setPixelVal() functions below. int Image::getPixelVal(int row, int col) /*returns the gray value of a specific pixel*/ { return pixelVal[row * M + col]; } void Image::setPixelVal(int row, int col, int value) /*sets the gray value of a specific pixel*/ { pixelVal[row * M + col] = value; } But now we need to transfer the data to and from the device which is where the image is stored. We created 2 new functions to handle this called getPixels() and setPixels(). void Image::getPixels(int row, int col, int sz, int* out) { cudaMemcpy(out, pixelVal + (row * M + col), sz * sizeof(int), cudaMemcpyDeviceToHost); } void Image::setPixels(int row, int col, int sz, int* in) { cudaMemcpy(pixelVal + (row * M + col), in, sz * sizeof(int), cudaMemcpyHostToDevice); } After making these changes we were able to greatly simplify our parallel functions as shown below. void Image::negateImage(Image& oldImage) /*negates image*/ { Image tempImage(N, M, Q); int* d_temp = tempImage.pixelVal; int* d_img = pixelVal; int size = N * M; int nblocks = size / ntpb; negate << <nblocks, ntpb >> >(d_temp, d_img, size); cudaError_t err = cudaGetLastError(); if (err != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(err)); cudaDeviceSynchronize(); oldImage = tempImage; } void Image::reflectImage(bool flag, Image& oldImage) /*Reflects the Image based on users input*/ { int rows = oldImage.N; int cols = oldImage.M; Image tempImage(oldImage); int* d_temp = tempImage.pixelVal; int* d_img = oldImage.pixelVal; int size = rows * cols; int nblocks = size / ntpb; if (flag) { horizontalReflect << <nblocks, ntpb >> >(d_temp, d_img, size, rows, cols); } else { verticalReflect << <nblocks, ntpb >> >(d_temp, d_img, size, rows, cols); } cudaDeviceSynchronize(); oldImage = tempImage; } void Image::enlargeImage(int value, Image& oldImage) /*enlarges Image and stores it in tempImage, resizes oldImage and stores the larger image in oldImage*/ { int rows, cols, gray; int pixel; int enlargeRow, enlargeCol; rows = oldImage.N * value; cols = oldImage.M * value; gray = oldImage.Q; Image tempImage(rows, cols, gray); int r = oldImage.N; int c = oldImage.M; int* d_temp = tempImage.pixelVal; int* d_img = oldImage.pixelVal; int size = rows * cols; int nblocks = size / ntpb; enlarge << <nblocks, ntpb >> >(d_temp, d_img, size, value, c, cols); cudaDeviceSynchronize(); oldImage = tempImage; } Finally we profiled everything again to ensure that our optimizations would show a performance increase. We noticed a large increase in the execution time after the optimizations and were satisfied with the outcome.[[File:gpu optimized.png]]