1
edit
Changes
Team Z
,→Assignment 2
= Assignment 2 =
By the end of the semester, I have realized that picking up this source code was more or less a mistake. My greatest shortcoming of the program is that it only uses PPM or PGM type of pictures which are not freely available and even from the pictures I found, 90% of them didn't work with the program at larger sizes. The program refused to accept those images as PGM or PPM. I have finally come to the understanding of how concurrent programming works. As I noted in assignment 1, the current code has lots of few openings where we can make use of concurrent programming using GPU.
The current structure of the image process holds an image object. The data structure is as follows:
class Image {
public:
public:
int N; // number of rows
int Q; // number of gray levels
int **pixelVal;
}; extern Image img; The pixelVal which holds the value for each pixel is defined as int** which became a roadblock in my path. for most part I was unable to make it compatible with CUDAcoding. I started my work on transforming the constructor. The constructor / copy constructor is called in almost every method of the class to preform any action.The code of constructor is as follows: inline Image::Image(int numRows, int numCols, int grayLevels) /* Creates an Image of numRows x numCols and creates the arrays for it*/ { N = numRows; M = numCols; Q = grayLevels; pixelVal = new int *[n]; for(int i = 0; i < N; i++) { pixelVal[i] = new int [M]; for(int j = 0; j < M; j++) pixelVal[i][j] = 0; } } I made CUDA code for the constructor which would assign 0 to all the pixels in GPU concurrently which would recude the execution time.The CUDA kernel code is as follows: __global__ void init(int* a, int v, int n) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < n) a[idx] = v; }A very simple kernel that would assign value of v to each member of array a.This is how i called the kernel from code :inline Image::Image(int numRows, int numCols, int grayLevels) /* Creates an Image of numRows x numCols and creates the arrays for it*/ { N = numRows; M = numCols; Q = grayLevels; pixelVal = new int *[n]; int n = N * M; //Allocate device memory int* dpixels; cudaMalloc((void**)&dpixels, n * sizeof(int)); int nblks = (n + NTPB - 1) / NTPB; init<<<nblks, NTPB>>>(dpixels, 0, n); cudaMemcpy(pixelVal, *dpixels, n * sizeof(int), cudaMemcpyDeviceToHost); cudaFree(dpixels); */ } The code worked just fine but i stumbled across two problems.1) The kernel execution time for a 257kb PGM file was 0.157 sec while the execution time on CPU was near 0 sec.2) The constructor executed just fine but program crashed afterwards indicating that copy from int* to int** didn't work well as expected. After testing the code with a 769kb PGM file, the CUP time was recorded 0.1 sec while kernel recorded 0.16. 'The raise of only 0.03 sec.'Somehow the program wouldn't run big files on either windows or linux, CPU or GPU code. I was unable find a large file that this program would run so it blocked my further research. The results so far indicates that the GPU code is extremely inefficient with small files but as the file sizegrows, the CUDA kernel performs proportionally well. Afterwards I tried to upgrade the enlargeImage functionality into CUDA code but due to the nature of the code, I failed to convert the code to CUDA. I began working on reflectImage. After few failures and some difficulties with transforming int** to int*, I was able to successfully transformthe code from CPU to CUDA.The original code is as follows: int rows = oldImage.N; int cols = oldImage.M; Image tempImage(oldImage); if(flag == true) //horizontal reflection { for(int i = 0; i < rows; i++) { for(int j = 0; j < cols; j++) tempImage.pixelVal[rows - (i + 1)][j] = oldImage.pixelVal[i][j]; } } I transformed it in CUDA as follows: __global__ void fliph(int* a, int* b, int r, int c) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i >= r || j >= c) return; a[(r - i * c) + j] = b[i * c + j]; }The results were same as constructor.The CUDA memory allocation cudaMalloc is the most time consuming operation in the kernel execution.The CPU execution time for 257kb size image reflection was 0.01 sec while CUDA kernel took about 0.13 sec.However when code was tested with 769kb image, the CPU execution went up to 0.04 while CUDA kernel took 0.16.The CPU execution time went up about 4 times but CUDA kernel time didnt change much. This proves that while CUDAcode is highly efficient at low picture size due to the cudaMalloc time, it improves performance drastically as thesize of input increases. According to nVIDIA Profiler, the code was not using the GPU at full performance in terms of compute utilization andmemory copy efficiency. The CUDA kernel can further be enhanced with the shared memory and Coalesced memory access.
= Assignment 3 =