Open main menu

CDOT Wiki β

Changes

Savy Cat

8,616 bytes added, 22:14, 10 April 2018
Assignment 3
</nowiki>
==== Running Display Test Single Rotation ====
We can un-comment the "test" section in Rotate.cpp to read a .jpg, verify stored colour channel values are correct, and make sure the rotation is working as expected. Here is Tiny-Shay.jpg, 30px x 21px top-down image of my cat laying on the floor. Mousing over a pixel will display the X and Y coordinates, along with the corresponding red, green, blue values.
==== Initial CUDA Code ====
This code will read .jpg filename given in the command line argument to CImg object, copy the float array to device, use the device to rotate the image given in command line arguments by 90 degrees clockwise one time, then copy the result back to the host. It is just to verify everything is working as expected. We will then change the code to rotate the same images the same number of times as before.
;image.h
for (int j = 0; j < height; j++) {
for (int k = 0; k < width; k++) {
std::cout << std::setw(4) << (int)img[idx(k, j, w, h, i)];
}
std::cout << std::endl;
}
floatPX_TYPE* getImage(char* filename, int &w, int &h) {
std::cout << "Trying to read " << filename << std::endl;
h = cimg.height();
int size = w * h * cimg.spectrum();
floatPX_TYPE* img = new PX_TYPE[size];
for (int i = 0; i < size; i++) {
img[i] = cimg[i];
__global__ void rot90(floatPX_TYPE* src, floatPX_TYPE* dst, int src_w, int src_h, int z) {
int k = blockIdx.x * blockDim.x + threadIdx.x;
// Allocate device memory for src and dst
std::cout << "Allocating device memory ..." << std::endl;
cudaMalloc((void**)&d_src, w * h * sizeof(floatPX_TYPE) * 3); cudaMalloc((void**)&d_dst, w * h * sizeof(floatPX_TYPE) * 3);
// Copy h_src to d_src
std::cout << "Copying source image to device ..." << std::endl;
cudaMemcpy(d_src, h_src, w * h * sizeof(floatPX_TYPE) * 3, cudaMemcpyHostToDevice);
// Launch grid 3 times (one grid per colour channel)
rot90 << <dGrid, dBlock >> > (d_src, d_dst, w, h, i);
}
 
// Ensure operations completed
cudaDeviceSynchronize();
// Copy d_dst to h_dst
std::cout << "Copying rotated image to host ..." << std::endl;
cudaMemcpy(h_dst, d_dst, w * h * sizeof(floatPX_TYPE) * 3, cudaMemcpyDeviceToHost);
// Dealocate memory
cudaFree(d_src);
cudaFree(d_dst);
delete[] h_src;
delete[] h_dst;
// Display 40x40px sample of h_dst and print pixel values to console to verify rotation worked
}</nowiki>
 
==== Single Rotation ====
Here we can verify the parallel solution reads the initial pixel values and applies the rotation correctly:
 
[[File:Verify-3.png|800px]]
 
After rotation:
 
[[File:Verify-4.png|800px]]
 
==== The Rotation Operation ====
 
Grid dimensions and total number of threads are displayed before launching.
 
A single colour channel of Tiny-Shay.jpg only requires about half of one 32 x 32 block:
 
[[File:Tiny-Shay-cuda.png]]
 
Large-Shay.jpg required a grid of 102 x 77 blocks, each block containing 32 x 32 threads, allowing for 8042496 threads per colour channel:
 
[[File:Large-Shay-cuda.png]]
 
It was my design choice, for reasons of being able to wrap my head around the logic, to launch 3 two-dimensional grids per image, one per colour channel. It was my initial thought to launch a single grid and utilize the z member to mimic 3 dimensions. I should also try to accomplish this in a single grid to compare the results. Instead, we pass the current iteration (z) to use in calculating the correct location for single dimensional representation of the image:
 
<nowiki>
// Launch grid 3 times (one grid per colour channel)
std::cout << "Performing rotation ..." << std::endl;
for (int i = 0; i < 3; i++) {
rot90 << <dGrid, dBlock >> > (d_src, d_dst, w, h, i);
}</nowiki>
 
==== Profiling With Nsight ====
I edit rotate90.cu, removing the display function calls, and looping to rotate the given image 12 times as done in the CPU version. I copy the result of the rotation back to the host after each operation completes. I re-use the memory allocated on the device for each rotation, only allocating source and destination arrays once, then freeing memory after all 12 rotations are complete:
 
<nowiki>
// Allocate device memory for src and dst
std::cout << "Allocating device memory ..." << std::endl;
cudaMalloc((void**)&d_src, w * h * sizeof(PX_TYPE) * 3);
cudaMalloc((void**)&d_dst, w * h * sizeof(PX_TYPE) * 3);
// Copy h_src to d_src
std::cout << "Copying source image to device ..." << std::endl;
cudaMemcpy(d_src, h_src, w * h * sizeof(PX_TYPE) * 3, cudaMemcpyHostToDevice);
 
// Rotate image 6 x 2 times, copying result back to host each time
for (int r = 0; r < 6; r++) {
std::cout << "Rotating 2x ..." << std::endl;
// Launch grid 3 times (one grid per colour channel)
for (int i = 0; i < 3; i++) {
rot90 << <dGrid, dBlock >> > (d_src, d_dst, w, h, i);
}
 
// Ensure operations completed
cudaDeviceSynchronize();
 
// Copy d_dst to h_dst
std::cout << "Copying result to host ..." << std::endl;
cudaMemcpy(h_dst, d_dst, w * h * sizeof(PX_TYPE) * 3, cudaMemcpyDeviceToHost);
 
// Rotate again
for (int i = 0; i < 3; i++) {
rot90 << <dGrid, dBlock >> > (d_dst, d_src, h, w, i);
}
 
// Ensure operations completed
cudaDeviceSynchronize();
 
// Copy d_src to h_src
cudaMemcpy(h_src, d_src, w * h * sizeof(PX_TYPE) * 3, cudaMemcpyDeviceToHost);
std::cout << "Copying result to host ..." << std::endl;
}
 
// Dealocate memory
std::cout << "Dealocating memory ..." << std::endl;
cudaFree(d_src);
cudaFree(d_dst);
delete[] h_src;
delete[] h_dst;</nowiki>
 
Here is the output from one run:
 
[[File:Cuda-profilerun.png]]
 
;Device Usage %
Tiny-Shay.jpg: 0.01%
 
Medium-Shay.jpg: 0.39%
 
Large-Shay.jpg: 0.93%
 
Huge-Shay.jpg: 1.26%
 
(36 kernel launches per run)
 
;Timeline Results
For each run, I list the 4 operations that took the most amount of time. For a tiny image, allocating source and destination variables on the device took the longest amount of time, but still, it took well under half a second. It took the same amount of time for every case however. Initializing the CImg variable from the .jpg file quickly became the biggest issue. This operation is CPU bound, and is dependent on the logic of ImageMagick. Copying the rotated image back to the host (cudaMemcpy) starts to become a hot spot as well between the large and huge sized image is a noticeable increase.
 
[[File:Summary-2.png]]
 
Comparing total run times of the CPU to the CUDA version shows a clear winner as .jpg files increase in size. Rotating Large-Shay.jpg (3264 x 2448) was '''3x''' faster, and Huge-Shay.jpg was '''4.95x''' faster. Tiny and Medium-Shay.jpg actually took longer using the CUDA version, but took less than half a second in both cases.
 
[[File:Summary-3.png]]
 
;Conclusion So Far
The initial CUDA code had decent results. The overall device utilization percent seems fairly low. This may be since the device can handle far more threads than even Huge-Shay.jpg requires, or, we may be able to optimize code to utilize more of the device. In order to get better results during initializing from the .jpg file, I would need to investigate efficiencies in ImageMagick, CImg, or explore other methods of reading the image file. Wait time for the grid to execute is very low in all cases (.007 - .15 seconds). I should investigate the effects of different grid design, explore shared memory, and other methods of optimization.
=== Assignment 3 ===
==== "Register" Index ====
For my first attempt at optimization, I thought maybe, just maybe, the index calculations were being performed from within global memory:
 
<nowiki>
__global__ void rot90(PX_TYPE* src, PX_TYPE* dst, int src_w, int src_h, int z) {
int k = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (k < src_w && j < src_h)
dst[(src_h - 1 - j) + k * src_h + src_w * src_h * z] = src[threadIdx.x + threadIdx.y * src_w + src_w * src_h * z];
 
}</nowiki>
 
So I declared two register variables and determined the indexes prior:
 
<nowiki>
__global__ void rot90(PX_TYPE* src, PX_TYPE* dst, int src_w, int src_h, int z) {
int k = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int d = (src_h - 1 - j) + k * src_h + src_w * src_h * z;
int s = threadIdx.x + threadIdx.y * src_w + src_w * src_h * z;
if (k < src_w && j < src_h)
dst[d] = src[s];
 
}</nowiki>
 
This only had a slightly negative effect. Although, such a small difference may have been due to the luck of the run:
 
[[File:Summary-4.png]]
 
==== Unsigned Char vs. Float ====
The first real improvement came from changing PX_TYPE from float back to unsigned char, as used in the serial version. Unsigned char is good enough for all .jpg colour values (255). GPUs are designed to perform operations on floating point numbers, however, we are not performing any calculations outside of the indexing. The performance of the kernel was the same for float or unsigned char. We copy the source image to device once, and back to the host 12 times, making size relevant.
 
{| class="wikitable"
|+Size Comparison
|-
|
|Unsigned Char
|Float
|-
|Tiny_Shay.jpg
|1.93 KB
|7.73 KB
|-
|Medium_Shay.jpg
|5.71 MB
|22.8 MB
|-
|Large_Shay.jpg
|22.8 MB
|91.4 MB
|-
|Huge_Shay.jpg
|91.4 MB
|365 MB
|}
 
This saves almost one second worth of latency for the largest file, bringing cudaMemcpy down to about the same time as the kernel execution:
 
[[File:Summary-5.png]]
 
==== Shared Memory ====
I could not think of how to utilize shared memory for this application. No calculations are being performed. Copying to shared memory would be an additional operation, as one write to global memory is required either way. By copying a small chunk of the source image to shared memory to improve read time, the indexing logic would no longer work.
 
==== Constant Memory ====
Utilizing constant memory for the source image was something I wanted to try. The largest unsigned char file of 91.4 MB seemed affordable, and we do not write to it.
 
Since it's required to use a constant value when declaring the size of the host variable, I needed to define the size of the largest file and use that for all files:
 
<nowiki>#define SRC_MAX_SIZE 95883264
 
__constant__ PX_TYPE d_src[SRC_MAX_SIZE];</nowiki>
 
Copy the actual number of elements over:
 
<nowiki>// Copy h_src to d_src
std::cout << "Copying source image to device ..." << std::endl;
cudaMemcpyToSymbol(d_src, h_src, w * h * sizeof(PX_TYPE) * 3);</nowiki>
 
Compiling gave an error saying 91MB is too much memory to use:
 
<nowiki>CUDACOMPILE : ptxas error : File uses too much global constant data (0x5b71000 bytes, 0x10000 max)</nowiki>
 
The only example file that fit and compiled was Tiny_Shay.jpg, which there is no point in improving.
93
edits