Changes

Jump to: navigation, search

BarraCUDA Boiz

1,185 bytes added, 00:53, 14 April 2017
Progress
=== Assignment 2 ===
==== Problem ====
After surveying the original code. We found one three major hot-spots for heavy CPU usage.
This block of code handles reshapes input pixels into a set of samples for classification.
const int N = width * height; const int dim = img.channels(); cv[[File::Mat samples = cv::Mat(N, dim, CV_32FC1); for (int x = 0; x<width; x++) { for (int y = 0; y<height; y++) { for (int d = 0; d<dim; d++) { int index = y * width + x; samples.at<float>(index, d) = (float)imgSetSamplesSerial.at<uchar>(y, x*dim + d); } } }png]]
 This block of code computes the distances between sampled centers and other input samples.  [[File:CalculateDistanceSerial.png|550px]]  This block of code generates the image that has to be outputted.  [[File:GenerateImageSerial.png|550px]] ==== Analysis ====
After analyzing this block of code. We decided to parallelize it.
You can find the new parallelized KmeansPlusPlus code [https://github.com/MajinBui/kmeansplusplusCUDA].  Here are the kernels that we programmed.  Set Samples kernel  [[File:SetSamplesKernel.png|550px]]  Calculate Distance kernel  [[File:CalculateDistanceKernel.png|550px]] Generate Image kernel  [[File:GenerateImageKernel.png|550px]]   ==== Conclusion ==== By comparing the run-times of the serial KmeansPlusPlus and the parallelized version, we can see that the performance of the program has improved. [[File:GraphAssignment2.png|900px]] The performance improvement is not significant for smaller clusters and iterations. But you can see that the performance has been improved for the higher test cases. === Assignment 3 === For assignment 3, we optimized the kernels by allocating the correct amounts of grids and block for each kernel . Previously, we allocated 32 threads by 32 blocks for every kernel call even when it did not require it. After adjustments, we found significant improvements for many of the kernels.  ====Runtime of program==== Here, we see that the program was improved by the optimizations of threads per block. Runtime of program: For larger images, we found that the program was improved more and more as the amount of clusters and iterations increased. [[File:Big Image.png]] For medium images, we found more inconsistent results. [[File:Med Image.png]] For small images, we found the most inconsistent results after optimizations. [[File:Small Image.png]] When the image side increases, the more efficient the kernel. ====Runtime of each kernel==== Each kernel individually found significant or marginal improvements after adjusting for thread/block size. Runtime of kernels: Set samples found small improvements on average. [[File:Set Samples.png]] Here we programmedchanged the calculation of y_index to the outside of the inner loop. [[File:SetSamplesKernelOptimized.png|550px]] Calcuate distance found a significant improvements. [[File:Calculate Distance Kernel.png]] 
__global__ void setCenter(float* d_center, float* d_sample, int n, int dim, int randi) { int i = blockIdxThe biggest change was the thread/block size.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < n && j < n) d_center[j * n + i] = d_sample[j * randi + i]; }
Launching the kernel[[File:CalculateDistanceKernelOptimized.png|550px]]
int nb = (n + ntpb - 1) / ntpb;
dim3 dGrid(nb, nb, 1);
dim3 dBlock(ntpb, ntpb, 1);
float* d_center = nullptr;
cudaMalloc((void**)&d_center, centers.rows * centers.cols * sizeof(float));
cudaMemcpy(d_center, (float*)centers.data, centers.rows * centers.cols * sizeof(float), cudaMemcpyHostToDevice);
check(cudaGetLastError());
float* d_sample = nullptr;
cudaMalloc((void**)&d_sample, samples.rows * samples.cols * sizeof(float));
cudaMemcpy(d_sample, (float*)samples.data, centers.rows * centers.cols * sizeof(float), cudaMemcpyHostToDevice);
int rand = genrand_int31() % n;
setCenter << <dGrid, dBlock >> >(d_center, d_sample, N, dim, rand);
cudaDeviceSynchronize();
Generate image found improvements as well since image sizes varied. Changing the thread/block size to the correct amount of pixels enabled better usage of memory.
After programming this kernel. we noticed an improvement in performace[[File:Generate Image Kernel. png]]
Here are is a graph comparing The biggest change was the run-times of the serial program vs parallelizedthread/block size.
[[File:Assignment2GraphGenerateImageKernelOptimized.png|550px]]
52
edits

Navigation menu