Changes

← Older edit

BarraCUDA Boiz

1,185 bytes added, 00:53, 14 April 2017

→‎Progress

=== Assignment 2 ===

==== Problem ====

After surveying the original code. We found ~~one~~ three major hot-spots for heavy CPU usage.

This block of code handles reshapes input pixels into a set of samples for classification.

~~const int N = width * height;~~ ~~const int dim = img.channels();~~ cv[[File:~~:Mat samples = cv::Mat(N, dim, CV_32FC1);~~ ~~for (int x = 0; x<width; x++) {~~ ~~for (int y = 0; y<height; y++) {~~ ~~for (int d = 0; d<dim; d++) {~~ ~~int index = y * width + x;~~ ~~samples.at<float>(index, d) = (float)img~~SetSamplesSerial.~~at<uchar>(y, x*dim + d);~~ } } }png]]

This block of code computes the distances between sampled centers and other input samples. [[File:CalculateDistanceSerial.png|550px]] This block of code generates the image that has to be outputted. [[File:GenerateImageSerial.png|550px]] ==== Analysis ====

After analyzing this block of code. We decided to parallelize it.

You can find the new parallelized KmeansPlusPlus code [https://github.com/MajinBui/kmeansplusplusCUDA]. Here are the kernels that we programmed. Set Samples kernel [[File:SetSamplesKernel.png|550px]] Calculate Distance kernel [[File:CalculateDistanceKernel.png|550px]] Generate Image kernel [[File:GenerateImageKernel.png|550px]] ==== Conclusion ==== By comparing the run-times of the serial KmeansPlusPlus and the parallelized version, we can see that the performance of the program has improved. [[File:GraphAssignment2.png|900px]] The performance improvement is not significant for smaller clusters and iterations. But you can see that the performance has been improved for the higher test cases. === Assignment 3 === For assignment 3, we optimized the kernels by allocating the correct amounts of grids and block for each kernel . Previously, we allocated 32 threads by 32 blocks for every kernel call even when it did not require it. After adjustments, we found significant improvements for many of the kernels. ====Runtime of program==== Here, we see that the program was improved by the optimizations of threads per block. Runtime of program: For larger images, we found that the program was improved more and more as the amount of clusters and iterations increased. [[File:Big Image.png]] For medium images, we found more inconsistent results. [[File:Med Image.png]] For small images, we found the most inconsistent results after optimizations. [[File:Small Image.png]] When the image side increases, the more efficient the kernel. ====Runtime of each kernel==== Each kernel individually found significant or marginal improvements after adjusting for thread/block size. Runtime of kernels: Set samples found small improvements on average. [[File:Set Samples.png]] Here we ~~programmed~~changed the calculation of y_index to the outside of the inner loop. [[File:SetSamplesKernelOptimized.png|550px]] Calcuate distance found a significant improvements. [[File:Calculate Distance Kernel.png]]

~~__global__ void setCenter(float* d_center, float* d_sample, int n, int dim, int randi) {~~ ~~int i = blockIdx~~The biggest change was the thread/block size.~~x * blockDim.x + threadIdx.x;~~ ~~int j = blockIdx.y * blockDim.y + threadIdx.y;~~ ~~if (i < n && j < n)~~ ~~d_center[j * n + i] = d_sample[j * randi + i];~~ }

~~Launching the kernel~~[[File:CalculateDistanceKernelOptimized.png|550px]]

~~int nb = (n + ntpb - 1) / ntpb;~~

~~dim3 dGrid(nb, nb, 1);~~

~~dim3 dBlock(ntpb, ntpb, 1);~~

~~float* d_center = nullptr;~~

~~cudaMalloc((void**)&d_center, centers.rows * centers.cols * sizeof(float));~~

~~cudaMemcpy(d_center, (float*)centers.data, centers.rows * centers.cols * sizeof(float), cudaMemcpyHostToDevice);~~

~~check(cudaGetLastError());~~

~~float* d_sample = nullptr;~~

~~cudaMalloc((void**)&d_sample, samples.rows * samples.cols * sizeof(float));~~

~~cudaMemcpy(d_sample, (float*)samples.data, centers.rows * centers.cols * sizeof(float), cudaMemcpyHostToDevice);~~

~~int rand = genrand_int31() % n;~~

~~setCenter << <dGrid, dBlock >> >(d_center, d_sample, N, dim, rand);~~

~~cudaDeviceSynchronize();~~

Generate image found improvements as well since image sizes varied. Changing the thread/block size to the correct amount of pixels enabled better usage of memory.

~~After programming this kernel. we noticed an improvement in performace~~[[File:Generate Image Kernel. png]]

~~Here are is a graph comparing~~ The biggest change was the ~~run-times of the serial program vs parallelized~~thread/block size.

[[File:~~Assignment2Graph~~GenerateImageKernelOptimized.png|550px]]

Addogra

52

edits

Changes

BarraCUDA Boiz

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools