Changes

← Older edit

BarraCUDA Boiz

2,872 bytes removed, 00:53, 14 April 2017

→‎Progress

This block of code computes the distances between sampled centers and other input samples.

[[File:CalculateDistanceSerial.png|550px]]

This block of code generates the image that has to be outputted.

[[File:GenerateImageSerial.png|550px]]

==== Analysis ====

You can find the new parallelized KmeansPlusPlus code

[https://github.com/~~agamdogra~~MajinBui/~~KmeansPlusPlusCuda here~~kmeansplusplusCUDA].

Here is are the ~~kernel~~ kernels that we programmed.

~~__global__ void setCenter(float* d_center, float* d_sample, int n, int dim, int randi) {~~

~~int i = blockIdx.x * blockDim.x + threadIdx.x;~~

~~int j = blockIdx.y * blockDim.y + threadIdx.y;~~

~~if (i < n && j < n)~~

~~d_center[j * n + i] = d_sample[j * randi + i];~~

}

~~Launching the~~ Set Samples kernel

~~int nb = (n + ntpb - 1) / ntpb;~~ ~~dim3 dGrid(nb, nb, 1);~~ ~~dim3 dBlock(ntpb, ntpb, 1);~~ ~~float* d_center = nullptr;~~ ~~cudaMalloc((void**)&d_center, centers~~[[File:SetSamplesKernel.~~rows * centers.cols * sizeof(float));~~ ~~cudaMemcpy(d_center, (float*)centers.data, centers.rows * centers.cols * sizeof(float), cudaMemcpyHostToDevice);~~ ~~check(cudaGetLastError());~~ ~~float* d_sample = nullptr;~~ ~~cudaMalloc((void**)&d_sample, samples.rows * samples.cols * sizeof(float));~~ ~~cudaMemcpy(d_sample, (float*)samples.data, centers.rows * centers.cols * sizeof(float), cudaMemcpyHostToDevice);~~ ~~int rand = genrand_int31() % n;~~ ~~setCenter << <dGrid, dBlock >> >(d_center, d_sample, N, dim, rand);~~ ~~cudaDeviceSynchronize();~~png|550px]]

~~The kernels:~~Calculate Distance kernel

~~setSamples - goes through the entire image and collects samples from the image (the current pixel and the next x number of pixels).~~ ~~__global__ void setSamples(cv~~[[File:~~:cuda::PtrStepSz<float> samples, cv::cuda::PtrStepSz<uchar> img, int dimC) {~~ ~~int i = blockIdx.y*blockDim.y + threadIdx~~CalculateDistanceKernel.y; ~~int j = blockIdx.x*blockDim.x + threadIdx.x;~~ ~~if (i >= img.rows~~ png|~~| j >= img.cols)~~ ~~return;~~ ~~int index = i * img.cols + j;~~ ~~for (int d = 0; d<dimC; d++) {~~ ~~samples(index, d) = (float)img(i, j * dimC + d);~~ } }550px]]

~~calculateDistance - goes through the image and computes the difference between the samples and the centers from the input image.~~Generate Image kernel

~~__global__ void calculateDistance(cv::cuda::PtrStepSz<float> centers, cv::cuda:~~[[File:~~PtrStepSz<float> samples, int k, int N, int dim, double* minval, float* D) {~~ ~~// Compute distances between already sampled centers and other input samples.~~ ~~// Update nearest distance if it is smaller than previous ones.~~ ~~int col = blockIdx.x * blockDim.x + threadIdx.x;~~ ~~int row = blockIdx.y * blockDim.y + threadIdx.y;~~ ~~int i = col + row * N;~~ ~~//int i = blockIdx.y*blockDim.y + threadIdx~~GenerateImageKernel.y; ~~if (i >= N)~~ ~~return;~~ ~~double dist = 0.0;~~ ~~for (int d = 0; d<dim; d++) {~~ ~~double diff = centers(k - 1, d) - samples(i, d);~~ ~~dist += diff * diff;~~ } ~~if (dist < minval[i]) {~~ ~~minval[i~~png|550px] ~~= dist;~~ } *D += minval[i]; }

~~generateImage - takes the modified image and then writes it to the file using the function "out()".~~

~~__global__ void generateImage(cv::cuda::PtrStepSz<uchar> out, cv::cuda::PtrStepSz<int> indices, cv::cuda::PtrStepSz<float> centers, int dim) {~~

~~// Generate output image~~

~~int i = blockIdx.y*blockDim.y + threadIdx.y;~~

~~int j = blockIdx.x*blockDim.x + threadIdx.x;~~

~~if (i >= out.rows || j >= out.cols)~~

~~return;~~

~~int index = i * out.cols + j;~~

~~int ci = indices(index, 0);~~

~~for (int d = 0; d<dim; d++) {~~

~~out(i, j*dim + d) = (uchar)centers(ci, d);~~

}

~~After programming these kernel. we noticed an improvement in performace.~~

==== Conclusion ====

By comparing the run-times of the serial KmeansPlusPlus and the parallelized version, we can see that the performance of the program has improved ~~slightly~~.

~~This program can further be improved by off-loading some more operations from the CPU to the GPU. But this will require more time and research~~[[File:GraphAssignment2.png|900px]]

The performance improvement is not significant for smaller clusters and iterations. But you can see that the performance has been improved for the higher test cases.

=== Assignment 3 ===

For assignment 3, we optimized the kernels by allocating the correct amounts of grids and block for each kernel. Previously, we allocated 32 threads by 32 blocks for every kernel call even when it did not require it. After adjustments, we found significant improvements for many of the kernels.

====Runtime of program====

Here, we see that the program was improved by the optimizations of threads per block.

Runtime of program:

For larger images, we found that the program was improved more and more as the amount of clusters and iterations increased.

[[File:Big Image.png]]

For medium images, we found more inconsistent results.

[[File:Med Image.png]]

For small images, we found the most inconsistent results after optimizations.

[[File:Small Image.png]]

When the image side increases, the more efficient the kernel.

====Runtime of each kernel====

Each kernel individually found significant or marginal improvements after adjusting for thread/block size.

Runtime of kernels:

Set samples found small improvements on average.

[[File:Set Samples.png]]

~~__global__ void setSamples(cv::cuda::PtrStepSz<float> samples, cv::cuda::PtrStepSz<uchar> img, int dimC) {~~ ~~int i = blockIdx.y*blockDim~~Here we changed the calculation of y_index to the outside of the inner loop.~~y + threadIdx.y;~~ ~~int j = blockIdx.x*blockDim.x + threadIdx.x;~~ ~~if (i >= img~~[[File:SetSamplesKernelOptimized.~~rows |~~png| ~~j >= img.cols)~~550px]] ~~return;~~ ~~int index = i * img~~Calcuate distance found a significant improvements.~~cols + j;~~ ~~int y_index = j * dimC;~~ ~~for (int d = 0; d<dimC; d++) {~~ ~~samples(index, d) = (float)img(i, y_index + d);~~ } }

[[File:Calculate Distance Kernel.png]]

~~__global__ void calculateDistance(cv::cuda::PtrStepSz<float> centers, cv::cuda::PtrStepSz<float> samples, int k, int N, int dim, double* minval, float* D) {~~ ~~// Compute distances between already sampled centers and other input samples.~~ The biggest change was the thread/~~/ Update nearest distance if it is smaller than previous ones~~block size. ~~int col = blockIdx.x * blockDim.x + threadIdx.x;~~ ~~int row = blockIdx.y * blockDim.y + threadIdx~~[[File:CalculateDistanceKernelOptimized.y; ~~int i = col + row * N;~~ ~~//int i = blockIdx.y*blockDim.y + threadIdx.y;~~ ~~if (i >= N)~~ ~~return;~~ ~~double dist = 0.0;~~ ~~int k_diff = k - 1;~~ ~~for (int d = 0; d<dim; d++) {~~ ~~double diff = centers(k_diff, d) - samples(i, d);~~ ~~dist += diff * diff;~~ } ~~if (dist < minval[i~~png|550px]~~) {~~ ~~minval[i~~] ~~= dist;~~ } *D += minval[i];Generate image found improvements as well since image sizes varied. }Changing the thread/block size to the correct amount of pixels enabled better usage of memory.

[[File:Generate Image Kernel.png]]

~~__global__ void generateImage(cv::cuda::PtrStepSz<uchar> out, cv::cuda::PtrStepSz<int> indices, cv::cuda::PtrStepSz<float> centers, int dim) {~~ /The biggest change was the thread/ ~~Generate output image~~ ~~int i = blockIdx.y*blockDim.y + threadIdx~~block size.y; ~~int j = blockIdx.x*blockDim.x + threadIdx.x;~~ ~~if (i >= out~~[[File:GenerateImageKernelOptimized.~~rows |~~png| ~~j >= out.cols)~~ ~~return;~~ ~~int index = i * out.cols + j;~~ ~~int ci = indices(index, 0);~~ ~~int y_index = j*dim;~~ ~~for (int d = 0; d<dim; d++) {~~ ~~out(i, y_index + d) = (uchar)centers(ci, d);~~ } }550px]]

Addogra

52

edits

Changes

BarraCUDA Boiz

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools