52
edits
Changes
→Progress
=== Assignment 2 ===
==== Problem ====
After surveying the original code. We found one three major hot-spots for heavy CPU usage.
This block of code handles reshapes input pixels into a set of samples for classification.
This block of code computes the distances between sampled centers and other input samples. [[File:CalculateDistanceSerial.png|550px]] This block of code generates the image that has to be outputted. [[File:GenerateImageSerial.png|550px]] ==== Analysis ====
After analyzing this block of code. We decided to parallelize it.
You can find the new parallelized KmeansPlusPlus code [https://github.com/MajinBui/kmeansplusplusCUDA]. Here are the kernels that we programmed. Set Samples kernel [[File:SetSamplesKernel.png|550px]] Calculate Distance kernel [[File:CalculateDistanceKernel.png|550px]] Generate Image kernel [[File:GenerateImageKernel.png|550px]] ==== Conclusion ==== By comparing the run-times of the serial KmeansPlusPlus and the parallelized version, we can see that the performance of the program has improved. [[File:GraphAssignment2.png|900px]] The performance improvement is not significant for smaller clusters and iterations. But you can see that the performance has been improved for the higher test cases. === Assignment 3 === For assignment 3, we optimized the kernels by allocating the correct amounts of grids and block for each kernel . Previously, we allocated 32 threads by 32 blocks for every kernel call even when it did not require it. After adjustments, we found significant improvements for many of the kernels. ====Runtime of program==== Here, we see that the program was improved by the optimizations of threads per block. Runtime of program: For larger images, we found that the program was improved more and more as the amount of clusters and iterations increased. [[File:Big Image.png]] For medium images, we found more inconsistent results. [[File:Med Image.png]] For small images, we found the most inconsistent results after optimizations. [[File:Small Image.png]] When the image side increases, the more efficient the kernel. ====Runtime of each kernel==== Each kernel individually found significant or marginal improvements after adjusting for thread/block size. Runtime of kernels: Set samples found small improvements on average. [[File:Set Samples.png]] Here we programmedchanged the calculation of y_index to the outside of the inner loop. [[File:SetSamplesKernelOptimized.png|550px]] Calcuate distance found a significant improvements. [[File:Calculate Distance Kernel.png]]
Generate image found improvements as well since image sizes varied. Changing the thread/block size to the correct amount of pixels enabled better usage of memory.
[[File:Assignment2GraphGenerateImageKernelOptimized.png|550px]]