46
edits
Changes
N/A
,→Assignment 3
# [mailto:wwpark@myseneca.ca?subject=DPS915 Woosle Park], Data Compression
# [mailto:jpitters@myseneca.ca?subject=DPS915 Jordan Pitters], Image Processing
# [mailto:apatel271@myseneca.ca?subject=DPS915 Akshat Akshatkumar Patel], Sorting Algorithms
[mailto:jpitters@myseneca.ca;wwpark@myseneca.ca?subject=DPS915;apatel271@myseneca.ca?subject=DPS915 Email All];
'''Conclusion:'''
Based on the results from profiling 3 different sizes of arrays, we can assume that majority of the time taken to sort the arrays is taken by the O(n^2) algorithms (Bubble, Insertion, and Selection). However, the other sorting algorithms (Heap, Merge, Quick) that are O(n log(n)) are extremely fast even when the size of the arrays are large. As observed from the profile, the elapsed time increased as the size of the array increased. I went from approximately 8 seconds to execute the entire program to 13 minutes to execute.
'''Final Selection: Sorting Algorithms'''
Based on the profiled applications above, we think that the sorting algorithms would benefit a lot from offloading to the GPU. Sorting Algorithms are commonly used in programming and can have a strong impact on the programs speed and efficiency. Since they are so commonly used, we think it would be quite interesting to see if we can speed up the O(n^2) sorting algorithms to potentially match the sorting speed of the O(n log n) algorithms since there won’t much change for the parallelized version of them, as they are already fast.Bubble Sort is an elementary sort, it is the most basic sorting method out there however it is the worst sorting algorithm. By Speeding it up using the GPU, we plan to see how much we can improve this sorting algorithm. Also, we would like to look at Insertion Sort, mainly because we know that it is not possible to do on the GPU. We want to take this opportunity to be innovative and attempt or to the very least find a way to solve this issue. '''Summary of selection:''' Bubble & Insertion Sort
=== Assignment 2 ===
'''Akshatkumar Patel, Jordan Pitters, Woosle Park'''----Sorting Algorithm algorithm parallelization completion: We found a pseudo code for this algorithm which we got from the following link. The link describes bubble sort & how to parallelize it. '''Serial Bubble Sort Pseudo''' for i ← 1 to length [A] do for j ← length [A] downto i +1 do If A[A] < A[j-1] then Exchange A[j] ↔ A[j-1] '''Parallel Bubble Sort Pseudo''' For k = 0 to n-2 If k is even then for i = 0 to (n/2)-1 do in parallel If A[2i] > A[2i+1] then Exchange A[2i] ↔ A[2i+1] Else for i = 0 to (n/2)-2 do in parallel If A[2i+1] > A[2i+2] then Exchange A[2i+1] ↔ A[2i+2] Next k After doing some further research, we found out this presumed bubble sort is a variation of a bubble sort. It is known as an odd-even bubble sort. A bubble sort works by repeatedly iterating through the list and comparing adjacent pairs and swapping them if they are in the wrong order. This is an elementary sorting algorithm; it is the basic learning block to all sorting algorithms. It is quite easy to implement but it is extremely slow in comparison to its counterparts (quick, merge, heap, & insertion). '''Time Complexity''' Bubble sort has a worse case and average complexity of O(n^2), where n is the number of items being sorted. Surprisingly even all the other O(n^2) algorithms such as insertion and selection sort run faster than bubble (As shown in the profile in A1). We know from experience that this algorithm is inefficient and should be avoided in case of large collections (Collection size depends on system specs, but we will just assume 20K is the average). '''Time Complexity of Parallel Bubble Sort''' Based on the given pseudo code above, we iterate through the list (n) number of times and call the odd or even kernel based on the index (i). We then iterate through the list n/2 number of times making it O(n^2)/n. '''Our Approach''' We iterate through the list (n-1) number of times calling the kernel. We make use of the power of the GPU which is doing the same thing concurrently. In this case, we will be launching 512 threads of n/1024 blocks (We found this to be the most efficient block & thread counts). When we launch a kernel, the kernel will concurrently compare 2 values. The only difference to our approach and the pseudo approach is that we will need to pass in a flag (index % 2) to let our kernel know whether it is even or odd. Since we sped up our iteration to be iterating only once (n number of times), we can assume that the parallel time complexity is O(n) and needs O(n) processors. __global__ void oddeven_bubble(int *a, int flag, int size) { int index = blockIdx.x * blockDim.x + threadIdx.x; int temp; if ((index >= size / 2 - 1) && flag % 2 != 0) return; if (flag % 2 == 0) { if (a[index * 2] > a[index * 2 + 1]) { temp = a[index * 2]; a[index * 2] = a[index * 2 + 1]; a[index * 2 + 1] = temp; } } else { if (a[index * 2 + 1] > a[index * 2 + 2]) { temp = a[index * 2 + 1]; a[index * 2 + 1] = a[index * 2 + 2]; a[index * 2 + 2] = temp; } } } This Algorithm were created & profiled on Tesla K80 GPU (Cloud – Google Collab). In order to compile them the follow was typed: !nvcc bubble.cu -o bubble '''Parallel Odd-Even Sort Comparison to Bubble & Overview''' Our Odd-even sort is uses N independent threads to do sort the number of elements in the list. The number of blocks used is (size+1024)/1024 * 512 threads to sort the elements in the list. Hence, we can assume N = half of size. The odd/even sorts by selecting every even array index & comparing with the adjacent and odd array index as observed an be observed in the image below. If the numbers are in the wrong order, a swap takes place. This process is repeated size-1 number of times. '''Observation''' Offloading the computing intensive sorting algorithm to the GPU resulted in DRASTIC increase in speed. We were able to take an algorithm that took about 13 minutes to sort a list of 500k randomly generated values and solve it within 15 seconds. '''Profile for our bubble sort''' Command used to generate profile: !nvprof ./bubble 500000 '''Profile generated for N = 500000''' ==437== NVPROF is profiling process 437, command: ./bubble 500000 ==437== Profiling application: ./bubble 500000 ==437== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 100.00% 15.6692s 499999 31.338us 23.360us 47.807us oddeven_bubble(int*, int, int) 0.00% 324.09us 1 324.09us 324.09us 324.09us [CUDA memcpy HtoD] 0.00% 258.27us 1 258.27us 258.27us 258.27us [CUDA memcpy DtoH] API calls: 98.63% 17.9870s 499999 35.974us 5.6240us 6.0456ms cudaLaunchKernel 0.85% 155.57ms 1 155.57ms 155.57ms 155.57ms cudaMalloc 0.33% 60.878ms 1 60.878ms 60.878ms 60.878ms cudaDeviceReset 0.17% 31.076ms 1 31.076ms 31.076ms 31.076ms cudaDeviceSynchronize 0.00% 835.79us 2 417.90us 407.55us 428.25us cudaMemcpy 0.00% 328.55us 96 3.4220us 156ns 146.17us cuDeviceGetAttribute 0.00% 264.00us 1 264.00us 264.00us 264.00us cudaFree 0.00% 187.13us 1 187.13us 187.13us 187.13us cuDeviceTotalMem 0.00% 26.678us 1 26.678us 26.678us 26.678us cuDeviceGetName 0.00% 2.9030us 1 2.9030us 2.9030us 2.9030us cuDeviceGetPCIBusId 0.00% 2.1570us 3 719ns 210ns 1.2000us cuDeviceGetCount 0.00% 1.3290us 2 664ns 305ns 1.0240us cuDeviceGet 0.00% 304ns 1 304ns 304ns 304ns cuDeviceGetUuid
----
In order to do so, we created a hash table sort of array structure. Since the insertionArray (array that held random values) had only values 0 < n < argument (Ex.500000), we simply iterated through the list at the start before passing the array on to the kernel. In this process, we iterated through the randomly numbers and added incremented the count array for each index we found (countArray[insertionArray[File:a2_bubble.cu.txti]]++).
Then we simply performed the same process on the “sorted” array to find out that 50% of the values would be missing. We tried multiple methods for the insertion sort, even syncing the threads after every step taking. Unfortunately, this increased the time drastically and still did not work. The logic of insertion sort cannot be applied to the kernel because the threads will be competing against one another instead of working together. Ex.All the threads would try to replace arr[0] all at once. This is done multiple times throughout the array and this is the reason why we lost 50% of the data. The data lost in a way correlates to the number of threads launched ((size + 512)/512) * 512), when we launch 1 block of 1 thread, it works but that is because we do not utilize the power of the GPU to compute multiple things at once. CUDA is designed to allow blocks to run in any order. There is also no cross-block synchronization within a single run making insertion sort impossible to work along with a lot of other sorting algorithms. In fact, this is the problem with many of the sorting algorithms we have looked at. '''Woosle Park – Insertion Sort Final files & Heap SortComparitive Graph''' '''Graph Comparison: ----Please give it a few seconds to load, we noticed it takes longer than expected to load the graph''' https://docs.google.com/spreadsheets/d/176TTtES25qwWpm18aDRkPYDCF2EMKmsIGz3k0iuVafk/edit?usp=sharing
'''ObservationBubble.cu:'''For heapsort the recursive algorithm creates a cuda warning of potential stack overflow. For the heapify kernel identifying the left and right element of the heap worked better using bit manipulation to locate them. Same issue occurred here as well in the labs where my gpu is too fast so the results of each kernel flat lined despite the element increased. That being said you do notice a slight increase in speed comparing the gpu results the higher the number of elements[[File:Bubble. Insertion sort currently running in 1-dimensional gird for testing will be changed in A3cu. txt]]
'''Insertion.cu (With method to test):''' [[File:Insertion.cu.txt]]
----
----
'''Final thoughts:'''
'''Some of the things we learned'''
----Parallel Bubble Sort & its implementation
- Blocks will run in random orders (One of the reasons why we encountered an error with the single kernel with a for loop inside odd-even approach). We were able make small numbers work & it was much faster but for big numbers approximately 1 percent of the array size would be incorrectly sorted.