45
edits
Changes
N/A
,→Assignment 2
=== Assignment 2 ===
'''Akshatkumar Patel, Jordan Pitters, Woosle Park'''----Sorting Algorithm algorithm parallelization completion: We found a pseudo code for this algorithm which we got from the following link. The link describes bubble sort & how to parallelize it. '''Serial Bubble Sort Pseudo''' for i ← 1 to length [A] do for j ← length [A] downto i +1 do If A[A] < A[j-1] then Exchange A[j] ↔ A[j-1] '''Parallel Bubble Sort Pseudo''' For k = 0 to n-2 If k is even then for i = 0 to (n/2)-1 do in parallel If A[2i] > A[2i+1] then Exchange A[2i] ↔ A[2i+1] Else for i = 0 to (n/2)-2 do in parallel If A[2i+1] > A[2i+2] then Exchange A[2i+1] ↔ A[2i+2] Next k After doing some further research, we found out this presumed bubble sort is a variation of a bubble sort. It is known as an odd-even bubble sort. A bubble sort works by repeatedly iterating through the list and comparing adjacent pairs and swapping them if they are in the wrong order. This is an elementary sorting algorithm; it is the basic learning block to all sorting algorithms. It is quite easy to implement but it is extremely slow in comparison to its counterparts (quick, merge, heap, & insertion). '''Time Complexity''' Bubble sort has a worse case and average complexity of O(n^2), where n is the number of items being sorted. Surprisingly even all the other O(n^2) algorithms such as insertion and selection sort run faster than bubble (As shown in the profile in A1). We know from experience that this algorithm is inefficient and should be avoided in case of large collections (Collection size depends on system specs, but we will just assume 20K is the average). '''Time Complexity of Parallel Bubble Sort''' Based on the given pseudo code above, we iterate through the list (n) number of times and call the odd or even kernel based on the index (i). We then iterate through the list n/2 number of times making it O(n^2)/n.
'''Akshatkumar Patel – Bubble Sort & Merge SortOur Approach''' We iterate through the list (n-1) number of times calling the kernel. We make use of the power of the GPU which is doing the same thing concurrently. In this case, we will be launching 512 threads of n/1024 blocks (We found this to be the most efficient block & thread counts). When we launch a kernel, the kernel will concurrently compare 2 values. The only difference to our approach and the pseudo approach is that we will need to pass in a flag (index % 2) to let our kernel know whether it is even or odd. Since we sped up our iteration to be iterating only once (n number of times), we can assume that the parallel time complexity is O(n) and needs O(n) processors. __global__ void oddeven_bubble(int *a, int flag, int size) { int index = blockIdx.x * blockDim.x + threadIdx.x; int temp; if ((index >= size / 2 ---1) && flag % 2 != 0) return; if (flag % 2 == 0) { if (a[index * 2] > a[index * 2 + 1]) { temp = a[index * 2]; a[index * 2] = a[index * 2 + 1]; a[index * 2 + 1] = temp; } } else { if (a[index * 2 + 1] > a[index * 2 + 2]) { temp = a[index * 2 + 1]; a[index * 2 + 1] = a[index * 2 + 2]; a[index * 2 + 2] = temp; } } } This Algorithm were created & profiled on Tesla K80 GPU (Cloud – Google Collab). In order to compile them the follow was typed:
'''Observation:Parallel Odd-Even Sort Comparison to Bubble & Overview'''We thought offloading the given entire sorting algorithms (since they are computing intensive in a way) would result in increase in speed. However, we observed that this was not the case for quite a few of the sorting algorithms since they are not optimized. One of the things that stood out to me was that I had to change my approach to bubble sort in order to make it work, I had to use the Odd/Even bubble sort. For bubble sort, when N gets bigger time increases more than that of the CPU’s. I tried using threads to make it faster but that resulted in a slower speed.
'''Woosle Park – Insertion Sort & Heap SortProfile for our bubble sort'''----
'''Profile generated for N = 500000'''
==437== NVPROF is profiling process 437, command: ./bubble 500000 ==437== Profiling application: ./bubble 500000 ==437== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 100.00% 15.6692s 499999 31.338us 23.360us 47.807us oddeven_bubble(int*, int, int) 0.00% 324.09us 1 324.09us 324.09us 324.09us [CUDA memcpy HtoD] 0.00% 258.27us 1 258.27us 258.27us 258.27us [FileCUDA memcpy DtoH] API calls:A2_insertion 98.63% 17.9870s 499999 35.974us 5.6240us 6.0456ms cudaLaunchKernel 0.85% 155.57ms 1 155.57ms 155.57ms 155.57ms cudaMalloc 0.33% 60.878ms 1 60.878ms 60.cu878ms 60.txt]]878ms cudaDeviceReset 0.17% 31.076ms 1 31.076ms 31.076ms 31.076ms cudaDeviceSynchronize 0.00% 835.79us 2 417.90us 407.55us 428.25us cudaMemcpy 0.00% 328.55us 96 3.4220us 156ns 146.17us cuDeviceGetAttribute 0.00% 264.00us 1 264.00us 264.00us 264.00us cudaFree 0.00% 187.13us 1 187.13us 187.13us 187.13us cuDeviceTotalMem 0.00% 26.678us 1 26.678us 26.678us 26.678us cuDeviceGetName 0.00% 2.9030us 1 2.9030us 2.9030us 2.9030us cuDeviceGetPCIBusId 0.00% 2.1570us 3 719ns 210ns 1.2000us cuDeviceGetCount 0.00% 1.3290us 2 664ns 305ns 1.0240us cuDeviceGet 0.00% 304ns 1 304ns 304ns 304ns cuDeviceGetUuid
----
'''Observation:Our Scenario'''We considered the methods actually thought we would need to take to completing the task of parallelizing the sorting algorithms had it working after multiple iterations and had assumed that offloading the entire algorithm adding __syncthreads(off the CPU on to the GPU) could yield results in speed. However, after testing and observations, we found that the sorting algorithms would only yield positive results if we managed a fine balance between CPU and GPU code calls, whether the GPU calls are optimizedeverywhere. This was especially so Suddenly for Selection Sort, as we tried completely offloading the algorithm and array size 100000 it didn’t seem to yield any positive resultsworked. We were not sure of the methods that were necessary tested it for 200000 to optimize the algorithm GPU-wise500000, and had meager ideas they all worked extremely fast (about designing a grid structure2x faster than CPU). We were quite surprised with our findings, so it remained but something we would need to investigatefelt wrong. The Quick Sort function on We tested the other hand was difficult to design because the algorithm utilized several recursive calls to do complete its sorting, which would mean recursive kernel calls, if we with size 512 - 25000 and they were to offload as plannedall incorrect. It eventually came We were a bit curious to the point where we know what was happening, why there were not sure what to do about the algorithm inconsistencies and had to research the capabilities of CUDA kernels, as well as potential other theorized ways of parallelizing Quick Sort. We found a great many number of comments on the algorithm lastly how our tester function was returning 0 errors for big arrays and pseudo code suggesting the same recursive conclusion we reachederrors for small arrays. We also found that kernels can indeed be recursively called and are This required us to be tied come up with a solution to streams to designate correlated kernels. As such, test our tester & the streams and several thread synchs, among other things were used # of same values generated to complete the task # of parallelizing Quick Sort. It was not simple, and values in the cloud compiler did not allow recursive calls of kernels as default, so some research was done that yielded the command line switch “-arch=sm_35 -rdc=true” which allowed recursion so that testing could be done. In the end, the code was not optimized, but we were able to prove the capability of parallelization for the Selection and Quick sorting algorithmscalled sorted array.
=== Assignment 3 ===