Changes

N/A

6,514 bytes added, 16:02, 23 March 2019

→‎Assignment 2

=== Assignment 2 ===

Sorting Algorithm parallelization completion:

'''Akshatkumar Patel – Bubble Sort & Merge Sort'''

----

Both (Bubble & Merge) sorting algorithms were created and profiled on Tesla K80 GPU (Cloud – Google Colab). In order to compile them the following was typed:

!nvcc bubble.cu -o bubble

Same can be applied to merge sort, but if it is not a computer_35 architecture the following compilation command must be used:

!nvcc -arch=sm_35 -rdc=true merge.cu -o merge -lcudadevrt

'''Observation:'''

We thought offloading the given entire sorting algorithms (since they are computing intensive in a way) would result in increase in speed. However, we observed that this was not the case for quite a few of the sorting algorithms since they are not optimized. One of the things that stood out to me was that I had to change my approach to bubble sort in order to make it work, I had to use the Odd/Even bubble sort. For bubble sort, when N gets bigger time increases more than that of the CPU’s. I tried using threads to make it faster but that resulted in a slower speed.

As for merge, I had difficulty doing recursion on the kernel but was solved using “-arch=sm_35 -rdc=true” command line switch. Merge can be optimized and improved greatly unlike the odd-even bubble sort since I found out there to be multiple solutions to creating it. I also happened to find a merge sort implementation in CUDA 6.0 Samples which were quite complex to understand but overall much faster than my implementation.

BUBBLE LINK

MERGE LINK

'''Woosle Park – Insertion Sort & Heap Sort'''

----

Both the insertion and heap sorting algorithms where created and profiled on a gtx1080 the algorithms were compiled using visual studio 2017.

'''Observation:'''

For heapsort the recursive algorithm creates a cuda warning of potential stack overflow. For the heapify kernel identifying the left and right element of the heap worked better using bit manipulation to locate them. Same issue occurred here as well in the labs where my gpu is too fast so the results of each kernel flat lined despite the element increased. That being said you do notice a slight increase in speed comparing the gpu results the higher the number of elements. Insertion sort currently running in 1-dimensional gird for testing will be changed in A3.

INSERTION LINK

HEAP LINK

'''Jordan Pitters – Selection & Quick Sort'''

----

Both the insertion and heap sorting algorithms where created and profiled on a gtx1080 the algorithms were compiled using visual studio 2017.

'''Observation:'''

Both the Selection & Quick sorting algorithms were created and profiled on Tesla K80 GPU (Cloud). Specifically, the compiler provided by Google Collaboratory at https://colab.research.google.com. In order to compile them the following steps were followed:

# open a cell for code and name the code file in the first line with: “%%file name.cu”. For example, I named the file using: “%%file TSort.cu” (meaning “Test Sort”).

# Put the code into the space, and when done click the play button to save the code to the cloud.

# Next, open another code cell and enter: “!nvcc name.cu -o name” (Do not run it yet)

# Specifically, I used “!nvcc -arch=sm_35 -rdc=true TSort.cu -o TSort -lcudadevrt” to run my code as it increased the compute capability to allow the required recursive calls of kernels. It is recommended to use this to run your code if it requires a computer_35 architecture or greater

# Finally, on a new line from the code in the previous step, or in a new cell, enter: “!./name 50” and click the play button to compile and run the code (the 50 is a command line argument).

# To time the code I used: “!nvprof ./TSort 50000” (as an example).

'''Observation:'''

We considered the methods we would need to take to completing the task of parallelizing the sorting algorithms and had assumed that offloading the entire algorithm (off the CPU on to the GPU) could yield results in speed. However, after testing and observations, we found that the sorting algorithms would only yield positive results if we managed a fine balance between CPU and GPU code calls, whether the GPU calls are optimized. This was especially so for Selection Sort, as we tried completely offloading the algorithm and it didn’t seem to yield any positive results. We were not sure of the methods that were necessary to optimize the algorithm GPU-wise, and had meager ideas about designing a grid structure, so it remained something we would need to investigate. The Quick Sort function on the other hand was difficult to design because the algorithm utilized several recursive calls to do complete its sorting, which would mean recursive kernel calls, if we were to offload as planned. It eventually came to the point where we were not sure what to do about the algorithm and had to research the capabilities of CUDA kernels, as well as potential other theorized ways of parallelizing Quick Sort. We found a great many number of comments on the algorithm and pseudo code suggesting the same recursive conclusion we reached. We also found that kernels can indeed be recursively called and are required to be tied to streams to designate correlated kernels. As such, the streams and several thread synchs, among other things were used to complete the task of parallelizing Quick Sort. It was not simple, and the cloud compiler did not allow recursive calls of kernels as default, so some research was done that yielded the command line switch “-arch=sm_35 -rdc=true” which allowed recursion so that testing could be done. In the end, the code was not optimized, but we were able to prove the capability of parallelization for the Selection and Quick sorting algorithms.

Quick LINK

Selection LINK

----

'''Final thoughts'''

Our parallel sorting algorithms were slower on the GPU. However, after reprofiling on an average computer with the following specs:

'''OS:''' Windows 10

'''Processor:''' Intel (R) Core (TM) i5-5200U CPU @2.2GHz

'''Ram:''' 8GB

The parallelized sorting algorithms were much faster than on the CPU listed above. Matrix performance is much faster than of an average computer hence the completion time in profiles are lower. Another thing to note is that, choosing the right number of threads & blocks is important or the sorting can be potentially incorrect. Many of our attempts at sorting worked when n was smaller than or equal to 50000 but as the we tested 500 thousand the sort was incorrect.

=== Assignment 3 ===

Akshat

45

edits

Changes

N/A

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools