Changes

Jump to: navigation, search

N/A

3,391 bytes added, 01:33, 7 April 2019
Assignment 3
=== Assignment 3 ===
 
Bubble sort - Optimization
----
 
'''Thoughts about using Shared Memory:'''
 
After doing some research, we believe that we cannot use __shared__ memory for this. The reason for this is because the array size can be anything (500 thousand, 1 million, 2 million, etc.). There is very limited amount of __shared__ memory depending on the GPU. Even if we used __shared__ memory, there is always the issue that only the values within each block would get sorted. We would also need to deal with bank conflicts (Thread Id would access memory of id id & id+1 and write to memory at id).
 
 
'''Our attempting at reducing the kernel counts:'''
 
For this we removed the for loop that called the kernel n-1 times, and added a for loop inside the kernel surrounding the condition statement (if-else). This was quite interesting in the sense that it was much faster (about 30 – 50%) than the method we had derived from the pseudo code. Unfortunately, about 1% of size (argument entered) would be incorrectly sorted. When we ran this code on different types of GPU with a higher & lower compute capability, we had fewer and more errors respectively however the sorting was still incorrect. So overall, we decided to stick with our multiple kernel call approach as we didn’t want to have errors with the benefit of faster completion time.
 
----
'''Final thoughts:'''
 
Bubble sort was not able to be sped up as fast as quick/merge/heap sort O(n * logn) but compared to serial bubble sort we ran on matrix & average pc, there was very noticeable difference. Our overall approach felt a lot like the reduction lab where we called the kernel multiple times to get the answer. We also realized that the optimal solution will be different on different hardware platforms.While doing research, we found out there are even better parallel sorting algorithms (radix, rank sort, parallel merge sort, hyper quick sort). We actually found the CUDA sample code for parallel merge sort (It was for older version of CUDA). When we profiled it and ran it, it turned out to be much slower (1 – 3 seconds) than serial merge sort for N = 500,000. However, the CUDA quick sort given was just as fast as the serial quick sort (potentially faster by few milliseconds).
 
'''Some of the things we learned'''
- Odd-even sort
 
- Parallel Bubble Sort & its implementation
 
- Sequencing issues will be caused for coalesced hardware when acceding array elements 0 & 1 in thread 0, accessing 2 & 3 in thread 1 (Issue with shared memory & why we avoided it).
 
- Shared memory is an optimal choice where we call the kernel once and the kernel has a for loop inside of it. However, causes bank conflict (can be solved using strides?) but unfortunately gave us an incorrect result so we avoided that method.
 
- As the size of array goes up, the number of threads used goes up (size/2 thread are used). This will eventually result in the maximum number of threads being used.
 
- Blocks will run in random orders (One of the reasons why we encountered an error with the single kernel with a for loop inside odd-even approach). We were able make small numbers work & it was much faster but for big numbers approximately 1 percent of the array size would be incorrectly sorted.
 
- We came across some interesting sorting algorithms (Bucket, radix, rank, hyper quick and merge sort) which are better options than the parallel bubble sort.
45
edits

Navigation menu