46
edits
Changes
N/A
,→Assignment 3
== Team Members ==
# [mailto:wwpark@myseneca.ca?subject=DPS915 Woosle Park], Data Compression
# [mailto:jpitters@myseneca.ca?subject=DPS915 Jordan Pitters], Image Processing
# [mailto:apatel271@myseneca.ca?subject=DPS915 Akshatkumar Patel], Sorting Algorithms
[mailto:jpitters@myseneca.ca;wwpark@myseneca.ca?subject=DPS915;apatel271@myseneca.ca?subject=DPS915 Email All];
This of time is spent in the compress function and the hashtable takes up most of the time because it is constantly being manipulated and read from. It looks like if the hashtable and the compress function were to be parallelized about 90% of the run time would be affected. The big-O for the application should be O(n) time so there is a linear increase in time based on file size, however the hashtable grew more in time compared to the compress function the larger the file. This application is not good for parallelization because of the dictionary hashtable. Due to the hastable or the compress dictionary needing to be accessible globally and be constantly modifiable and read this could pose issues if multiple threads were running especially since modifying and reading the table needs to be done sequentially for efficient compression. Threading the compress could lead to errors in compressions making this difficult to parallelize.
----
----
'''Application 3 - Sorting Algorithms'''
'''Description:'''
I decided to select an option from the suggested projects – Sorting Algorithms. The sorting algorithms I included were: Bubble, Insertion, Selection, Heap, Merge, Heap, and Quicksort. In order I decided to profile create an application that uses all of the sorting algorithms all at once and have calls their respective functions instead of creating individual modules and profiling them, simply because the 99% percent of the total running time would be taken up by the sorting algorithm function. So for a better understanding and comparison of what the time taken by each sorting algorithm uses the most time, I created decided to create a single module which with functions that perform the sorting algorithms. I allocated memory for all of the arrays and populated a single array of size ''n'' with randomly generated data. I then copied the memory from the populated array to all of the other arrays to ensure consistency amongst the comparison consistent data throughout all of the sorting algorithms. I then passed each array to its respective sorting function which returned the sorted array using pass-by-reference. One of the things to keep in mind is that when ''n '' increases (the size of the array being sorted), the time increases. I have included 3 different profiles where n (the size of the array) equals 50000, 100000 and lastly 50000.
'''Source Code:'''
'''N = 50,000'''
Total Time (seconds) = 7.98 seconds
Flat profile:
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/Performs Bubble Sort on the array passed through parametercall name void 63.21 5.04 5.04 bubbleSort(int *arr, int size) { for 29.91 7.42 2.38 selectionSort(int i = 0; i < size - 1; i++*, int) { for 6.79 7.96 0.54 insertionSort(int j = 0; j < size - i - 1; j++*, int) { if 0.13 7.97 0.01 49999 0.00 0.00 merge(arr[j] int*, std::vector<int, std::allocator<int> arr[j + 1]>&, int, int, int) { 0.13 7.98 0.01 quickSort(int tmp = arr[j];*, int, int) arr[j] = arr[j + 0.00 7.98 0.00 8 0.00 1];.25 mergeIndexSort(int*, std::vector<int, std::allocator<int> >&, int, int) arr[j + 0.00 7.98 0.00 1] = tmp; } } }} 0.00 0.00 _GLOBAL__sub_I__Z9fillArrayPii
'''N = 100,000'''
Total Time (seconds) = 31.42 seconds
Flat profile:
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/Performs Selection Sort on the array passed through parametercall name 63.91 20.05 20.05 bubbleSort(int*, int)void 29.15 29.19 9.14 selectionSort(int *arr, int size) { for 6.96 31.38 2.18 insertionSort(int i = 0; i < size; i++*, int) { 0.06 31.40 0.02 heapSort(int*, int minIdx = i;) for 0.03 31.41 0.01 99999 0.00 0.00 merge(int j = i; j *, std::vector< size; j++int, std::allocator<int> >&, int, int, int) { if 0.03 31.42 0.01 quickSort(arr[j] < arr[minIdx]int*, int, int) { minIdx = j; } } //swap 0.00 31.42 0.00 8 0.00 1.25 mergeIndexSort(int*, std::vector<int, std::allocator<int> >&, int, int tmp = arr[i];) arr[i] = arr[minIdx]; arr[minIdx] = tmp; }} 0.00 31.42 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z9fillArrayPii
'''N = 500,000'''
Total Time (minutes) = 13.47 minutes
Flat profile:
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/Performs Insertion Sort on the array passed through parametercall name void 62.36 503.32 503.32 bubbleSort(int*, int) 30.67 750.86 247.54 selectionSort(int*, int) 7.08 807.99 57.14 insertionSort(int *arr, int size) { for 0.02 808.12 0.13 heapSort(int i = *, int) 0.01 808.20 0.08 499999 0.00 0; i .00 merge(int*, std::vector< size; i++int, std::allocator<int> >&, int, int, int) { 0.01 808.26 0.06 quickSort(int curr = arr[i]; *, int, int j = i;) for 0.00 808.26 0.00 8 0.00 10.01 mergeIndexSort(j = i; j int*, std::vector<int, std::allocator<int> > 0 && arr[j - 1] > curr; j--, int, int) { arr[j] = arr[j - 0.00 808.26 0.00 1]; } arr[j] = curr; }} 0.00 0.00 _GLOBAL__sub_I__Z9fillArrayPii
'''Conclusion:'''
Based on the results from profiling 3 different sizes of arrays, we can assume that majority of the time taken to sort the arrays is taken by the O(n^2) algorithms (Bubble, Insertion, and Selection). However, the other sorting algorithms (Heap, Merge, Quick) that are O(n log(n)) are extremely fast even when the size of the arrays are large. As observed from the profile, the elapsed time increased as the size of the array increased. I went from approximately 8 seconds to execute the entire program to 13 minutes to execute.
=== Assignment 2 ===
'''Serial Bubble Sort Pseudo'''
for i ← 1 to length [A] do
for j ← length [A] downto i +1 do
If A[A] < A[j-1] then
Exchange A[j] ↔ A[j-1]
Bubble sort has a worse case and average complexity of O(n^2), where n is the number of items being sorted. Surprisingly even all the other O(n^2) algorithms such as insertion and selection sort run faster than bubble (As shown in the profile in A1). We know from experience that this algorithm is inefficient and should be avoided in case of large collections (Collection size depends on system specs, but we will just assume 20K is the average).
'''Time Complexity of Parallel Bubble Sort'''
Based on the given pseudo code above, we iterate through the list (n) number of times and call the odd or even kernel based on the index (i). We then iterate through the list n/2 number of times making it O(n^2)/n.
'''Our Approach'''
We iterate through the list (n-1) number of times calling the kernel. We make use of the power of the GPU which is doing the same thing concurrently. In this case, we will be launching 512 threads of n/1024 blocks (We found this to be the most efficient block & thread counts). When we launch a kernel, the kernel will concurrently compare 2 values. The only difference to our approach and the pseudo approach is that we will need to pass in a flag (index % 2) to let our kernel know whether it is even or odd. Since we sped up our iteration to be iterating only once (n number of times), we can assume that the parallel time complexity is O(n) and needs O(n) processors.
__global__ void oddeven_bubble(int *a, int flag, int size)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int temp;
if ((index >= size / 2 - 1) && flag % 2 != 0)
return;
if (flag % 2 == 0)
{
if (a[index * 2] > a[index * 2 + 1])
{
temp = a[index * 2];
a[index * 2] = a[index * 2 + 1];
a[index * 2 + 1] = temp;
}
}
else
{
if (a[index * 2 + 1] > a[index * 2 + 2])
{
temp = a[index * 2 + 1];
a[index * 2 + 1] = a[index * 2 + 2];
a[index * 2 + 2] = temp;
}
}
}
!nvcc bubble.cu -o bubble
'''Parallel Odd-Even Sort Comparison to Bubble & Overview'''
'''Profile generated for N = 500000'''
'''Insertion Sort'''
We stated in A1 that we wanted to attempt Insertion sort in order to learn & be innovate.
'''Kernel Code:'''
__global__ void Sort(int* arr, int size) {
for (int i = threadIdx.x; i < size; i++) {
int curr = arr[i];
int j = i;
for (j = i; j > 0 && arr[j - 1] > curr; j--) {
arr[j] = arr[j - 1];
}
arr[j] = curr;
}
}
'''Our Scenario'''
We actually thought we had it working after multiple iterations and adding __syncthreads() everywhere. Suddenly for array size 100000 it worked. We tested it for 200000 to 500000, they all worked extremely fast (about 2x faster than CPU). We were quite surprised with our findings, but something felt wrong. We tested the kernel with size 512 - 25000 and they were all incorrect. We were a bit curious to know what was happening, why there were inconsistencies and lastly how our tester function was returning 0 errors for big arrays and errors for small arrays. This required us to come up with a solution to test our tester & the # of same values generated to the # of values in the so called sorted array.
'''Here is how we found out that the values were getting overridden:'''
We created two separate arrays of size n (value passed through the argument), first array held the random values and the second was initialized to 0.
Our goal was to find out why it worked for some values and it didn’t work for some. Generally, the lower values did not work ( < 50000) and the big values worked according to our original tester. This made us curious to know why our kernel worked for big values but not small values.
In order to do so, we created a hash table sort of array structure. Since the insertionArray (array that held random values) had only values 0 < n < argument (Ex.500000), we simply iterated through the list at the start before passing the array on to the kernel. In this process, we iterated through the randomly numbers and added incremented the count array for each index we found (countArray[insertionArray[i]]++).
Then we simply performed the same process on the “sorted” array to find out that 50% of the values would be missing. We tried multiple methods for the insertion sort, even syncing the threads after every step taking. Unfortunately, this increased the time drastically and still did not work.
'''Final files & Comparitive Graph'''
'''N = 500Graph Comparison: - Please give it a few seconds to load,000we noticed it takes longer than expected to load the graph'''Flat profile:
'''Final files:'''
'''Bubble.cu:''' [[File:Bubble.cu.txt]]
'''Insertion.cu (With method to test):''' [[File:Insertion.cu.txt]]
=== Assignment 3 ===
Bubble sort - Optimization
----
'''Thoughts about using Shared Memory:'''
After doing some research, we believe that we cannot use __shared__ memory for this. The reason for this is because the array size can be anything (500 thousand, 1 million, 2 million, etc.). There is very limited amount of __shared__ memory depending on the GPU. Even if we used __shared__ memory, there is always the issue that only the values within each block would get sorted. We would also need to deal with bank conflicts (Thread Id would access memory of id id & id+1 and write to memory at id).
'''Our attempting at reducing the kernel counts:'''
For this we removed the for loop that called the kernel n-1 times, and added a for loop inside the kernel surrounding the condition statement (if-else). This was quite interesting in the sense that it was much faster (about 30 – 50%) than the method we had derived from the pseudo code. Unfortunately, about 1% of size (argument entered) would be incorrectly sorted. When we ran this code on different types of GPU with a higher & lower compute capability, we had fewer and more errors respectively however the sorting was still incorrect. So overall, we decided to stick with our multiple kernel call approach as we didn’t want to have errors with the benefit of faster completion time.
----
'''Final thoughts:'''
Bubble sort was not able to be speed up as fast as quick/merge/heap sort O(n * logn) but compared to serial bubble sort we ran on matrix & average pc, there was very noticeable difference. Our overall approach felt a lot like the reduction lab where we called the kernel multiple times to get the answer. We also realized that the optimal solution will be different on different hardware platforms.While doing research, we found out there are even better parallel sorting algorithms (radix, rank sort, parallel merge sort, hyper quick sort). We actually found the CUDA sample code for parallel merge sort (It was for older version of CUDA). When we profiled it and ran it, it turned out to be much slower (1 – 3 seconds) than serial merge sort for N = 500,000. However, the CUDA quick sort given was just as fast as the serial quick sort (potentially faster by few milliseconds).
'''Some of the things we learned'''
- Odd-even sort
- Parallel Bubble Sort & its implementation
- Race Conditions
- We CANNOT do Insertion sort in parallel (Confirmed!)
- Sequencing issues will be caused for coalesced hardware when acceding array elements 0 & 1 in thread 0, accessing 2 & 3 in thread 1 (Issue with shared memory & why we avoided it).
- Shared memory is an optimal choice where we call the kernel once and the kernel has a for loop inside of it. However, causes bank conflict (can be solved using strides?) but unfortunately gave us an incorrect result so we avoided that method.
- As the size of array goes up, the number of threads used goes up (size/2 thread are used). This will eventually result in the maximum number of threads being used.
- Blocks will run in random orders (One of the reasons why we encountered an error with the single kernel with a for loop inside odd-even approach). We were able make small numbers work & it was much faster but for big numbers approximately 1 percent of the array size would be incorrectly sorted.
- Algorithms with alot of branching and are VERY sequential are not suitable for the GPU.
- We came across some interesting sorting algorithms (Bucket, radix, rank, hyper quick and merge sort) which are better options than the parallel bubble sort.