Changes

← Older edit

N/A

6,545 bytes added, 02:57, 7 April 2019

→‎Assignment 3

# [mailto:wwpark@myseneca.ca?subject=DPS915 Woosle Park], Data Compression

# [mailto:jpitters@myseneca.ca?subject=DPS915 Jordan Pitters], Image Processing

# [mailto:apatel271@myseneca.ca?subject=DPS915 ~~Akshat~~ Akshatkumar Patel], Sorting Algorithms

[mailto:jpitters@myseneca.ca;wwpark@myseneca.ca?subject=DPS915;apatel271@myseneca.ca?subject=DPS915 Email All];

=== Assignment 2 ===

'''Akshatkumar Patel, Jordan Pitters, Woosle Park'''----Sorting ~~Algorithm~~ algorithm parallelization completion: We found a pseudo code for this algorithm which we got from the following link. The link describes bubble sort & how to parallelize it. '''Serial Bubble Sort Pseudo''' for i ← 1 to length [A] do for j ← length [A] downto i +1 do If A[A] < A[j-1] then Exchange A[j] ↔ A[j-1] '''Parallel Bubble Sort Pseudo''' For k = 0 to n-2 If k is even then for i = 0 to (n/2)-1 do in parallel If A[2i] > A[2i+1] then Exchange A[2i] ↔ A[2i+1] Else for i = 0 to (n/2)-2 do in parallel If A[2i+1] > A[2i+2] then Exchange A[2i+1] ↔ A[2i+2] Next k After doing some further research, we found out this presumed bubble sort is a variation of a bubble sort. It is known as an odd-even bubble sort. A bubble sort works by repeatedly iterating through the list and comparing adjacent pairs and swapping them if they are in the wrong order. This is an elementary sorting algorithm; it is the basic learning block to all sorting algorithms. It is quite easy to implement but it is extremely slow in comparison to its counterparts (quick, merge, heap, & insertion). '''Time Complexity''' Bubble sort has a worse case and average complexity of O(n^2), where n is the number of items being sorted. Surprisingly even all the other O(n^2) algorithms such as insertion and selection sort run faster than bubble (As shown in the profile in A1). We know from experience that this algorithm is inefficient and should be avoided in case of large collections (Collection size depends on system specs, but we will just assume 20K is the average). '''Time Complexity of Parallel Bubble Sort''' Based on the given pseudo code above, we iterate through the list (n) number of times and call the odd or even kernel based on the index (i). We then iterate through the list n/2 number of times making it O(n^2)/n. '''Our Approach''' We iterate through the list (n-1) number of times calling the kernel. We make use of the power of the GPU which is doing the same thing concurrently. In this case, we will be launching 512 threads of n/1024 blocks (We found this to be the most efficient block & thread counts). When we launch a kernel, the kernel will concurrently compare 2 values. The only difference to our approach and the pseudo approach is that we will need to pass in a flag (index % 2) to let our kernel know whether it is even or odd. Since we sped up our iteration to be iterating only once (n number of times), we can assume that the parallel time complexity is O(n) and needs O(n) processors. __global__ void oddeven_bubble(int *a, int flag, int size) { int index = blockIdx.x * blockDim.x + threadIdx.x; int temp; if ((index >= size / 2 - 1) && flag % 2 != 0) return; if (flag % 2 == 0) { if (a[index * 2] > a[index * 2 + 1]) { temp = a[index * 2]; a[index * 2] = a[index * 2 + 1]; a[index * 2 + 1] = temp; } } else { if (a[index * 2 + 1] > a[index * 2 + 2]) { temp = a[index * 2 + 1]; a[index * 2 + 1] = a[index * 2 + 2]; a[index * 2 + 2] = temp; } } } This Algorithm were created & profiled on Tesla K80 GPU (Cloud – Google Collab). In order to compile them the follow was typed: !nvcc bubble.cu -o bubble '''Parallel Odd-Even Sort Comparison to Bubble & Overview''' Our Odd-even sort is uses N independent threads to do sort the number of elements in the list. The number of blocks used is (size+1024)/1024 * 512 threads to sort the elements in the list. Hence, we can assume N = half of size. The odd/even sorts by selecting every even array index & comparing with the adjacent and odd array index as observed an be observed in the image below. If the numbers are in the wrong order, a swap takes place. This process is repeated size-1 number of times. '''Observation''' Offloading the computing intensive sorting algorithm to the GPU resulted in DRASTIC increase in speed. We were able to take an algorithm that took about 13 minutes to sort a list of 500k randomly generated values and solve it within 15 seconds. '''Profile for our bubble sort''' Command used to generate profile: !nvprof ./bubble 500000 '''Profile generated for N = 500000''' ==437== NVPROF is profiling process 437, command: ./bubble 500000 ==437== Profiling application: ./bubble 500000 ==437== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 100.00% 15.6692s 499999 31.338us 23.360us 47.807us oddeven_bubble(int*, int, int) 0.00% 324.09us 1 324.09us 324.09us 324.09us [CUDA memcpy HtoD] 0.00% 258.27us 1 258.27us 258.27us 258.27us [CUDA memcpy DtoH] API calls: 98.63% 17.9870s 499999 35.974us 5.6240us 6.0456ms cudaLaunchKernel 0.85% 155.57ms 1 155.57ms 155.57ms 155.57ms cudaMalloc 0.33% 60.878ms 1 60.878ms 60.878ms 60.878ms cudaDeviceReset 0.17% 31.076ms 1 31.076ms 31.076ms 31.076ms cudaDeviceSynchronize 0.00% 835.79us 2 417.90us 407.55us 428.25us cudaMemcpy 0.00% 328.55us 96 3.4220us 156ns 146.17us cuDeviceGetAttribute 0.00% 264.00us 1 264.00us 264.00us 264.00us cudaFree 0.00% 187.13us 1 187.13us 187.13us 187.13us cuDeviceTotalMem 0.00% 26.678us 1 26.678us 26.678us 26.678us cuDeviceGetName 0.00% 2.9030us 1 2.9030us 2.9030us 2.9030us cuDeviceGetPCIBusId 0.00% 2.1570us 3 719ns 210ns 1.2000us cuDeviceGetCount 0.00% 1.3290us 2 664ns 305ns 1.0240us cuDeviceGet 0.00% 304ns 1 304ns 304ns 304ns cuDeviceGetUuid

~~'''Akshatkumar Patel – Bubble Sort & Merge Sort'''~~

----

~~Both~~ '''Insertion Sort''' We stated in A1 that we wanted to attempt Insertion sort in order to learn & be innovate. '''Kernel Code:''' __global__ void Sort(int* arr, int size) { for (~~Bubble~~ int i = threadIdx.x; i < size; i++) { int curr = arr[i]; int j = i; for (j = i; j > 0 & ~~Merge~~& arr[j - 1] > curr; j--) ~~sorting algorithms were created~~ { arr[j] = arr[j - 1]; } arr[j] = curr; } } '''Our Scenario'''We actually thought we had it working after multiple iterations and ~~profiled on Tesla K80 GPU~~ adding __syncthreads(~~Cloud – Google Colab~~)everywhere. Suddenly for array size 100000 it worked. ~~In order~~ We tested it for 200000 to ~~compile them~~ 500000, they all worked extremely fast (about 2x faster than CPU). We were quite surprised with our findings, but something felt wrong. We tested the ~~following~~ kernel with size 512 - 25000 and they were all incorrect. We were a bit curious to know what was ~~typed~~happening, why there were inconsistencies and lastly how our tester function was returning 0 errors for big arrays and errors for small arrays. This required us to come up with a solution to test our tester & the # of same values generated to the # of values in the so called sorted array. '''Here is how we found out that the values were getting overridden:''' ~~!nvcc bubble~~We created two separate arrays of size n (value passed through the argument), first array held the random values and the second was initialized to 0.~~cu -o bubbleSame can be applied~~ Our goal was to ~~merge sort~~find out why it worked for some values and it didn’t work for some. Generally, the lower values did not work ( < 50000) and the big values worked according to our original tester. This made us curious to know why our kernel worked for big values but ~~if it is~~ not small values. In order to do so, we created a ~~computer_35 architecture~~ hash table sort of array structure. Since the ~~following compilation command must~~ insertionArray (array that held random values) had only values 0 < n < argument (Ex.500000), we simply iterated through the list at the start before passing the array on to the kernel. In this process, we iterated through the randomly numbers and added incremented the count array for each index we found (countArray[insertionArray[i]]++). Then we simply performed the same process on the “sorted” array to find out that 50% of the values would be ~~used:~~ ~~!nvcc -arch=sm_35 -rdc=true merge~~missing. We tried multiple methods for the insertion sort, even syncing the threads after every step taking. Unfortunately, this increased the time drastically and still did not work.~~cu -o merge -lcudadevrt~~

~~'''Observation:'''~~

We thought offloading the given entire sorting algorithms (since they are computing intensive in a way) would result in increase in speed. However, we observed that this was not the case for quite a few of the sorting algorithms since they are not optimized. One of the things that stood out to me was that I had to change my approach to bubble sort in order to make it work, I had to use the Odd/Even bubble sort. For bubble sort, when N gets bigger time increases more than that of the CPU’s. I tried using threads to make it faster but that resulted in a slower speed.

~~As for merge, I had difficulty doing recursion on~~ The logic of insertion sort cannot be applied to the kernel ~~but was solved using “-arch=sm_35 -rdc=true” command line switch~~because the threads will be competing against one another instead of working together. Ex. ~~Merge can be optimized and improved greatly unlike~~ All the ~~odd-even bubble sort since I found out there~~ threads would try to be replace arr[0] all at once. This is done multiple ~~solutions to creating it~~times throughout the array and this is the reason why we lost 50% of the data. ~~I also happened to find~~ The data lost in a ~~merge sort implementation in CUDA 6.0 Samples which were quite complex~~ way correlates to ~~understand~~ the number of threads launched ((size + 512)/512) * 512), when we launch 1 block of 1 thread, it works but ~~overall much faster than my implementation~~that is because we do not utilize the power of the GPU to compute multiple things at once.

~~[[File:a2_bubble.cu.txt]]~~

~~[[File:A2_merge~~CUDA is designed to allow blocks to run in any order.cuThere is also no cross-block synchronization within a single run making insertion sort impossible to work along with a lot of other sorting algorithms. In fact, this is the problem with many of the sorting algorithms we have looked at.~~txt]]~~

~~'''Woosle Park – Insertion Sort & Heap Sort'''~~

~~----~~

~~Both the insertion and heap sorting algorithms where created and profiled on~~ '''Final files & Comparitive Graph''' '''Graph Comparison: - Please give it a ~~gtx1080~~ few seconds to load, we noticed it takes longer than expected to load the ~~algorithms were compiled using visual studio 2017~~graph''' https://docs.google.com/spreadsheets/d/176TTtES25qwWpm18aDRkPYDCF2EMKmsIGz3k0iuVafk/edit?usp=sharing

'''~~Observation~~Final files:'''For heapsort the recursive algorithm creates a cuda warning of potential stack overflow. For the heapify kernel identifying the left and right element of the heap worked better using bit manipulation to locate them. Same issue occurred here as well in the labs where my gpu is too fast so the results of each kernel flat lined despite the element increased. That being said you do notice a slight increase in speed comparing the gpu results the higher the number of elements. Insertion sort currently running in 1-dimensional gird for testing will be changed in A3.

'''Bubble.cu:''' [[File:Bubble.cu.txt]]

'''Insertion.cu (With method to test):''' [[File:~~A2_insertion~~Insertion.cu.txt]]

~~[[File:A2_heap.cu.txt]]~~=== Assignment 3 ===

~~'''Jordan Pitters – Selection & Quick Sort'''~~Bubble sort - Optimization

----

~~Both the insertion and heap sorting algorithms where created and profiled on a gtx1080 the algorithms were compiled~~ '''Thoughts about using ~~visual studio 2017.~~Shared Memory:'''

~~'''Observation:'''Both the Selection & Quick sorting algorithms were created and profiled on Tesla K80 GPU (Cloud). Specifically~~After doing some research, ~~the compiler provided by Google Collaboratory at https://colab.research.google~~we believe that we cannot use __shared__ memory for this.~~com. In order to compile them the following steps were followed:~~ ~~# open a cell~~ The reason for ~~code and name~~ this is because the ~~code file in the first line with: “%%file name.cu”. For example~~array size can be anything (500 thousand, ~~I named the file using: “%%file TSort.cu” (meaning “Test Sort”).~~ ~~# Put the code into the space~~1 million, ~~and when done click the play button to save the code to the cloud.# Next~~2 million, ~~open another code cell and enter: “!nvcc name~~etc.~~cu -o name” (Do not run it yet~~)~~# Specifically, I used “!nvcc -arch=sm_35 -rdc=true TSort~~.~~cu -o TSort -lcudadevrt” to run my code as it increased~~ There is very limited amount of __shared__ memory depending on the ~~compute capability to allow the required recursive calls of kernels~~GPU. ~~It is recommended to use this to run your code~~ Even if ~~it requires a computer_35 architecture or greater# Finally~~we used __shared__ memory, ~~on a new line from~~ there is always the ~~code in~~ issue that only the ~~previous step, or in a new cell, enter: “!~~values within each block would get sorted.~~/name 50”~~ We would also need to deal with bank conflicts (Thread Id would access memory of id id & id+1 and ~~click the play button~~ write to ~~compile and run the code (the 50 is a command line argument).# To time the code I used: “!nvprof ./TSort 50000” (as an example~~memory at id).

'''~~Observation~~Our attempting at reducing the kernel counts:'''We considered the methods we would need to take to completing the task of parallelizing the sorting algorithms and had assumed that offloading the entire algorithm (off the CPU on to the GPU) could yield results in speed. However, after testing and observations, we found that the sorting algorithms would only yield positive results if we managed a fine balance between CPU and GPU code calls, whether the GPU calls are optimized. This was especially so for Selection Sort, as we tried completely offloading the algorithm and it didn’t seem to yield any positive results. We were not sure of the methods that were necessary to optimize the algorithm GPU-wise, and had meager ideas about designing a grid structure, so it remained something we would need to investigate. The Quick Sort function on the other hand was difficult to design because the algorithm utilized several recursive calls to do complete its sorting, which would mean recursive kernel calls, if we were to offload as planned. It eventually came to the point where we were not sure what to do about the algorithm and had to research the capabilities of CUDA kernels, as well as potential other theorized ways of parallelizing Quick Sort. We found a great many number of comments on the algorithm and pseudo code suggesting the same recursive conclusion we reached. We also found that kernels can indeed be recursively called and are required to be tied to streams to designate correlated kernels. As such, the streams and several thread synchs, among other things were used to complete the task of parallelizing Quick Sort. It was not simple, and the cloud compiler did not allow recursive calls of kernels as default, so some research was done that yielded the command line switch “-arch=sm_35 -rdc=true” which allowed recursion so that testing could be done. In the end, the code was not optimized, but we were able to prove the capability of parallelization for the Selection and Quick sorting algorithms.

For this we removed the for loop that called the kernel n-1 times, and added a for loop inside the kernel surrounding the condition statement (if-else). This was quite interesting in the sense that it was much faster (about 30 – 50%) than the method we had derived from the pseudo code. Unfortunately, about 1% of size (argument entered) would be incorrectly sorted. When we ran this code on different types of GPU with a higher & lower compute capability, we had fewer and more errors respectively however the sorting was still incorrect. So overall, we decided to stick with our multiple kernel call approach as we didn’t want to have errors with the benefit of faster completion time.

~~[[File~~----'''Final thoughts:~~A2_quick.cu.txt]]~~'''

~~[[File:A2_selection~~Bubble sort was not able to be speed up as fast as quick/merge/heap sort O(n * logn) but compared to serial bubble sort we ran on matrix & average pc, there was very noticeable difference.cuOur overall approach felt a lot like the reduction lab where we called the kernel multiple times to get the answer. We also realized that the optimal solution will be different on different hardware platforms.While doing research, we found out there are even better parallel sorting algorithms (radix, rank sort, parallel merge sort, hyper quick sort). We actually found the CUDA sample code for parallel merge sort (It was for older version of CUDA). When we profiled it and ran it, it turned out to be much slower (1 – 3 seconds) than serial merge sort for N = 500,000. However, the CUDA quick sort given was just as fast as the serial quick sort (potentially faster by few milliseconds).~~txt]]~~

'''Some of the things we learned''' -Odd---even sort

~~'''Final thoughts'''~~- Parallel Bubble Sort & its implementation

[https://docs.google.com/spreadsheets/d/1nDCtm8ar2AmhZujV4QeHUHoLfq80L1buAJqD0BALR0A/edit?usp=sharing https://docs.google.com/spreadsheets/d/1nDCtm8ar2AmhZujV4QeHUHoLfq80L1buAJqD0BALR0A/edit?usp=sharing GPU_CPU_Comparisons]- Race Conditions

~~Our~~ - We CANNOT do Insertion sort in parallel ~~sorting algorithms were slower on the GPU. However, after reprofiling on an average computer with the following specs:~~(Confirmed!)

~~'''OS:''' Windows 10~~- Sequencing issues will be caused for coalesced hardware when acceding array elements 0 & 1 in thread 0, accessing 2 & 3 in thread 1 (Issue with shared memory & why we avoided it).

~~'''Processor:''' Intel~~ - Shared memory is an optimal choice where we call the kernel once and the kernel has a for loop inside of it. However, causes bank conflict (Rcan be solved using strides?) ~~Core (TM) i5-5200U CPU @2~~but unfortunately gave us an incorrect result so we avoided that method.~~2GHz~~

~~'''Ram:''' 8GB~~- As the size of array goes up, the number of threads used goes up (size/2 thread are used). This will eventually result in the maximum number of threads being used.

- Blocks will run in random orders (One of the reasons why we encountered an error with the single kernel with a for loop inside odd-even approach). We were able make small numbers work & it was much faster but for big numbers approximately 1 percent of the array size would be incorrectly sorted.

~~The parallelized sorting algorithms were much faster than on the CPU listed above. Matrix performance is much faster than~~ - Algorithms with alot of ~~an average computer hence the completion time in profiles~~ branching and are VERY sequential are lower. Another thing to note is that, choosing the right number of threads & blocks is important or the sorting can be potentially incorrect. Many of our attempts at sorting worked when n was smaller than or equal to 50000 but as the we tested 500 thousand not suitable for the ~~sort was incorrect~~GPU.

~~=== Assignment 3 ===~~- We came across some interesting sorting algorithms (Bucket, radix, rank, hyper quick and merge sort) which are better options than the parallel bubble sort.

Wwpark

46

edits

Changes

N/A

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools