Difference between revisions of "N/A"

From CDOT Wiki
Jump to: navigation, search
(Assignment 1)
(Assignment 3)
 
(16 intermediate revisions by one other user not shown)
Line 3: Line 3:
 
== Team Members ==  
 
== Team Members ==  
 
# [mailto:wwpark@myseneca.ca?subject=DPS915 Woosle Park], Data Compression
 
# [mailto:wwpark@myseneca.ca?subject=DPS915 Woosle Park], Data Compression
# [mailto:apatel271@myseneca.ca?subject=DPS915 Akshat Patel], Sorting Algorithms
 
 
# [mailto:jpitters@myseneca.ca?subject=DPS915 Jordan Pitters], Image Processing
 
# [mailto:jpitters@myseneca.ca?subject=DPS915 Jordan Pitters], Image Processing
 +
# [mailto:apatel271@myseneca.ca?subject=DPS915 Akshatkumar Patel], Sorting Algorithms
 
[mailto:jpitters@myseneca.ca;wwpark@myseneca.ca?subject=DPS915;apatel271@myseneca.ca?subject=DPS915 Email All];
 
[mailto:jpitters@myseneca.ca;wwpark@myseneca.ca?subject=DPS915;apatel271@myseneca.ca?subject=DPS915 Email All];
  
Line 313: Line 313:
 
'''Description:'''
 
'''Description:'''
  
I decided to select an option from the suggested projects – Sorting Algorithms. The sorting algorithms I included were: Bubble, Insertion, Selection, Merge, Heap, and Quicksort. In order to profile the sorting algorithms all at once and have a better comparison of what sorting algorithm uses the most time, I created a single module which functions that perform the sorting algorithms. I allocated memory for all of the arrays and populated a single array of size ''n'' with randomly generated data. I then copied the memory from the populated array to all of the other arrays to ensure consistency amongst the comparison of the sorting algorithms. I then passed each array to its respective sorting function which returned the sorted array using pass-by-reference. One of the things to keep in mind is that when ''n'' increases (the size of the array being sorted), the time increases. I have included 3 different profiles where ''n'' (the size of the array) equals 50000, 100000 and lastly 50000.
+
I decided to select an option from the suggested projects – Sorting Algorithms. The sorting algorithms I included were: Bubble, Insertion, Selection, Heap, Merge, and Quicksort. I decided to create an application that uses all of the sorting algorithms and calls their respective functions instead of creating individual modules and profiling them, simply because the 99%  percent of the total running time would be taken up by the sorting algorithm function. So for a better understanding and comparison of the time taken by each sorting algorithm, I decided to create a single module with functions that perform the sorting algorithms. I allocated memory for all of the arrays and populated a single array of size ''n'' with randomly generated data. I then copied the memory from the populated array to all of the other arrays to ensure consistent data throughout all of the sorting algorithms. I then passed each array to its respective sorting function which returned the sorted array using pass-by-reference. One of the things to keep in mind is that when ''n'' increases (the size of the array being sorted), the time increases. I have included 3 different profiles where n (the size of the array) equals 50000, 100000 and lastly 50000.
  
  
 
'''Source Code:'''
 
'''Source Code:'''
 
   [[File:SortingAlgorithms.cpp.txt]]
 
   [[File:SortingAlgorithms.cpp.txt]]
   The link to the source code AND the make file can also be found on GitHub at: [http://www.example.com https://github.com/Akshat55/Sorting-Algorithms-Comparison]
+
   The link to the source code AND the make file can also be found on GitHub at: [https://github.com/Akshat55/Sorting-Algorithms-Comparison https://github.com/Akshat55/Sorting-Algorithms-Comparison]
  
  
 
'''N = 50,000'''
 
'''N = 50,000'''
 +
  Total Time (seconds) = 7.98 seconds
 +
 
Flat profile:
 
Flat profile:
  
Line 337: Line 339:
  
 
'''N = 100,000'''
 
'''N = 100,000'''
 +
  Total Time (seconds) = 31.42 seconds
 
Flat profile:
 
Flat profile:
  
Line 353: Line 356:
  
 
'''N = 500,000'''
 
'''N = 500,000'''
 +
  Total Time (minutes) = 13.47 minutes
 
Flat profile:
 
Flat profile:
  
Line 368: Line 372:
  
 
'''Conclusion:'''
 
'''Conclusion:'''
 +
 
Based on the results from profiling 3 different sizes of arrays, we can assume that majority of the time taken to sort the arrays is taken by the O(n^2) algorithms (Bubble, Insertion, and Selection). However, the other sorting algorithms (Heap, Merge, Quick) that are O(n log(n)) are extremely fast even when the size of the arrays are large. As observed from the profile, the elapsed time increased as the size of the array increased. I went from approximately 8 seconds to execute the entire program to 13 minutes to execute.
 
Based on the results from profiling 3 different sizes of arrays, we can assume that majority of the time taken to sort the arrays is taken by the O(n^2) algorithms (Bubble, Insertion, and Selection). However, the other sorting algorithms (Heap, Merge, Quick) that are O(n log(n)) are extremely fast even when the size of the arrays are large. As observed from the profile, the elapsed time increased as the size of the array increased. I went from approximately 8 seconds to execute the entire program to 13 minutes to execute.
  
Line 373: Line 378:
  
 
'''Final Selection: Sorting Algorithms'''
 
'''Final Selection: Sorting Algorithms'''
Based on the profiled applications above, we think that the sorting algorithms would benefit a lot from offloading to the GPU. Sorting Algorithms are commonly used in programming. The sorting algorithm used can have a strong impact on the programs speed and efficiency. Since they are so widely used, we think it would be quite interesting to see if we can speed up the O(n^2) sorting algorithms to potentially match the sorting speed of the O(n log n) algorithms since there won’t much change for them as they are already fast.
+
 
 +
Based on the profiled applications above, we think that the sorting algorithms would benefit a lot from offloading to the GPU. Sorting Algorithms are commonly used in programming and can have a strong impact on the programs speed and efficiency. Since they are so commonly used, we think it would be quite interesting to see if we can speed up the O(n^2) sorting algorithms to potentially match the sorting speed of the O(n log n) algorithms since there won’t much change for the parallelized version of them, as they are already fast. Bubble Sort is an elementary sort, it is the most basic sorting method out there however it is the worst sorting algorithm. By Speeding it up using the GPU, we plan to see how much we can improve this sorting algorithm. Also, we would like to look at Insertion Sort, mainly because we know that it is not possible to do on the GPU. We want to take this opportunity to be innovative and attempt or to the very least find a way to solve this issue.
 +
 
 +
'''Summary of selection:'''
 +
 
 +
Bubble & Insertion Sort
  
 
=== Assignment 2 ===
 
=== Assignment 2 ===
 +
 +
 +
'''Akshatkumar Patel, Jordan Pitters, Woosle Park'''
 +
----
 +
Sorting algorithm parallelization completion:
 +
 +
 +
We found a pseudo code for this algorithm which we got from the following link. The link describes bubble sort & how to parallelize it.
 +
 +
'''Serial Bubble Sort Pseudo'''
 +
 +
    for i ← 1 to length [A] do
 +
        for j ← length [A] downto i +1 do
 +
            If A[A] < A[j-1] then
 +
                Exchange A[j] ↔ A[j-1]
 +
 +
 +
'''Parallel Bubble Sort Pseudo'''
 +
 +
    For k = 0 to n-2
 +
    If k is even then
 +
        for i = 0 to (n/2)-1 do in parallel
 +
            If A[2i] > A[2i+1] then
 +
                Exchange A[2i] ↔ A[2i+1]
 +
    Else
 +
        for i = 0 to (n/2)-2 do in parallel
 +
            If A[2i+1] > A[2i+2] then
 +
                Exchange A[2i+1] ↔ A[2i+2]
 +
    Next k
 +
 +
 +
After doing some further research, we found out this presumed bubble sort is a variation of a bubble sort. It is known as an odd-even bubble sort. A bubble sort works by repeatedly iterating through the list and comparing adjacent pairs and swapping them if they are in the wrong order. This is an elementary sorting algorithm; it is the basic learning block to all sorting algorithms.  It is quite easy to implement but it is extremely slow in comparison to its counterparts (quick, merge, heap, & insertion).
 +
 +
 +
'''Time Complexity'''
 +
 +
Bubble sort has a worse case and average complexity of O(n^2), where n is the number of items being sorted. Surprisingly even all the other O(n^2) algorithms such as insertion and selection sort run faster than bubble (As shown in the profile in A1). We know from experience that this algorithm is inefficient and should be avoided in case of large collections (Collection size depends on system specs, but we will just assume 20K is the average).
 +
 +
 +
'''Time Complexity of Parallel Bubble Sort'''
 +
 +
Based on the given pseudo code above, we iterate through the list (n) number of times and call the odd or even kernel based on the index (i). We then iterate through the list n/2 number of times making it O(n^2)/n.
 +
 +
 +
'''Our Approach'''
 +
 +
We iterate through the list (n-1) number of times calling the kernel. We make use of the power of the GPU which is doing the same thing concurrently. In this case, we will be launching 512 threads of n/1024 blocks (We found this to be the most efficient block & thread counts). When we launch a kernel, the kernel will concurrently compare 2 values. The only difference to our approach and the pseudo approach is that we will need to pass in a flag (index % 2) to let our kernel know whether it is even or odd. Since we sped up our iteration to be iterating only once (n number of times), we can assume that the parallel time complexity is O(n) and needs O(n) processors.
 +
 +
 +
    __global__ void oddeven_bubble(int *a, int flag, int size)
 +
    {
 +
   
 +
      int index = blockIdx.x * blockDim.x + threadIdx.x;
 +
      int temp;
 +
   
 +
      if ((index >= size / 2 - 1) && flag % 2 != 0)
 +
        return;
 +
   
 +
      if (flag % 2 == 0)
 +
      {
 +
        if (a[index * 2] > a[index * 2 + 1])
 +
        {
 +
          temp = a[index * 2];
 +
          a[index * 2] = a[index * 2 + 1];
 +
          a[index * 2 + 1] = temp;
 +
        }
 +
      }
 +
      else
 +
      {
 +
        if (a[index * 2 + 1] > a[index * 2 + 2])
 +
        {
 +
          temp = a[index * 2 + 1];
 +
          a[index * 2 + 1] = a[index * 2 + 2];
 +
          a[index * 2 + 2] = temp;
 +
        }
 +
      }
 +
    }
 +
 +
 +
This Algorithm were created & profiled on Tesla K80 GPU (Cloud – Google Collab). In order to compile them the follow was typed:
 +
 +
    !nvcc bubble.cu -o bubble
 +
 +
'''Parallel Odd-Even Sort Comparison to Bubble & Overview'''
 +
 +
Our Odd-even sort is uses N independent threads to do sort the number of elements in the list. The number of blocks used is (size+1024)/1024 * 512 threads to sort the elements in the list. Hence, we can assume N = half of size. The odd/even sorts by selecting every even array index & comparing with the adjacent and odd array index as observed an be observed in the image below. If the numbers are in the wrong order, a swap takes place. This process is repeated size-1 number of times.
 +
 +
'''Observation'''
 +
 +
Offloading the computing intensive sorting algorithm to the GPU resulted in DRASTIC increase in speed. We were able to take an algorithm that took about 13 minutes to sort a list of 500k randomly generated values and solve it within 15 seconds.
 +
 +
'''Profile for our bubble sort'''
 +
 +
Command used to generate profile:
 +
 +
!nvprof ./bubble 500000
 +
 +
'''Profile generated for N = 500000'''
 +
 +
==437== NVPROF is profiling process 437, command: ./bubble 500000
 +
==437== Profiling application: ./bubble 500000
 +
==437== Profiling result:
 +
            Type  Time(%)      Time    Calls      Avg      Min      Max  Name
 +
GPU activities:  100.00%  15.6692s    499999  31.338us  23.360us  47.807us  oddeven_bubble(int*, int, int)
 +
                    0.00%  324.09us        1  324.09us  324.09us  324.09us  [CUDA memcpy HtoD]
 +
                    0.00%  258.27us        1  258.27us  258.27us  258.27us  [CUDA memcpy DtoH]
 +
      API calls:  98.63%  17.9870s    499999  35.974us  5.6240us  6.0456ms  cudaLaunchKernel
 +
                    0.85%  155.57ms        1  155.57ms  155.57ms  155.57ms  cudaMalloc
 +
                    0.33%  60.878ms        1  60.878ms  60.878ms  60.878ms  cudaDeviceReset
 +
                    0.17%  31.076ms        1  31.076ms  31.076ms  31.076ms  cudaDeviceSynchronize
 +
                    0.00%  835.79us        2  417.90us  407.55us  428.25us  cudaMemcpy
 +
                    0.00%  328.55us        96  3.4220us    156ns  146.17us  cuDeviceGetAttribute
 +
                    0.00%  264.00us        1  264.00us  264.00us  264.00us  cudaFree
 +
                    0.00%  187.13us        1  187.13us  187.13us  187.13us  cuDeviceTotalMem
 +
                    0.00%  26.678us        1  26.678us  26.678us  26.678us  cuDeviceGetName
 +
                    0.00%  2.9030us        1  2.9030us  2.9030us  2.9030us  cuDeviceGetPCIBusId
 +
                    0.00%  2.1570us        3    719ns    210ns  1.2000us  cuDeviceGetCount
 +
                    0.00%  1.3290us        2    664ns    305ns  1.0240us  cuDeviceGet
 +
                    0.00%    304ns        1    304ns    304ns    304ns  cuDeviceGetUuid
 +
 +
 +
----
 +
 +
'''Insertion Sort'''
 +
 +
We stated in A1 that we wanted to attempt Insertion sort in order to learn & be innovate.
 +
 +
'''Kernel Code:'''
 +
 +
__global__ void Sort(int* arr, int size) {
 +
for (int i = threadIdx.x; i < size; i++) {
 +
int curr = arr[i];
 +
int j = i;
 +
for (j = i; j > 0 && arr[j - 1] > curr; j--) {
 +
arr[j] = arr[j - 1];
 +
}
 +
arr[j] = curr;
 +
}
 +
}
 +
 +
 +
'''Our Scenario'''
 +
We actually thought we had it working after multiple iterations and adding __syncthreads() everywhere. Suddenly for array size 100000 it worked. We tested it for 200000 to 500000, they all worked extremely fast (about 2x faster than CPU). We were quite surprised with our findings, but something felt wrong. We tested the kernel with size 512 - 25000 and they were all incorrect. We were a bit curious to know what was happening, why there were inconsistencies and lastly how our tester function was returning 0 errors for big arrays and errors for small arrays. This required us to come up with a solution to test our tester & the # of same values generated to the # of values in the so called sorted array.
 +
 +
 +
'''Here is how we found out that the values were getting overridden:'''
 +
 +
We created two separate arrays of size n (value passed through the argument), first array held the random values and the second was initialized to 0.
 +
 +
 +
Our goal was to find out why it worked for some values and it didn’t work for some. Generally, the lower values did not work ( < 50000) and the big values worked according to our original tester. This made us curious to know why our kernel worked for big values but not small values.
 +
 +
 +
In order to do so, we created a hash table sort of array structure. Since the insertionArray (array that held random values) had only values 0 < n < argument (Ex.500000), we simply iterated through the list at the start before passing the array on to the kernel.  In this process, we iterated through the randomly numbers and added incremented the count array for each index we found (countArray[insertionArray[i]]++).
 +
 +
 +
Then we simply performed the same process on the “sorted” array to find out that 50% of the values would be missing. We tried multiple methods for the insertion sort, even syncing the threads after every step taking. Unfortunately, this increased the time drastically and still did not work.
 +
 +
 +
The logic of insertion sort cannot be applied to the kernel because the threads will be competing against one another instead of working together.
 +
Ex.
 +
All the threads would try to replace arr[0] all at once. This is done multiple times throughout the array and this is the reason why we lost 50% of the data. The data lost in a way correlates to the number of threads launched ((size + 512)/512) * 512), when we launch 1 block of 1 thread, it works but that is because we do not utilize the power of the GPU to compute multiple things at once.
 +
 +
 +
CUDA is designed to allow blocks to run in any order. There is also no cross-block synchronization within a single run making insertion sort impossible to work along with a lot of other sorting algorithms. In fact, this is the problem with many of the sorting algorithms we have looked at.
 +
 +
 +
'''Final files & Comparitive Graph'''
 +
 +
'''Graph Comparison: - Please give it a few seconds to load, we noticed it takes longer than expected to load the graph'''
 +
 +
https://docs.google.com/spreadsheets/d/176TTtES25qwWpm18aDRkPYDCF2EMKmsIGz3k0iuVafk/edit?usp=sharing
 +
 +
'''Final files:'''
 +
 +
'''Bubble.cu:''' [[File:Bubble.cu.txt]]
 +
 +
'''Insertion.cu (With method to test):''' [[File:Insertion.cu.txt]]
 +
 
=== Assignment 3 ===
 
=== Assignment 3 ===
 +
 +
Bubble sort - Optimization
 +
----
 +
 +
'''Thoughts about using Shared Memory:'''
 +
 +
After doing some research, we believe that we cannot use __shared__ memory for this. The reason for this is because the array size can be anything (500 thousand, 1 million, 2 million, etc.). There is very limited amount of __shared__ memory depending on the GPU. Even if we used __shared__ memory, there is always the issue that only the values within each block would get sorted. We would also need to deal with bank conflicts (Thread Id would access memory of id id & id+1 and write to memory at id).
 +
 +
 +
'''Our attempting at reducing the kernel counts:'''
 +
 +
For this we removed the for loop that called the kernel n-1 times, and added a for loop inside the kernel surrounding the condition statement (if-else). This was quite interesting in the sense that it was much faster (about 30 – 50%) than the method we had derived from the pseudo code. Unfortunately, about 1% of size (argument entered) would be incorrectly sorted. When we ran this code on different types of GPU with a higher & lower compute capability, we had fewer and more errors respectively however the sorting was still incorrect. So overall, we decided to stick with our multiple kernel call approach as we didn’t want to have errors with the benefit of faster completion time.
 +
 +
----
 +
'''Final thoughts:'''
 +
 +
Bubble sort was not able to be speed up as fast as quick/merge/heap sort O(n * logn) but compared to serial bubble sort we ran on matrix & average pc, there was very noticeable difference. Our overall approach felt a lot like the reduction lab where we called the kernel multiple times to get the answer. We also realized that the optimal solution will be different on different hardware platforms.While doing research, we found out there are even better parallel sorting algorithms (radix, rank sort, parallel merge sort, hyper quick sort). We actually found the CUDA sample code for parallel merge sort (It was for older version of CUDA). When we profiled it and ran it, it turned out to be much slower (1 – 3 seconds) than serial merge sort for N = 500,000. However, the CUDA quick sort given was just as fast as the serial quick sort (potentially faster by few milliseconds).
 +
 +
'''Some of the things we learned'''
 +
 +
- Odd-even sort
 +
 +
- Parallel Bubble Sort & its implementation
 +
 +
- Race Conditions
 +
 +
- We CANNOT do Insertion sort in parallel (Confirmed!)
 +
 +
- Sequencing issues will be caused for coalesced hardware when acceding array elements 0 & 1 in thread 0, accessing 2 & 3 in thread 1 (Issue with shared memory & why we avoided it).
 +
 +
- Shared memory is an optimal choice where we call the kernel once and the kernel has a for loop inside of it. However, causes bank conflict (can be solved using strides?) but unfortunately gave us an incorrect result so we avoided that method.
 +
 +
- As the size of array goes up, the number of threads used goes up (size/2 thread are used). This will eventually result in the maximum number of threads being used.
 +
 +
- Blocks will run in random orders (One of the reasons why we encountered an error with the single kernel with a for loop inside odd-even approach). We were able make small numbers work & it was much faster but for big numbers approximately 1 percent of the array size would be incorrectly sorted.
 +
 +
- Algorithms with alot of branching and are VERY sequential are not suitable for the GPU.
 +
 +
- We came across some interesting sorting algorithms (Bucket, radix, rank, hyper quick and merge sort) which are better options than the parallel bubble sort.

Latest revision as of 01:57, 7 April 2019


GPU610/DPS915 | Student List | Group and Project Index | Student Resources | Glossary

N/A

Team Members

  1. Woosle Park, Data Compression
  2. Jordan Pitters, Image Processing
  3. Akshatkumar Patel, Sorting Algorithms

Email All;

Progress

Assignment 1

Application 1 - Data Compression


Description: https://www.geeksforgeeks.org/lzw-lempel-ziv-welch-compression-technique/

The algorithm used for data compression here is the Lempel–Ziv–Welch (LZW) algorithm. It is a lossless algorithm meaning no data is lost during compression for a file. This algorithm is generally used for gif or pdf files but for this example, I used a .txt file because it was easier to manipulate and scale in size. The file used for compression is a .txt version of the Holy Bible(https://raw.githubusercontent.com/mxw/grmr/master/src/finaltests/bible.txt) because the contents are large enough to see the compression time and percentage. The algorithm should read a files sequence of symbols and grouping them into strings and then converting it into bit 12 code that is then stored into a table. That table is then referred to when decompressing a file doing a reverse sequence of steps from compression.

Source Code:

Code by RobTwentyFour

//  Compile with gcc 4.7.2 or later, using the following command line:
//
//    g++ -std=c++0x lzw.c -o lzw
//
//LZW algorithm implemented using fixed 12 bit codes.

#include <iostream>
#include <sstream>
#include <fstream>

#include <bitset>
#include <string>
#include <unordered_map>

#define MAX_DEF 4096

using namespace std;

string convert_int_to_bin(int number)
{
   string result = bitset<12>(number).to_string();
   return result;
}

void compress(string input, int size, string filename) {
   unordered_map<string, int> compress_dictionary(MAX_DEF);
   //Dictionary initializing with ASCII
   for ( int unsigned i = 0 ; i < 256 ; i++ ){
   compress_dictionary[string(1,i)] = i;
   }
   string current_string;
   unsigned int code;
   unsigned int next_code = 256;
   //Output file for compressed data
   ofstream outputFile;
   outputFile.open(filename + ".lzw");

   for(char& c: input){
   current_string = current_string + c;
   if ( compress_dictionary.find(current_string) ==compress_dictionary.end() ){
           if (next_code <= MAX_DEF)
               compress_dictionary.insert(make_pair(current_string, next_code++));
           current_string.erase(current_string.size()-1);
           outputFile << convert_int_to_bin(compress_dictionary[current_string]);
           current_string = c;
       }   
   }   
   if (current_string.size())
           outputFile << convert_int_to_bin(compress_dictionary[current_string]);
   outputFile.close();
}



void decompress(string input, int size, string filename) {
   unordered_map<unsigned int, string> dictionary(MAX_DEF);
   //Dictionary initializing with ASCII
   for ( int unsigned i = 0 ; i < 256 ; i++ ){
   dictionary[i] = string(1,i);
   }
   string previous_string;
   unsigned int code;
   unsigned int next_code = 256;
   //Output file for decompressed data
   ofstream outputFile;
   outputFile.open(filename + "_uncompressed.txt");

   int i =0;
   while (i<size){
       //Extracting 12 bits and converting binary to decimal
       string subinput = input.substr(i,12);
       bitset<12> binary(subinput);
       code = binary.to_ullong();
       i+=12;

       if ( dictionary.find(code) ==dictionary.end() ) 
           dictionary.insert(make_pair(code,(previous_string + previous_string.substr(0,1))));
       outputFile<<dictionary[code];
       if ( previous_string.size())
           dictionary.insert(make_pair(next_code++,previous_string + dictionary[code][0])); 
       previous_string = dictionary[code];
       }
   outputFile.close();
}

string convert_char_to_string(const char *pCh, int arraySize){
   string str;
   if (pCh[arraySize-1] == '\0') str.append(pCh);
   else for(int i=0; i<arraySize; i++) str.append(1,pCh[i]);
   return str;
}

static void show_usage()
{
       cerr << "Usage: \n"
             << "Specify the file that needs to be compressed or decompressed\n"
             <<"lzw -c input    #compress file input\n"
             <<"lzw -d input    #decompress file input\n"
             <<"Compressed data will be found in a file with the same name but with a .lzw extension\n"
             <<"Decompressed data can be found in a file with the same name and a _uncompressed.txt extension\n"
             << endl;
}


int main (int argc, char* argv[]) {
   streampos size;
   char * memblock;

   if (argc <2)
   {
       show_usage();   
       return(1);
   }
   ifstream file (argv[2], ios::in|ios::binary|ios::ate);
   if (file.is_open())
   {
       size = file.tellg();
       memblock = new char[size];
       file.seekg (0, ios::beg);
       file.read (memblock, size);
       file.close();
       string input = convert_char_to_string(memblock,size);
       if (string( "-c" ) == argv[1] )
           compress(input,size, argv[2]);
       else if (string( "-d" ) == argv[1] )
           decompress(input,size, argv[2]);
       else
           show_usage();
   }
   else {
   cout << "Unable to open file."<<endl;
   show_usage();
   }
   return 0;
}

Flatline Profiles:

bible.txt - 4,351,186 bytes

Each sample counts as 0.01 seconds.
 %   cumulative   self              self     total           
time   seconds   seconds    calls  ns/call  ns/call  name    
50.04      0.18     0.18  5758089    31.29    31.29  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const
50.04      0.36     0.18                             compress(std::string, int, std::string)
 0.00      0.36     0.00  1402806     0.00    31.29  std::__detail::_Map_base<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&)
 0.00      0.36     0.00     4098     0.00     0.00  show_usage()
 0.00      0.36     0.00     4097     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<std::string const, int>, true>*)
 0.00      0.36     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z18convert_int_to_bini
 0.00      0.36     0.00        1     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear()


bible2.txt - 8,702,373 bytes

Each sample counts as 0.01 seconds.
 %   cumulative   self              self     total           
time   seconds   seconds    calls  ns/call  ns/call  name    
48.39      0.44     0.44 11511109    38.26    38.26  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const
46.19      0.86     0.42                             compress(std::string, int, std::string)
 5.50      0.91     0.05  2804639    17.84    56.10  std::__detail::_Map_base<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&)
 0.00      0.91     0.00     4098     0.00     0.00  show_usage()
 0.00      0.91     0.00     4097     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<std::string const, int>, true>*)
 0.00      0.91     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z18convert_int_to_bini
 0.00      0.91     0.00        1     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear()


bible3.txt - 13,053,560 bytes

Each sample counts as 0.01 seconds.
 %   cumulative   self              self     total           
time   seconds   seconds    calls  ns/call  ns/call  name    
47.58      0.58     0.58 17264129    33.63    33.63  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const
42.66      1.10     0.52                             compress(std::string, int, std::string)
 7.38      1.19     0.09  4206472    21.41    55.04  std::__detail::_Map_base<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&)
 1.64      1.21     0.02                             convert_char_to_string(char const*, int)
 0.82      1.22     0.01                             std::pair<std::__detail::_Node_iterator<std::pair<unsigned int const, std::string>, false, false>, bool> std::_Hashtable<unsigned int, std::pair<unsigned int const, std::string>, std::allocator<std::pair<unsigned int const, std::string> >, std::__detail::_Select1st, std::equal_to<unsigned int>, std::hash<unsigned int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_emplace<std::pair<unsigned int, std::string> >(std::integral_constant<bool, true>, std::pair<unsigned int, std::string>&&)
 0.00      1.22     0.00     4098     0.00     0.00  show_usage()
 0.00      1.22     0.00     4097     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<std::string const, int>, true>*)
 0.00      1.22     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z18convert_int_to_bini
 0.00      1.22     0.00        1     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear()


bible4.txt - 17,039,360 bytes

Each sample counts as 0.01 seconds.
 %   cumulative   self              self     total           
time   seconds   seconds    calls  ns/call  ns/call  name    
60.43      0.96     0.96 22530032    42.65    42.65  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node(unsigned long, std::string const&, unsigned long) const
32.73      1.48     0.52                             compress(std::string, int, std::string)
 6.29      1.58     0.10  5486575    18.24    60.89  std::__detail::_Map_base<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::string const&)
 0.63      1.59     0.01                             convert_char_to_string(char const*, int)
 0.00      1.59     0.00     4098     0.00     0.00  show_usage()
 0.00      1.59     0.00     4097     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<std::string const, int>, true>*)
 0.00      1.59     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z18convert_int_to_bini
 0.00      1.59     0.00        1     0.00     0.00  std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear()


Conclusion:

This of time is spent in the compress function and the hashtable takes up most of the time because it is constantly being manipulated and read from. It looks like if the hashtable and the compress function were to be parallelized about 90% of the run time would be affected. The big-O for the application should be O(n) time so there is a linear increase in time based on file size, however the hashtable grew more in time compared to the compress function the larger the file. This application is not good for parallelization because of the dictionary hashtable. Due to the hastable or the compress dictionary needing to be accessible globally and be constantly modifiable and read this could pose issues if multiple threads were running especially since modifying and reading the table needs to be done sequentially for efficient compression. Threading the compress could lead to errors in compressions making this difficult to parallelize.




Application 2 - Image Processing


Description: The code in focus was borrowed from user "cwginac" at DreamInCode: cwginac - Image Processing Tutorial

I stumbled across this code while searching for a program written in C++ that holds the purpose of processing images, and can be deployed on Linux without too many issues with libraries. I began my research looking for open source image processing programs, which lead me to a number of libraries and sources, (including CLMG). However, those sources were mainly in JAVA. Concerned that I didn't truly understand the process, I then redefined my focus to understanding image processing. I thus searched for "image processing tutorials in C++". The DreamInCode website was one that was listed on the front page. The code uses standard libraries to handle images in the PGM format. It is a fairly straight forward program that intakes a number of images as command-line arguments, with the first being the image to edit, and provides the user some options to process the image, outputting the result to one of the provided image paths. The one thing I noticed the program lacks is a method for converting images to PGM, considering the program requires it. Therefore, for my testing of the program, I took a JPEG image and converted it to PGM using an online method found here: Dan's Tools - Convert Files. Providing just the one image as a command-line argument only yielded 4 options that include getting/setting the values of pixels and getting other information. Looking through the code I knew there was more to it, but the method to process an image using the program required 2 arguments: first is the original image, second is the output image. With these provided, the program allows the user to rotate, invert/ reflect, enlarge, shrink, crop, translate, and negate.


The code is found on the site, near the end of the article. To run it, I made a Makefile. The code downloaded and borrowed from the site are stored as text files, so I renamed them as .cpp and .h files within the Linux environment. Here are the files for ease of access: File:Main.cpp.txt | File:Image.h.txt | File:Image.cpp.txt Also, here is the Makefile:

#Makefile for A1 - Image Processing
#
GCC_VERSION = 8.2.0
PREFIX = /usr/local/gcc/${GCC_VERSION}/bin/
CC = ${PREFIX}gcc
CPP = ${PREFIX}g++

main: main.o
	$(CPP) -pg -omain main.o
main.o: main.cpp
	$(CPP) -c -O2 -g -pg -std=c++17 main.cpp image.cpp
clean:
	rm *.o

From my test, I used a 768 KB image borrowed from the web, enlarged it a couple times, shrank it, rotated it, and negated it. The result was an 18.7 MB image. The time it took to run the program was:

real    1m33.427s
user    0m0.431s
sys     0m0.493s

The generated FLAT profile of the program revealed:

 %   cumulative   self              self     total
time   seconds   seconds    calls  ms/call  ms/call  name
34.46      0.21     0.21        7    30.03    30.03  Image::operator=(Image const&)
26.26      0.37     0.16        6    26.69    26.69  Image::Image(int, int, int)
13.13      0.45     0.08                             Image::rotateImage(int, Image&)
11.49      0.52     0.07                             writeImage(char*, Image&)
 9.85      0.58     0.06                             Image::enlargeImage(int, Image&)
 1.64      0.59     0.01                             readImage(char*, Image&)
 1.64      0.60     0.01                             Image::negateImage(Image&)
 1.64      0.61     0.01                             Image::shrinkImage(int, Image&)
 0.00      0.61     0.00        7     0.00     0.00  Image::~Image()
 0.00      0.61     0.00        1     0.00     0.00  _GLOBAL__sub_I__ZN5ImageC2Ev
 0.00      0.61     0.00        1     0.00     0.00  Image::Image(Image const&)


A second run of the program, with me enlarging the image, rotating, translating, and negating the image resulted in a time of:

real    1m0.968s
user    0m0.295s
sys     0m0.297s

And a Flat profile of:

 %   cumulative   self              self     total
time   seconds   seconds    calls  ms/call  ms/call  name
33.37      0.14     0.14        6    23.36    23.36  Image::operator=(Image const&)
26.22      0.25     0.11        5    22.02    22.02  Image::Image(int, int, int)
14.30      0.31     0.06                             writeImage(char*, Image&)
14.30      0.37     0.06                             Image::rotateImage(int, Image&)
 7.15      0.40     0.03                             Image::enlargeImage(int, Image&)
 2.38      0.41     0.01                             Image::reflectImage(bool, Image&)
 2.38      0.42     0.01                             Image::translateImage(int, Image&)
 0.00      0.42     0.00        6     0.00     0.00  Image::~Image()
 0.00      0.42     0.00        1     0.00     0.00  _GLOBAL__sub_I__ZN5ImageC2Ev
 0.00      0.42     0.00        1     0.00     0.00  Image::Image(Image const&)


Conclusion:

Ignoring the real time spent, the majority of time is spent in the equals operator function and the class constructor, most likely because the image is constantly being manipulated, read from, and being copied to and from temporary storage for ease of use and object safety. Other than the basic functions (like read/write), it looks like the rotate and enlarge functions take a larger amount of time, which could mean that, if they were to be parallelized, it could positively affect the run time. My discernment of the big-O notation for the rotate function is O(n^2) which shows a quadratic growth rate, whereas the enlarge function had a notation of O(n^3) or greater. The reason for the rotate function having a longer run-time could be due to the fact that I enlarged the image before rotating it, but the notations don't lie. Personally, I'd say that this application is not the best for parallelization because of its simplicity in handling the images, but I can definitely see how one or more of the functions in the program can be parallelized. Some of the issues posed in making the program parallel is centered upon the image needing to be accessible to every other function, and, considering that the image is being processed, it would be constantly modified and read from. I simple terms, I think that, if multiple threads were running to quicken the program, the computation of the image could lead to errors in processing resulting in a corrupted image, distortions, and things of the sort. I may be wrong in this thought, but, to my knowledge, not being to avoid such issues makes this program somewhat difficult to safely parallelize.



Application 3 - Sorting Algorithms

Description:

I decided to select an option from the suggested projects – Sorting Algorithms. The sorting algorithms I included were: Bubble, Insertion, Selection, Heap, Merge, and Quicksort. I decided to create an application that uses all of the sorting algorithms and calls their respective functions instead of creating individual modules and profiling them, simply because the 99% percent of the total running time would be taken up by the sorting algorithm function. So for a better understanding and comparison of the time taken by each sorting algorithm, I decided to create a single module with functions that perform the sorting algorithms. I allocated memory for all of the arrays and populated a single array of size n with randomly generated data. I then copied the memory from the populated array to all of the other arrays to ensure consistent data throughout all of the sorting algorithms. I then passed each array to its respective sorting function which returned the sorted array using pass-by-reference. One of the things to keep in mind is that when n increases (the size of the array being sorted), the time increases. I have included 3 different profiles where n (the size of the array) equals 50000, 100000 and lastly 50000.


Source Code:

 File:SortingAlgorithms.cpp.txt
 The link to the source code AND the make file can also be found on GitHub at: https://github.com/Akshat55/Sorting-Algorithms-Comparison


N = 50,000

 Total Time (seconds) = 7.98 seconds

Flat profile:

Each sample counts as 0.01 seconds.

 %   cumulative   self              self     total           
time   seconds   seconds    calls  ms/call  ms/call  name    
63.21      5.04     5.04                             bubbleSort(int*, int)
29.91      7.42     2.38                             selectionSort(int*, int)
 6.79      7.96     0.54                             insertionSort(int*, int)
 0.13      7.97     0.01    49999     0.00     0.00  merge(int*, std::vector<int, std::allocator<int> >&, int, int, int)
 0.13      7.98     0.01                             quickSort(int*, int, int)
 0.00      7.98     0.00        8     0.00     1.25  mergeIndexSort(int*, std::vector<int, std::allocator<int> >&, int, int)
 0.00      7.98     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z9fillArrayPii


N = 100,000

 Total Time (seconds) = 31.42 seconds 

Flat profile:

Each sample counts as 0.01 seconds.

 %   cumulative   self              self     total           
time   seconds   seconds    calls  ms/call  ms/call  name    
63.91     20.05    20.05                             bubbleSort(int*, int)
29.15     29.19     9.14                             selectionSort(int*, int)
 6.96     31.38     2.18                             insertionSort(int*, int)
 0.06     31.40     0.02                             heapSort(int*, int)
 0.03     31.41     0.01    99999     0.00     0.00  merge(int*, std::vector<int, std::allocator<int> >&, int, int, int)
 0.03     31.42     0.01                             quickSort(int*, int, int)
 0.00     31.42     0.00        8     0.00     1.25  mergeIndexSort(int*, std::vector<int, std::allocator<int> >&, int, int)
 0.00     31.42     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z9fillArrayPii


N = 500,000

 Total Time (minutes) = 13.47 minutes 

Flat profile:

Each sample counts as 0.01 seconds.

 %   cumulative   self              self     total           
time   seconds   seconds    calls  ms/call  ms/call  name    
62.36    503.32   503.32                             bubbleSort(int*, int)
30.67    750.86   247.54                             selectionSort(int*, int)
 7.08    807.99    57.14                             insertionSort(int*, int)
 0.02    808.12     0.13                             heapSort(int*, int)
 0.01    808.20     0.08   499999     0.00     0.00  merge(int*, std::vector<int, std::allocator<int> >&, int, int, int)
 0.01    808.26     0.06                             quickSort(int*, int, int)
 0.00    808.26     0.00        8     0.00    10.01  mergeIndexSort(int*, std::vector<int, std::allocator<int> >&, int, int)
 0.00    808.26     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z9fillArrayPii

Conclusion:

Based on the results from profiling 3 different sizes of arrays, we can assume that majority of the time taken to sort the arrays is taken by the O(n^2) algorithms (Bubble, Insertion, and Selection). However, the other sorting algorithms (Heap, Merge, Quick) that are O(n log(n)) are extremely fast even when the size of the arrays are large. As observed from the profile, the elapsed time increased as the size of the array increased. I went from approximately 8 seconds to execute the entire program to 13 minutes to execute.


Final Selection: Sorting Algorithms

Based on the profiled applications above, we think that the sorting algorithms would benefit a lot from offloading to the GPU. Sorting Algorithms are commonly used in programming and can have a strong impact on the programs speed and efficiency. Since they are so commonly used, we think it would be quite interesting to see if we can speed up the O(n^2) sorting algorithms to potentially match the sorting speed of the O(n log n) algorithms since there won’t much change for the parallelized version of them, as they are already fast. Bubble Sort is an elementary sort, it is the most basic sorting method out there however it is the worst sorting algorithm. By Speeding it up using the GPU, we plan to see how much we can improve this sorting algorithm. Also, we would like to look at Insertion Sort, mainly because we know that it is not possible to do on the GPU. We want to take this opportunity to be innovative and attempt or to the very least find a way to solve this issue.

Summary of selection:

Bubble & Insertion Sort

Assignment 2

Akshatkumar Patel, Jordan Pitters, Woosle Park


Sorting algorithm parallelization completion:


We found a pseudo code for this algorithm which we got from the following link. The link describes bubble sort & how to parallelize it.

Serial Bubble Sort Pseudo

   for i ← 1 to length [A] do
       for j ← length [A] downto i +1 do
           If A[A] < A[j-1] then
               Exchange A[j] ↔ A[j-1]


Parallel Bubble Sort Pseudo

   For k = 0 to n-2
   If k is even then
       for i = 0 to (n/2)-1 do in parallel
           If A[2i] > A[2i+1] then
               Exchange A[2i] ↔ A[2i+1]
   Else
       for i = 0 to (n/2)-2 do in parallel
           If A[2i+1] > A[2i+2] then
               Exchange A[2i+1] ↔ A[2i+2]
   Next k


After doing some further research, we found out this presumed bubble sort is a variation of a bubble sort. It is known as an odd-even bubble sort. A bubble sort works by repeatedly iterating through the list and comparing adjacent pairs and swapping them if they are in the wrong order. This is an elementary sorting algorithm; it is the basic learning block to all sorting algorithms. It is quite easy to implement but it is extremely slow in comparison to its counterparts (quick, merge, heap, & insertion).


Time Complexity

Bubble sort has a worse case and average complexity of O(n^2), where n is the number of items being sorted. Surprisingly even all the other O(n^2) algorithms such as insertion and selection sort run faster than bubble (As shown in the profile in A1). We know from experience that this algorithm is inefficient and should be avoided in case of large collections (Collection size depends on system specs, but we will just assume 20K is the average).


Time Complexity of Parallel Bubble Sort

Based on the given pseudo code above, we iterate through the list (n) number of times and call the odd or even kernel based on the index (i). We then iterate through the list n/2 number of times making it O(n^2)/n.


Our Approach

We iterate through the list (n-1) number of times calling the kernel. We make use of the power of the GPU which is doing the same thing concurrently. In this case, we will be launching 512 threads of n/1024 blocks (We found this to be the most efficient block & thread counts). When we launch a kernel, the kernel will concurrently compare 2 values. The only difference to our approach and the pseudo approach is that we will need to pass in a flag (index % 2) to let our kernel know whether it is even or odd. Since we sped up our iteration to be iterating only once (n number of times), we can assume that the parallel time complexity is O(n) and needs O(n) processors.


   __global__ void oddeven_bubble(int *a, int flag, int size)
   {
   
     int index = blockIdx.x * blockDim.x + threadIdx.x;
     int temp;
   
     if ((index >= size / 2 - 1) && flag % 2 != 0)
       return;
   
     if (flag % 2 == 0)
     {
       if (a[index * 2] > a[index * 2 + 1])
       {
         temp = a[index * 2];
         a[index * 2] = a[index * 2 + 1];
         a[index * 2 + 1] = temp;
       }
     }
     else
     {
       if (a[index * 2 + 1] > a[index * 2 + 2])
       {
         temp = a[index * 2 + 1];
         a[index * 2 + 1] = a[index * 2 + 2];
         a[index * 2 + 2] = temp;
       }
     }
   }


This Algorithm were created & profiled on Tesla K80 GPU (Cloud – Google Collab). In order to compile them the follow was typed:

   !nvcc bubble.cu -o bubble

Parallel Odd-Even Sort Comparison to Bubble & Overview

Our Odd-even sort is uses N independent threads to do sort the number of elements in the list. The number of blocks used is (size+1024)/1024 * 512 threads to sort the elements in the list. Hence, we can assume N = half of size. The odd/even sorts by selecting every even array index & comparing with the adjacent and odd array index as observed an be observed in the image below. If the numbers are in the wrong order, a swap takes place. This process is repeated size-1 number of times.

Observation

Offloading the computing intensive sorting algorithm to the GPU resulted in DRASTIC increase in speed. We were able to take an algorithm that took about 13 minutes to sort a list of 500k randomly generated values and solve it within 15 seconds.

Profile for our bubble sort

Command used to generate profile:

!nvprof ./bubble 500000

Profile generated for N = 500000

==437== NVPROF is profiling process 437, command: ./bubble 500000 
==437== Profiling application: ./bubble 500000
==437== Profiling result:
           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
GPU activities:  100.00%  15.6692s    499999  31.338us  23.360us  47.807us  oddeven_bubble(int*, int, int)
                   0.00%  324.09us         1  324.09us  324.09us  324.09us  [CUDA memcpy HtoD]
                   0.00%  258.27us         1  258.27us  258.27us  258.27us  [CUDA memcpy DtoH]
     API calls:   98.63%  17.9870s    499999  35.974us  5.6240us  6.0456ms  cudaLaunchKernel
                   0.85%  155.57ms         1  155.57ms  155.57ms  155.57ms  cudaMalloc
                   0.33%  60.878ms         1  60.878ms  60.878ms  60.878ms  cudaDeviceReset
                   0.17%  31.076ms         1  31.076ms  31.076ms  31.076ms  cudaDeviceSynchronize
                   0.00%  835.79us         2  417.90us  407.55us  428.25us  cudaMemcpy
                   0.00%  328.55us        96  3.4220us     156ns  146.17us  cuDeviceGetAttribute
                   0.00%  264.00us         1  264.00us  264.00us  264.00us  cudaFree
                   0.00%  187.13us         1  187.13us  187.13us  187.13us  cuDeviceTotalMem
                   0.00%  26.678us         1  26.678us  26.678us  26.678us  cuDeviceGetName
                   0.00%  2.9030us         1  2.9030us  2.9030us  2.9030us  cuDeviceGetPCIBusId
                   0.00%  2.1570us         3     719ns     210ns  1.2000us  cuDeviceGetCount
                   0.00%  1.3290us         2     664ns     305ns  1.0240us  cuDeviceGet
                   0.00%     304ns         1     304ns     304ns     304ns  cuDeviceGetUuid



Insertion Sort

We stated in A1 that we wanted to attempt Insertion sort in order to learn & be innovate.

Kernel Code:

__global__ void Sort(int* arr, int size) {
	for (int i = threadIdx.x; i < size; i++) {
		int curr = arr[i];
		int j = i;
		for (j = i; j > 0 && arr[j - 1] > curr; j--) {
			arr[j] = arr[j - 1];
		}
		arr[j] = curr;
	}
}


Our Scenario We actually thought we had it working after multiple iterations and adding __syncthreads() everywhere. Suddenly for array size 100000 it worked. We tested it for 200000 to 500000, they all worked extremely fast (about 2x faster than CPU). We were quite surprised with our findings, but something felt wrong. We tested the kernel with size 512 - 25000 and they were all incorrect. We were a bit curious to know what was happening, why there were inconsistencies and lastly how our tester function was returning 0 errors for big arrays and errors for small arrays. This required us to come up with a solution to test our tester & the # of same values generated to the # of values in the so called sorted array.


Here is how we found out that the values were getting overridden:

We created two separate arrays of size n (value passed through the argument), first array held the random values and the second was initialized to 0.


Our goal was to find out why it worked for some values and it didn’t work for some. Generally, the lower values did not work ( < 50000) and the big values worked according to our original tester. This made us curious to know why our kernel worked for big values but not small values.


In order to do so, we created a hash table sort of array structure. Since the insertionArray (array that held random values) had only values 0 < n < argument (Ex.500000), we simply iterated through the list at the start before passing the array on to the kernel. In this process, we iterated through the randomly numbers and added incremented the count array for each index we found (countArray[insertionArray[i]]++).


Then we simply performed the same process on the “sorted” array to find out that 50% of the values would be missing. We tried multiple methods for the insertion sort, even syncing the threads after every step taking. Unfortunately, this increased the time drastically and still did not work.


The logic of insertion sort cannot be applied to the kernel because the threads will be competing against one another instead of working together. Ex. All the threads would try to replace arr[0] all at once. This is done multiple times throughout the array and this is the reason why we lost 50% of the data. The data lost in a way correlates to the number of threads launched ((size + 512)/512) * 512), when we launch 1 block of 1 thread, it works but that is because we do not utilize the power of the GPU to compute multiple things at once.


CUDA is designed to allow blocks to run in any order. There is also no cross-block synchronization within a single run making insertion sort impossible to work along with a lot of other sorting algorithms. In fact, this is the problem with many of the sorting algorithms we have looked at.


Final files & Comparitive Graph

Graph Comparison: - Please give it a few seconds to load, we noticed it takes longer than expected to load the graph

https://docs.google.com/spreadsheets/d/176TTtES25qwWpm18aDRkPYDCF2EMKmsIGz3k0iuVafk/edit?usp=sharing

Final files:

Bubble.cu: File:Bubble.cu.txt

Insertion.cu (With method to test): File:Insertion.cu.txt

Assignment 3

Bubble sort - Optimization


Thoughts about using Shared Memory:

After doing some research, we believe that we cannot use __shared__ memory for this. The reason for this is because the array size can be anything (500 thousand, 1 million, 2 million, etc.). There is very limited amount of __shared__ memory depending on the GPU. Even if we used __shared__ memory, there is always the issue that only the values within each block would get sorted. We would also need to deal with bank conflicts (Thread Id would access memory of id id & id+1 and write to memory at id).


Our attempting at reducing the kernel counts:

For this we removed the for loop that called the kernel n-1 times, and added a for loop inside the kernel surrounding the condition statement (if-else). This was quite interesting in the sense that it was much faster (about 30 – 50%) than the method we had derived from the pseudo code. Unfortunately, about 1% of size (argument entered) would be incorrectly sorted. When we ran this code on different types of GPU with a higher & lower compute capability, we had fewer and more errors respectively however the sorting was still incorrect. So overall, we decided to stick with our multiple kernel call approach as we didn’t want to have errors with the benefit of faster completion time.


Final thoughts:

Bubble sort was not able to be speed up as fast as quick/merge/heap sort O(n * logn) but compared to serial bubble sort we ran on matrix & average pc, there was very noticeable difference. Our overall approach felt a lot like the reduction lab where we called the kernel multiple times to get the answer. We also realized that the optimal solution will be different on different hardware platforms.While doing research, we found out there are even better parallel sorting algorithms (radix, rank sort, parallel merge sort, hyper quick sort). We actually found the CUDA sample code for parallel merge sort (It was for older version of CUDA). When we profiled it and ran it, it turned out to be much slower (1 – 3 seconds) than serial merge sort for N = 500,000. However, the CUDA quick sort given was just as fast as the serial quick sort (potentially faster by few milliseconds).

Some of the things we learned

- Odd-even sort

- Parallel Bubble Sort & its implementation

- Race Conditions

- We CANNOT do Insertion sort in parallel (Confirmed!)

- Sequencing issues will be caused for coalesced hardware when acceding array elements 0 & 1 in thread 0, accessing 2 & 3 in thread 1 (Issue with shared memory & why we avoided it).

- Shared memory is an optimal choice where we call the kernel once and the kernel has a for loop inside of it. However, causes bank conflict (can be solved using strides?) but unfortunately gave us an incorrect result so we avoided that method.

- As the size of array goes up, the number of threads used goes up (size/2 thread are used). This will eventually result in the maximum number of threads being used.

- Blocks will run in random orders (One of the reasons why we encountered an error with the single kernel with a for loop inside odd-even approach). We were able make small numbers work & it was much faster but for big numbers approximately 1 percent of the array size would be incorrectly sorted.

- Algorithms with alot of branching and are VERY sequential are not suitable for the GPU.

- We came across some interesting sorting algorithms (Bucket, radix, rank, hyper quick and merge sort) which are better options than the parallel bubble sort.