Changes

← Older edit

Happy Valley

3,648 bytes added, 11:07, 11 April 2018

m

→‎Assignment 3

=== Parallelized ===

The image below demonstrates the way Bitonic method computes the sorted values. The algorithm works as following: you take 2 sequences and sort them,

Reverse one & append it to the other to create a bitonic sequence, then you split the seequence into larger & smaller half and sort each half (and repeat the process).

Image below shows that each time 2 values are on either side of a shared vertical line (red), they are swapped if they are in the wrong order. In theory, these swap operations are '''independent''' and are great for parallel exposure. 2 outer loops cannot be parallelized since the most-inner loop depends on previously calculated (swapped) values. However, we can parallelize the 3rd loop that goes from 0 to N where N is the total number of elements in the input array.

<pre>

for (k = 2; k <= N; k = 2 * k) // Cannot be parallel!

{

for (j = k >> 1; j > 0; j = j >> 1) // Cannot be parallel!

{

for (i = 0; i<N; i++) {} // Can be parallel!

}

</pre>

[[File:Bitonic1.png|800px]]

==== Kernel ====

We can take the code executed in the innermost loop and put it into CUDA kernel. The kernel is launched 'n' times where 'n' is the the total number of elements to be sorted. We pass data allocated on the device memory as well as 'j' & 'k" indices which can be used to indicate the current position in the Sorting Network.

'''Source Code'''

=== Comparison ===

The table below and graph show the comparison of parallel & serial versions: it includes CPU execution time, GPU execution time with all the memory operations and kernel execution time (in seconds). Note that n is the exponent since Bitonic sort requires the input array size to be of 2^n. It appears that CPU runs faster when the size is below 2^18, when the size is 2^16 CPU runs about 72% faster than GPU. After that, the CUDA code starts to outperform the serial code. The difference between execution time becomes dramatic one 'n' (the exponent) reaches number over 26.

'''Data Table'''

==== Switching to shared memory ====

VISUAL PROFILER suggested few ideas for optimization: - Concurrent Kernel Execution - Low Memcpy/Compute Overlap Concurrent Kernel Execution can let CUDA programmers launch several kernels asynchronously by utilizing Stream functionalities. Unfortunately, it is not applicable for the Bitonic sort algorithm since for the same reason we cannot parallelize 2 outer loops. If we launch kernels in parallel, they will start 'competing'for the data values and thus we will end up having race conditions. Low memcpy/compute overlap is related to the Concurrent Kernel Execution. In theory, you can pass chunks of the input array asynchronously into each kernel in the array. However, it seems to be hard to partition the inout data in any meaningful way. ==== Switching to CudaMallocHost ==== There is slightly performance increase when switch to CudaMallocHost. ' ~~Source Code~~ ''The data table ''' [[File:HVMallocHosttable.png|800px]] ''' The diagram ''' [[File:HVMallocHost.png|800px]] ==== Switching to x86 from x64 ==== The performance increased dramatically when switching the compiler mode from x86 to x64. ''' The data table ''' [[File:HVx86vsx64table.png|800px]] ''' The diagram ''' [[File:HVx86vsx64.png|800px]] Although the result of the program did not change, the console showed this warning massage:==4500== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements === Reference === ''' Links ''' <pre>http://parallelcomp.uw.hu/ch09lev1sec2.html</pre> <pre>https://www.geeksforgeeks.org/bitonic-sort/</pre>

<pre>

~~code here!~~https://en.wikipedia.org/wiki/Selection_sort

</pre>

Yalong

56

edits

Changes

Happy Valley

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools