1
edit
Changes
→Assignment 1
My application that I selected is a PI approximation function. PI can be approximated in a number of ways however I chose to use the dartboard algorithm. Although not the fastest, the algorithm is very feasible to parallelize. The main idea behind it can be compared to a dartboard – you throw a random number (n) of darts at the board and note down the darts that have landed within it and those that have not. The image below demonstrates this concept:
[[File:dart.gif]]
During the execution, the program takes the number of iterations through the command line. It generates two random decimal numbers between 0 and 1 and determines if the randomly generated coordinates are inside in the circle. Then it calculates the size of PI. As we increase the number of iteration we are getting a more realistic value of PI.
[[File:output.jpg]]
I created a screenshot from the first execution, the rest of the execution is summarized on the table and the chart below.
[[File:table.jpg]]
[[File:Pichart.jpg]]
As we can see the hotspot of the program is the pi () function, which takes 100% of the execution time. This function is containing a single for loop, which is calculating the size of the PI. This can be executed independently, because it has no data dependency, therefor it is possible to parallelize with CUDA to speed up the processing time.
'''Problem Description'''
In the second assignment our team had two options to select from. These options were either a prime number calculator or a calculator of Pi. After some discussion, we selected to work with the PI calculation problem. We took Norbert’s original CPU program and we ported it to the GPU which sped up the processing rate, resulting in a faster processing time overall. In this assignment we experimented more with our CUDA solution and utilizing two different forms of optimization, we managed to get an even faster computation speed of the overall program.
'''Optimization 1'''
One of the main problems of our previous solution was that it was taking a much longer execution time than we were expecting. During the analysis of the program, we realized that we were wasting precious computation time in generating random numbers on the host and copying them over to the device. For this reason we decided to generate the random numbers directly on the device using cuRAND. This allowed us to reduce the run time to a quarter of what it took to run before.
The following code sample demonstrates this optimization:
[[File:A3p1.PNG]]
'''Optimization 2'''
The second optimization we made was that we implemented a reduction algorithm to help with a couple of things. Firstly it lessened the amount of data being copied from the GPU from a potential 60 million integers to just 65 thousand integers which ends up being ~923 times less items to copy. Secondly and most importantly, it did most of the additions and made it much faster (less iterations) to add up the remaining partial sums. We also implemented the thread divergence method in the reduction algorithm for even better results. This method halved the time required after our initial optimization, effectively reducing the total run time by an eighth of the original time.
The following code sample demonstrate this optimization:
[[File:A3P2.PNG]]
'''Program execution '''
[[File:A3P3.PNG ]]
[[File:A3P4.PNG |1100px]]
'''Conclusion'''