1
edit
Changes
→Assignment 2
=== Assignment 2 ===
'''Problem Description'''
In the assignment one, Wes worked on a Fibonacci number calculator algorithm and Norbert worked on a dartboard algorithm to calculate the PI problem. In this assignment our team decided to select the PI calculation problem, and we converted our basic CPU program to a parallel program which speeds up the algorithm using GPU. As Norbert concluded on previous assignment, the PI can be approximated in a number of ways. The dartboard algorithm is not the fastest algorithm, but it is very feasible to parallelize. The main idea behind it can be compared to a dartboard – you throw a random number of (n) darts at the board and note down the darts that have landed within it and those that have not. The hot spot of this algorithm is a single for loop, which is calculating the size of the PI. This can be executed independently, because it has no data dependency, therefore it is possible to parallelize with CUDA to speed up the processing time. So our strategy is to break up the “for loop” into multiple portions which could be executed by the tasks.
The image below demonstrates the Random Points within a Square to Calculate Pi concept:
[[Image:filename|thumb|widthpx| ]]
The number of tasks are equals with the number of thrown dart. Every task executed on the GPU, and performs the calculation which verifies if the point is inside of the circle or not. This tasks are able to execute the work independently, because it does not requires any information’s from the other task. Finally the host gather all the synchronize data from the device, and calculates the size of the PI.
'''Code analysis'''
This loop is the hot spot of the previous program.
[[Image:filename|thumb|widthpx| ]]
During this assignment we converted our program structure to be more feasible for parallelization. We rewrote the program and changed the “for loop” from the previous program and we created a kernel which will execute the task on the device.
[[Image:filename|thumb|widthpx| ]]
'''Program execution'''
The following table and chart compare the CPU runtime vs the GPU runtime.
[[Image:filename|thumb|widthpx| ]]
[[Image:filename|thumb|widthpx| ]]
'''Conclusion'''
The run time of the GPU is much faster than that of the CPU. The problem however comes when data is being transferred and initialized. The times shown above reflect the program as a whole, however the actual calculation times of the program were a fraction of this. An interesting thing to note is that at 1 million items, the run time of the CPU code was faster than that of the GPU code. This is due to the time required to initialize the GPU’s variables and copy the generated points from the host to the GPU. The GPU however became more efficient after that as it was able to calculate the results much faster. The next step to this program will be to more efficiently allocate the blocks and generate the random numbers on the GPU reducing the amount of data transferred reduced to only one array in each direction.
'''References:'''
http://http.developer.nvidia.com/GPUGems3/gpugems3_ch37.html
https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI
=== Assignment 3 ===