Changes

Jump to: navigation, search

GPU610/gpuchill

3,773 bytes added, 11:08, 4 March 2019
Assignment 1
As you can see 80% of the total time was spent in mergeSort and sort functions. <br />
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.
 
==== Calculation of Pi ====
The program I decided to assess, and profile calculates the value of PI by using the approximation method called Monte Carlo. This works by having a circle that is 𝜋r2 and a square that is 4r2 with r being 1.0 and generating randomized points inside the area, both x and y being between -1 and 1 we keep track of how many points have been located inside the circle. The more points generated the more accurate the final calculation of PI will be. The amount of points needed for say billionth precision can easily reach in the hundreds of billions which would take just as many calculations of the same mathematical computation, which makes it a fantastic candidate to parallelize.
 
 
[[File:Pi_calc.png]]
<br/>
Figure 1: Graphical representation of the Monte Carlo method of approximating PI
 
----
 
{| class="wikitable mw-collapsible mw-collapsed"
! pi.cpp
|-
|
<source>
/*
Author: Daniel Serpa
Pseudo code: Blaise Barney (https://computing.llnl.gov/tutorials/parallel_comp/#ExamplesPI)
*/
 
#include <iostream>
#include <cstdlib>
#include <math.h>
#include <iomanip>
 
double calcPI(double);
 
int main(int argc, char ** argv) {
if (argc != 2) {
std::cout << "Invalid number of arguments" << std::endl;
return 1;
}
std::srand(852);
double npoints = atof(argv[1]);
std::cout << "Number of points: " << npoints << std::endl;
double PI = calcPI(npoints);
std::cout << std::setprecision(10) << PI << std::endl;
return 0;
}
 
double calcPI(double npoints) {
double circle_count = 0.0;
for (int j = 0; j < npoints; j++) {
double x_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;
double y_coor = 2.0 * ((double)std::rand() / RAND_MAX) - 1.0;
if (sqrt(pow(x_coor, 2) + pow(y_coor, 2)) < 1.0) circle_count += 1.0;
}
return 4.0*circle_count / npoints;
}
</source>
|}
Figure 2: Serial C++ program used for profiling of the Monte Carlo method of approximating PI
 
===== Compilation =====
Program is compiled using the command: <source>gpp -O2 -g -pg -oapp pi.cpp</source>
 
===== Running =====
We will profile the program using 2 billion points
<source>
> time app 2000000000
Number of points: 2e+09
3.14157
 
real 1m0.072s
user 0m59.268s
sys 0m0.018s
</source>
 
===== Profiling =====
Flat:
<source>
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
100.80 34.61 34.61 calcPI(double)
0.00 34.61 0.00 1 0.00 0.00 _GLOBAL__sub_I_main
</source>
Call:
<source>
granularity: each sample hit covers 2 byte(s) for 0.03% of 34.61 seconds
 
index % time self children called name
<spontaneous>
[1] 100.0 34.61 0.00 calcPI(double) [1]
-----------------------------------------------
0.00 0.00 1/1 __libc_csu_init [16]
[9] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [9]
-----------------------------------------------
 
Index by function name
 
[9] _GLOBAL__sub_I_main (pi.cpp) [1] calcPI(double)
</source>
 
===== Results =====
You need many billions of points and maybe even trillions but using just 2 billion dots causes the program to take over 30 seconds to run. The most intensive part of the program is the loop which is what loops 2 billion times in my run of the program while profiling, which can all be parallelized. We can determine from the profiling that 100% of the time executing the program is spent in the loop but of course that is not possible so we will go with 99.999%, using a GTX 1080 as an example GPU which has 20 SMX and each having 2048 threads, and using Amdahl's Law we can expect a speedup of 29058.094x
=== Assignment 2 ===
=== Assignment 3 ===
33
edits

Navigation menu