BetaT
BetaT
Assignment 1
Profile Assessment
Application 1 naiver strokes flow velocity.
This application calculates the naiver strokes flow velocity.
Naiver Strokes is an equation for Flow Velocity.
Navier–Stokes equations are useful because they describe the physics of many phenomena of scientific and engineering interest. They may be used to model the weather, ocean currents, water flow in a pipe and air flow around a wing. The Navier–Stokes equations in their full and simplified forms help with the design of aircraft and cars, the study of blood flow, the design of power stations, the analysis of pollution, and many other things. Coupled with Maxwell's equations they can be used to model and study magnetohydrodynamics. courtesy of wikipedia ("https://en.wikipedia.org/wiki/Navier%E2%80%93Stokes_equations")
problem
The problem with this application comes in the main function trying to calculate the finite-difference
The user inputs 2 values which will be used as a reference for the loop.
// Finite-difference loop: for (int it=1; it<=nt-1; it++) { for (int k=0; k<=nx-1; k++) { un[k][it-1] = u[k][it-1]; } for (int i=1; i<=nx-1; i++) { u[0][it] = un[1][it-1]; u[i][it] = un[i][it-1] - c*dt/dx*(un[i][it-1]-un[i-1][it-1]); } }
Tests ran with no optimization on linux
By using the command line argument cat /proc/cpuinfo We can find the CPU specs for the VM we are operating linux through. for this test we have: Dual-Core AMD Opteron cpu MHz at 2792
n | Time in Milliseconds | |
---|---|---|
100 x 100 | 24 | |
500 x 500 | 352 | |
1000 x 1000 | 1090 | |
2000 x 2000 | 3936 | |
5000 x 5000 | 37799 | |
5000 x 10000 | 65955 | |
10000 x 10000 | 118682 | |
12500 x 12500 | 220198 |
gprof
it gets a bit messy down there, but basically 89.19% of the program is spent in the main() calculating those for loops shown above. The additional time is spent allocating the memory which might cause some slowdown when transferring it to the GPU across the bus int he future.
But the main thing to take away here is that main() is 89.19% and takes 97 seconds.
Each sample counts as 0.01 seconds.
% cumulative self self total time seconds seconds calls s/call s/call name 89.19 97.08 97.08 main 4.73 102.22 5.14 1406087506 0.00 0.00 std::vector<std::vector<double, std::allocator<double> >, std::allocator<std::vector<double, std::allocator<double> > > >::operator[](unsigned int) 4.49 107.11 4.88 1406087506 0.00 0.00 std::vector<double, std::allocator<double> >::operator[](unsigned int)
Potential Speed Increase with Amdahls Law
Using Amdahls Law ---- > Sn = 1 / ( 1 - P + P/n )
We can examine how fast out program is capable of increasing its speed.
P = is the part of the program we want to optimize which from above is 89.17% n = the amount of processors we will use. One GPU card has 384 processors or CUDA cores and another GPU we will use has 1020 processor or CUDA cores.
Applying the algorithm gives us.
Amdahls Law for GPU with 384 Cores---- > Sn = 1 / ( 1 - 0.8919 + 0.8919/384 )
Sn = 9.0561125222
Amdahls Law for GPU with 1024 Cores---- > Sn = 1 / ( 1 - 0.8919 + 0.8919/1024 )
Sn = 9.176753777
Therefor According to Amdahls law we can expect a 9x increase in speed.
97 seconds to execute main / 9 amdahls law = 10.7777 seconds to execute after using GPU
Interestingly according to the law the difference in GPU cores does not significantly increase speed. Future tests will confirm or deny these results.
Potential Speed Increase with Gustafsons Law
Gustafsons Law S(n) = n - ( 1 - P ) ∙ ( n - 1 )
(Quadro K2000 GPU) S = 380 - ( 1 - .8918 ) * ( 380 - 1 ) = 339.031
(GeForce GTX960 GPU) S = 1024 - ( 1 - .8918 ) * ( 1024 - 1 ) = 913.3114
Using Gustafsons law we see drastic changes in the amount speed increase, this time the additional Cores made a big difference and applying these speed ups we get:
(Quadro K2000 GPU) 97 seconds to execute / 339.031 = 0.29
(GeForce GTX960 GPU) 97 seconds to execute / 913.3114 = 0.11
Tests ran with no optimization on Windows nisghts
System Specifications
Application 2 Calculating Pi
This application is pretty straightforward, it calculates Pi to the decimal point which is given by the user. So an input of 10 vs 100,000 will calculate Pi to either the 10th or 100 thousandth decimal.
problem
Inside the function calculate we have:
void calculate(std::vector<int>& r, int n) { int i, k; int b, d; int c = 0;
for (i = 0; i < n; i++) {
r[i] = 2000;
}
for (k = n; k > 0; k -= 14) {
d = 0;
i = k;
for (;;) { d += r[i] * 10000; b = 2 * i - 1;
r[i] = d % b; d /= b; i--; if (i == 0) break; d *= i; }
//printf("%.4d", c + d / 10000); c = d % 10000; } }
I Believe the 2 for loops will cause a delay in the program execution time.
Tests ran with no optimization on linux
for this test the linux VM has: Dual-Core AMD Opteron cpu MHz at 2792
n | Time in Milliseconds | |
---|---|---|
1000 | 2 | |
10000 | 266 | |
100000 | 26616 | |
200000 | 106607 | |
500000 | 671163 |
gprof
As with the other application that was profiled it can be a bit hard to read the gprof results. Basically the program spends 87% of the time in the calculate() method and with a problem size of 500,000 it spend a cumulative of 354 seconds. Hopefully we can get this number down.
But the main thing to take away here is that main() is 89.19% and takes 97 seconds. Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total time seconds seconds calls s/call s/call name 87.39 354.08 354.08 1 354.08 395.84 calculate(std::vector<int, std::allocator<int> >&, int) 10.31 395.84 41.76 678273676 0.00 0.00 std::vector<int, std::allocator<int> >::operator[](unsigned int)
Potential Speed Increase with Amdahls Law
Using Amdahls Law ---- > Sn = 1 / ( 1 - P + P/n )
We can examine how fast out program is capable of increasing its speed.
P = is the part of the program we want to optimize which from above is 87.39% n = the amount of processors we will use. One GPU card has 384 processors or CUDA cores and another GPU we will use has 1020 processor or CUDA cores.
Applying the algorithm gives us.
Amdahls Law for GPU with 384 Cores---- > Sn = 1 / ( 1 - 0.8739 + 0.8739/384 )
Sn = 7.789631
Amdahls Law for GPU with 1024 Cores---- > Sn = 1 / ( 1 - 0.8739 + 0.8739/1024 )
Sn = 7.876904
Therefor According to Amdahls law we can expect a 7.7x to 7.9x increase in speed.
97 seconds to execute main / 7.8 amdahls law = 45.3948 seconds to execute after using GPU
Interestingly the last application had p = 89% (9x speed up) and this application p = 87% (7.8x speed up), 2% made quite a difference.
Potential Speed Increase with Gustafsons Law
Gustafsons Law S(n) = n - ( 1 - P ) ∙ ( n - 1 )
(Quadro K2000 GPU) S = 380 - ( 1 - .8739 ) * ( 380 - 1 ) = 332.2081
(GeForce GTX960 GPU) S = 1024 - ( 1 - .8739 ) * ( 1024 - 1 ) = 894.9837
Using Gustafsons law we see drastic changes in the amount speed increase, this time the additional Cores made a big difference and applying these speed ups we get:
(Quadro K2000 GPU) 354 seconds to execute / 332.2081 = 1.065597
(GeForce GTX960 GPU) 354 seconds to execute / 894.9837 = 0.395537
Conclusions with Profile Assessment
Based on the problem we have for both applications which is quadratic(A nested for loop). The time spent processing the main problem which was 89.19% and 87.39%. Plus the amount of time in seconds the program spent on the particular problem which was 97 & 354 seconds. I believe it is feasible to optimize one of these application with CUDA to improve performance.
I will attempt to optimize the naiver strokes flow velocity program as that application is more interesting to me.