Changes

Jump to: navigation, search

BetaT

3,893 bytes added, 20:37, 6 March 2017
no edit summary
=== Tests ran with no optimization on linux ===
By using the command line argument cat /proc/cpuinfo
We can find the CPU specs for the VM we are operating linux through.
for this test we have:
Dual-Core AMD Opteron
cpu MHz at 2792
{| class="wikitable sortable" border="1" cellpadding="5"
System Specifications
 
 
== Application 2 Calculating Pi==
 
This application is pretty straightforward, it calculates Pi to the decimal point which is given by the user. So an input of 10 vs 100,000 will calculate Pi to either the 10th or 100 thousandth decimal.
 
=== problem ===
 
Inside the function calculate we have:
 
void calculate(std::vector<int>& r, int n)
{
int i, k;
int b, d;
int c = 0;
 
for (i = 0; i < n; i++) {
r[i] = 2000;
}
 
for (k = n; k > 0; k -= 14) {
d = 0;
 
i = k;
for (;;) {
d += r[i] * 10000;
b = 2 * i - 1;
 
r[i] = d % b;
d /= b;
i--;
if (i == 0) break;
d *= i;
}
//printf("%.4d", c + d / 10000);
c = d % 10000;
}
}
 
I Believe the 2 for loops will cause a delay in the program execution time.
 
=== Tests ran with no optimization on linux ===
 
for this test the linux VM has:
Dual-Core AMD Opteron
cpu MHz at 2792
 
 
{| class="wikitable sortable" border="1" cellpadding="5"
|+ Naiver Equation
! n !! Time in Milliseconds
|-
||1000 ||2||
|-
||10000 ||266||
|-
||100000 ||26616||
|-
||200000 ||106607||
|-
||500000 ||671163||
|}
 
 
=== gprof ===
 
As with the other application that was profiled it can be a bit hard to read the gprof results. Basically the program spends 87% of the time in the calculate() method and with a problem size of 500,000 it spend a cumulative of 354 seconds. Hopefully we can get this number down.
 
But the main thing to take away here is that main() is 89.19% and takes 97 seconds.
Flat profile:
 
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
87.39 354.08 354.08 1 354.08 395.84 calculate(std::vector<int, std::allocator<int> >&, int)
10.31 395.84 41.76 678273676 0.00 0.00 std::vector<int, std::allocator<int> >::operator[](unsigned int)
 
=== Potential Speed Increase with Amdahls Law ===
 
Using Amdahls Law ---- > Sn = 1 / ( 1 - P + P/n )
 
We can examine how fast out program is capable of increasing its speed.
 
P = is the part of the program we want to optimize which from above is 87.39%
n = the amount of processors we will use. One GPU card has 384 processors or CUDA cores and another GPU we will use has 1020 processor or CUDA cores.
 
Applying the algorithm gives us.
 
Amdahls Law for GPU with 384 Cores---- > Sn = 1 / ( 1 - 0.8739 + 0.8739/384 )
Sn = 7.789631
 
Amdahls Law for GPU with 1024 Cores---- > Sn = 1 / ( 1 - 0.8739 + 0.8739/1024 )
Sn = 7.876904
 
'''Therefor According to Amdahls law we can expect a 7.7x to 7.9x increase in speed.
 
97 seconds to execute main / 7.8 amdahls law = 45.3948 seconds to execute after using GPU
 
Interestingly the last application had p = 89% (9x speed up) and this application p = 87% (7.8x speed up), 2% made quite a difference.'''
 
--------------------------------------------------------------------------------------------------
 
=== Potential Speed Increase with Gustafsons Law ===
 
Gustafsons Law S(n) = n - ( 1 - P ) ∙ ( n - 1 )
 
(Quadro K2000 GPU) S = 380 - ( 1 - .8739 ) * ( 380 - 1 ) = 332.2081
 
(GeForce GTX960 GPU) S = 1024 - ( 1 - .8739 ) * ( 1024 - 1 ) = 894.9837
 
 
Using Gustafsons law we see drastic changes in the amount speed increase, this time the additional Cores made a big difference and applying these speed ups we get:
 
(Quadro K2000 GPU) 354 seconds to execute / 332.2081 = 1.065597
 
(GeForce GTX960 GPU) 354 seconds to execute / 894.9837 = 0.395537
== Conclusions with Profile Assessment ==
Based on the problem we have for both applications which is quadratic(A nested for loop). the The time spent processing the main problem which was 89.19% and 87.39%. Plus the amount of time in seconds the program spent on the particular problem which was 97 & 354 seconds. I believe it is feasible to optimize this one of these application with CUDA to improve performance.  I will attempt to optimize the naiver strokes flow velocity program as that application is more interesting to me.
212
edits

Navigation menu