Open main menu

CDOT Wiki β

Changes

Kernal Blas

374 bytes added, 10:14, 4 April 2018
Progress
==== Calculation of Pi ====
For this assessment, we used code found at [https://helloacm.com/cc-coding-exercise-finding-approximation-of-pi-using-monto-carlo-algorithm/ helloacm.com]
<syntaxhighlight lang="cpp">
int main() {
srand(time(NULL));
cout.precision(10);
std::chrono::steady_clock::time_point ts, te;
const double N[] = {1e1,1e3, 1e4, 1e5, 1e6, 1e7, 1e8};
for (int j = 0; j < (sizeof(N) / sizeof(N[0])); j ++) {
ts = std::chrono::steady_clock::now();
int circle = 0;
for (int i = 0; i < N[j]; i ++) {
double x = static_cast<double>(rand()) / static_cast<double>(RAND_MAX);
double y = static_cast<double>(rand()) / static_cast<double>(RAND_MAX);
if (x * x + y * y <= 1.0) circle ++;
}
te = std::chrono::steady_clock::now();
cout << N[j] << (char)9 << (char)9 << (double)circle / N[j] * 4 ;
reportTime("", te - ts);
}
return 0;
}
</syntaxhighlight>
In this version, the value of PI is calculated using the Monte Carlo method.
This method states:
'''Parallelizing
From one of the suggested improvements in the algorithm post link. A potential improvement is changing from char& c to a const char in the for loop <syntaxhighlight lang="cpp">  for (char& c : input) { </syntaxhighlight > since char& c is not being modified. Otherwise we We did not see any other way to parallelize compressionthe algorithm.
=== Assignment 2 ===
[[File:Prof.PNG]] <br>
Profiling the code shows that '''memcpycudaMalloc''' takes up most of the time spent. Even when <br>
there are 10 iterations, the time remains at 300 milliseconds. <br>
As the iteration passes 25 million, we have a bit of memory leak which results in inaccurate results. <br><br>
In order to optimize the code, we must find a way reduce the time memcpy cudaMalloc takes.<br>
=== Assignment 3 ===
----
After realizing the cudaMemcpy was took and cudaMalloc takes quite a bit of time, we focused our efforts on optimizing it.It was difficult to find a solution because the initial copy always takes a bit of timeto set up.<br>
We tried using cudaMallocHost to see if we can allocate memory instead of using malloc. <br>
cadaMallocHost cudaMallocHost will allocate pinned memory which is stored in RAM and can be accessed by the GPU's DMA directly.
We changed one part of our code
</syntaxhighlight>
As expected <br/>
The error in PI estimation is how far it is from the known value of pi. PI = 3.1415926535
 
'''Test runs: <br/>
'''Run 1:
<br/>
n = 10[[File:10-Kernal-Blas.png]]<br/><br/>n = 1000[[File:1000-Kernal-Blas.png]]<br/><br/>n = 10000[[File:10000-Kernal-Blas.png]]<br/><br/>n = 100000[[File:100000-Kernal-Blas.png]]<br/><br/>n = 1000000[[File:1000000-Kernal-Blas.png]]<br/>Here is we can see where an error occurs and onward where , we suspect that a memory leak causes the problem resulting in an error in pi calculation
'''Optimized time run results
<br>
[[File:kernal-blas-optimizedChart3.png|800pxPNG]]<br>[[File:Chartp3.PNG]]<br> The final results show that although cudaMallocHost should imporve the speed of memory transfer, if didn't make much <br>of a difference here. In conclusion, we can see that the GPU performance is significantly faster than the CPU's performance.<br>
96
edits