Changes

← Older edit

Kernal Blas

301 bytes added, 10:14, 4 April 2018

→‎Progress

==== Calculation of Pi ====

For this assessment, we used code found at [https://helloacm.com/cc-coding-exercise-finding-approximation-of-pi-using-monto-carlo-algorithm/ helloacm.com]

int main() {

srand(time(NULL));

cout.precision(10);

std::chrono::steady_clock::time_point ts, te;

const double N[] = {1e1,1e3, 1e4, 1e5, 1e6, 1e7, 1e8};

for (int j = 0; j < (sizeof(N) / sizeof(N[0])); j ++) {

ts = std::chrono::steady_clock::now();

int circle = 0;

for (int i = 0; i < N[j]; i ++) {

double x = static_cast<double>(rand()) / static_cast<double>(RAND_MAX);

double y = static_cast<double>(rand()) / static_cast<double>(RAND_MAX);

if (x * x + y * y <= 1.0) circle ++;

}

te = std::chrono::steady_clock::now();

cout << N[j] << (char)9 << (char)9 << (double)circle / N[j] * 4 ;

reportTime("", te - ts);

}

return 0;

}

</syntaxhighlight>

In this version, the value of PI is calculated using the Monte Carlo method.

This method states:

100000000 3.1419176 - took - 10035 millisecs

''The first column represents ~~the "stride" or~~ the number of ~~digits of pi~~ points we are ~~calculating to.~~generating

With this method, we can see that the accuracy of the calculation is slightly off. This is due to the randomization of points within the circle. Running the program again will give us slightly different results.

'''Parallelizing

~~From one of the suggested improvements in the algorithm post link. A potential improvement is changing from char& c to a const char in the for loop~~ ~~<syntaxhighlight lang="cpp">~~ ~~for (char& c : input) {~~ ~~</syntaxhighlight >~~ ~~since char& c is not being modified. Otherwise we~~ We did not see any other way to parallelize ~~compression~~the algorithm.

=== Assignment 2 ===

However, the parallelized results seem to stay accurate throughout the iterations.

It seems as though the calculation time doesn't change much and stays consistent.

[[File:Cudamalloc.PNG|800px]] [[File:Prof.PNG]] Profiling the code shows that '''~~memcpy~~cudaMalloc''' takes up most of the time spent. Even when

there are 10 iterations, the time remains at 300 milliseconds.

As the iteration passes 25 million, we have a bit of memory leak which results in inaccurate results.

In order to optimize the code, we must find a way reduce the time ~~memcpy~~ cudaMalloc takes.

=== Assignment 3 ===

----

After realizing the cudaMemcpy and cudaMalloc takes quite a bit of time, we focused our efforts on optimizing it.

It was difficult to find a solution because the initial copy takes a bit of time to set up.

We tried using cudaMallocHost to see if we can allocate memory instead of using malloc.

cudaMallocHost will allocate pinned memory which is stored in RAM and can be accessed by the GPU's DMA directly.

We changed one part of our code

~~The kernel code we used to optimize our code~~

~~__global__~~ cudaMallocHost((void ~~gpu_monte_carlo(float~~ *~~estimate, curandState~~ *~~states~~)&host, ~~float n~~size) { ~~unsigned int tid = threadIdx.x + blockDim.x * blockIdx.x;~~ ~~float points_in_circle = 0;~~ ~~float x, y~~;

~~curand_init(1234, tid, 0, &states[tid]);~~ </~~/ Initialize CURAND~~syntaxhighlight>

~~for (int i = 0; i~~ < ~~n; i++) {~~br/> ~~x = curand_uniform(&states[tid]);~~ ~~y = curand_uniform(&states[tid]);~~ ~~points_in_circle += (x*x + y*y <= 1.0f); // count if x & y is~~ Here we can see where an error occurs, we suspect that a memory leak causes the problem resulting in an error in ~~the circle.~~ } ~~estimate[tid] = 4.0f * points_in_circle / n; // return estimate of~~ pi}~~</syntaxhighlight>~~calculation

~~How we optimize and improved the code from assignment 2 is instead of using a randomized number we ask the user for input on pi calculation~~'''Optimized time run results [[File:Chart3. PNG]] ~~The error in PI estimation is how far it is from the known value of pi~~[[File:Chartp3. ~~PI = 3.1415926535~~PNG]]

The final results show that although cudaMallocHost should imporve the speed of memory transfer, if didn'~~''Test runs:~~ t make much of a difference here. In conclusion, we can see that the GPU performance is significantly faster than the CPU'~~''Run 1: n = 10[[File:10-Kernal-Blas.png]] n = 1000[[File:1000-Kernal-Blas.png]] n = 10000[[File:10000-Kernal-Blas.png]] n = 100000[[File:100000-Kernal-Blas.png]] n = 1000000[[File:1000000-Kernal-Blas~~s performance.~~png]]~~ ~~Here is where an error occurs and onward where we suspect that a memory leak causes the problem resulting in an error in pi calculation~~

Jpham14

96

edits

Changes

Kernal Blas

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools