Changes

Kernal Blas

189 bytes removed, 08:37, 4 April 2018

→‎Assignment 3

----

After realizing the cudaMemcpy was took quite a bit of time, we focused our efforts on optimizing it.

It was difficult to find a solution because the initial copy always takes a bit of time.<br>

We tried using cudaMallocHost to see if we can allocate memory instead of using malloc. <br>

cadaMallocHost will allocate pinned memory which is stored in RAM and can be accessed by the GPU's DMA directly.

We changed one part of our code

~~The kernel code we used to optimize our code~~

~~__global__~~ cudaMallocHost((void ~~gpu_monte_carlo(float~~ *~~estimate, curandState~~ *~~states~~)&host, ~~float n~~size) { ~~unsigned int tid = threadIdx.x + blockDim.x * blockIdx.x;~~ ~~float points_in_circle = 0;~~ ~~float x, y~~;

~~curand_init(1234, tid, 0, &states[tid]); // Initialize CURAND~~

~~for (int i = 0; i < n; i++) {~~

~~x = curand_uniform(&states[tid]);~~

~~y = curand_uniform(&states[tid]);~~

~~points_in_circle += (x*x + y*y <= 1.0f); // count if x & y is in the circle.~~

}

~~estimate[tid] = 4.0f * points_in_circle / n; // return estimate of pi~~

}

</syntaxhighlight>

~~How we optimize and improved the code from assignment 2 is instead of using a randomized number we ask the user for input on pi calculation.~~ As expected <br/>

The error in PI estimation is how far it is from the known value of pi. PI = 3.1415926535

<br>

[[File:kernal-blas-optimized.png]]

[[File:Chartp3.PNG]]

Jpham14

96

edits

Changes

Kernal Blas

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools