Open main menu

CDOT Wiki β

Changes

Kernal Blas

181 bytes added, 10:14, 4 April 2018
Progress
==== Calculation of Pi ====
For this assessment, we used code found at [https://helloacm.com/cc-coding-exercise-finding-approximation-of-pi-using-monto-carlo-algorithm/ helloacm.com]
<syntaxhighlight lang="cpp">
int main() {
srand(time(NULL));
cout.precision(10);
std::chrono::steady_clock::time_point ts, te;
const double N[] = {1e1,1e3, 1e4, 1e5, 1e6, 1e7, 1e8};
for (int j = 0; j < (sizeof(N) / sizeof(N[0])); j ++) {
ts = std::chrono::steady_clock::now();
int circle = 0;
for (int i = 0; i < N[j]; i ++) {
double x = static_cast<double>(rand()) / static_cast<double>(RAND_MAX);
double y = static_cast<double>(rand()) / static_cast<double>(RAND_MAX);
if (x * x + y * y <= 1.0) circle ++;
}
te = std::chrono::steady_clock::now();
cout << N[j] << (char)9 << (char)9 << (double)circle / N[j] * 4 ;
reportTime("", te - ts);
}
return 0;
}
</syntaxhighlight>
In this version, the value of PI is calculated using the Monte Carlo method.
This method states:
100000000 3.1419176 - took - 10035 millisecs
''The first column represents the "stride" or the number of digits of pi points we are calculating to.generating
With this method, we can see that the accuracy of the calculation is slightly off. This is due to the randomization of points within the circle. Running the program again will give us slightly different results.
'''Parallelizing
From one of the suggested improvements in We did not see any other way to parallelize the algorithm post link. A potential improvement is changing from char& c to a const char in the for loop
<syntaxhighlight lang="cpp">  for (char& c : input) {== Assignment 2 ===
</syntaxhighlight >----
since char& c is not being modified. Otherwise we did not see any way to parallelize compression.
 
=== Assignment 2 ===
In order to parallelize the code from above, we decided to use a kernel to handle the calculations.
The logic largely remains the same , but we offload the results CPU calculations to the GPU. <br>This code generates random points within the kernel and the calculations are much fasteralso done in here.<br>
<br>
Offloading to the GPU results in a pi calculation time to be reduced
<br>
[[File:Pi_calculation.png]]
<br>
'''Kernel code used
<syntaxhighlight lang="cpp">
__global__ void calculate(float *sumd_pi, int nbincurandState *states, float step, int nthreads, int nblocksn) { unsigned int i; float x; int idx tid = blockIdxthreadIdx.x * + blockDim.x + threadIdx.x; // Sequential thread index across the blocks for (i = idx; i< nbin; i += nthreads*nblocks) { x = (i + 0.5)*step; sum[idx] += 4.0 / (1blockIdx.0 + x*x); }} </syntaxhighlight><br>'''Main function<syntaxhighlight lang="cpp">// Main routine that executes on the hostint main(int argc, char** argv) { // interpret command-line argument if (argc !float points = 2) { std::cerr << argv[0] << ": invalid number of arguments\n"; return 1; } float n = std::atoi(argv[1]); int nblocks = 30x, y;
steady_clock::time_point ts, te; dim3 dimGridcurand_init(nblocks1000, 1, 1); // Grid dimensions dim3 dimBlock(ntpbtid, 10, 1&states[tid]); // Block dimensions float *sumHost, *sumDev; // Pointer to host & device arraysInitialize CURAND
float step for (int i = 1.0 / ; i < n; // Step sizei++) { size_t size x = nblocks*ntpb * sizeofcurand_uniform(float&states[tid]); //Array memory sizecalls random float from 0.0 to 1.0 sumHost y = curand_uniform(float *)malloc(size&states[tid]); // Allocate array on host cudaMalloc( points += (void x*x + y*)&sumDev, sizey <= 1.0f); // Allocate array on device // Initialize array count if x & y is in device to 0 cudaMemset(sumDev, 0, size); // initializationthe circle. std::srand(std::time(nullptr));}
ts = steady_clock::now();  // Do calculation on device calculate << <dimGrid, dimBlock >> > (sumDev, n, step, ntpb, nblocks); // call CUDA kernel te = steady_clock::now();  cudaMemcpy(sumHost, sumDev, size, cudaMemcpyDeviceToHost);  for (tid = 0; tid<ntpb*nblocks; tid++) pi += sumHostd_pi[tid]; pi = 4.0f *= step;  /points / Print results printf("Number of iterations= %f\nPI = %f\n", n,pi); reportTime("Pi calculation took ", te - ts);    // Cleanup free(sumHost); cudaFree(sumDev);  return 0;estimate of pi
}
</syntaxhighlight>
<br>
[http://docs.nvidia.com/cuda/curand/index.html cuRAND] documentation.
<br>
 
'''Results CPU vs GPU
<br>
[[File:Cpuvsgpusheet.PNG|600px]]
<br>
<br>
[[File:Cpuvsgpu.png|600px]]
<br>
As we can see above, the more iterations, the more accurate the calculation of PI. <br>
The CPU's results drastically change as we increase the iteration 10x. <br>
However, the parallelized results seem to stay accurate throughout the iterations. <br>
It seems as though the calculation time doesn't change much and stays consistent. <br> [[File:Cudamalloc.PNG|800px]] <br>[[File:Prof.PNG]] <br> Profiling the code shows that '''cudaMalloc''' takes up most of the time spent. Even when <br>there are 10 iterations, the time remains at 300 milliseconds. <br>As the iteration passes 100 25 million, we have a bit of memory leaks leak which results in inaccurate results. <br><br>   In order to optimize the code, we must find a way reduce the time cudaMalloc takes. <br>
=== Assignment 3 ===
The kernel code we used to optimize our code
<syntaxhighlight lang="cpp">
__global__ void gpu_monte_carlo(float *estimate----After realizing the cudaMemcpy and cudaMalloc takes quite a bit of time, curandState *states, float n) {we focused our efforts on optimizing it. unsigned int tid = threadIdxIt was difficult to find a solution because the initial copy takes a bit of time to set up.x + blockDim<br> We tried using cudaMallocHost to see if we can allocate memory instead of using malloc.x * blockIdx<br>cudaMallocHost will allocate pinned memory which is stored in RAM and can be accessed by the GPU's DMA directly.x; float points_in_circle = 0; float x, y;We changed one part of our code
curand_init(1234, tid, 0, &states[tid]); // Initialize CURAND<syntaxhighlight lang="cpp">
cudaMallocHost((void **)&host, size);
for (int i = 0; i < n; i++) {
x = curand_uniform(&states[tid]);
y = curand_uniform(&states[tid]);
points_in_circle += (x*x + y*y <= 1.0f); // count if x & y is in the circle.
}
estimate[tid] = 4.0f * points_in_circle / n; // return estimate of pi
}
</syntaxhighlight>
'''Test runs:<br/>
<br/>
Here we can see where an error occurs, we suspect that a memory leak causes the problem resulting in an error in pi calculation
 
'''Optimized time run results
<br>
[[File:Chart3.PNG]]<br>
[[File:Chartp3.PNG]]<br>
How we optimize and improved The final results show that although cudaMallocHost should imporve the code is instead speed of memory transfer, if didn't make much <br>of using a randomized number difference here. In conclusion, we ask can see that the GPU performance is significantly faster than the user for input on pi calculationCPU's performance.<br>
96
edits