1
edit
Changes
→Conclusion
== '''Announcements''' ==
N/A
== '''Team Members''' ==
== '''Progress''' ==
=== '''Assignment 1''' ===
==== '''Introduction''' ====
For the initial profiling, I've decided to investigate the Monte Carlo Statistics Methodology of approximating the value of Pi. A brief explanation of Monte Carlo Pi calculation can be found here: https://www.youtube.com/watch?v=VJTFfIqO4TU
==== '''Program Source File(s)''' ====
Link: https://drive.google.com/file/d/0B8GUuIUqdEJES3VEOGRnYmRNaEk
==== '''Code Snippet''' ====
Serial Pi Calculation Algorithm
// loops through user amount of rounds of sets of points
for(i = 0; i < points; i++)
{
x = randNum();
y = randNum();
// check if point resides within the circle
if (((x*x) + (y*y)) <= 1.0)
{
score++;
}
}
// calculate pi
pi = 4.0 * (float)score/(float)points;
==== '''Software and Hardware''' ====
[[File:Pi_software_and_hardware_list.jpg|border]]
==== '''Program Execution Plan''' ====
Pi serial tests would be conducted with sample counts up to 1 billion. Between 100 million and 1 billion, a sample count of 134217728 is sampled as that is the maximum sample value allowed for the Nvidia 460 GTX without generating memory allocation errors.
==== '''Compilation and Running''' ====
[[File:Pi_serial_execution_example.jpg|border]]
==== '''Serial Results''' ====
[[File:Pi_serial_results.jpg|border]]
==== '''Conclusion''' ====
As the sample count increases, the execution time of the program also increases. The Big-O Classification for ''pi_serial'' is O(1).
=== '''Assignment 2''' ===
==== '''Program Introduction''' ====In Phase 2, I've parallelized the serial program to run on a custom kernel on a CUDA-enabled device. ==== '''Source File(s)''' ====
Link: https://drive.google.com/file/d/0B8GUuIUqdEJEbDBRNkhWYnpGSnM
==== '''Code Snippet''' ====
Working Kernel Parallel CUDA Pi Calculation
__global__ void findPi(float *estimatedPi, curandState *states, unsigned int taskElements, float seed)
{
unsigned int task_id = blockDim.x * blockIdx.x + threadIdx.x; // linear sequence of threads x-axis
int score = 0;
float xCoord;
float yCoord;
// 'random' generated value using curand, initialize curand using task_id, and seed parameter
curand_init(seed, task_id, 0, &states[task_id]);
// tally number of task elements
for(int i = 0; i < taskElements; i++)
{
// assigned each point coordinate values
xCoord = curand_uniform (&states[task_id]);
yCoord = curand_uniform (&states[task_id]);
// determine if coordinate is within the circle
if((xCoord*xCoord + yCoord*yCoord) <= 1.0f)
{
score++;
}
}
// estimated value of pi for this particular task
estimatedPi[task_id] = (4.0f * score) / (float)taskElements;
}
}
==== '''Software and Hardware''' ====
[[File:Pi_software_and_hardware_list.jpg|border]]
==== '''Program Execution Plan''' ====
Pi cuda tests would be conducted with sample counts starting at 100 thousand, with incremental multiplier of 10, to the maximum supported sample count of 134217728 (memory constraint on Nvidia 460 GTX). The blocks and threads values will be 128, 128 respectively throughout all the tests.
==== '''Compilation and Running''' ====
[[File:Pi_cuda_execution_example.jpg|border]]
==== '''Parallel (CUDA) Results''' ====
[[File:Pi_cuda_results.jpg|border]]
==== '''Serial VS CUDA''' ====
[[File:Pi_serial_vs_cuda_results.jpg|border]]
==== '''Conclusion''' ====
Using CUDA technology and parallelizing the serial code in the original code, there is an enormous increase in performance (lower execution time) to calculate, as high as '''1372%'''. In the next (final) phase, an attempt to investigate if shared memory, optimal memory allocation, minimizing said memory access time, and other optimization factors would provide a further increase (lower execution time) in performance for ''pi_cuda''.
=== '''Assignment 3''' ===