Changes

← Older edit

GPU610/Cosmosis

44,403 bytes added, 15:10, 13 April 2013

→‎Dynamic Shared Memory (not implemented)

[mailto:nbguzman@myseneca.ca,jsantos13@myseneca.ca,acraig1@myseneca.ca,cfbale@myseneca.ca?subject=dps915-gpu610 Email All]

== ~~Repo~~ Links ==* Repo - [https://code.google.com/p/gpu-nbody/code.google.com/p/gpu-nbody]* SFML - [http://www.sfml-dev.org/ www.sfml-dev.org/]* SSE - [https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions en.wikipedia.org/wiki/Streaming_SIMD_Extensions]* N-Body - [http://www.cs.princeton.edu/courses/archive/fall07/cos126/assignments/nbody.html cs.princeton.edu/courses/archive/fall07/cos126/assignments/nbody.html]* Grids - [http://www.resultsovercoffee.com/2011/02/cuda-blocks-and-grids.html resultsovercoffee.com/2011/02/cuda-blocks-and-grids.html]

== Progress ==

=== Assignment 1 ===

For our assignment 1, we are looking into finding and N-body simulator. All of us in the group have agreed to find 1 each, and after profiling them, we will choose the most inefficient one to parallelize.

====Alex's Profiling Findings====

=== Assignment 2 ===

We decided to use the N-Body simulator that Clinton profiled for assignment 2.

==== Baseline ====

The following profiles were made under the following compiler and computer settings:

<pre>nvcc main.cpp timer.cpp sim\simbody.cu sim\simulation.cu -DWIN32 -O3</pre>

* i5 2500K @ 4.5Ghz

* Nvidia GTX 560Ti

* Both drawing no graphics to the screen for unbiased results.

* Random position, velocity and mass for each body.

* Brute force algorithm for calculating the forces (O(n^2)).

==== Profiles ====

===== Profile #1 =====

[[File:gpu670_cosmo_a2_prof1.png|border]]

For our initial profile we sampled the time difference from the CPU and GPU implementation of the N-Body brute force algorithm. The chart above describes the amount of seconds it took to compute 512 brute-force samples of the N-Body simulation, lower is better. Our findings proved to be quite impressive with a 3097% speed up on 5000 bodies using the GPU to simulate. According to Amdahl’s Law, there would be an ~8.5x speedup for a function taking up 88.46% of an application’s execution time with a graphics card that has 384 cuda cores.

===== Profile #2 =====

Our second profile consists of running simulations for 240 seconds to determine how many samples we achieve per second, and how many total samples we end up with after four minutes.

[[File:gpu670_cosmo_a2_prof2.png|border]]

This profile shows the unbelievable amount of speedup that we can achieve with simple parallelization. Using the CPU on Windows with SSE, we achieved an average of about 51 samples per second for a total of 12359 samples taken over a period of four minutes. With the GPU parallelization we achieved an average of 370 samples per second, with a total of 88707 samples over a period of four minutes. Therefore, giving us an average speed increase of about 7.25x per sample.

==== Difficulties ====

We faced many discrepant difficulties in our endeavor to transfer the code from being executed on the host to being executed on the GPU. One of the challenges faced was due to the fact that the image’s functions were within a library. Because of this we had to take the image management out of the ''Body ''class, as we could not use a thrust device vector to store the bodies because we could not make the functions the image used callable on the device. This was an annoying hurdle as we had to restructure one of our classes (''Body''), and a portion of the code within other classes (''BodyManager'' & ''Game'').

Another challenge we were presented with was getting nvcc to compile and link in 64-bit while using our static 32-bit SFML libraries. We ended up reverting to a dynamic-linking version of SFML and a 32-bit version of our executable. This change is only temporary until we can safely and more stably compile SFML and all of it’s dependencies''' '''using a 64-bit architecture.

==== Code ====

===== Old CPU Code =====

void Simulation::Tick(double dt) {

size_t i = 0, j = 0;

for(i = 0; i < bodies_.size(); ++i) {

bodies_[i].ResetForce();

for(j = 0; j < bodies_.size(); ++j) {

if(i != j) {

bodies_[i].AddForce(bodies_[j]);

}

for(i = 0; i < bodies_.size(); ++i) {

bodies_[i].Tick(dt);

}

</div>

===== Updated Kernel Code =====

void __global__ SimCalc(BodyArray a){

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < a.size) {

a.array[idx].ResetForce();

for (size_t j = 0; j < a.size; ++j) {

if (idx != j) {

a.array[idx].AddForce(a.array[j]);

}

void __global__ SimTick(BodyArray a, float dt) {

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < a.size) {

a.array[idx].Tick(dt);

}

</div>

=== Assignment 3 ===

==== Problem Overview ====

An N-body simulation is a simulation of a dynamical system of particles, usually under the influence of physical forces, such as gravity. In cosmology, they are used to study processes of non-linear structure formation such as the process of forming galaxy filaments and galaxy halos from dark matter in physical cosmology. Direct N-body simulations are used to study the dynamical evolution of star clusters.

To be able to successfully have an N-body simulation, a program must go through each body, and add the forces of each other body that is affecting it to update where its position would be in the simulation. Due to this, the algorithm that is used to do these calculations has a time complexity of O(n^2). Having the time increase exponentially as n increases linearly makes the simulation rather slow, and thus a perfect candidate for parallelization.

==== Baseline ====

The following profiles were made under the following compiler and computer settings:

<pre>nvcc main.cpp timer.cpp sim\simbody.cu sim\simulation.cu -DWIN32 -O3</pre>

* i5 2500K @ 4.5Ghz

* Nvidia GTX 560Ti

* Raw computations, no graphics drawn to the screen for unbiased results.

* Random position, velocity and mass for each body.

* Brute force algorithm for calculating the forces (O(n^2)).

==== Initial Profiling ====

Initially, the serial version of this program took about 13 minutes to calculate 512 samples in a 5000-body simulation. Even with the use of Steaming SIMD Extensions, the program took about 7 minutes to do the same test.

==== Parallelization ====

===== Basic Parallelization =====

* Turned old serial code where the program bottlenecked to into two separate kernels

===== Optimized Parallelization =====

* Changed the launch configuration for the kernels so there were no wasted threads (based on devices compute capabilities)

* Prefetched values that don’t change throughout the loops

* Did computations in the kernel to reduce function overhead

* Used constant memory for the gravitation constant

==== Profiles ====

===== Profile #1 =====

[[File:cosmosis_assn3_p1_1.png|border]]

[[File:cosmosis_assn3_p1_2.png|border]]

To be able to see the difference between the pre and post optimized code, this graph does not include the serial cpu timings.

===== Profile #2 =====

Our second profile again consists of running simulations for 240 seconds to determine how many samples we achieve per second, and how many total samples we end up with after four minutes.

[[File:cosmosis_assn3_p2_1.png|border]]

Optimized GPU after four minutes.

[[File:cosmosis_assn3_p2_2.png|border]]

Naive GPU Samples after four minutes.

Comparing our results from the previous GPU implementation, we managed to achieve a total of 188072 samples compared to 88707. Roughly a 112.015 % increase in the number of samples completed in four minutes. Compared with our CPU code, the optimized GPU code is 1421.741% faster.

==== Test Suite ====

[[File:cosmosis_assn3_test.png|border]]

During the initial stages of our optimizations, we noticed that incorrect data started showing up after some changes. In order to ensure that even after our optimizations the data was still correct we had to develop a comprehensive test suite. The test suite goes through multiple tests and compares host values (assumed 100% correct) to the device values. These values are compared using their final position after a number of samples. The test suite allows for 1.0 difference in values to compensate for floating-point errors.

==== Conclusions ====

Through the use of CUDA, we managed to achieve a total of 4229.33% speedup in time from serial CPU to the final optimized GPU. We used many basic techniques to achieve a speedup of 35.4% from the pre-optimized code, to the post-optimized code. There were several different parallelization techniques that we did not manage to get to work with our program that could have sped it up even further. One such thing was shared memory.

Our kernels accessed the same bodies for the calculations so we tried to implement shared memory so that threads in a current block can access them faster. It worked when n bodies was less than 1755 for graphics cards with a compute capability of 2.x. This is due to the fact that a body took up 28 bytes in memory, hence why 1755 bodies would not work because it took up 49,140 bytes (greater than the max shared memory a 2.x graphics card can hold: 48K). There was a roundabout way of feeding the kernel chunks of bodies at a time that only worked on some occasions, so we ended up scrapping it.

We initially intended on using the fast-math library provided by CUDA. At first our results were marginally faster than our previous code. Though after some optimizations we discovered that our code actually performed better than the fast-math library. With fast-math, it took 0.451889 seconds to process 1000 bodies for 512 samples, conversely without fast-math we got 0.409581 seconds, which is a considerable improvement.

==== Optimized Code ====

void __global__ SimCalc(BodyArray a)

{

int_fast32_t idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < a.size) {

const _T G = 6.67384f * pow(10.0f, -11.0f);

//precompute positions at index

const _T px = a.array[idx].Position.x;

const _T py = a.array[idx].Position.y;

//mass at the index

const _T M_idx = G*a.array[idx].Mass;

a.array[idx].Force = vec2_t();

for (int_fast32_t j(0); j != a.size; ++j) {

if (idx != j) {

_T dx = a.array[j].Position.x - px;

_T dy = a.array[j].Position.y - py;

_T r = sqrt(dx*dx + dy*dy);

_T F = (M_idx*a.array[j].Mass)/(r*r);

a.array[idx].Force.x += F * (dx / r);

a.array[idx].Force.y += F * (dy / r);

}

void __global__ SimTick(BodyArray a, _T dt)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < a.size)

{

_T mass = a.array[idx].Mass;

a.array[idx].Velocity.x += dt * (a.array[idx].Force.x / mass);

a.array[idx].Velocity.y += dt * (a.array[idx].Force.y / mass);

a.array[idx].Position.x += dt * a.array[idx].Velocity.x;

a.array[idx].Position.y += dt * a.array[idx].Velocity.y;

}

</div>

==== Launch Control ====

We used the following calculations to determine the the number of threads and blocks to launch with:

numThreads_ = prop.maxThreadsPerMultiProcessor / maxBlocks;

numBlocks_ = (bodies_.size() + numThreads_ - 1) / numThreads_;

numThreads_ = (numThreads_ + 1) & ~1;

</div>

==== Dynamic Shared Memory (not implemented) ====

This is the roundabout way we thought of, of how to send in chunks to the kernel so that the kernel can handle shared memory size of no greater than the max shared memory size of the GPU:

CHUNKSIZE = 512;

shared_ = CHUNKSIZE * sizeof(SimBody);

while (chunks > 0)

{

BodyArray ba = { &arr.array[index], CHUNKSIZE };

SimCalc <<< numBlocks_, numThreads_, shared_ >>>(ba);

cudaThreadSynchronize();

SimTick <<< numBlocks_, numThreads_, shared_ >>>(ba, timeStep);

cudaThreadSynchronize();

index += CHUNKSIZE;

--chunks;

}

chunks = arr.size / CHUNKSIZE + 1;

index = 0;

</div>

It handles calculations in chunks so that the kernel can do calculations on body sizes of more than 1175 for devices with compute capabilities of 3.x.

Here is what the shared memory kernels would look like (not implemented because not correct):

void __global__ SimCalc(BodyArray a)

{

int_fast32_t idx = blockIdx.x * blockDim.x + threadIdx.x;

int tid = threadIdx.x;

extern __shared__ SimBody sa[];

if (idx >= a.size)

return;

sa[tid] = a.array[idx];

__syncthreads();

const _T G = 6.67384f * pow(10.0f, -11.0f);

//precompute positions at index

const _T px = sa[tid].Position.x;

const _T py = sa[tid].Position.y;

//mass at the index

const _T M_idx = G*sa[tid].Mass;

sa[tid].Force = vec2_t();

for (int_fast32_t j(0); j != a.size; ++j) {

if (idx != j) {

_T dx = a.array[j].Position.x - px;

_T dy = a.array[j].Position.y - py;

_T r = sqrt(dx*dx + dy*dy);

_T F = (M_idx*a.array[j].Mass)/(r*r);

sa[tid].Force.x += F * (dx / r);

sa[tid].Force.y += F * (dy / r);

}

__syncthreads();

}

a.array[idx] = sa[tid];

}

void __global__ SimTick(BodyArray a, _T dt)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

int tid = threadIdx.x;

extern __shared__ SimBody sa[];

if (idx >= a.size)

return;

sa[tid] = a.array[idx];

__syncthreads();

_T mass = sa[tid].Mass;

sa[tid].Velocity.x += dt * (sa[tid].Force.x / mass);

sa[tid].Velocity.y += dt * (sa[tid].Force.y / mass);

sa[tid].Position.x += dt * sa[tid].Velocity.x;

sa[tid].Position.y += dt * sa[tid].Velocity.y;

__syncthreads();

a.array[idx] = sa[tid];

}

</div>

Nbguzman

1

edit

Changes

GPU610/Cosmosis

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools