1
edit
Changes
assignment 2
You can see that the majority of the processing time is used on SQ(square) and MAX(which value is bigger) calculations. The point calculation can be done independently and therefore can be parallelized with CUDA. This program can be speed up even more if we utilize the Barnes-hut algorithm for calculating N-Bodies using quad trees and spatial partitions.
=== Assignment 2 - N-Body Simulation ======= Baseline ====The following profiles were made under the following compiler and computer settings: <pre>nvcc main.cpp timer.cpp sim\simbody.cu sim\simulation.cu -DWIN32 -O3</pre> * i5 2500K @ 4.5Ghz* Nvidia GTX 560Ti* Both drawing no graphics to the screen for unbiased results.* Random position, velocity and mass for each body.* Brute force algorithm for calculating the forces (O(n^2)). ==== Profiles ========= Profile #1 ===== [[File:gpu670_cosmo_a2_prof1.png|border]] For our initial profile we sampled the time difference from the CPU and GPU implementation of the N-Body brute force algorithm. The chart above describes the amount of seconds it took to compute 512 brute-force samples of the N-Body simulation, lower is better. Our findings proved to be quite impressive with a 3097% speed up on 5000 bodies using the GPU to simulate. According to Amdahl’s Law, there would be an ~8.5x speedup for a function taking up 88.46% of an application’s execution time with a graphics card that has 384 cuda cores. ===== Profile #2 =====Our second profile consists of running simulations for 240 seconds to determine how many samples we achieve per second, and how many total samples we end up with after four minutes. [[File:gpu670_cosmo_a2_prof2.png|border]] This profile shows the unbelievable amount of speedup that we can achieve with simple parallelization. Using the CPU on Windows with SSE, we achieved an average of about 51 samples per second for a total of 12359 samples taken over a period of four minutes. With the GPU parallelization we achieved an average of 370 samples per second, with a total of 88707 samples over a period of four minutes. Therefore, giving us an average speed increase of about 7.25x per sample. ==== Difficulties ====We faced many discrepant difficulties in our endeavor to transfer the code from being executed on the host to being executed on the GPU. One of the challenges faced was due to the fact that the image’s functions were within a library. Because of this we had to take the image management out of the ''Body ''class, as we could not use a thrust device vector to store the bodies because we could not make the functions the image used callable on the device. This was an annoying hurdle as we had to restructure one of our classes (''Body''), and a portion of the code within other classes (''BodyManager'' & ''Game''). Another challenge we were presented with was getting nvcc to compile and link in 64-bit while using our static 32-bit SFML libraries. We ended up reverting to a dynamic-linking version of SFML and a 32-bit version of our executable. This change is only temporary until we can safely and more stably compile SFML and all of it’s dependencies''' '''using a 64-bit architecture. ==== Useful Links ==== * SFML - [http://www.sfml-dev.org/ www.sfml-dev.org/]* SSE - [https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions en.wikipedia.org/wiki/Streaming_SIMD_Extensions]* N-Body - [http://www.cs.princeton.edu/courses/archive/fall07/cos126/assignments/nbody.html cs.princeton.edu/courses/archive/fall07/cos126/assignments/nbody.html]* Grids - [http://www.resultsovercoffee.com/2011/02/cuda-blocks-and-grids.html resultsovercoffee.com/2011/02/cuda-blocks-and-grids.html]* Repo - [https://code.google.com/p/gpu-nbody/ code.google.com/p/gpu-nbody]
=== Assignment 3 ===