Open main menu

CDOT Wiki β

Changes

GPU610/Cosmosis

5,395 bytes added, 20:21, 6 February 2013
no edit summary
If someone knows how to add spoiler tags, it would be much appreciated if you could add them to my 2 groups of pictures.
 
====Clinton's Profiling Findings====
 
I decided to code my own N-Body simulator using the instructions and data found at [http://www.cs.princeton.edu/courses/archive/fall07/cos126/assignments/nbody.html cs.princeton.edu]. I have created both a Windows and Linux version of the simulation, the Windows version supports drawing graphics to the screen while the Linux version does not. The implementation is coded in C++ and uses the "brute force" algorithm to calculate the bodies. While this implementation is "perfect", the run-time for it is O(n^2). I have also tried to implement a basic form of SSE ([https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions Streaming SIMD Extensions]) which should increase the speed of the simulation. I will provide profiling for both SSE and non-SSE versions.
 
The profiling seen has been run with the following properties:
*Windows: i5 2500K @ 4.5Ghz
*Linux: Seneca Matrix
*Both drawing no graphics to the screen for unbiased results.
*Random position, velocity and mass for each body.
*Brute force algorithm for calculating the forces (O(n^2)).
 
=====Profile #1: 1000 Bodies CPU Usage=====
 
During this profile, the simulations are run for 240 seconds to determine which functions have the most CPU load. Inline functions are also disabled for this test so that we can properly determine where the bottlenecks of the code are.
 
'''Windows:'''
 
[[File:gpu670_cfbale_prof1_1.png|border]]
 
As expected, the main source of CPU load is coming from the function that does all the work computing the forces of the planets. This function takes 88.46% of the entire application's worth in computing power for four minutes. The hotspots for the AddForces function can be seen here:
 
[[File:gpu670_cfbale_prof1_2.png|border]]
 
This picture shows that the majority of the computing involved comes from the square root function and computing the gravitational constant. Both of which require heavy floating point math to complete.
 
'''Linux:'''
 
Linux shows very similar results for 1000 elements, most, if not all the cpu usuage has gone to the AddForces function. With just over 3.7 billion calls to the AddForces function, we can see the slowdown of the O(n2) run-time immediately.
 
[[File:gpu670_cfbale_prof1_3.png|border]]
 
=====Profile #2: 1000 Bodies Timing (SSE and non-SSE)=====
 
For this test, the simulations are run for 240 seconds to determine the amount of time it takes to calculate one “sample” which is one whole brute-force pass of all the bodies.
 
'''Windows (non-SSE):'''
 
[[File:gpu670_cfbale_prof2_1.png|border]]
 
This screenshot is a picture of the simulation running in the console. I added some information to show exactly how much information is being processed per second. Above you can see that on average it takes about 19.89 m/s to process one sample (full brute-force calculation). A nice speed which gives us about 50 samples per second. During the entire four minute long test, my Windows machine was able to execute 12067 samples.
 
'''Linux (non-SSE):'''
 
[[File:gpu670_cfbale_prof2_2.png|border]]
 
On Seneca's Matrix server, the results are surprisingly much slower, about half the speed of my Windows machine using full optimizations. You can see that in the time it took for my machine to run 12067 samples, Matrix only completed 6359.
 
'''Windows (SSE):'''
 
I rewrote some of the functions for calculating the forces on the bodies to use SSE code. This is the first time I have ever written SSE code and may not be properly optimized for what it's doing. The performance increases of my implementation is negligible, but I'm sure that If I had more knowledge in the SSE architecture the performance difference would be much more noticeable.
 
[[File:gpu670_cfbale_prof2_3.png|border]]
 
After the rewrite of my calculation functions, I only gained about a 2.5% increase in speed, which is definitely not worth it.
 
'''Linux (SSE):'''
 
To enable SSE on Linux, you can use the built in compiler parameters that g++ has to automatically generate SSE instructions for you:
 
<pre>-march=native -mfpmath=sse</pre>
 
[[File:gpu670_cfbale_prof2_4.png|border]]
 
Enabling this gave me a small performance boost of about 5 samples per second. Increasing my total sample count over four minutes to 7468 from 6359, that's a 15% increase in speed from just adding two compiler parameters, not bad.
 
=====Profile #3: 512 Samples 1000-5000 Bodies Comparison=====
 
For this final profile, I sampled the time difference between Linux and Windows. I include The Linux SSE and non-SSE versions, but only the standard Windows implementation due to the fact that the speed increase is next to nothing with my SSE version. The following test was timing the amount of seconds it took to compute 512 brute-force samples of the N-Body simulation, lower the better.
 
[[File:gpu670_cfbale_prof3_1.png|border]]
 
=====Parallelizable?=====
 
For my 2D N-Body simulation, you can spot the section of code where parallelization would give massive speedups. Since processors do things in a serial order, the following double-for loop is the cause for most of the delay in the application:
 
[[File:gpu670_cfbale_last.png|border]]
 
If I was to parallelize this code using CUDA, I would put the ResetForce and AddForce function calls into their own thread on the GPU. Therefore, instead of computing all the forces sequentially, they all get computed at once.
=== Assignment 2 ===
=== Assignment 3 ===
1
edit