1
edit
Changes
→Assignment 3: images
Initially, the serial version of this program took about 13 minutes to calculate 512 samples in a 5000-body simulation. Even with the use of Steaming SIMD Extensions, the program took about 7 minutes to do the same test.
==== Parallelization ==== ===== Basic Parallelization =====
* Turned old serial code where the program bottlenecked to into two separate kernels
===== Optimized Parallelization =====
* Changed the launch configuration for the kernels so there were no wasted threads (based on devices compute capabilities)
===== Profile #1 =====
[[File:cosmosis_assn3_p1_1.png|border]]
[[ImageFile:cosmosis_assn3_p1_2.png|border]]
To be able to see the difference between the pre and post optimized code, this graph does not include the serial cpu timings.
Our second profile again consists of running simulations for 240 seconds to determine how many samples we achieve per second, and how many total samples we end up with after four minutes.
[[ImageFile:cosmosis_assn3_p2_1.png|border]]
Optimized GPU after four minutes.
[[ImageFile:cosmosis_assn3_p2_2.png|border]]
Naive GPU Samples after four minutes.
Comparing our results from the previous GPU implementation, we managed to achieve a total of 188072 samples compared to 88707. Roughly a 112.015 % increase in the number of samples completed in four minutes. Compared with our CPU code, the optimized GPU code is 1421.741% faster.
==== Test Suite ====
[[File:cosmosis_assn3_test.png|border]]
During the initial stages of our optimizations, we noticed that incorrect data started showing up after some changes. In order to ensure that even after our optimizations the data was still correct we had to develop a comprehensive test suite. The test suite goes through multiple tests and compares host values (assumed 100% correct) to the device values. These values are compared using their final position after a number of samples. The test suite allows for 1.0 difference in values to compensate for floating-point errors.