GPU610/SSD
Drive_God_Lin
We are going to be working on a CERN project code named Drive_God_Lin.
Summery:
SUSSIX - Frequency Analysis of Non-Linear Betatron Motion
The SUSSIX program has been developed to postprocess tracking or experimental data via frequency analysis.
This code performs harmonic analysis of turn-by-turn data from a circular acceleration (measured data or from numerical simulation).
It is heavily used in the optics re-construction for the LHC( Large Hadron Collider) and the same tools are being exported to the lower-energy machines at CERN.
SUSSIX is a FORTRAN program for the post processing of turn-by-turn BPM (beat per minute) data, which computes the frequency, amplitude, and phase of tunes and resonant lines to a high degree of precision through the use of an interpolated FFT (fast Fourier transform).
For analysis of LHC BPM data a specific version sussix4drive (the FORTRAN file code), run through the C steering code Drive God lin (the C file code), has been implemented in the CCC by the beta-beating team
To see the whole Document click here
Team Members
Progress
Assignment 1
Sezar:
I decided to profile CERN project - Drive_God_Lin. After gprof the project with the test data provided I learned the following:
System specifications:
OS: Xubuntu (Ubuntu 12.10 (quantal)) 36-bit CUP: Intel(R) Pentium(R) 4 CPU 3.00GHz RAM: 1GB DDR2 GPU: nVIDIA GEFORCE GT 620
(summery of gprof)
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 39.98 26.68 26.68 524 50.92 50.92 ordres_ 37.42 51.65 24.97 314400 0.08 0.08 zfunr_ 12.69 60.12 8.47 314400 0.03 0.03 cfft_ ...
As you can see, most the program's time is spent in the ordres() subroutine and zfunr() subroutine; both are localed in the fortran portion of the program (there is a c portion as well).
Furthermore, this program is already parallelized using OpenMP; which means it may be farther parallelized using cuda technology.
In these two subroutines (specially in orders()) there are a lot of nested loops, if statements, goto's, and even some sore of a search algorithm (in orders() only) that I'm certain it could be notably improved using cuda technology.
The project is on github here
Dylan:
I profiled CERN's Drive_God_Lin application.
The resulting profile is generated using gprof.
% cumulative self self total time seconds seconds calls ms/call ms/call name 49.47 10.34 10.34 314400 0.03 0.03 zfunr_ 28.76 16.35 6.01 524 11.47 11.47 ordres_ 10.81 18.61 2.26 314400 0.01 0.01 cfft_ 3.30 19.30 0.69 314400 0.00 0.04 tunelasr_ 3.06 19.94 0.64 1048 0.61 13.63 spectrum_ 1.67 20.29 0.35 314400 0.00 0.00 calcr_
All other functions took less than 1% of the application's run time,
so I am not considering them when assessing which parts of the application need to be optimized,
as the speed increase would be negligible.
In total, the application ran for 20.90 seconds using 1 core.
The specifications of the system being used to generate this profile was:
Ubuntu 12.04 LTS 32-bit Intel® Core™2 Quad CPU Q9550 @ 2.83GHz × 4 3.9 GB Memory NVIDIA GTX 480
The 3 most problematic functions are zfunr_ , ordres_ , and cfft.
1. zfunr_ takes up almost 49.5% of the application processing time, and it is also called 314,400 times.
It is possible that this time can be significantly increased using a many-core GPU to run hundreds of threads simultaneously.
2. cfft_ is called just as much as the zfunr_ function, but it takes significantly less time in the application. It is likely
that this function can be optimized to use the GPU, however the speed increase may be minimal compared to the zfunr_ function.
3. The ordres_ function takes 28.76% of the application time, but is only called 524 times throughout the application.
This subroutine is used to order harmonics, and is incredibly long. Depending on the ordering algorithm used in this subroutine,
it may be possible to optimize the serial performance of the subroutine before further optimizing it to use the GPU.
4. Many other functions are called thousands of times throughout the application. It may be possible to optimize some of these
to attain a minuscule speed increase, assuming the memory read/write rate between the GPU and CPU doesn't increase run-time instead.
However it is clear that these functions are not priority.
It is important to note that the application currently uses OpenMP,
so multi-core processing can be used. When set to utilize 12 threads,
the program finishes in around 5 seconds.
I have forked a copy of the code into my own GitHub repository here
Stephanie:
For assignment one I profiled a closest pair algorithm. The source code can be found here:
http://rosettacode.org/wiki/Closest-pair_problem/C
I was able to run the code successfully on Matrix and I believe it is a good candidate to parallelize. The closest() function generally eats up the most time and I believe that most of its remedial processes could be done by the GPU.
A useful reference site to help explain closest pair algorithms:
http://www.algorithmist.com/index.php/Closest_Pair
Assignment 2
Progress so far:
- Fortran code converted to C with a simple conversion program found Here
- The Project is on Github(with the modified make file and c converted code)
The profile with the C converted code
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 38.75 19.47 19.47 314400 0.06 0.06 zfunr_ 30.04 34.56 15.09 524 28.80 28.80 ordres_ 9.83 39.50 4.94 f__cabs
As you can see there aren't any real difference/improvements just yet...
System specifications:
OS: Xubuntu (Ubuntu 12.10 (quantal)) 36-bit CUP: Intel(R) Pentium(R) 4 CPU 3.00GHz RAM: 1GB DDR2 GPU: nVIDIA GEFORCE GT 620
Closest Pair
Unfortunately we had to abandon the project since it was not really compatible with CUDA(there is no time to look more into it as the semester is almost over...). Team TudyBert, who are also working on the same project, have found that CUDA uses dynamic libraries where Dirve_God_Lin uses static libraries and there isn’t much to be done about that. As a result, we will be using Stephanie’s closet pair program.
The closest pair problem can be explained simply by imagining a random set of points spread in a metric space. Now the process of finding two points with the smallest distance between them is called the closest pair problem.
One of the more common algorithms used to find the closest pair is the Brute-force algorithm; which is calculating the distances of all the points ( O(n^2) notation):
n=total number of points n(n-1)/2
then simply look for the pair of points that has the smallest distance between each other. However, this algorithm was evident to be slow.
This is the function used in the closestPair.c program that Stephanie used for her assignment 1:
double brute_force(point* pts, int max_n, point *a, point *b) { int i, j; double d, min_d = MAXDOUBLE; for (i = 0; i < max_n; i++) { for (j = i + 1; j < max_n; j++) { d = dist(pts[i], pts[j]); if (d >= min_d ) continue; *a = pts[i]; *b = pts[j]; min_d = d; } } return min_d; }
This is the profile:
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 99.78 41.58 41.58 16384 2.54 2.54 brute_force 0.14 41.64 0.06 closest 0.05 41.66 0.02 cmp_x 0.02 41.67 0.01 cmp_y
System specifications:
OS: Ubuntu 12.04 32-bit CPU: Intel(R) Core 2 Duo(R) CPU E4600 @ 2.40GHz x 2 RAM: 2GB DDR2 GPU: nVIDIA GEFORCE GT 620
As a consequence, we presume that by making the brute force function parallel using CUDA technology will speed up the process significantly.
Github link here