1
edit
Changes
→Added summery
# $LD_PROFILE=libLolaBunny.so ./RabbitCTRunner [LOCATION OF MODULE] [LOCATION OF DATA-SET] [REPORT FILE] [VOLUMLE SIZE] <br />'''Example:''' LD_PROFILE=libLolaBunny.so ./RabbitCTRunner ../modules/LolaBunny/libLolaBunny.so ~/datasets/rabbitct_512-v2.rctd ./resultFile 128
# $ sprof -p [LOCATION OF MODULE] [LOCATION OF PROFILE FILE] > log <BR />'''Example:''' sprof -p ../modules/LolaBunny/libLolaBunny.so /var/tmp/libLolaBunny.so.profile > log <br />'''Note.''' /var/tmp/ was my default location for profiles. Man ld.so for LD_PROFILE_EXPORT
You should now have a log file in your current dir, which has the a flat-profile of the module:
<pre>
[prasanth@localhost RabbitCTRunner]$ cat log
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
100.00 112.58 112.58 0 0.00 RCTAlgorithmBackprojection
</pre>
This is the output from RabbitCTRunner:
<pre>
[prasanth@localhost RabbitCTRunner]$ LD_PROFILE=libLolaBunny.so ./RabbitCTRunner ../modules/LolaBunny/libLolaBunny.so ~/datasets/rabbitct_512-v2.rctd ./resultFile 128
RabbitCT runner http://www.rabbitct.com/
Info: using 4 buffer subsets with 240 projections each.
Running ... this may take some time.
(\_/)
(='.'=)
(")_(")
--------------------------------------------------------------
Quality of reconstructed volume:
Root Mean Squared Error: 38914.3 HU
Mean Squared Error: 1.51433e+09 HU^2
Max. Absolute Error: 65535 HU
PSNR: -19.5571 dB
--------------------------------------------------------------
Runtime statistics:
Total: 112.627 s
Average: 117.32 ms
</pre>
======Summary======
With the flat-profile data, I can say with some certainty that 100% of the time is spent on the 'RCTAlgorithmBackprojection' method. Digging in to the source code this is the actual code of this method:
''LolaBunny.cpp''
<pre>
FNCSIGN bool RCTAlgorithmBackprojection(RabbitCtGlobalData* r)
{
unsigned int L = r->L;
float O_L = r->O_L;
float R_L = r->R_L;
double* A_n = r->A_n;
float* I_n = r->I_n;
float* f_L = r->f_L;
s_rcgd = r;
for (unsigned int k=0; k<L; k++)
{
double z = O_L + (double)k * R_L;
for (unsigned int j=0; j<L; j++)
{
double y = O_L + (double)j * R_L;
for (unsigned int i=0; i<L; i++)
{
double x = O_L + (double)i * R_L;
double w_n = A_n[2] * x + A_n[5] * y + A_n[8] * z + A_n[11];
double u_n = (A_n[0] * x + A_n[3] * y + A_n[6] * z + A_n[9] ) / w_n;
double v_n = (A_n[1] * x + A_n[4] * y + A_n[7] * z + A_n[10]) / w_n;
f_L[k * L * L + j * L + i] += (float)(1.0 / (w_n * w_n) * p_hat_n(u_n, v_n));
}
}
}
return true;
}
</pre>
We can see there is a nested for loop, containing three for loops. In Big-O notation the order of growth for this method would be O(N3). It is also using double precision and matrix multiplications, therefore I think this code can be optimized using CUDA.
These results here were calculated on a:
<pre>
Lenovo T400 laptop
Intel® Core™2 Duo CPU P8600 @ 2.40GHz × 2
4GB 1066 MHz Memory
Fedora Release 18 (Spherical Cow) 64-bit
Kernel: 3.6.10-4.fc18.x86_64
</pre>
I'm sure running this on our GTX 480 can yield better results. (hopefully).
====[[User:Leo_Turalba | Leo]]====