1
edit
Changes
→Team Sonic
= Team Sonic =
[[Image:sonicteam.png|right|frameless|widthpx727| ]]
== Members ==
# [mailto:pvaaheeswaran@senecacollege.ca?subject=gpu610 Prasanth Vaaheeswaran]
# [mailto:dlev1@myseneca.ca?subject=GPU610 Daniel Lev]
<!-- # [mailto:vturalba@myseneca.ca?subject=GPU610 Leo Turalba] -->:[mailto:pvaaheeswaran@senecacollege.ca,dlev1@myseneca.ca,vturalba@myseneca.ca?subject=gpu610 Email All]
== About ==
== Required Files ==
#[http://www5.cs.fau.de/research/projects/rabbitct/download/ RabbitCT]#[http://www5.cs.fau.de/fileadmin/Forschung/Software/RabbitCT/download/rabbitct_512-v2.rctd Dataset (Included problem sizes: 128, 256, 512)]
== Progress ==
=== Assignment 1 ===
We can see there is a nested for loop, containing three for loops. In Big-O notation the order of growth for this method would be O(N3). It is also using double precision and matrix multiplications, therefore I think this code can be optimized using CUDA.
These results here were calculated on a:
<pre>
Lenovo T400 laptop
Intel® Core™2 Duo CPU P8600 @ 2.40GHz × 2
4GB 1066 MHz Memory
Fedora Release 18 (Spherical Cow) 64-bit
Kernel: 3.6.10-4.fc18.x86_64
</pre>
I'm sure running this on our GTX 480 can yield better results. (hopefully).
<!--
====[[User:Leo_Turalba | Leo]]====
=====Topic=====
Im looking at fractal maps right now. Julia and Mandelbrot set. Im trying to figure out how to do the Julia set because last semester, one of the assignment is for Mandelbrot Set.
=====Updates=====
-->
=== Assignment 2 ===
=====Background=====
In assignment 1 we identified the 'RCTAlgorithmBackprojection' function located in the dll/shared library was taking up virtually 100% of the execution time. Our goal for this assignment was to off-load that logic to the GPU using cuda api, leaving everything else intact.
=====Process=====
The work we did can be separated in to three basic steps:
#Compile base software from source (RabbitCTRunner.exe and dll)
#Write cuda kernel to replace 'RCTAlgorithmBackprojection' in dll
#Integrate cuda kernel in to RabbitCTRunner
Using preprocessor directives the RabbitCTRunner was fitted with a token named GPU. If GPU was defined, our kernel along with all the necessary code will be compiled and used by RabbitCTRunner. This allowed us to have a small footprint on the changes applied and also allows us to use the flag to compile back the original RabbitCTRunner.
======Kernel======
<pre>
__global__
void Backprojection(float * gpu_matrix_result, RabbitCtGlobalData * r) // problem_size is r.L
{
unsigned int L = r->L;
float O_L = r->O_L;
float R_L = r->R_L;
double* A_n = r->A_n;
//float* I_n = r->I_n;
//float* f_L = r->f_L;
// for optimization, put these ^ in shared memory.
//s_rcgd = r;
unsigned int col = blockDim.x * blockIdx.x + threadIdx.x; // this is like the original "i" iterator
unsigned int row = blockDim.y * blockIdx.y + threadIdx.y; // this is like the original "j" iterator
unsigned int depth = blockDim.z * blockIdx.z + threadIdx.z; // this is like the original "k" iterator
if (row < L && col < L && depth < L)
{
unsigned int final_index = L * L * depth + L * row + col;
double z = O_L + (double)depth * R_L;
double y = O_L + (double)row * R_L;
double x = O_L + (double)col * R_L;
double w_n = A_n[2] * x + A_n[5] * y + A_n[8] * z + A_n[11];
double u_n = (A_n[0] * x + A_n[3] * y + A_n[6] * z + A_n[9] ) / w_n; // inline this
double v_n = (A_n[1] * x + A_n[4] * y + A_n[7] * z + A_n[10]) / w_n; // inline this
// p_hat_n inlined:
int i = (int)u_n;
int j = (int)v_n;
double alpha = u_n - (int)u_n;
double beta = v_n - (int)v_n;
double p_hat_n_result =
(1.0 - alpha) * (1.0 - beta) * p_n(i , j )
+ alpha * (1.0 - beta) * p_n(i+1, j )
+ (1.0 - alpha) * beta * p_n(i , j+1)
+ alpha * beta * p_n(i+1, j+1);
///////
gpu_matrix_result[final_index] += (float)(1.0 / (w_n * w_n) * p_hat_n_result);
// this call calculates the index value twice because of +=
}
}
</pre>
======Memory Allocation======
<pre>
#ifdef GPU
float * gpu_matrix;
cudaMalloc((void**)&gpu_matrix, numel_vol * sizeof(float));
cudaMemcpy(gpu_matrix, rctgdata.f_L, numel_vol * sizeof(float), cudaMemcpyHostToDevice);
const int threads_per_block = 16;
const int num_of_blocks = (problem_size + threads_per_block - 1) / threads_per_block;
dim3 block(threads_per_block, threads_per_block, threads_per_block);
dim3 grid(num_of_blocks, num_of_blocks, num_of_blocks);
#endif
</pre>
======Execution Config======
<pre>
#ifdef GPU
Backprojection<<<grid, block>>>(gpu_matrix, &rctgdata);
#else
</pre>
=====Metrics=====
Original RabbitCTRunner with volume size 128:
<pre>
C:\Users\pvaaheeswaran\Desktop\cuda rabbit>Rabbit.exe LolaBunny.dll rabbitct_512-v2.rctd original 128
RabbitCT runner http://www.rabbitct.com/
Info: using 4 buffer subsets with 240 projections each.
Running ... this may take some time.
(\_/)
(='.'=)
(")_(")
--------------------------------------------------------------
Quality of reconstructed volume:
Root Mean Squared Error: 38914.3 HU
Mean Squared Error: 1.51433e+009 HU^2
Max. Absolute Error: 65535 HU
PSNR: -19.5571 dB
--------------------------------------------------------------
Runtime statistics:
Total: 339.759 s
Average: 353.915 ms
FULL RUNTIME: 340.507 secs
C:\Users\pvaaheeswaran\Desktop\cuda rabbit>
</pre>
CudaRabbit, volume size 128.
<pre>
C:\Users\pvaaheeswaran\Desktop\cuda rabbit>CudaRabbit.exe LolaBunny.dll rabbitct_512-v2.rctd result 128/
RabbitCT runner http://www.rabbitct.com/
Info: using 4 buffer subsets with 240 projections each.
Running ... this may take some time.
(\_/)
(='.'=)
(")_(")
--------------------------------------------------------------
Quality of reconstructed volume:
Root Mean Squared Error: 38914.3 HU
Mean Squared Error: 1.51433e+009 HU^2
Max. Absolute Error: 65535 HU
PSNR: -19.5571 dB
--------------------------------------------------------------
Runtime statistics:
Total: 0.004881 s
Average: 0.00508437 ms
FULL RUNTIME: 0.512 secs
C:\Users\pvaaheeswaran\Desktop\cuda rabbit>
</pre>
=====Graph=====
This graph represents for only 128 volume size.
[[Image:sonicgraph1.png|left|frame]]
=====Summary=====
As you can see, using cuda has improved the performance by ~99.85%, we went from over 5 minutes to under a second. However, this code still needs to be optimized and run on bigger volume sizes. This will be our next goal.
=== Assignment 3 ===