Open main menu

CDOT Wiki β

Changes

Skynet/GPU610

2,794 bytes added, 15:30, 3 December 2014
Optimizations Used
'''__device__ __host__''' : we used this so that parts of classes Vec3 and Sphere could be accessed by both the host and device were necessary.
<syntaxhighlight lang="cpp">
class Vec3
{
public:
float x, y, z;
__device__ __host__ Vec3() : x(float(0)), y(float(0)), z(float(0)) {}
__device__ __host__ Vec3(float xx) : x(xx), y(xx), z(xx) {}
__device__ __host__ Vec3(float xx, float yy, float zz) : x(xx), y(yy), z(zz) {}
__device__ __host__ Vec3& normalize()
{
...
</syntaxhighlight>
'''__device__ __forceinline__''' : because the program uses various loops and recursion we force the compiler to use inline functions to speed up the trace and mix functions as well as some methods in the Vec3 and Sphere class.
'''__device__ __forceinline__''' : because the program uses various loops and recursion we force the compiler to use inline functions to speed up the trace and mix functions as well as some methods in the Vec3 and Sphere class. <syntaxhighlight lang="cpp">
__device__ __forceinline__ Vec3 trace(const Vec3 &rayorig, const Vec3 &raydir, const Sphere* spheres, const int depth, int nsphere)
{
...
</syntaxhighlight>
'''sqrtf, tanf, fmaxf''' : where std:: was being used we replaced it with CUDA's math library equivalents although gains were marginal from this.
<syntaxhighlight lang="cpp">
}
surfaceColor += sphere->surfaceColor * transmission * fmaxf(float(0), nhit.dot(lightDirection)) * spheres[i].emissionColor;
}
</syntaxhighlight>
'''shared memory''' : we implemented shared memory but quickly realized that it was actually slower then sticking to global memory, we believe this has to do with the number of times the array has to be copied into shared memory.
'''shared memory''' : we implemented shared memory but quickly realized that it was actually slower then sticking to global memory, we believe this has to do with the number of times the array has to be copied into shared memory.<syntaxhighlight lang="cpp"> //extern __shared__ char test[]; /*Sphere* sp = (Sphere*)&test[0]; for(int i = 0; i < nsphere; i++) { sp[i] = spheres[i]; }*/
</syntaxhighlight>
**We also needed to rework a few parts of code in order to be parallelized
 
====Difficulties====
 
'''RECURSION''': this was the original difficulty that forced us to use a ray tracer that took into account no transparency/reflection/depth in assignment 2. For assignment 3 we decided we wanted a relatively full featured ray tracer and decided to work on recursion. Recursion itself is support on gpu's with compute capability of 2.0+, but we ran into stack memory issues, because of recursion the compiler was not able to identify the stack size required for our kernel and in effect was allocating less memory then we required. Eventually we realized we could manually re-size the stack by using '''cudaThreadSetLimit(cudaLimitStackSize,..);''', after testing various combination we came up with a sizing scheme that made everything work.
 
====What We Learnt====
 
First off taking someone else's code on a subject we know nothing about was a definite learning experience in itself, through breaking down code segments and working through the various mathematics involved we now have a decent grasp on how ray tracing works as a technique. Once we got past this hurdle we quickly figured out that everything we learnt in class revolved around matrix optimizations for the most part and those were not present in our problem, so we had to do a lot of side reading. After all was said and done we tried to implement constant and shared memory but with either no success or poor results, switching our focus instead to the repetitive nature of our program and change our main function to an inline function and moving all the required code strictly onto the device. At the end of the day we have a much better, but far from "in depth" understanding of some of CUDA's features and capabilities.
 
====What Would We Do Differently?====
 
We would look for a program that used matrices and related math in order to more directly apply our in class lectures to our assignment.