Changes

← Older edit

Studyapplocator

5,813 bytes added, 14:33, 22 April 2018

→‎Presentation

7 3 8 9 4 1 6 2 5

9 2 5 6 8 7 3 4 1

</nowiki>

</code>

=====Flat Profile - Easy=====

<code>

~~<nowiki>~~

Flat profile:

0.00 0.00 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)

~~</nowiki>~~

</code>

[12] _GLOBAL__sub_I_sudoku [2] solveSudoku() [11] print(int (*) [9])

[13] _GLOBAL__sub_I_temp [14] storePositions() [7] goBack(int&, int&)

[5] checkColumn(int, int) [15] __static_initialization_and_destruction_0(int, int) [4] checkRow(int, ~~in t~~int) [6] checkSquare(int, int, int) [16] __static_initialization_and_destruction_0(int, int) [3] placeNum(int , int)

</nowiki>

==Ray Tracing==

Ray tracing is a rendering technique for generating an image by tracing the path of light as pixels in an image plane and simulating the effects of its encounters with virtual objects.

The technique is capable of producing a very high degree of visual realism, usually higher than that of typical scan line rendering methods,but at a grater computational cost.(Wikipedia [https://en.wikipedia.org/wiki/Ray_tracing_(graphics)]).

==Source Code==

Source code taken from this location.[https://www.scratchapixel.com/code.php?id=3&origin=/lessons/3d-basic-rendering/introduction-to-ray-tracing]

----

Compile using the following command -

g++ -O2 -std=c++0x -pg raytracer.cpp -o raytracer

Profile using the command

gprof -p -b ./raytracer gmon.out > raytracer.flt

----

==Profile generated==

Each sample counts as 0.01 seconds.

% cumulative self self total

time seconds seconds calls us/call us/call name

81.82 0.36 0.36 307200 1.17 1.17 trace(Vec3<float> const&, Vec3<float> const&, std::vector<Sphere, std::allocator<Sphere> > const&, int const&)

18.18 0.44 0.08 render(std::vector<Sphere, std::allocator<Sphere> > const&)

0.00 0.44 0.00 4 0.00 0.00 void std::vector<Sphere, std::allocator<Sphere> >::_M_insert_aux<Sphere>(__gnu_cxx::__normal_iterator<Sphere*, std::vector<Sphere, std::allocator<Sphere> > >, Sphere&&)

0.00 0.44 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z3mixRKfS0_S0_

----

Where to parallelize the program?

From the above profile we can see that the trace function require faster computation time.

Finding the intersection of this ray with the sphere in the scene the algorithm takes a longer .Hence an area to parellelize the program will be here.

for (unsigned i = 0; i < spheres.size(); ++i) {

float t0 = INFINITY, t1 = INFINITY;

if (spheres[i].intersect(rayorig, raydir, t0, t1)) {

if (t0 < 0) t0 = t1;

if (t0 < tnear) {

tnear = t0;

sphere = &spheres[i];

}

Another area that will be speed up the program would be the render function

<code>

for (unsigned y = 0; y < IMG_RES; ++y) {

for (unsigned x = 0; x < IMG_RES; ++x) {

int k = x + y * IMG_RES;

float xxPoints = (2 * ((x + 0.5) * iwidth) - 1) * viewangle * aspectratio;

float yyPoints = (1 - 2 * ((y + 0.5) * iheight)) * viewangle;

Vec3f rayDirection, rayOrigin;

rayDirection.init(xxPoints, yyPoints, -1);

rayDirection.normalize();

rayOrigin.init(0);

// Begin tracing //

trace(rayOrigin, rayDirection, 0, pixel, sphere, k);

}

</nowiki>

</code>

This function traces the rays for each pixel of the image, traces it and returns a color.

= Assignment 2 =

~~Under progress~~For our parallelization example we decided on using Ray Tracing as it is the best candidate for optimization based on the code. ==Parallelization==After converting a portion of the code to become more parallelized we decided to test the run times of the program at various resolutions. This rendered a picture at a certain quality and at each resolution the run time increased. ===Changing the Render function=== Instead of using regular C++ indexing in the render() function paralleled using Blocks and Thread indexing. So in the C++ version of the code we had a nested for loop that iterates over the x and the y axis of the image depending on the resolution of the image set. [[File:RenderCPP.jpg]] This was changed to thread based indexing when we changed the render function to the kernel. [[File:Render.png]] ===Declaring Device pointer===We declared a device pointer to the sphere object and allocated memory for device object and lastly copied the data from host object to the device object. [[File:Htod.png]] ===Setting up the Grid===We allocated the grid of threads based on the image resolution we set the code to render and divide it by the number of threads per block [[File:Grid.png]]===Launching the Kernel===Instead of calling the render function in the main we changed fucntion render() to a __global__ void render() kernel. [[File:RCpp.png]] In the end we launch the kernel to render the image and copy the rendered data from device memory to the host memory. [[File:block.jpg]]===Image resolution 512=== [[File:512.jpg]] ===Image resolution 1024=== [[File:1024.jpg]] ===Image resolution 2048=== [[File:2048.jpg]] ===Image resolution 4096=== [[File:4096.jpg]] ===Analysis=== [[File:excelgraph2.jpg]] From this chart we can see the significant drop in run time when we switch from serial to parallel processing in ray tracing using CUDA as we double the resolution from 512. There is still room for improvement which will be implemented, and analyzed in assignment 3.

= Assignment 3 =

~~Under Progress~~In this assignment we decided to enhance memory access to a vital data point which decreased the run time of the render kernel by almost half. This effect is shown in the graph below: [[File:optimizedExcel.jpg]] We can see the difference of run times in the kernel from the Nvidia Visual Profiler===Optimized Image Resolution Results at 512===[[File:512Optimized.jpg]] ===Optimized Image Resolution Results at 1024===[[File:1024Optimized.jpg]] ===Optimized Image Resolution Results at 2048===[[File:2048Optimized.jpg]] ===Optimized Image Resolution Results at 4096===[[File:4096Optimized.jpg]] Although there are more ways to optimize the code by better using available GPU resources, like using more available bandwidth, using more cores depending on compute capability, having better memcpy efficiency. For simplicity we decided to reduce memory access times as it was the main area where the kernel was spending most of its time as indicated by the nvvp profiles we collected. = Presentation =[[File:Presentation.pdf]]

Fmalik17

53

edits

CDOT Wiki β

Changes

Studyapplocator

CDOT Wiki ^β