Changes

← Older edit

Studyapplocator

2,676 bytes added, 14:33, 22 April 2018

→‎Presentation

Another area that will be speed up the program would be the render function

<code> <nowiki> for (unsigned y = 0; y < ~~height~~IMG_RES; ++y) { for (unsigned x = 0; x < ~~width~~IMG_RES; ++x~~, ++pixel~~) { int k = x + y * IMG_RES; float xx xxPoints = (2 * ((x + 0.5) * ~~invWidth~~iwidth) - 1) * ~~angle~~ viewangle * aspectratio; float yy yyPoints = (1 - 2 * ((y + 0.5) * ~~invHeight~~iheight)) * ~~angle~~viewangle; Vec3f ~~raydir~~rayDirection, rayOrigin; rayDirection.init(xxxxPoints, yyyyPoints, -1); ~~raydir~~ rayDirection.normalize(); *pixel = trace(Vec3f rayOrigin.init(~~0), raydir, spheres,~~ 0); } }

// Begin tracing // trace(rayOrigin, rayDirection, 0, pixel, sphere, k); } } </nowiki></code> This function traces the rays for each pixel of the image , traces it and returns a color.

= Assignment 2 =

==Parallelization==

After converting a portion of the code to become more parallelized we decided to test the run times of the program at various resolutions. This rendered a picture at a certain quality and at each resolution the run time increased. ===Changing the Render function=== Instead of using regular C++ indexing in the render() function paralleled using Blocks and Thread indexing. So in the C++ version of the code we had a nested for loop that iterates over the x and the y axis of the image depending on the resolution of the image set. [[File:RenderCPP.jpg]] This was changed to thread based indexing when we changed the render function to the kernel. [[File:Render.png]] ===Declaring Device pointer===We declared a device pointer to the sphere object and allocated memory for device object and lastly copied the data from host object to the device object. [[File:Htod.png]] ===Setting up the Grid===We allocated the grid of threads based on the image resolution we set the code to render and divide it by the number of threads per block [[File:Grid.png]]===Launching the Kernel===Instead of calling the render function in the main we changed fucntion render() to a __global__ void render() kernel. [[File:RCpp.png]] In the end we launch the kernel to render the image and copy the rendered data from device memory to the host memory.

[[File:block.jpg]]

===Image resolution 512===

[[File:512.jpg]]

===Image resolution 1024===

[[File:1024.jpg]]

===Image resolution 2048===

[[File:2048.jpg]]

===Image resolution 4096===

[[File:4096.jpg]]

===Analysis===

[[File:excelgraph2.jpg]]

From this chart we can see the significant drop in run time when we switch from serial to parallel processing in ray tracing using CUDA as we double the resolution from 512. There is still room for improvement which will be implemented, and analyzed in assignment 3.

= Assignment 3 =

~~Under Progress~~In this assignment we decided to enhance memory access to a vital data point which decreased the run time of the render kernel by almost half. This effect is shown in the graph below: [[File:optimizedExcel.jpg]] We can see the difference of run times in the kernel from the Nvidia Visual Profiler===Optimized Image Resolution Results at 512===[[File:512Optimized.jpg]] ===Optimized Image Resolution Results at 1024===[[File:1024Optimized.jpg]] ===Optimized Image Resolution Results at 2048===[[File:2048Optimized.jpg]] ===Optimized Image Resolution Results at 4096===[[File:4096Optimized.jpg]] Although there are more ways to optimize the code by better using available GPU resources, like using more available bandwidth, using more cores depending on compute capability, having better memcpy efficiency. For simplicity we decided to reduce memory access times as it was the main area where the kernel was spending most of its time as indicated by the nvvp profiles we collected. = Presentation =[[File:Presentation.pdf]]

Fmalik17

53

edits

CDOT Wiki β

Changes

Studyapplocator

CDOT Wiki ^β