Open main menu

CDOT Wiki β

Changes

Sirius

976 bytes added, 09:35, 9 April 2018
Assignment 3
There must be a way to optimize this application, but as of today (March 4, 2018) I am not sure which path to take.<br>
For me the most important thing is to solve the problem regardless of the tools used and I think that reimplementing everything from scratch using OpenCV and CUDA is a viable solution.
 
Source Code for Vehicle Detection
=== Box Blur on an image using opencv C++ Library (Max Fainshtein) ===
=== Assignment 3 ===
We had realized Upon using Nvidia's Visual Profiler it was evident that our implementation of a kernel had made we can make some massive improvements, compared to the serial version, but after analyzing the Assignment 2 version we had noticed that we could still make improvementstry and improve our kernel even further.
<br><br>
Problem:
----
The kernels had been executing concurrently but Nvidia's Visual Profiler showed that we were not using all the percentage of concurrency was quite lowStreaming Multi Processors to their maximum capability.
<br><br>
Solution:
----
Initiate thread count based One way to address low compute utilization is attempt increase occupancy of each SM. According to Cuda's occupancy calculator the machine we were using for testing had a compute capability of 6.1. This means that each SM had 32 resident blocks and 2048 resident threads. To achieve maximum occupancy you would have 2048/32 = 64 threads/ block. To determine an appropriate grid size we would divide the total number of pixels by the 64 threads/block. This allows us to use dynamic grid sizing depending on Compute Capability the size of the CUDA deviceimage passed in.
<br><br>
The number of threads that were initialized per block had been calculated based on resident threads and blocks<syntaxhighlight lang="cpp>int iDevice; cudaDeviceProp prop; cudaGetDevice(&iDevice); cudaGetDeviceProperties(&prop, iDevice); int resident_threads = prop.maxThreadsPerMultiProcessor; int resident_blocks = 8; if (prop.major >= 3 && prop.major < 5) { resident_blocks = 16; } else if (prop.major >= 5 && prop.major <= 6) { resident_blocks = 32; } dim3 blockDims(resident_threads/resident_blocks,1,1); //Calculate grid size to cover the whole image dim3 gridDims(pixels/blockDims.x);</syntaxhighlight> This resulted in a compute utilization increase from 33% to close 43% but unfortunately this did not yield much improvements.
<br><br>
The number of blocks for the grid had been recalculated to incorporate the complexity of the image and the new threads per block.
96
edits