For me the most important thing is to solve the problem regardless of the tools used and I think that reimplementing everything from scratch using OpenCV and CUDA is a viable solution.
=== Source Code for Vehicle Detection===<syntaxhighlight lang="cpp">void detect_vehicles() { for (unsigned int i = 0; i < files.size(); i++) { // Load one image at the time and display it load_image(img, files[i]); win.set_image(img); // Run the detector on the image and show the output for (auto&& d : net(img)) { auto fd = sp(img, d); rectangle rect; for (unsigned long j = 0; j < fd.num_parts(); ++j) rect += fd.part(j); if (d.label == "rear") win.add_overlay(rect, rgb_pixel(255, 0, 0), d.label); else win.add_overlay(rect, rgb_pixel(255, 255, 0), d.label); } // Clear the overlay dlib::sleep(1000); win.clear_overlay(); }}</syntaxhighlight>
=== Box Blur on an image using opencv C++ Library (Max Fainshtein) ===
=== Assignment 3 ===
We had realized Upon using Nvidia's Visual Profiler it was evident that our implementation of a kernel had made we can make some massive improvements, compared to the serial version, but after analyzing the Assignment 2 version we had noticed that we could still make improvementstry and improve our kernel even further.
<br><br>
Problem:
----
The kernels had been executing concurrently but Nvidia's Visual Profiler showed that we were not using all the percentage of concurrency was quite lowStreaming Multi Processors to their maximum capability.
<br><br>
Solution:
----
Initiate thread count based One way to address low compute utilization is attempt increase occupancy of each SM. According to Cuda's occupancy calculator the machine we were using for testing had a compute capability of 6.1. This means that each SM had 32 resident blocks and 2048 resident threads. To achieve maximum occupancy you would have 2048/32 = 64 threads/ block. To determine an appropriate grid size we would divide the total number of pixels by the 64 threads/block. This allows us to use dynamic grid sizing depending on Compute Capability the size of the CUDA deviceimage passed in.
<br><br>
The number of <syntaxhighlight lang="cpp>int iDevice;cudaDeviceProp prop;cudaGetDevice(&iDevice);cudaGetDeviceProperties(&prop, iDevice);int resident_threads = prop.maxThreadsPerMultiProcessor;int resident_blocks = 8;if (prop.major >= 3 && prop.major < 5) { resident_blocks = 16; }else if (prop.major >= 5 && prop.major <= 6) { resident_blocks = 32;}//determine threads that were initialized per /block had been calculated based on resident threads and blocksdim3 blockDims(resident_threads/resident_blocks,1,1); //Calculate grid size to cover the whole imagedim3 gridDims(pixels/blockDims.x);</syntaxhighlight> This resulted in a compute utilization increase from 33% to close 43% but unfortunately this did not yield much improvements.
<br><br>
The number of blocks for the grid had been recalculated to incorporate the complexity of the image and the new threads per block.