Changes

← Older edit

Sirius

17,587 bytes added, 11:01, 9 April 2018

→‎Conclusion

For me the most important thing is to solve the problem regardless of the tools used and I think that reimplementing everything from scratch using OpenCV and CUDA is a viable solution.

=== ~~Boxblur on an image using opencv C++ Library (Max Fainshtein)~~ Source Code for Vehicle Detection ===<syntaxhighlight lang="cpp">~~My suggested topic~~ void detect_vehicles() { for the parallel program term project is an application that performs a boxblur on an image using open cv. This is made possible by using the opencv library installed through https://sourceforge.net/projects/opencvlibrary/(unsigned int i = 0; i < files~~/opencv-win/3~~.~~3.0~~size(); i++) { /~~opencv-3.3.0-vc14.exe~~/~~download or by using Visual Studios NuGet packages~~ Load one image at the time and ~~installing opencv.~~display it load_image(img, files[i]); win.~~native by Harry Y. Opencv was used to read images and access and modify the color of each pixel so that it is the average of the user defined box kernal. This application is running at O~~set_image(~~n^2~~img) ~~where n is the number of pixel rows and columns for the image. Running this program for images of various sizes resulted in the following:~~;

// Run the detector on the image and show the output

for (auto&& d : net(img)) {

auto fd = sp(img, d);

rectangle rect;

for (unsigned long j = 0; j < fd.num_parts(); ++j)

rect += fd.part(j);

if (d.label == "rear")

win.add_overlay(rect, rgb_pixel(255, 0, 0), d.label);

else

win.add_overlay(rect, rgb_pixel(255, 255, 0), d.label);

}

// Clear the overlay

dlib::sleep(1000);

win.clear_overlay();

}

</syntaxhighlight>

=== Box Blur on an image using opencv C++ Library (Max Fainshtein) ===

My suggested topic for the parallel program term project is an application that performs a box blur on an image using open cv. This is made possible by using the opencv library installed through https://sourceforge.net/projects/opencvlibrary/files/opencv-win/3.3.0/opencv-3.3.0-vc14.exe/download or by using Visual Studios NuGet packages and installing opencv.win.native by Harry Y. Opencv was used to read images and access and modify the color of each pixel so that it is the average of the user defined box kernal. This application is running at O(n^2) where n is the number of pixel rows and columns for the image.

==== Results ====

Running this program for images of various sizes resulted in the following:

[[File:dps915_boxfilter_result.png]]

==== Graph ====

This is the data displayed as a bar graph.

[[File:dps915_boxfilter_graph.png | 750px]]

The application has the opportunity to receive an incredible boost to performance with the addition of parallel programming as most of the computational time is made up of calculating the average of every pixel which can be calculated concurrently, while only requiring a single synchronization at the end before we display the image.

=== Source Code for Box Blur ===

int findingNeighbors(Mat img, int i, int j, int neighbour,float * b, float * g, float * r) {

int row_limit = img.rows;

int column_limit = img.cols;

Scalar temp;

double sum = 0, blue=0, red=0, green=0;

for (int x = i - floor(neighbour / 2); x <= i + floor(neighbour / 2); x++) {

for (int y = j - floor(neighbour / 2); y <= j + floor(neighbour / 2); y++) {

if (x >= 0 && y >= 0 && x < row_limit && y < column_limit) {

temp = img.at<Vec3b>(x, y);

blue += temp.val[0];

green += temp.val[1];

red += temp.val[2];

}

*b = blue / pow(neighbour, 2);

*g = green / pow(neighbour, 2);

*r = red / pow(neighbour, 2);

return 1;

}

</syntaxhighlight>

=== Algorithms (Joseph Pildush)===

My topic is about Algorithms and the stress on the CPU and RAM of running them with large sizes of arrays. When using most algorithms with a set of arrays of a small size, the algorithms tend to finish faster then a second. When using these algorithms with larger sized arrays, based on my results it can be seen that the stress continues to increase as the size of the arrays increase, which in turn would also increase the execution time of the algorithms. In a situation when these algorithms are being called multiple times on large sized arrays, there would be an immense increase in execution time which may also result in the program becoming overall slow and/or hanging.

Algorithms Used: std::sort, saxpy, prefix-sum

0.00 17.52 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z10setRandArrRSt6vectorIiSaIiEEi

0.00 17.52 0.00 1 0.00 0.00 void std::__insertion_sort<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_less_iter>(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_less_iter)

</source>

==== Source Code ====

//std::sort Algorithm

void stdSort(vector<int>& array,int arrSize,steady_clock::time_point ts,steady_clock::time_point te){

cout << "--==Execution Time of std::sort Alogirthm==--" << endl;

/*std::sort Algorithm*/

//Time the fill of 1 vector

ts = steady_clock::now();

//Fill array with random numbers

setRandArr(array, arrSize);

te = steady_clock::now();

printTiming("std::sort Vector (1) Initialize", te - ts);

//Start timing of std::sort

ts = steady_clock::now();

//Use std::sort to sort vector array1

sort(array.begin(),array.end());

//End timing std::sort

te = steady_clock::now();

//Print Results

printTiming("std::sort algorithm", te - ts);

}

//saxpy Algorithm

void saxpyAlg(int arrSize,steady_clock::time_point ts,steady_clock::time_point te){

cout << endl << "--==Execution Time of saxpy Alogirthm==--" << endl;

/*saxpy Algorithm*/

vector<int> saxpyX,saxpyY;

int saxpyA = 15;

//Time the fill of 2 vectors

ts = steady_clock::now();

setRandArr(saxpyX, arrSize);

setRandArr(saxpyY, arrSize);

te = steady_clock::now();

printTiming("saxpy Vectors (2) Initialize", te - ts);

//Start timing of saxpy

ts = steady_clock::now();

for (int i = 0;i < arrSize;++i)

saxpyY[i] = saxpyA*saxpyX[i] + saxpyY[i];

//End timing of saxpy

te = steady_clock::now();

printTiming("saxpy Algorithm", te - ts);

}

//Prefix Sum Algorithm

void prefixSum(vector<int>& array,int arrSize,steady_clock::time_point ts,steady_clock::time_point te){

cout << endl << "--==Execution Time of Prefix-Sum Alogirthm==--" << endl;

/*Prefix-Sum Algorithm*/

vector<int> psSum;

array.clear();

//Time the fill of 1 vector

ts = steady_clock::now();

//Fill array with random numbers

setRandArr(array, arrSize);

te = steady_clock::now();

printTiming("Prefix-Sum Vector (1) Initialize", te - ts);

//Start timing of Prefix-Sum

ts = steady_clock::now();

psSum.push_back(array[0]);

for (int i = 1;i < arrSize;++i)

psSum.push_back(psSum[i - 1] + array[i]);

//End timing of Prefix-Sum

te = steady_clock::now();

printTiming("Prefix-Sum Algorithm", te - ts);

}

</source>

For this reason, when developing very advanced applications that would require to make multiple algorithm calls on very large sized arrays, it seems quite beneficial to use CUDA to implement parallel programming on the GPU in order to decrease the stress on other hardware as well as minimize the execution time for the algorithms.

=== LZW Data Compression and Decompression(Mithilan Sivanesan) ===

Data compression is the process of reducing the number of bits required to store data. Compression can be lossless, which means there is little to no loss of data and lossy where there can be data lost. Files compressed using loss-less compression can be decompressed to to produce the original file in its entirety.

LZW is a dynamic dictionary method.

Dictionary methods substitute codes for common strings from a table or dictionary. A dictionary code may be, fixed, static or dynamic. In the fixed case, the dictionary is specified as part of the algorithm. In the static case, the compressor analyzes the input, constructs a dictionary, and transmits it to the decompresser. In the dynamic case, both the compressor and decompresser construct identical dictionaries from past data using identical algorithms.

LZW starts with a dictionary of 256 1-byte symbols. It parses the input into the longest possible strings that match a dictionary entry, then replaces the string with its index. After each encoding, that string plus the byte that follows it is added to the dictionary. For example, if the input is ABCABCABCABC then the encoding is as follows:

65 = A (add AB to dictionary as code 256)

66 = B (add BC as 257)

67 = C (add CA as 258)

256 = AB (add ABC as 259)

258 = CA (add CAB as 260)

257 = BC (add BCA 261)

259 = ABC (end of input)

===Flat Profile===

====Flat Profile: Compression====

Each sample counts as 0.01 seconds.

% cumulative self self total

time seconds seconds calls ns/call ns/call name

47.64 0.91 0.91 compress(std::string, int, std::string)

39.27 1.66 0.75 28552683 26.27 26.27 show_usage()

6.81 1.79 0.13 2431472 53.47 53.47 convert_int_to_bin(int)

3.66 1.86 0.07 2431472 28.79 55.06 std::__detail::_Map_base<std::string, std::pair<std::string const, int>, std::_Select1st<std::pair<std::string const, int> >, true, std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::_Select1st<std::pair<std::string const, int> >, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, false, false, true> >::operator[](std::string const&)

2.62 1.91 0.05 convert_char_to_string(char const*, int)

0.00 1.91 0.00 3841 0.00 0.00 std::__detail::_Hashtable_iterator<std::pair<std::string const, int>, false, false> std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::_Select1st<std::pair<std::string const, int> >, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, false, false, true>::_M_insert_bucket<std::pair<std::string, unsigned int> >(std::pair<std::string, unsigned int>&&, unsigned int, unsigned int)

0.00 1.91 0.00 256 0.00 0.00 std::__detail::_Hashtable_iterator<std::pair<std::string const, int>, false, false> std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::_Select1st<std::pair<std::string const, int> >, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, false, false, true>::_M_insert_bucket<std::pair<std::string, int> >(std::pair<std::string, int>&&, unsigned int, unsigned int)

0.00 1.91 0.00 256 0.00 26.27 std::__detail::_Map_base<std::string, std::pair<std::string const, int>, std::_Select1st<std::pair<std::string const, int> >, true, std::_Hashtable<std::string, std::pair<std::string const, int>, std::allocator<std::pair<std::string const, int> >, std::_Select1st<std::pair<std::string const, int> >, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, false, false, true> >::operator[](std::string&&)

0.00 1.91 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z18convert_int_to_bini

====Flat Profile: Decompression====

Each sample counts as 0.01 seconds.

% cumulative self self total

time seconds seconds calls ms/call ms/call name

41.33 0.31 0.31 decompress(std::string, int, std::string)

20.00 0.46 0.15 11 13.64 13.64 show_usage()

16.00 0.58 0.12 6529533 0.00 0.00 std::__detail::_Map_base<unsigned int, std::pair<unsigned int const, std::string>, std::_Select1st<std::pair<unsigned int const, std::string> >, true, std::_Hashtable<unsigned int, std::pair<unsigned int const, std::string>, std::allocator<std::pair<unsigned int const, std::string> >, std::_Select1st<std::pair<unsigned int const, std::string> >, std::equal_to<unsigned int>, std::hash<unsigned int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, false, false, true> >::operator[](unsigned int const&)

10.67 0.66 0.08 9 8.89 22.53 std::_Hashtable<unsigned int, std::pair<unsigned int const, std::string>, std::allocator<std::pair<unsigned int const, std::string> >, std::_Select1st<std::pair<unsigned int const, std::string> >, std::equal_to<unsigned int>, std::hash<unsigned int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, false, false, true>::_M_rehash(unsigned int)

8.00 0.72 0.06 convert_char_to_string(char const*, int)

4.00 0.75 0.03 2176681 0.00 0.00 std::__detail::_Hashtable_iterator<std::pair<unsigned int const, std::string>, false, false> std::_Hashtable<unsigned int, std::pair<unsigned int const, std::string>, std::allocator<std::pair<unsigned int const, std::string> >, std::_Select1st<std::pair<unsigned int const, std::string> >, std::equal_to<unsigned int>, std::hash<unsigned int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, false, false, true>::_M_insert_bucket<std::pair<unsigned int, std::string> >(std::pair<unsigned int, std::string>&&, unsigned int, unsigned int)

0.00 0.75 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z18convert_int_to_bini

====Source Code====

void compress(string input, int size, string filename) {

unordered_map<string, int> compress_dictionary(MAX_DEF);

//Dictionary initializing with ASCII

for ( int unsigned i = 0 ; i < 256 ; i++ ){

compress_dictionary[string(1,i)] = i;

}

string current_string;

unsigned int code;

unsigned int next_code = 256;

//Output file for compressed data

ofstream outputFile;

outputFile.open(filename + ".lzw");

for(char& c: input){

current_string = current_string + c;

if ( compress_dictionary.find(current_string) ==compress_dictionary.end() ){

if (next_code <= MAX_DEF)

compress_dictionary.insert(make_pair(current_string, next_code++));

current_string.erase(current_string.size()-1);

outputFile << convert_int_to_bin(compress_dictionary[current_string]);

current_string = c;

}

if (current_string.size())

outputFile << convert_int_to_bin(compress_dictionary[current_string]);

outputFile.close();

}

void decompress(string input, int size, string filename) {

unordered_map<unsigned int, string> dictionary(MAX_DEF);

//Dictionary initializing with ASCII

for ( int unsigned i = 0 ; i < 256 ; i++ ){

dictionary[i] = string(1,i);

}

string previous_string;

unsigned int code;

unsigned int next_code = 256;

//Output file for decompressed data

ofstream outputFile;

outputFile.open(filename + "_uncompressed.txt");

int i =0;

while (i<size){

//Extracting 12 bits and converting binary to decimal

string subinput = input.substr(i,12);

bitset<12> binary(subinput);

code = binary.to_ullong();

i+=12;

if ( dictionary.find(code) ==dictionary.end() )

dictionary.insert(make_pair(code,(previous_string + previous_string.substr(0,1))));

outputFile<<dictionary[code];

if ( previous_string.size())

dictionary.insert(make_pair(next_code++,previous_string + dictionary[code][0]));

previous_string = dictionary[code];

}

outputFile.close();

}

</source>

=== Assignment 2 ===

~~The project that has been chosen for~~ For Assignment 2 ~~is Max Fainshtein's Assignment 1 project, Boxblur on an image using opencv C++ Library and~~ we have decided to ~~implement~~ parallelize the Box Blur algorithm because we realized that it would have been the only algorithm worth parallelizing using CUDA ~~into~~ .The main choice that made us decide to continue with Box Blur, as supposed of the Vehicle Detection program, is because it was the ~~project~~ only problem where we could createa kernel for and gain a lot of gain in ~~order~~ execution speed. The only way to ~~eliminate~~ optimize the ~~stress on~~ Vehicle Detection program was to enable CUDA for the ~~CPU~~DLIB library, which it is not really in the scope of this assignment.~~ ----~~

The kernel had been designed to run with 512 threads in order to ensure that this type of program would be able to run on lower compute capability CUDA supported hardware.

~~ ~~

The grid was designed to accommodate a 4K image that would be processed with 3 colour channels.

~~ ~~Each ~~block~~ thread of the grid would represent a single pixel within the image that is being processed. ~~ ~~By implementing this kernel, the process time of the blur effect had made a significant improvement, compared to it's serial counter-part as illustrated in the ~~results~~ graph below. === Kernel Code ===<syntaxhighlight lang="cpp">__global__void blur(unsigned char* input_image, unsigned char* output_image, int width, int height, int neighbour) { const unsigned int offset = blockIdx.x*blockDim.x + threadIdx.x; int x = offset % width; int y = (offset - x) / width; if (offset < width*height) { float output_red = 0; float output_green = 0; float output_blue = 0; int hits = 0; for (int ox = -neighbour; ox < neighbour + 1; ++ox) { for (int oy = -neighbour; oy < neighbour + 1; ++oy) { if ((x + ox) > -1 && (x + ox) < width && (y + oy) > -1 && (y + oy) < height) { const int currentoffset = (offset + ox + oy*width) * 3; output_red += input_image[currentoffset]; output_green += input_image[currentoffset + 1]; output_blue += input_image[currentoffset + 2]; hits++; } } } output_image[offset * 3] = static_cast<unsigned char>(output_red / hits); output_image[offset * 3 + 1] = static_cast<unsigned char>(output_green / hits); output_image[offset * 3 + 2] = static_cast<unsigned char>(output_blue / hits); }}</syntaxhighlight>=== Launching the Kernel ===<syntaxhighlight lang="cpp">void filter(const Mat& input, Mat& output, int width, int height, int neighbour){ //Calculate total number of bytes of input and output image const int colorBytes = input.step * input.rows; const int grayBytes = output.step * output.rows; unsigned char *d_input, *d_output; //Allocate device memory cudaMalloc((void**)&d_input, width*height * 3 * sizeof(unsigned char)); cudaMalloc((void**)&d_output, width*height * 3 * sizeof(unsigned char)); //Copy data from OpenCV input image to device memory cudaMemcpy(d_input, input.ptr(), width*height * 3 * sizeof(unsigned char), cudaMemcpyHostToDevice); //cudaMemcpy(d_input, input.ptr(), colorBytes, cudaMemcpyHostToDevice); dim3 blockDims(512, 1, 1); //Calculate grid size to cover the whole image dim3 gridDims((unsigned int)ceil((double)(width*height * 3 / blockDims.x)), 1, 1); //Launch the color conversion kernel blur <br< <gridDims, blockDims >> >(d_input, d_output, input.cols, input.rows, neighbour); //Synchronize to check for any kernel launch errors cudaDeviceSynchronize(); //Copy back data from destination device meory to OpenCV output image cudaMemcpy(output.ptr(), d_output, width*height * 3 * sizeof(unsigned char), cudaMemcpyDeviceToHost); //Free the device memory cudaFree(d_input); cudaFree(d_output);}</syntaxhighlight>==== Graph ====[[File:boxFilterFirst.png | 750px]]

----

=== Assignment 3 ===

~~We had realized~~ Upon using Nvidia's Visual Profiler it was evident that ~~our implementation of a kernel had made~~ we can make some ~~massive~~ improvements~~, compared~~ to ~~the serial version, but after profiling the Assignment 2 version we had noticed that we could still make improvements~~try and improve our kernel even further.

Problem:

----

~~The kernels had been executing concurrently but~~ Nvidia's Visual Profiler showed that we were not using all the ~~percentage of concurrency was quite low~~Streaming Multi Processors to their maximum capability.

Solution:

----

~~Initiate thread count based~~ One way to address low compute utilization is to attempt to increase occupancy of each SM. According to Cuda's occupancy calculator the machine we were using for testing had a compute capability of 6.1. This means that each SM had 32 resident blocks and 2048 resident threads. To achieve maximum occupancy you would have 2048/32 = 64 threads/ block. To determine an appropriate grid size we would divide the total number of pixels by the 64 threads/block. This allows us to use dynamic grid sizing depending on ~~Compute Capability~~ the size of the ~~CUDA device~~image passed in.

~~The number of~~ <syntaxhighlight lang="cpp>int iDevice;cudaDeviceProp prop;cudaGetDevice(&iDevice);cudaGetDeviceProperties(&prop, iDevice);int resident_threads = prop.maxThreadsPerMultiProcessor;int resident_blocks = 8;if (prop.major >= 3 && prop.major < 5) { resident_blocks = 16; }else if (prop.major >= 5 && prop.major <= 6) { resident_blocks = 32;}//determine threads ~~that were initialized per~~ /block ~~had been calculated based on resident threads and blocks~~dim3 blockDims(resident_threads/resident_blocks,1,1); //Calculate grid size to cover the whole imagedim3 gridDims(pixels/blockDims.x);</syntaxhighlight> This resulted in a compute utilization increase from 33% to close 43% but unfortunately this did not yield much improvements.

The number of blocks for the grid had been recalculated to incorporate the complexity of the image and the new threads per block.

Problem:

----

We considered shared memory when optimizing our kernel. When attempting to implement shared memory we realized that it would be a difficult task to complete because every pixel in a block needs access to a different range of pixels for averaging. One major problem was that neighborhood pixels may fall out of range of the block. We also attempted to store the entire image in shared memory but this solution is not scalable to larger image sizes as shared memory is a limited resource.

Below you'll see that our optimizations although show slight improvements sometimes, it was not effective. We are currently still looking for a way to implement shared memory which will surely improve efficiency and execution time.

==== Graph ====

[[File:boxFilterOptimize.png | 750px]]

=== Conclusion ===

Implementing the CUDA library into the Box Filter assignment proved to be a great success. We were able to implement a blur effect on a 4K image with a 99.2% improvement over the serial version.

With further optimization, we managed to slightly improve the execution time of the blur effect.

Below are the final results of all the test runs as well as the corresponding graph.

==== Results ====

[[File:boxFilterFinalTable.png | 500px]]

==== Graph ====

[[File: BoxFilterResults.png | 750px]]

Jpildush

66

edits

CDOT Wiki β

Changes

Sirius

CDOT Wiki ^β