Changes

← Older edit

The parallelizing Express

2,344 bytes added, 20:12, 13 April 2017

→‎Results

# [mailto:jjsunga@myseneca.ca?subject=DPS915 Jadrian Sunga]

# [mailto:mradmanovic@myseneca.ca?subject=DPS915 Marko Radmanovic]

[mailto:azoeb@myseneca.ca,;jjsunga@myseneca.ca,;mradmanovic@myseneca.ca?subject=DPS915 Email All]

== Progress ==

==== '''Profile''' ====

'''Flat profile:'''

Each sample counts as 0.01 seconds.

0.00 1.01 0.00 1 0.00 0.00 void std::_Destroy<int*, int>(int*, int*, std::allocator<int>&)

'''Call ~~graph (explanation follows)~~Graph''' [[Media:callgraph.pdf]] ==== '''Full Profile Data File''' ====

Since the profile data is too large a download of it in pdf format will also be included.

~~granularity~~Download: ~~each sample hit covers 2 byte(s) for 0.99% of 1~~[[Media:profile-data.~~01 seconds~~pdf]]

~~[[File:callgraph.png]]~~==== '''Analysis''' ====

~~=== '''Full Profile Data File''' ===~~http://pastebin.com/ZS7x0Uvq

~~Download: [[File:~~As seen in the call graph of the profile-data98% of the time spent in the program is spent processing in the main file. Within this main multiple calls to the Color3d overloaded operator is made (located in the Color3d file). This makes the program ideal for parallelization as many of the steps used to process the pixel colours can be delegated to the Nvidia GPU. This can be seen in the above code snippets (found in the link above) when calls to Color3d operator are done multiple times in a triple for loop.~~txt]]~~

----

0.00 14.90 0.00 1 0.00 0.00 std::__lg(int)

0.00 14.90 0.00 1 0.00 10.00 void std::sort<int*>(int*, int*)

~~� <nowiki>~~

~~Call graph~~

Full profiling can be found here - http://pastebin.com/vByZLWtd

~~granularity: each sample hit covers 4 byte(s) for 0.07% of 14.90 seconds~~-----------------------------------------------

~~index % time self children called name~~

~~<spontaneous>~~

~~[1] 100.0 14.89 0.01 main [1]~~

~~0.00 0.01 1/1 void std::sort<int*>(int*, int*) [6]~~

~~-----------------------------------------------~~

~~0.01 0.00 120348/120348 void std::iter_swap<int*, int*>(int*, int*) [3]~~

~~[2] 0.1 0.01 0.00 120348 void std::__iter_swap<true>::iter_swap<int*, int*>(int*, int*) [2]~~

~~0.00 0.00 120348/120348 void std::swap<int>(int&, int&) [12]~~

~~-----------------------------------------------~~

~~0.00 0.00 2746/120348 void std::__move_median_first<int*>(int*, int*, int*) [8]~~

~~0.00 0.01 117602/120348 int* std::__unguarded_partition<int*, int>(int*, int*, int const&) [7]~~

~~[3] 0.1 0.00 0.01 120348 void std::iter_swap<int*, int*>(int*, int*) [3]~~

~~0.01 0.00 120348/120348 void std::__iter_swap<true>::iter_swap<int*, int*>(int*, int*) [2]~~

~~-----------------------------------------------~~

~~0.00 0.01 2963/2963 void std::__introsort_loop<int*, int>(int*, int*, int) [5]~~

~~[4] 0.1 0.00 0.01 2963 int* std::__unguarded_partition_pivot<int*>(int*, int*) [4]~~

~~0.00 0.01 2963/2963 int* std::__unguarded_partition<int*, int>(int*, int*, int const&) [7]~~

~~0.00 0.00 2963/2963 void std::__move_median_first<int*>(int*, int*, int*) [8]~~

~~-----------------------------------------------~~

~~2963 void std::__introsort_loop<int*, int>(int*, int*, int) [5]~~

~~0.00 0.01 1/1 void std::sort<int*>(int*, int*) [6]~~

~~[5] 0.1 0.00 0.01 1+2963 void std::__introsort_loop<int*, int>(int*, int*, int) [5]~~

~~0.00 0.01 2963/2963 int* std::__unguarded_partition_pivot<int*>(int*, int*) [4]~~

~~2963 void std::__introsort_loop<int*, int>(int*, int*, int) [5]~~

~~-----------------------------------------------~~

~~0.00 0.01 1/1 main [1]~~

~~[6] 0.1 0.00 0.01 1 void std::sort<int*>(int*, int*) [6]~~

~~0.00 0.01 1/1 void std::__introsort_loop<int*, int>(int*, int*, int) [5]~~

~~0.00 0.00 1/1 std::__lg(int) [19]~~

~~0.00 0.00 1/1 void std::__final_insertion_sort<int*>(int*, int*) [17]~~

~~-----------------------------------------------~~

~~0.00 0.01 2963/2963 int* std::__unguarded_partition_pivot<int*>(int*, int*) [4]~~

~~[7] 0.1 0.00 0.01 2963 int* std::__unguarded_partition<int*, int>(int*, int*, int const&) [7]~~

~~0.00 0.01 117602/120348 void std::iter_swap<int*, int*>(int*, int*) [3]~~

~~-----------------------------------------------~~

~~0.00 0.00 2963/2963 int* std::__unguarded_partition_pivot<int*>(int*, int*) [4]~~

~~[8] 0.0 0.00 0.00 2963 void std::__move_median_first<int*>(int*, int*, int*) [8]~~

~~0.00 0.00 2746/120348 void std::iter_swap<int*, int*>(int*, int*) [3]~~

~~-----------------------------------------------~~

~~0.00 0.00 120348/120348 void std::__iter_swap<true>::iter_swap<int*, int*>(int*, int*) [2]~~

~~[12] 0.0 0.00 0.00 120348 void std::swap<int>(int&, int&) [12]~~

~~-----------------------------------------------~~

~~0.00 0.00 15/32767 void std::__insertion_sort<int*>(int*, int*) [16]~~

~~0.00 0.00 32752/32767 void std::__unguarded_insertion_sort<int*>(int*, int*) [18]~~

~~[13] 0.0 0.00 0.00 32767 void std::__unguarded_linear_insert<int*>(int*) [13]~~

~~-----------------------------------------------~~

~~0.00 0.00 1/1 __do_global_ctors_aux [34]~~

~~[14] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [14]~~

~~0.00 0.00 1/1 __static_initialization_and_destruction_0(int, int) [15]~~

~~-----------------------------------------------~~

~~0.00 0.00 1/1 _GLOBAL__sub_I_main [14]~~

~~[15] 0.0 0.00 0.00 1 __static_initialization_and_destruction_0(int, int) [15]~~

~~-----------------------------------------------~~

~~0.00 0.00 1/1 void std::__final_insertion_sort<int*>(int*, int*) [17]~~

~~[16] 0.0 0.00 0.00 1 void std::__insertion_sort<int*>(int*, int*) [16]~~

~~0.00 0.00 15/32767 void std::__unguarded_linear_insert<int*>(int*) [13]~~

~~-----------------------------------------------~~

~~0.00 0.00 1/1 void std::sort<int*>(int*, int*) [6]~~

~~[17] 0.0 0.00 0.00 1 void std::__final_insertion_sort<int*>(int*, int*) [17]~~

~~0.00 0.00 1/1 void std::__insertion_sort<int*>(int*, int*) [16]~~

~~0.00 0.00 1/1 void std::__unguarded_insertion_sort<int*>(int*, int*) [18]~~

~~-----------------------------------------------~~

~~0.00 0.00 1/1 void std::__final_insertion_sort<int*>(int*, int*) [17]~~

~~[18] 0.0 0.00 0.00 1 void std::__unguarded_insertion_sort<int*>(int*, int*) [18]~~

~~0.00 0.00 32752/32767 void std::__unguarded_linear_insert<int*>(int*) [13]~~

~~-----------------------------------------------~~

~~0.00 0.00 1/1 void std::sort<int*>(int*, int*) [6]~~

~~[19] 0.0 0.00 0.00 1 std::__lg(int) [19]~~

~~-----------------------------------------------~~

�

Index by function name

[5] void std::__introsort_loop<int*, int>(int*, int*, int) [4] int* std::__unguarded_partition_pivot<int*>(int*, int*)

[8] void std::__move_median_first<int*>(int*, int*, int*) [19] std::__lg(int)

~~</nowiki>~~

***The compilation of the code takes roughly 14 seconds and when the code is profiled we can see that the main uses most of the % processing time. i.e., 99.93%.

=== Final Statement ===

We plan to parallelize the "ColorTransfer" program as it yields better theoretical speedup of the execution of the whole task in comparison to Sudoku and Sorted Array Processing.

== Assignment 2 ==

Source : [https://www.dropbox.com/s/5tf3jial3rg9d4k/a2.7z?dl=0 A2]

=== Description ===

'''Removing CPU Bottleneck'''

The CPU bottleneck in the ColorTransfer/main.cpp:

<pre>

// Multiplication of matrix and vector

Color3d operator *(const cv::Mat& M, Color3d& v) {

Color3d u = Color3d();

for(int i=0; i<3; i++) {

u(i) = 0.0;

for(int j=0; j<3; j++) {

u(i) += M.at<double>(i, j) * v(j);

}

return u;

}

</pre>

''' Added functions and changes'''

To fix this issue We wrote a device function which handles a matrix by vector multiplication. We also wrote a kernel which has equal logic to the host version so we could off load all the needed data to the kernel without needing to do multiple back and forth(s). We also wrote a helper function which will allocate all transfer all the opencv matrices to a suitable form for the kernel to deal with.

Matrix by vector

<pre>

__device__ void matvec(float* d_A, float* d_B, float* d_C)

{

int tid = threadIdx.x + blockIdx.x * blockDim.x;

float sum = 0;

if (tid < 3)

{

for (int i = 0; i < 3; ++i)

{

sum += d_A[i] * d_B[(i * 3) + tid];

}

d_C[tid] = sum;

}

</pre>

Kernel

<pre>

__global__ void matvec_kernel(float* d_A, float* d_RGB2, float* d_LMS2, float* d_C,

const int n, int targetrows, int targetcols, float* d_Tar)

{

const double eps = 1.0e-4;

for (int y = 0; y < targetrows; ++y) {

for (int x = 0; x < targetcols; ++x) {

memcpy(&d_A, &d_Tar[y * 3 + x], 3 * sizeof(float));

matvec(d_A, d_RGB2, d_C);

memcpy(&d_A, d_C, 3 * sizeof(float));

for (int c = 0; c < 3; c++)

d_A[c] = d_A[c] > -5.0 ? pow((double)10.0, (double)d_A[c]) : eps;

matvec(d_A, d_LMS2, d_C);

memcpy(&d_Tar[y * 3 + x], d_C, 3 * sizeof(float));

}

</pre>

Helper

<pre>

inline void vecTransfer(float* h, Color3d* v)

{

for (int j = 0; j < 3; ++j)

h[j] = v->v[j];

}

//KERNEL Helper function does setup and launch

void matvec_L(cv::Mat* mRGB2LMS, cv::Mat* mLMS2lab, float* h_C, int tarrow, int tarcol, float* h_Tar)

{

float *h_A, *h_RGB2, *h_LMS2, *d_Tar;

float *d_A, *d_RGB2, *d_LMS2, *d_C;

int N = 3;

h_A = (float*)malloc(sizeof(float) * N);

h_RGB2 = new float[mRGB2LMS->total()];

h_LMS2 = new float[mLMS2lab->total()];

//h_C = (float*)malloc(sizeof(float) * N);

cudaMalloc((void**)&d_A, sizeof(float) * N);

cudaMalloc((void**)&d_RGB2, sizeof(float) * N * N);

cudaMalloc((void**)&d_LMS2, sizeof(float) * N * N);

cudaMalloc((void**)&d_C, sizeof(h_C));

cudaMalloc((void**)&d_Tar, sizeof(h_Tar));

Color3d vec;

//copy vec and matrix to host pointers

vecTransfer(h_A, &vec);

memcpy(h_RGB2, mRGB2LMS->data, mRGB2LMS->total());

memcpy(h_LMS2, mLMS2lab->data, mLMS2lab->total());

cudaMemcpy(d_A, h_A, sizeof(float) * N, cudaMemcpyHostToDevice);

cudaMemcpy(d_RGB2, h_RGB2, sizeof(float) * N * N, cudaMemcpyHostToDevice);

cudaMemcpy(d_LMS2, h_LMS2, sizeof(float) * N * N, cudaMemcpyHostToDevice);

cudaMemcpy(d_Tar, h_Tar, sizeof(h_Tar), cudaMemcpyHostToDevice);

matvec_kernel << <N / BLOCK_SIZE + 1, BLOCK_SIZE >> >(d_A, d_RGB2, d_LMS2, d_C, N, tarrow, tarcol, d_Tar);

//printf("error code: %s\n",cudaGetErrorString(cudaGetLastError()));

cudaMemcpy(h_C, d_C, sizeof(h_C), cudaMemcpyDeviceToHost);

free(h_A);

free(h_RGB2);

free(h_LMS2);

//free(h_C);

cudaFree(d_A);

cudaFree(d_RGB2);

cudaFree(d_LMS2);

cudaFree(d_C);

cudaFree(d_Tar);

}

</pre>

Changes to main loop

Old

<pre>

for (int y = 0; y < target.rows; y++) {

for (int x = 0; x < target.cols; x++) {

v = target.at<Color3d>(y, x);

matvec_L(&v, &mlab2LMS, h_C);

memcpy(&v, h_C, N * sizeof(float));

for (int c = 0; c < 3; c++)

v(c) = v(c) > -5.0 ? pow(10.0, v(c)) : eps;

matvec_L(&v, &mLMS2RGB, h_C);

memcpy(&target.at<Color3d>(y, x), h_C, N * sizeof(float));

}

</pre>

New

<pre>

float* h_TARGET = (float *)malloc(sizeof(target.data));

memcpy(h_TARGET, target.data, sizeof(target.data));

matvec_L(&mlab2LMS, &mLMS2RGB, h_C, rows, cols, h_TARGET);

</pre>

=== Results ===

For our profiling we had three test cases.

<pre>

(Size is in pixels)

Small : 400x400 by 400x400

Medium : 1400x989 by 1215x717

Large : 7362x4858 by 6000x4032

</pre>

[[File:a2timings.PNG]]

== Assignment 3 ==

=== Link to Original Unchanged Project and project used for a3 ===

[https://www.dropbox.com/s/32q70e9iovpm80l/a2.7z?dl=0 Download]

The file contains the visual studio solutions used for Assignment 3 as well as visual studio solution version of the original code.

To run the location of the OPENCV directory must be applied to the following project properties:

1. Under c/c++ -> general -> additional include directories -> the path to opencv\..\..\include

2. Under linker -> general -> additional library directories -> the path to opencv\..\..\lib

After doing the above, build the solution and run the appropriate Release or Debug exe with the target and reference image as arguments.

A link the the corresponding xls file of the run time can be found [https://www.dropbox.com/s/ecu7eycyv41krwq/a3.xlsx?dl=0 here]

=== What was done ===

At first the power function used was switched out with __pow in the kernel as the traditional pow function is more heavy of a function. But the results were very small causing a different from around 10-30 milliseconds. Afterwards the kernel was upgraded to implement grid and strides. In doing so instead of doing all the calculations on one thread, many of them were able to be calculated on a separate row. This made it so that for every pixel with a row one thread would be responsible for the colour shift.

Other implementations were made to transfer all the data necessary for calculations all at once from the beginning and then perform all calculations done by tatsy on the device side, but due to time constraints and the complication of the project we were unable to fully implement these changes. The code is however left (commented) in the included project download.

=== Optimized Kernel ===

<pre>

__global__ void matvec_kernel(float* d_A, float* d_RGB2, float* d_LMS2, float* d_C,

const int n, int targetrows, int targetcols, float* d_Tar)

{

const double eps = 1.0e-4;

//grid-stride loop

for (int tid = threadIdx.x + blockIdx.x * blockDim.x;

tid < targetrows;

tid += blockDim.x * gridDim.x)

{

for (int x = 0; x < targetcols; ++x) {

memcpy(&d_A, &d_Tar[tid * 3 + x], 3 * sizeof(float));

matvec(d_A, d_RGB2, d_C);

memcpy(&d_A, d_C, 3 * sizeof(float));

for (int c = 0; c < 3; c++)

d_A[c] = d_A[c] > -5.0 ? __powf(10.0f, d_A[c]) : eps;

matvec(d_A, d_LMS2, d_C);

memcpy(&d_Tar[tid * 3 + x], d_C, 3 * sizeof(float));

}

</pre>

=== Results ===

[[File:a3timings.PNG]]

When running in release and comparing the results to the original unchanged project straight from tatsy we noticed that for very small images the cuda version is slightly slower. This is probably due to the actual conversion of colour on a smaller image to be much shorter, whereas the transferring of the data over to device memory itself may add additional time. Overall when comparing the optimized and unoptimized versions of the kernel there was a visible increase.

Jjsunga

49

edits

CDOT Wiki β

Changes

The parallelizing Express

CDOT Wiki ^β