49
edits
Changes
→Results
# [mailto:jjsunga@myseneca.ca?subject=DPS915 Jadrian Sunga]
# [mailto:mradmanovic@myseneca.ca?subject=DPS915 Marko Radmanovic]
[mailto:azoeb@myseneca.ca,;jjsunga@myseneca.ca,;mradmanovic@myseneca.ca?subject=DPS915 Email All]
== Progress ==
==== '''Profile''' ====
'''Flat profile:'''
Each sample counts as 0.01 seconds.
0.00 1.01 0.00 1 0.00 0.00 void std::_Destroy<int*, int>(int*, int*, std::allocator<int>&)
Since the profile data is too large a download of it in pdf format will also be included.
----
0.00 14.90 0.00 1 0.00 0.00 std::__lg(int)
0.00 14.90 0.00 1 0.00 10.00 void std::sort<int*>(int*, int*)
Full profiling can be found here - http://pastebin.com/vByZLWtd
Index by function name
[5] void std::__introsort_loop<int*, int>(int*, int*, int) [4] int* std::__unguarded_partition_pivot<int*>(int*, int*)
[8] void std::__move_median_first<int*>(int*, int*, int*) [19] std::__lg(int)
***The compilation of the code takes roughly 14 seconds and when the code is profiled we can see that the main uses most of the % processing time. i.e., 99.93%.
=== Final Statement ===
We plan to parallelize the "ColorTransfer" program as it yields better theoretical speedup of the execution of the whole task in comparison to Sudoku and Sorted Array Processing.
== Assignment 2 ==
Source : [https://www.dropbox.com/s/5tf3jial3rg9d4k/a2.7z?dl=0 A2]
=== Description ===
'''Removing CPU Bottleneck'''
The CPU bottleneck in the ColorTransfer/main.cpp:
<pre>
// Multiplication of matrix and vector
Color3d operator *(const cv::Mat& M, Color3d& v) {
Color3d u = Color3d();
for(int i=0; i<3; i++) {
u(i) = 0.0;
for(int j=0; j<3; j++) {
u(i) += M.at<double>(i, j) * v(j);
}
}
return u;
}
</pre>
''' Added functions and changes'''
To fix this issue We wrote a device function which handles a matrix by vector multiplication. We also wrote a kernel which has equal logic to the host version so we could off load all the needed data to the kernel without needing to do multiple back and forth(s). We also wrote a helper function which will allocate all transfer all the opencv matrices to a suitable form for the kernel to deal with.
Matrix by vector
<pre>
__device__ void matvec(float* d_A, float* d_B, float* d_C)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
float sum = 0;
if (tid < 3)
{
for (int i = 0; i < 3; ++i)
{
sum += d_A[i] * d_B[(i * 3) + tid];
}
d_C[tid] = sum;
}
}
</pre>
Kernel
<pre>
__global__ void matvec_kernel(float* d_A, float* d_RGB2, float* d_LMS2, float* d_C,
const int n, int targetrows, int targetcols, float* d_Tar)
{
const double eps = 1.0e-4;
for (int y = 0; y < targetrows; ++y) {
for (int x = 0; x < targetcols; ++x) {
memcpy(&d_A, &d_Tar[y * 3 + x], 3 * sizeof(float));
matvec(d_A, d_RGB2, d_C);
memcpy(&d_A, d_C, 3 * sizeof(float));
for (int c = 0; c < 3; c++)
d_A[c] = d_A[c] > -5.0 ? pow((double)10.0, (double)d_A[c]) : eps;
matvec(d_A, d_LMS2, d_C);
memcpy(&d_Tar[y * 3 + x], d_C, 3 * sizeof(float));
}
}
}
</pre>
Helper
<pre>
inline void vecTransfer(float* h, Color3d* v)
{
for (int j = 0; j < 3; ++j)
h[j] = v->v[j];
}
//KERNEL Helper function does setup and launch
void matvec_L(cv::Mat* mRGB2LMS, cv::Mat* mLMS2lab, float* h_C, int tarrow, int tarcol, float* h_Tar)
{
float *h_A, *h_RGB2, *h_LMS2, *d_Tar;
float *d_A, *d_RGB2, *d_LMS2, *d_C;
int N = 3;
h_A = (float*)malloc(sizeof(float) * N);
h_RGB2 = new float[mRGB2LMS->total()];
h_LMS2 = new float[mLMS2lab->total()];
//h_C = (float*)malloc(sizeof(float) * N);
cudaMalloc((void**)&d_A, sizeof(float) * N);
cudaMalloc((void**)&d_RGB2, sizeof(float) * N * N);
cudaMalloc((void**)&d_LMS2, sizeof(float) * N * N);
cudaMalloc((void**)&d_C, sizeof(h_C));
cudaMalloc((void**)&d_Tar, sizeof(h_Tar));
Color3d vec;
//copy vec and matrix to host pointers
vecTransfer(h_A, &vec);
memcpy(h_RGB2, mRGB2LMS->data, mRGB2LMS->total());
memcpy(h_LMS2, mLMS2lab->data, mLMS2lab->total());
cudaMemcpy(d_A, h_A, sizeof(float) * N, cudaMemcpyHostToDevice);
cudaMemcpy(d_RGB2, h_RGB2, sizeof(float) * N * N, cudaMemcpyHostToDevice);
cudaMemcpy(d_LMS2, h_LMS2, sizeof(float) * N * N, cudaMemcpyHostToDevice);
cudaMemcpy(d_Tar, h_Tar, sizeof(h_Tar), cudaMemcpyHostToDevice);
matvec_kernel << <N / BLOCK_SIZE + 1, BLOCK_SIZE >> >(d_A, d_RGB2, d_LMS2, d_C, N, tarrow, tarcol, d_Tar);
//printf("error code: %s\n",cudaGetErrorString(cudaGetLastError()));
cudaMemcpy(h_C, d_C, sizeof(h_C), cudaMemcpyDeviceToHost);
free(h_A);
free(h_RGB2);
free(h_LMS2);
//free(h_C);
cudaFree(d_A);
cudaFree(d_RGB2);
cudaFree(d_LMS2);
cudaFree(d_C);
cudaFree(d_Tar);
}
</pre>
Changes to main loop
Old
<pre>
for (int y = 0; y < target.rows; y++) {
for (int x = 0; x < target.cols; x++) {
v = target.at<Color3d>(y, x);
matvec_L(&v, &mlab2LMS, h_C);
memcpy(&v, h_C, N * sizeof(float));
for (int c = 0; c < 3; c++)
v(c) = v(c) > -5.0 ? pow(10.0, v(c)) : eps;
matvec_L(&v, &mLMS2RGB, h_C);
memcpy(&target.at<Color3d>(y, x), h_C, N * sizeof(float));
}
}
</pre>
New
<pre>
float* h_TARGET = (float *)malloc(sizeof(target.data));
memcpy(h_TARGET, target.data, sizeof(target.data));
matvec_L(&mlab2LMS, &mLMS2RGB, h_C, rows, cols, h_TARGET);
</pre>
=== Results ===
For our profiling we had three test cases.
<pre>
(Size is in pixels)
Small : 400x400 by 400x400
Medium : 1400x989 by 1215x717
Large : 7362x4858 by 6000x4032
</pre>
[[File:a2timings.PNG]]
== Assignment 3 ==
=== Link to Original Unchanged Project and project used for a3 ===
[https://www.dropbox.com/s/32q70e9iovpm80l/a2.7z?dl=0 Download]
The file contains the visual studio solutions used for Assignment 3 as well as visual studio solution version of the original code.
To run the location of the OPENCV directory must be applied to the following project properties:
1. Under c/c++ -> general -> additional include directories -> the path to opencv\..\..\include
2. Under linker -> general -> additional library directories -> the path to opencv\..\..\lib
After doing the above, build the solution and run the appropriate Release or Debug exe with the target and reference image as arguments.
A link the the corresponding xls file of the run time can be found [https://www.dropbox.com/s/ecu7eycyv41krwq/a3.xlsx?dl=0 here]
=== What was done ===
At first the power function used was switched out with __pow in the kernel as the traditional pow function is more heavy of a function. But the results were very small causing a different from around 10-30 milliseconds. Afterwards the kernel was upgraded to implement grid and strides. In doing so instead of doing all the calculations on one thread, many of them were able to be calculated on a separate row. This made it so that for every pixel with a row one thread would be responsible for the colour shift.
Other implementations were made to transfer all the data necessary for calculations all at once from the beginning and then perform all calculations done by tatsy on the device side, but due to time constraints and the complication of the project we were unable to fully implement these changes. The code is however left (commented) in the included project download.
=== Optimized Kernel ===
<pre>
__global__ void matvec_kernel(float* d_A, float* d_RGB2, float* d_LMS2, float* d_C,
const int n, int targetrows, int targetcols, float* d_Tar)
{
const double eps = 1.0e-4;
//grid-stride loop
for (int tid = threadIdx.x + blockIdx.x * blockDim.x;
tid < targetrows;
tid += blockDim.x * gridDim.x)
{
for (int x = 0; x < targetcols; ++x) {
memcpy(&d_A, &d_Tar[tid * 3 + x], 3 * sizeof(float));
matvec(d_A, d_RGB2, d_C);
memcpy(&d_A, d_C, 3 * sizeof(float));
for (int c = 0; c < 3; c++)
d_A[c] = d_A[c] > -5.0 ? __powf(10.0f, d_A[c]) : eps;
matvec(d_A, d_LMS2, d_C);
memcpy(&d_Tar[tid * 3 + x], d_C, 3 * sizeof(float));
}
}
}
</pre>
=== Results ===
[[File:a3timings.PNG]]
When running in release and comparing the results to the original unchanged project straight from tatsy we noticed that for very small images the cuda version is slightly slower. This is probably due to the actual conversion of colour on a smaller image to be much shorter, whereas the transferring of the data over to device memory itself may add additional time. Overall when comparing the optimized and unoptimized versions of the kernel there was a visible increase.