Open main menu

CDOT Wiki β

Changes

The parallelizing Express

2,938 bytes added, 20:12, 13 April 2017
Results
== Assignment 3 ==
 
=== Link to Original Unchanged Project and project used for a3 ===
 
[https://www.dropbox.com/s/32q70e9iovpm80l/a2.7z?dl=0 Download]
 
The file contains the visual studio solutions used for Assignment 3 as well as visual studio solution version of the original code.
 
To run the location of the OPENCV directory must be applied to the following project properties:
 
1. Under c/c++ -> general -> additional include directories -> the path to opencv\..\..\include
2. Under linker -> general -> additional library directories -> the path to opencv\..\..\lib
 
After doing the above, build the solution and run the appropriate Release or Debug exe with the target and reference image as arguments.
 
A link the the corresponding xls file of the run time can be found [https://www.dropbox.com/s/ecu7eycyv41krwq/a3.xlsx?dl=0 here]
 
=== What was done ===
 
At first the power function used was switched out with __pow in the kernel as the traditional pow function is more heavy of a function. But the results were very small causing a different from around 10-30 milliseconds. Afterwards the kernel was upgraded to implement grid and strides. In doing so instead of doing all the calculations on one thread, many of them were able to be calculated on a separate row. This made it so that for every pixel with a row one thread would be responsible for the colour shift.
 
Other implementations were made to transfer all the data necessary for calculations all at once from the beginning and then perform all calculations done by tatsy on the device side, but due to time constraints and the complication of the project we were unable to fully implement these changes. The code is however left (commented) in the included project download.
 
=== Optimized Kernel ===
<pre>
__global__ void matvec_kernel(float* d_A, float* d_RGB2, float* d_LMS2, float* d_C,
const int n, int targetrows, int targetcols, float* d_Tar)
{
const double eps = 1.0e-4;
//grid-stride loop
for (int tid = threadIdx.x + blockIdx.x * blockDim.x;
tid < targetrows;
tid += blockDim.x * gridDim.x)
{
for (int x = 0; x < targetcols; ++x) {
memcpy(&d_A, &d_Tar[tid * 3 + x], 3 * sizeof(float));
 
matvec(d_A, d_RGB2, d_C);
memcpy(&d_A, d_C, 3 * sizeof(float));
 
for (int c = 0; c < 3; c++)
d_A[c] = d_A[c] > -5.0 ? __powf(10.0f, d_A[c]) : eps;
 
matvec(d_A, d_LMS2, d_C);
memcpy(&d_Tar[tid * 3 + x], d_C, 3 * sizeof(float));
}
}
}
</pre>
 
=== Results ===
 
[[File:a3timings.PNG]]
 
 
When running in release and comparing the results to the original unchanged project straight from tatsy we noticed that for very small images the cuda version is slightly slower. This is probably due to the actual conversion of colour on a smaller image to be much shorter, whereas the transferring of the data over to device memory itself may add additional time. Overall when comparing the optimized and unoptimized versions of the kernel there was a visible increase.
49
edits