93
edits
Changes
Savy Cat
,→Assignment 3
=== Assignment 3 ===
==== "Register" Index ====
For my first attempt at optimization, I thought maybe, just maybe, the index calculations were being performed from within global memory:
<nowiki>
__global__ void rot90(PX_TYPE* src, PX_TYPE* dst, int src_w, int src_h, int z) {
int k = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (k < src_w && j < src_h)
dst[(src_h - 1 - j) + k * src_h + src_w * src_h * z] = src[threadIdx.x + threadIdx.y * src_w + src_w * src_h * z];
}</nowiki>
So I declared two register variables and determined the indexes prior:
<nowiki>
__global__ void rot90(PX_TYPE* src, PX_TYPE* dst, int src_w, int src_h, int z) {
int k = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int d = (src_h - 1 - j) + k * src_h + src_w * src_h * z;
int s = threadIdx.x + threadIdx.y * src_w + src_w * src_h * z;
if (k < src_w && j < src_h)
dst[d] = src[s];
}</nowiki>
This only had a slightly negative effect. Although, such a small difference may have been due to the luck of the run:
[[File:Summary-4.png]]
==== Unsigned Char vs. Float ====
The first real improvement came from changing PX_TYPE from float back to unsigned char, as used in the serial version. Unsigned char is good enough for all .jpg colour values (255). GPU are designed to perform operations on floating point numbers, however, we are not performing any calculations outside of the indexing. The performance of the kernel was the same for float or unsigned char. We copy the source image to device once, and back to the host 12 times, making size relevant.
{| class="wikitable"
|+Size Comparison
|-
|
|Unsigned Char
|Float
|-
|Bread
|Pie
|
|-
|Butter
|Ice cream
|
|-
|Butter
|Ice cream
|
|-
|Butter
|Ice cream
|
|}