93
edits
Changes
→Assignment 3
'''Block Size''' (Timothy Moy)
The code modified was line 22 const int ntpb = 16; // number of threads per block The first quick method to try and improve it was to change the block size. Playing with the block size changed the kernel run times, but it wasn't apparent what exactly causes it.Most likely it is due to the 16*16 block configuration being able to not take up all the memory of the SM, but is still large enough that it gives us a boost in execution times. https://devtalk.nvidia.com/default/topic/1026825/how-to-choose-how-many-threads-blocks-to-have-/
[[Media:assign3-ntpb.png]]
I then tried merging the sinf() and cosf() function calls into one via sincosf() so that the kernel made less function calls. That proved to be trim the run times a bit, but then I noticed that sin and cos never change since our angle never changes. Thus, this led to testing of the sin and cos functions to use the Host to calculate it and pass them in as parameters for the kernel. The result was a much more significant run time since our kernel is no longer calculating the same number in each thread.
Kernel Signature Changes:
__global__ void rotateKernel(int* oldImage, int* newImage, int rows, int cols, float rads) {
vs
__global__ void rotateKernel(int* oldImage, int* newImage, int rows, int cols, /*float rads*/ float sinRads, float cosRads) {
Kernel Code Changes
float sinRads = sinf(rads);
float cosRads = cosf(rads);
//float sinRads, cosRads;
//__sincosf(rads, &sinRads, &cosRads);
vs
//float sinRads = sinf(rads);
//float cosRads = cosf(rads);
float sinRads, cosRads;
__sincosf(rads, &sinRads, &cosRads);
vs
//float sinRads = sinf(rads);
//float cosRads = cosf(rads);
//float sinRads, cosRads;
//__sincosf(rads, &sinRads, &cosRads);
and
Host Function Additions:
float cos1 = cos(rads);
float sin1 = sin(rads);
Kernel Launch Changed
rotateKernel<<<dGrid, dBlock >>>(d_a, d_b, rows, cols, rads);
vs
rotateKernel<<<dGrid, dBlock >>>(d_a, d_b, rows, cols, sin1, cos1);
[[Media:assign3-sincos.png]]