Changes

← Older edit

A-Team

1,476 bytes added, 00:50, 8 April 2019

→‎Final Profile

}

===Dynamic Parallelism===

Dynamic Parallelism in CUDA allows for the support of kernels to create and synchronize new nested kernels. Additionally, for our use case it also allows us to spend more time on the device to process information quickly without constant cudaMemcpy() or cudaMalloc() calls.

{| class="wikitable mw-collapsible mw-collapsed"

! Parent call Child kernel( ... )

|-

|

__global__ void train(float* d_W1, float* d_W2, float* d_W3, float* d_b_X, float* d_b_Y, float* d_a2, float* d_a1, float* d_yhat, float* d_dyhat, float* d_dW3, float* d_dW2, float* d_dW1, float* d_dz2, float* d_dz1, float* d_t) {

int BATCH_SIZE = 256;

float lr = 0.01 / BATCH_SIZE;

//backpropagation

d_dyhat = k_difference(d_yhat, d_b_Y, 10 * 10);

kernel_dot <<<(2560 + 128)/64, 64>>> (d_dyhat, k_transpose(d_W3, 64, 10), BATCH_SIZE, 10, 64, d_dz2);

cudaDeviceSynchronize();

}

__global__ void kernel_dot(float* d_a, float* d_b, int ni, int nj, int nk, float* d_p) {

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

//matrix multiplication

if (i < ni && j < nj) {

float sum = 0.0f;

for (int k = 0; k < nk; k++)

sum += d_a[i * nk + k] * d_b[k * nj + j];

d_p[i * nj + j] = sum;

}

</syntaxhighlight>

|}

===Final Iteration===

{| class="wikitable mw-collapsible mw-collapsed"

! ~~Train_kernel~~GPU code

|-

|

</syntaxhighlight>

|}

===Final Profile===

This final profile is only of 20 iterations as we had errors occur beyond 20 iterations, likely due to naive coding and bad coding practice.

[[File:nnfinalprofile.jpg]]

===Compiling===

Spdjurovic

113

edits

CDOT Wiki β

Changes

A-Team

CDOT Wiki ^β