Open main menu

CDOT Wiki β

Changes

A-Team

1,476 bytes added, 00:50, 8 April 2019
Final Profile
}
===Dynamic Parallelism===
Dynamic Parallelism in CUDA allows for the support of kernels to create and synchronize new nested kernels. Additionally, for our use case it also allows us to spend more time on the device to process information quickly without constant cudaMemcpy() or cudaMalloc() calls.
 
{| class="wikitable mw-collapsible mw-collapsed"
! Parent call Child kernel( ... )
|-
|
<syntaxhighlight lang="cpp">
__global__ void train(float* d_W1, float* d_W2, float* d_W3, float* d_b_X, float* d_b_Y, float* d_a2, float* d_a1, float* d_yhat, float* d_dyhat, float* d_dW3, float* d_dW2, float* d_dW1, float* d_dz2, float* d_dz1, float* d_t) {
int BATCH_SIZE = 256;
float lr = 0.01 / BATCH_SIZE;
//backpropagation
d_dyhat = k_difference(d_yhat, d_b_Y, 10 * 10);
kernel_dot <<<(2560 + 128)/64, 64>>> (d_dyhat, k_transpose(d_W3, 64, 10), BATCH_SIZE, 10, 64, d_dz2);
cudaDeviceSynchronize();
}
 
__global__ void kernel_dot(float* d_a, float* d_b, int ni, int nj, int nk, float* d_p) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
//matrix multiplication
if (i < ni && j < nj) {
float sum = 0.0f;
for (int k = 0; k < nk; k++)
sum += d_a[i * nk + k] * d_b[k * nj + j];
d_p[i * nj + j] = sum;
}
}
</syntaxhighlight>
|}
===Final Iteration===
{| class="wikitable mw-collapsible mw-collapsed"
! Train_kernelGPU code
|-
|
</syntaxhighlight>
|}
===Final Profile===
This final profile is only of 20 iterations as we had errors occur beyond 20 iterations, likely due to naive coding and bad coding practice.
[[File:nnfinalprofile.jpg]]
===Compiling===
113
edits