===Dynamic Parallelism===
Dynamic Parallelism in CUDA allows for the support of kernels to create and synchronize new nested kernels. Additionally, for our use case it also allows us to spend more time on the device to process information quickly without constant cudaMemcpy() or cudaMalloc() calls.
! Parent call Child kernel( ... )
<syntaxhighlight lang="cpp">
__global__ void train(float* d_W1, float* d_W2, float* d_W3, float* d_b_X, float* d_b_Y, float* d_a2, float* d_a1, float* d_yhat, float* d_dyhat, float* d_dW3, float* d_dW2, float* d_dW1, float* d_dz2, float* d_dz1, float* d_t) {
int BATCH_SIZE = 256;
float lr = 0.01 / BATCH_SIZE;
d_dyhat = k_difference(d_yhat, d_b_Y, 10 * 10);
kernel_dot <<<(2560 + 128)/64, 64>>> (d_dyhat, k_transpose(d_W3, 64, 10), BATCH_SIZE, 10, 64, d_dz2);
__global__ void kernel_dot(float* d_a, float* d_b, int ni, int nj, int nk, float* d_p) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
//matrix multiplication
if (i < ni && j < nj) {
float sum = 0.0f;
for (int k = 0; k < nk; k++)
sum += d_a[i * nk + k] * d_b[k * nj + j];
d_p[i * nj + j] = sum;
===Final Iteration===
! Train_kernelGPU code
===Final Profile===
This final profile is only of 20 iterations as we had errors occur beyond 20 iterations, likely due to naive coding and bad coding practice.

