Open main menu

CDOT Wiki β

Changes

A-Team

8,688 bytes added, 00:50, 8 April 2019
Final Profile
=== Assignment 1 ===
Our group decided to profile a couple of different solutions, the first being a simple neural network and ray tracing solution, in order to determine the best project to generate a solution for.
=====Neural Network=====
======Sebastian's findings======
I found a simple [https://gist.github.com/sbugrov/7f373f0e4788f8e076b8efa2abfd227a neural network] that takes a MNIST data set and preforms training on batches of the data. For a quick illustration MNIST is a numerical data set that contains many written numbers --in a gray scale format at 28 x 28 pixels in size. As well as the corresponding numerical values; between 0 and 9. The reason for this data set is to train networks such that they will be able to recognize written numbers when they confront them.
Our Hypothesis for this solution is a acceleration of roughly 10x; when dot() is parallelized. This means that our code should take somewhere in the ball park of 102 seconds to train the network.
=====Ray Tracing=====
======Henry's findings======
======Initial Profile======
 
{| class="wikitable mw-collapsible mw-collapsed"
! Initial Profile (Warning: long)
|-
| Initial Profile
Flat profile:
0.00 19.10 0.00 1 0.00 0.00 Imager::Spheroid::~Spheroid()
0.00 19.10 0.00 1 0.00 0.00 Algebra::UnitTest()
|}
 
----
From looking at the flat profile, 43.88% of time is in SolveLinearEquations. Most of the other time is used for calculating the shapes, while 1.02% is in the TraceRay function.
======Call Graph======
{| class="wikitable mw-collapsible mw-collapsed"
! Call Graph
|-
| Call graph (explanation follows)
Call graph
Most of the time (99.3%) is spent executing the SaveImage function (Imager::Scene::SaveImage(char const*, unsigned long, unsigned long, double, unsigned long) const). In the additional lodepng code that runs alongside the ray tracer, 94.4% of time is spent in the CalculateLighting function (Imager::Scene::CalculateLighting(Imager::Intersection const&, Imager::Vector const&, double, Imager::Color, int) const).
|}
 
----
=== Assignment 2 ===
During assignment 2, we tried a simple kernel that took the shape of a dot product, what this achieved was nothing special, actually as predicted at the end of assignment 1, continuously calling cudaMalloc and cudaMemCpy had severe consequences on time.
====Initial implementation====
vector <float> //version 1 dot product ddot__global__ void kdot(const vector <float>& m1* d_a, const vector <float>& m2* d_b, float* d_p, const int m1_rowsni, const int m1_columnsnj, const int m2_columnsnk) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; //matrix multiplication if (i < ni && j < nj) { cudaError_t Error float sum = cudaSuccess0.0f; vectorfor (int k = 0; k <nk; k++) sum += d_a[i * nk + k] * d_b[k * nj + j]; d_p[i * nj + j] = sum; } } ====Naive====Naturally this is a naive implementation as we are calling cudaMalloc for each iteration of the training for loop. cout << "Training the model ...\n"; for (unsigned i = 0; i < 10000; ++i) { This actually costs us an additional 20 minutes when profiling could be done. ====The next steps====Well firstly we had to engage in research as to understand how the actual neural network was learning; for example why they used relu() function, how back-propagation worked and so much more. Some additional sites will be included.  =====After that and many coffees!===== __global__ void train(float* d_W1, float* d_W2, float* d_W3, float* d_b_X, float* d_b_Y, float* d_a2, float* d_a1, float* d_dyhat, float* d_dW3, float* d_dW2, float* d_dW1, float* d_dz2, float* d_dz1) { int BATCH_SIZE = 256; float lr = .01 / BATCH_SIZE; kdot<<< 50,51>>>(ktranspose(d_a2, BATCH_SIZE, 64), d_dyhat, 64, BATCH_SIZE, 10, d_dW3); kdot << <80,32>> >(d_dyhat, ktranspose(d_W3, 64, 10), BATCH_SIZE, 10, 64, d_dz2); kreluPrime(d_a2, 128 * 64); for (int i = 0; i < BATCH_SIZE * 10; i++) { d_dz2[i] = d_dz2[i] * d_a2[i]; } kdot << <1024, 32>> >(ktranspose(d_a1, BATCH_SIZE, 128), d_dz2, 128, BATCH_SIZE, 64, d_dW2); kdot << <512,32>> > product(d_dz2, ktranspose(d_W2, 128, 64), BATCH_SIZE, 64, 128, d_dz1); kreluPrime(d_a1, BATCH_SIZE * 784); float for (int i = 0; i < 256 * h_p 64; i++) { d_dz1[i] = new floatd_dz1[i] * d_a1[m1.sizei]; } kdot <<<512,512,32 >>>(ktranspose(d_b_X, BATCH_SIZE, 784), d_dz1, 784, BATCH_SIZE, 128, d_dW1); // Updating the parameters //W3 = W3 - lr * dW3; for (int i = 0; i < (64*10); i++) { d_W3[i]= d_W3[i] - lr * d_dW3[i]; } //W2 = W2 - lr * dW2; float for (int i = 0; i < (128* h_m1 64); i++) { d_W2[i] = new floatd_W2[m1_rows i] - lr * m1_columnsd_dW2[i]; } //W1 = W1 - lr * dW1; for (int i = 0; i < m1(784*128); i++) { d_W1[i] = d_W1[i] - lr * d_dW1[i]; }} ===Dynamic Parallelism=== Dynamic Parallelism in CUDA allows for the support of kernels to create and synchronize new nested kernels.sizeAdditionally, for our use case it also allows us to spend more time on the device to process information quickly without constant cudaMemcpy() or cudaMalloc() calls. {| class="wikitable mw-collapsible mw-collapsed"! Parent call Child kernel( ... )|-|<syntaxhighlight lang="cpp">__global__ void train(float* d_W1, float* d_W2, float* d_W3, float* d_b_X, float* d_b_Y, float* d_a2, float* d_a1, float* d_yhat, float* d_dyhat, float* d_dW3, float* d_dW2, float* d_dW1, float* d_dz2, float* d_dz1, float* d_t) { int BATCH_SIZE = 256; float lr = 0.01 / BATCH_SIZE; //backpropagation d_dyhat = k_difference(d_yhat, d_b_Y, 10 * 10); kernel_dot <<<(2560 + 128)/64, 64>>> (d_dyhat, k_transpose(d_W3, 64, 10), BATCH_SIZE, 10, 64, d_dz2); cudaDeviceSynchronize(); } __global__ void kernel_dot(float* d_a, float* d_b, int ni, int nj, int nk, float* d_p) { int i= blockIdx.x * blockDim.x +threadIdx.x; int j = blockIdx.y * blockDim.y +threadIdx.y; //matrix multiplication if (i < ni && j < nj) { h_m1float sum = 0.0f; for (int k = 0; k < nk; k++) sum += d_a[i* nk + k] * d_b[k * nj + j] = m1; d_p[i* nj + j]= sum;
}
}</syntaxhighlight>|} ===Final Iteration==={| class="wikitable mw-collapsible mw-collapsed"! GPU code|-|<syntaxhighlight lang="cpp">__device__ float* k_difference(const float* m1, const float* m2, const int size) { /* Returns the difference between the two vectors. */ float* h_m2 difference = new float[m1_rows * m2_columnssize]; for (int i = 0; i != m2.< size; i++) { h_m2difference[i] = m1[i] - m2[i];
}
//declare device variables return difference;} __device__ float* d_m1; k_MFV(const float f, const float* d_m2;m, const int size) { float* d_p; Error mult = cudaMalloc((void**)&d_m1, m1_rows * m1_columns * sizeof(new float))[size]; if for (Error !int i = cudaSuccess0; i < size; i++) { cerr << "Failed @ d_m1 " << cudaGetErrorName(Error) << "!"; exit(EXIT_FAILURE)mult[i] = f * m[i];
}
Error return mult;}__device__ float* k_MM(float* m1, float* m2, const int m2_size) { float* product = cudaMallocnew float[m2_size];  for (int i = 0; i != m2_size; ++i) { product[i] = m1[i] * m2[i]; };  return product;}__device__ float* k_transpose(voidfloat *m, const int C, const int R) {  /*)&d_m2 Returns a transpose matrix of input matrix. Inputs: m: vector, input matrix C: int, number of columns in the input matrix R: int, number of rows in the input matrix Output: vector, m1_rows transpose matrix mT of input matrix m * m2_columns /  float* sizeof(mT = new float))[C * R]; if for (Error unsigned n = 0; n != cudaSuccessC * R; n++) { cerr << "Failed @ d_m2 " << cudaGetErrorName(Error) << "!"unsigned i = n / C; unsigned j = n % C; exit(EXIT_FAILURE)mT[n] = m[R*j + i];
}
  Error return mT;  //for (int i = cudaMalloc0; i<R; ++i) // for (int j = 0; j<C; ++j) // { // mT[j * C + i] = m[i * R + j]; // }  //return mT;}__device__ void dkernel_dot(voidfloat* d_a, float* d_b, int ni, int nj, int nk, float*d_p) { for (int row = 0; row != ni; ++row) { for (int col = 0; col != nk; ++col) { d_p[row *nk + col] = 0.f; for (int k = 0; k != nj; ++k)&{ d_p[row * nk + col] += d_a[row * nj + k] * d_b[k * nk + col]; } } }}//version 1 dot product__global__ void kernel_dot(float* d_a, float* d_b, m1_rows int ni, int nj, int nk, float* m1_columns d_p) { int i = blockIdx.x * sizeofblockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; //matrix multiplication if (i < ni && j < nj) { floatsum = 0.0f; for (int k = 0; k < nk; k++) sum += d_a[i * nk + k] * d_b[k * nj + j]; d_p[i * nj + j] = sum; }}void cudaCheck(cudaError_t Error);{
if (Error != cudaSuccess) {
cerr << "Failed @ d_p " << cudaGetErrorName(Error) << "!";
exit(EXIT_FAILURE);
}
}   __device__ float* k_relu(float* a, int n) { Error for (int i = cudaMemcpy0; i < n; ++i) { if (d_m1, h_m1, m1_rows a[i] < 0) { a[i] = 0.01f; } else a[i] = a[i]; } return a;}__device__ float* m1_columns k_reluPrime(float* sizeofa, int n) { for (int i = 0; i < n; ++i) { if (a[i] > 0) { a[i] = 1.0f; } else a[i] = 0.0; } return a;}///activation functions __global__ __global__ void kernel_relu(float)* a, cudaMemcpyHostToDeviceint n){ int i = blockIdx.x * blockDim.x + threadIdx.x; if (Error != cudaSuccessi < n) { cerr << "Failed @ Memcpy d_m1 " <if (a[i] < cudaGetErrorName(Error0) << "!"{ a[i] = 0.01f; exit(EXIT_FAILURE)} else a[i] = a[i];
}
Error = cudaMemcpy}__global__ void kernel_reluPrime(d_m2, h_m2, m1_rows * m2_columns float* sizeof(float)a, cudaMemcpyHostToDeviceint n){ int i = blockIdx.x * blockDim.x + threadIdx.x; if (Error != cudaSuccessi < n) { cerr << "Failed @ Memcpy d_m2 " << cudaGetErrorNameif (Errora[i] > 0) << "!"{ a[i] = 1.0f; exit(EXIT_FAILURE)} else a[i] = 0.0;
}
}
 
//set blocks and call kernel
int width = m1_rows;
int height = m1_columns;
dim3 dBlock(32, 32);
dim3 dGrid((width + dBlock.x - 1) / dBlock.x, (height + dBlock.y - 1) / dBlock.y);
kdot << < dGrid__device__ void ksoftmax(float *input, dBlock >> > int input_len) { //assert(d_m1, d_m2, d_p, m1_rows, m1_columns, m2_columnsinput != NULL); if //assert(Error input_len != cudaSuccess0); int i; float m; /* Find maximum value from input array */ m = input[0]; for (i = 1; i < input_len; i++) { if (input[i] > m) { m = input[i]; } }  float sum = 0; cerr for (i = 0; i << "Failed @ kdot function call " << cudaGetErrorNameinput_len; i++) { sum += expf(Errorinput[i] - m) ; }  for (i = 0; i << "!"input_len;i++) { exit input[i] = expf(input[i] - m - log(EXIT_FAILUREsum));  }
}
 __device__ void k_sigmoid(float* m1, int size) {  /* Returns the value of the sigmoid function f(x) = 1/copy device matrix to host matrix(1 + e^-x). Error = cudaMemcpy(h_pInput: m1, d_p, m1_rows * m1_columns * sizeofa vector. Output: 1/(float1 + e^-x), cudaMemcpyDeviceToHost);for every element of the input matrix m1. */ if for (Error unsigned i = 0; i != cudaSuccesssize; ++i) { cerr << "Failed @ cudaMemcpy from d_p to h_p " << cudaGetErrorNamem1[i] = 1 / (Error1 + exp(-m1[i]) << "!"; exit(EXIT_FAILURE);
}
}__global__ void feed_forward(float* d_b_X, float* d_W1, float* d_W2, float* d_W3, float* d_b_Y, float* d_a1, float* d_a2, float* d_yhat, float* d_dyhat) { int BATCH_SIZE = 256; float lr = 0.01 //freeCuda & deleteBATCH_SIZE; float* tempY = new float[256 * 64]; //displayfeed forward kernel_dot <<<256, 256>>> ("C = A B :"d_b_X, d_W1, BATCH_SIZE, h_p784, m1_rows128, m1_columnsd_a1); Error = cudaFreecudaDeviceSynchronize(d_m1); cudaCheckk_relu(Errord_a1, BATCH_SIZE * 784); Error = cudaFreekernel_dot <<<256, 128>>> (d_m2d_a1, d_W2, BATCH_SIZE, 128, 64, d_a2); cudaCheckcudaDeviceSynchronize(Error); Error = cudaFreek_relu(d_pd_a2, BATCH_SIZE * 128); cudaCheckkernel_dot <<<256, 64>>> (Errord_a2, d_W3, BATCH_SIZE, 64, 10, d_yhat); delete[] h_m1cudaDeviceSynchronize(); delete[] h_m2; cudaDeviceResetksoftmax(tempY, 10 * 10); //h_p to vector for (int i = 0; i < (m1_rows * m1_columns)100; i++) { product.push_back(h_pd_yhat[i] = tempY[i]);
}
delete[] h_ptempY;}  __global__ void train(float* d_W1, float* d_W2, float* d_W3, float* d_b_X, float* d_b_Y, float* d_a2, float* d_a1, float* d_yhat, float* d_dyhat, float* d_dW3, float* d_dW2, float* d_dW1, float* d_dz2, float* d_dz1, float* d_t) { cudaError_t Error; int BATCH_SIZE = 256; return productfloat lr = 0.01 / BATCH_SIZE; //backpropagation d_dyhat = k_difference(d_yhat, d_b_Y, 10 * 10); kernel_dot <<<(2560 + 128)/64, 64>>> (d_dyhat, k_transpose(d_W3, 64, 10), BATCH_SIZE, 10, 64, d_dz2); cudaDeviceSynchronize(); float* mT = new float[256 * 64 - 1]; for (int i = 0; i < 256; ++i) for (int j = 0; j < 64; ++j) { mT[j * 64 + i] = d_a2[i * 256 + j]; } kernel_dot <<<(16384 + 256)/64, 64>>> (mT, d_dyhat, 64, BATCH_SIZE, 10, d_dW3); cudaDeviceSynchronize(); k_reluPrime(d_a2, 256 * 64); for (int i = 0; i < BATCH_SIZE * 10; i++) { d_dz2[i] = d_dz2[i] * d_a2[i]; } mT = new float[256 * 128]; for (int i = 0; i < 256; ++i) for (int j = 0; j < 128; ++j) { mT[j * 128 + i] = d_a1[i * 256 + j]; } kernel_dot <<<64, 512>>> (mT, d_dz2, 128, BATCH_SIZE, 64, d_dW2); cudaDeviceSynchronize(); kernel_dot <<<80, 32>>> (d_dz2, k_transpose(d_W2, 128, 64), BATCH_SIZE, 64, 128, d_dz1); cudaDeviceSynchronize(); k_reluPrime(d_a1, BATCH_SIZE * 784); for (int i = 0; i < 256 * 64; i++) { d_dz1[i] = d_dz1[i] * d_a1[i]; } kernel_dot <<<784, 256>>> (d_t, d_dz1, 784, BATCH_SIZE, 128, d_dW1); cudaDeviceSynchronize(); //// Updating the parameters ////W3 = W3 - lr * dW3; d_W3 = k_difference(d_W3, k_MFV(lr, d_dW3, 64 * 10), 64 * 10); //W2 = W2 - lr * dW2; d_W2 = k_difference(d_W2, k_MFV(lr, d_dW2, 128 * 64), 128 * 64); ////W1 = W1 - lr * dW1; d_W1 = k_difference(d_W1, k_MFV(lr, d_dW1, 784 * 128), 784 * 128); for (int i = 0; i < (784 * 128); i++) { d_W1[i] = d_W1[i] - lr * d_dW1[i]; } //for (int i = 0; i != 10; ++i) { // for (int j = 0; j != 10; ++j) { // printf("%f ", d_W3[i * 10 + j]); // } // printf("\n"); //} //printf("\n"); //for (int i = 0; i != 10; ++i) { // for (int j = 0; j != 10; ++j) { // printf("%f ", d_yhat[i * 10 + j]); // } // printf("\n"); //} //printf("\n"); float* dif; dif = k_difference(d_b_Y, d_yhat, 10 * 10); float loss = 0.0; for (unsigned k = 0; k < BATCH_SIZE * 10; ++k) { loss += dif[k] * dif[k]; } printf("%f \n", loss / BATCH_SIZE); Error = cudaGetLastError(); if (Error != cudaSuccess) { printf("\n %s \n", Error); }};</syntaxhighlight>|}===Final Profile===This final profile is only of 20 iterations as we had errors occur beyond 20 iterations, likely due to naive coding and bad coding practice. [[File:nnfinalprofile.jpg]] ===Compiling===follow the article to set up visual studios for dynamic parallelism and recommended readings:  http://developer.download.nvidia.com/assets/cuda/files/CUDADownloads/TechBrief_Dynamic_Parallelism_in_CUDA.pdf   http://ramblingsofagamedevstudent.blogspot.com/2014/03/set-up-visual-studio-2012-for-cuda.html
=== Assignment 3 ===
====What we would do differently:====
There are many things, one of the major ones is to take on a more manageable task, one with proper documentation and reasoning behind chosen values.
113
edits