Changes

← Older edit

A-Team

8,688 bytes added, 00:50, 8 April 2019

→‎Final Profile

=== Assignment 1 ===

Our group decided to profile a couple of different solutions, the first being a simple neural network and ray tracing solution, in order to determine the best project to generate a solution for.

=====Neural Network=====

======Sebastian's findings======

I found a simple [https://gist.github.com/sbugrov/7f373f0e4788f8e076b8efa2abfd227a neural network] that takes a MNIST data set and preforms training on batches of the data. For a quick illustration MNIST is a numerical data set that contains many written numbers --in a gray scale format at 28 x 28 pixels in size. As well as the corresponding numerical values; between 0 and 9. The reason for this data set is to train networks such that they will be able to recognize written numbers when they confront them.

Our Hypothesis for this solution is a acceleration of roughly 10x; when dot() is parallelized. This means that our code should take somewhere in the ball park of 102 seconds to train the network.

=====Ray Tracing=====

======Henry's findings======

======Initial Profile======

{| class="wikitable mw-collapsible mw-collapsed"

! Initial Profile (Warning: long)

|-

| Initial Profile

Flat profile:

0.00 19.10 0.00 1 0.00 0.00 Imager::Spheroid::~Spheroid()

0.00 19.10 0.00 1 0.00 0.00 Algebra::UnitTest()

|}

----

From looking at the flat profile, 43.88% of time is in SolveLinearEquations. Most of the other time is used for calculating the shapes, while 1.02% is in the TraceRay function.

======Call Graph======

{| class="wikitable mw-collapsible mw-collapsed"

! Call Graph

|-

| Call graph (explanation follows)

Call graph

Most of the time (99.3%) is spent executing the SaveImage function (Imager::Scene::SaveImage(char const*, unsigned long, unsigned long, double, unsigned long) const). In the additional lodepng code that runs alongside the ray tracer, 94.4% of time is spent in the CalculateLighting function (Imager::Scene::CalculateLighting(Imager::Intersection const&, Imager::Vector const&, double, Imager::Color, int) const).

|}

----

=== Assignment 2 ===

During assignment 2, we tried a simple kernel that took the shape of a dot product, what this achieved was nothing special, actually as predicted at the end of assignment 1, continuously calling cudaMalloc and cudaMemCpy had severe consequences on time.

====Initial implementation====

~~vector <float>~~ //version 1 dot product ~~ddot~~__global__ void kdot(const ~~vector <~~float~~>& m1~~* d_a, const ~~vector <~~float~~>& m2~~* d_b, float* d_p, ~~const~~ int ~~m1_rows~~ni, ~~const~~ int ~~m1_columns~~nj, ~~const~~ int ~~m2_columns~~nk) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; //matrix multiplication if (i < ni && j < nj) { ~~cudaError_t Error~~ float sum = ~~cudaSuccess~~0.0f; ~~vector~~for (int k = 0; k <nk; k++) sum += d_a[i * nk + k] * d_b[k * nj + j]; d_p[i * nj + j] = sum; } } ====Naive====Naturally this is a naive implementation as we are calling cudaMalloc for each iteration of the training for loop. cout << "Training the model ...\n"; for (unsigned i = 0; i < 10000; ++i) { This actually costs us an additional 20 minutes when profiling could be done. ====The next steps====Well firstly we had to engage in research as to understand how the actual neural network was learning; for example why they used relu() function, how back-propagation worked and so much more. Some additional sites will be included. =====After that and many coffees!===== __global__ void train(float* d_W1, float* d_W2, float* d_W3, float* d_b_X, float* d_b_Y, float* d_a2, float* d_a1, float* d_dyhat, float* d_dW3, float* d_dW2, float* d_dW1, float* d_dz2, float* d_dz1) { int BATCH_SIZE = 256; float lr = .01 / BATCH_SIZE; kdot<<< 50,51>>>(ktranspose(d_a2, BATCH_SIZE, 64), d_dyhat, 64, BATCH_SIZE, 10, d_dW3); kdot << <80,32>> >(d_dyhat, ktranspose(d_W3, 64, 10), BATCH_SIZE, 10, 64, d_dz2); kreluPrime(d_a2, 128 * 64); for (int i = 0; i < BATCH_SIZE * 10; i++) { d_dz2[i] = d_dz2[i] * d_a2[i]; } kdot << <1024, 32>> >(ktranspose(d_a1, BATCH_SIZE, 128), d_dz2, 128, BATCH_SIZE, 64, d_dW2); kdot << <512,32>> > ~~product~~(d_dz2, ktranspose(d_W2, 128, 64), BATCH_SIZE, 64, 128, d_dz1); kreluPrime(d_a1, BATCH_SIZE * 784); ~~float~~ for (int i = 0; i < 256 * ~~h_p~~ 64; i++) { d_dz1[i] = ~~new float~~d_dz1[i] * d_a1[~~m1.size~~i]; } kdot <<<512,512,32 >>>(ktranspose(d_b_X, BATCH_SIZE, 784), d_dz1, 784, BATCH_SIZE, 128, d_dW1); // Updating the parameters //W3 = W3 - lr * dW3; for (int i = 0; i < (64*10); i++) { d_W3[i]= d_W3[i] - lr * d_dW3[i]; } //W2 = W2 - lr * dW2; ~~float~~ for (int i = 0; i < (128* ~~h_m1~~ 64); i++) { d_W2[i] = ~~new float~~d_W2[~~m1_rows~~ i] - lr * ~~m1_columns~~d_dW2[i]; } //W1 = W1 - lr * dW1; for (int i = 0; i < m1(784*128); i++) { d_W1[i] = d_W1[i] - lr * d_dW1[i]; }} ===Dynamic Parallelism=== Dynamic Parallelism in CUDA allows for the support of kernels to create and synchronize new nested kernels.~~size~~Additionally, for our use case it also allows us to spend more time on the device to process information quickly without constant cudaMemcpy() or cudaMalloc() calls. {| class="wikitable mw-collapsible mw-collapsed"! Parent call Child kernel( ... )|-|<syntaxhighlight lang="cpp">__global__ void train(float* d_W1, float* d_W2, float* d_W3, float* d_b_X, float* d_b_Y, float* d_a2, float* d_a1, float* d_yhat, float* d_dyhat, float* d_dW3, float* d_dW2, float* d_dW1, float* d_dz2, float* d_dz1, float* d_t) { int BATCH_SIZE = 256; float lr = 0.01 / BATCH_SIZE; //backpropagation d_dyhat = k_difference(d_yhat, d_b_Y, 10 * 10); kernel_dot <<<(2560 + 128)/64, 64>>> (d_dyhat, k_transpose(d_W3, 64, 10), BATCH_SIZE, 10, 64, d_dz2); cudaDeviceSynchronize(); } __global__ void kernel_dot(float* d_a, float* d_b, int ni, int nj, int nk, float* d_p) { int i= blockIdx.x * blockDim.x +threadIdx.x; int j = blockIdx.y * blockDim.y +threadIdx.y; //matrix multiplication if (i < ni && j < nj) { ~~h_m1~~float sum = 0.0f; for (int k = 0; k < nk; k++) sum += d_a[i* nk + k] * d_b[k * nj + j] ~~= m1~~; d_p[i* nj + j]= sum;

}

}</syntaxhighlight>|} ===Final Iteration==={| class="wikitable mw-collapsible mw-collapsed"! GPU code|-|<syntaxhighlight lang="cpp">__device__ float* k_difference(const float* m1, const float* m2, const int size) { /* Returns the difference between the two vectors. */ float* ~~h_m2~~ difference = new float[~~m1_rows * m2_columns~~size]; for (int i = 0; i ~~!= m2.~~< size; i++) { ~~h_m2~~difference[i] = m1[i] - m2[i];

}

~~//declare device variables~~ return difference;} __device__ float* ~~d_m1;~~ k_MFV(const float f, const float* ~~d_m2;~~m, const int size) { float* ~~d_p;~~ ~~Error~~ mult = ~~cudaMalloc((void**)&d_m1, m1_rows * m1_columns * sizeof(~~new float))[size]; if for (~~Error !~~int i = ~~cudaSuccess~~0; i < size; i++) { ~~cerr << "Failed @ d_m1 " << cudaGetErrorName(Error) << "!";~~ ~~exit(EXIT_FAILURE)~~mult[i] = f * m[i];

}

~~Error~~ return mult;}__device__ float* k_MM(float* m1, float* m2, const int m2_size) { float* product = ~~cudaMalloc~~new float[m2_size]; for (int i = 0; i != m2_size; ++i) { product[i] = m1[i] * m2[i]; }; return product;}__device__ float* k_transpose(~~void~~float *m, const int C, const int R) { /*~~)&d_m2~~ Returns a transpose matrix of input matrix. Inputs: m: vector, input matrix C: int, number of columns in the input matrix R: int, number of rows in the input matrix Output: vector, ~~m1_rows~~ transpose matrix mT of input matrix m * ~~m2_columns~~ / float* ~~sizeof(~~mT = new float))[C * R]; if for (~~Error~~ unsigned n = 0; n != ~~cudaSuccess~~C * R; n++) { ~~cerr << "Failed @ d_m2 " << cudaGetErrorName(Error) << "!"~~unsigned i = n / C; unsigned j = n % C; ~~exit(EXIT_FAILURE)~~mT[n] = m[R*j + i];

}

~~Error~~ return mT; //for (int i = ~~cudaMalloc~~0; i<R; ++i) // for (int j = 0; j<C; ++j) // { // mT[j * C + i] = m[i * R + j]; // } //return mT;}__device__ void dkernel_dot(~~void~~float* d_a, float* d_b, int ni, int nj, int nk, float*d_p) { for (int row = 0; row != ni; ++row) { for (int col = 0; col != nk; ++col) { d_p[row *nk + col] = 0.f; for (int k = 0; k != nj; ++k)&{ d_p[row * nk + col] += d_a[row * nj + k] * d_b[k * nk + col]; } } }}//version 1 dot product__global__ void kernel_dot(float* d_a, float* d_b, ~~m1_rows~~ int ni, int nj, int nk, float* ~~m1_columns~~ d_p) { int i = blockIdx.x * ~~sizeof~~blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; //matrix multiplication if (i < ni && j < nj) { floatsum = 0.0f; for (int k = 0; k < nk; k++) sum += d_a[i * nk + k] * d_b[k * nj + j]; d_p[i * nj + j] = sum; }}void cudaCheck(cudaError_t Error);{

if (Error != cudaSuccess) {

cerr ~~<< "Failed @ d_p "~~ << cudaGetErrorName(Error) << "!";

exit(EXIT_FAILURE);

}

} __device__ float* k_relu(float* a, int n) { ~~Error~~ for (int i = ~~cudaMemcpy~~0; i < n; ++i) { if (~~d_m1, h_m1, m1_rows~~ a[i] < 0) { a[i] = 0.01f; } else a[i] = a[i]; } return a;}__device__ float* ~~m1_columns~~ k_reluPrime(float* ~~sizeof~~a, int n) { for (int i = 0; i < n; ++i) { if (a[i] > 0) { a[i] = 1.0f; } else a[i] = 0.0; } return a;}///activation functions __global__ __global__ void kernel_relu(float)* a, ~~cudaMemcpyHostToDevice~~int n){ int i = blockIdx.x * blockDim.x + threadIdx.x; if (~~Error != cudaSuccess~~i < n) { ~~cerr << "Failed @ Memcpy d_m1 " <~~if (a[i] < ~~cudaGetErrorName(Error~~0) ~~<< "!"~~{ a[i] = 0.01f; ~~exit(EXIT_FAILURE)~~} else a[i] = a[i];

}

~~Error = cudaMemcpy~~}__global__ void kernel_reluPrime(~~d_m2, h_m2, m1_rows * m2_columns~~ float* ~~sizeof(float)~~a, ~~cudaMemcpyHostToDevice~~int n){ int i = blockIdx.x * blockDim.x + threadIdx.x; if (~~Error != cudaSuccess~~i < n) { ~~cerr << "Failed @ Memcpy d_m2 " << cudaGetErrorName~~if (~~Error~~a[i] > 0) ~~<< "!"~~{ a[i] = 1.0f; ~~exit(EXIT_FAILURE)~~} else a[i] = 0.0;

}

~~//set blocks and call kernel~~

~~int width = m1_rows;~~

~~int height = m1_columns;~~

~~dim3 dBlock(32, 32);~~

~~dim3 dGrid((width + dBlock.x - 1) / dBlock.x, (height + dBlock.y - 1) / dBlock.y);~~

~~kdot << < dGrid~~__device__ void ksoftmax(float *input, ~~dBlock >> >~~ int input_len) { //assert(~~d_m1, d_m2, d_p, m1_rows, m1_columns, m2_columns~~input != NULL); if //assert(~~Error~~ input_len != ~~cudaSuccess~~0); int i; float m; /* Find maximum value from input array */ m = input[0]; for (i = 1; i < input_len; i++) { if (input[i] > m) { m = input[i]; } } float sum = 0; ~~cerr~~ for (i = 0; i <~~< "Failed @ kdot function call " << cudaGetErrorName~~input_len; i++) { sum += expf(~~Error~~input[i] - m) ; } for (i = 0; i <~~< "!"~~input_len;i++) { ~~exit~~ input[i] = expf(input[i] - m - log(~~EXIT_FAILURE~~sum)); }

}

__device__ void k_sigmoid(float* m1, int size) { /* Returns the value of the sigmoid function f(x) = 1/~~copy device matrix to host matrix~~(1 + e^-x). ~~Error = cudaMemcpy(h_p~~Input: m1, ~~d_p, m1_rows * m1_columns * sizeof~~a vector. Output: 1/(~~float~~1 + e^-x)~~, cudaMemcpyDeviceToHost);~~for every element of the input matrix m1. */ if for (~~Error~~ unsigned i = 0; i != ~~cudaSuccess~~size; ++i) { ~~cerr << "Failed @ cudaMemcpy from d_p to h_p " << cudaGetErrorName~~m1[i] = 1 / (~~Error~~1 + exp(-m1[i]) ~~<< "!";~~ ~~exit(EXIT_FAILURE~~);

}

}__global__ void feed_forward(float* d_b_X, float* d_W1, float* d_W2, float* d_W3, float* d_b_Y, float* d_a1, float* d_a2, float* d_yhat, float* d_dyhat) { int BATCH_SIZE = 256; float lr = 0.01 /~~/freeCuda & delete~~BATCH_SIZE; float* tempY = new float[256 * 64]; //~~display~~feed forward kernel_dot <<<256, 256>>> (~~"C = A B :"~~d_b_X, d_W1, BATCH_SIZE, ~~h_p~~784, ~~m1_rows~~128, ~~m1_columns~~d_a1); ~~Error = cudaFree~~cudaDeviceSynchronize(~~d_m1~~); ~~cudaCheck~~k_relu(~~Error~~d_a1, BATCH_SIZE * 784); ~~Error = cudaFree~~kernel_dot <<<256, 128>>> (~~d_m2~~d_a1, d_W2, BATCH_SIZE, 128, 64, d_a2); ~~cudaCheck~~cudaDeviceSynchronize(~~Error~~); ~~Error = cudaFree~~k_relu(~~d_p~~d_a2, BATCH_SIZE * 128); ~~cudaCheck~~kernel_dot <<<256, 64>>> (~~Error~~d_a2, d_W3, BATCH_SIZE, 64, 10, d_yhat); ~~delete[] h_m1~~cudaDeviceSynchronize(); ~~delete[] h_m2;~~ ~~cudaDeviceReset~~ksoftmax(tempY, 10 * 10); ~~//h_p to vector~~ for (int i = 0; i < ~~(m1_rows * m1_columns)~~100; i++) { ~~product.push_back(h_p~~d_yhat[i] = tempY[i]);

}

delete[] ~~h_p~~tempY;} __global__ void train(float* d_W1, float* d_W2, float* d_W3, float* d_b_X, float* d_b_Y, float* d_a2, float* d_a1, float* d_yhat, float* d_dyhat, float* d_dW3, float* d_dW2, float* d_dW1, float* d_dz2, float* d_dz1, float* d_t) { cudaError_t Error; int BATCH_SIZE = 256; ~~return product~~float lr = 0.01 / BATCH_SIZE; //backpropagation d_dyhat = k_difference(d_yhat, d_b_Y, 10 * 10); kernel_dot <<<(2560 + 128)/64, 64>>> (d_dyhat, k_transpose(d_W3, 64, 10), BATCH_SIZE, 10, 64, d_dz2); cudaDeviceSynchronize(); float* mT = new float[256 * 64 - 1]; for (int i = 0; i < 256; ++i) for (int j = 0; j < 64; ++j) { mT[j * 64 + i] = d_a2[i * 256 + j]; } kernel_dot <<<(16384 + 256)/64, 64>>> (mT, d_dyhat, 64, BATCH_SIZE, 10, d_dW3); cudaDeviceSynchronize(); k_reluPrime(d_a2, 256 * 64); for (int i = 0; i < BATCH_SIZE * 10; i++) { d_dz2[i] = d_dz2[i] * d_a2[i]; } mT = new float[256 * 128]; for (int i = 0; i < 256; ++i) for (int j = 0; j < 128; ++j) { mT[j * 128 + i] = d_a1[i * 256 + j]; } kernel_dot <<<64, 512>>> (mT, d_dz2, 128, BATCH_SIZE, 64, d_dW2); cudaDeviceSynchronize(); kernel_dot <<<80, 32>>> (d_dz2, k_transpose(d_W2, 128, 64), BATCH_SIZE, 64, 128, d_dz1); cudaDeviceSynchronize(); k_reluPrime(d_a1, BATCH_SIZE * 784); for (int i = 0; i < 256 * 64; i++) { d_dz1[i] = d_dz1[i] * d_a1[i]; } kernel_dot <<<784, 256>>> (d_t, d_dz1, 784, BATCH_SIZE, 128, d_dW1); cudaDeviceSynchronize(); //// Updating the parameters ////W3 = W3 - lr * dW3; d_W3 = k_difference(d_W3, k_MFV(lr, d_dW3, 64 * 10), 64 * 10); //W2 = W2 - lr * dW2; d_W2 = k_difference(d_W2, k_MFV(lr, d_dW2, 128 * 64), 128 * 64); ////W1 = W1 - lr * dW1; d_W1 = k_difference(d_W1, k_MFV(lr, d_dW1, 784 * 128), 784 * 128); for (int i = 0; i < (784 * 128); i++) { d_W1[i] = d_W1[i] - lr * d_dW1[i]; } //for (int i = 0; i != 10; ++i) { // for (int j = 0; j != 10; ++j) { // printf("%f ", d_W3[i * 10 + j]); // } // printf("\n"); //} //printf("\n"); //for (int i = 0; i != 10; ++i) { // for (int j = 0; j != 10; ++j) { // printf("%f ", d_yhat[i * 10 + j]); // } // printf("\n"); //} //printf("\n"); float* dif; dif = k_difference(d_b_Y, d_yhat, 10 * 10); float loss = 0.0; for (unsigned k = 0; k < BATCH_SIZE * 10; ++k) { loss += dif[k] * dif[k]; } printf("%f \n", loss / BATCH_SIZE); Error = cudaGetLastError(); if (Error != cudaSuccess) { printf("\n %s \n", Error); }};</syntaxhighlight>|}===Final Profile===This final profile is only of 20 iterations as we had errors occur beyond 20 iterations, likely due to naive coding and bad coding practice. [[File:nnfinalprofile.jpg]] ===Compiling===follow the article to set up visual studios for dynamic parallelism and recommended readings: http://developer.download.nvidia.com/assets/cuda/files/CUDADownloads/TechBrief_Dynamic_Parallelism_in_CUDA.pdf http://ramblingsofagamedevstudent.blogspot.com/2014/03/set-up-visual-studio-2012-for-cuda.html

=== Assignment 3 ===

====What we would do differently:====

There are many things, one of the major ones is to take on a more manageable task, one with proper documentation and reasoning behind chosen values.

Spdjurovic

113

edits

CDOT Wiki β

Changes

A-Team

CDOT Wiki ^β