Difference between revisions of "Algo holics"
Ssdhillon20 (talk | contribs) (→Team Members) |
Ssdhillon20 (talk | contribs) (→Progress) |
||
Line 11: | Line 11: | ||
− | + | ====Sudoku Puzzle Solver by Gurpreet Singh==== | |
Is it a program that solves Sudoku puzzles(9X9) using Bruteforce algorithm. The user can either pass a Sudoku files as an input or enter the values manually. Moreover, the file or the manual entry must strictly have 9 rows and 9 columns in them. Last but not the least, all the cells must be separated by a space and the cells that needs to be solved must have 0 in them as their value. | Is it a program that solves Sudoku puzzles(9X9) using Bruteforce algorithm. The user can either pass a Sudoku files as an input or enter the values manually. Moreover, the file or the manual entry must strictly have 9 rows and 9 columns in them. Last but not the least, all the cells must be separated by a space and the cells that needs to be solved must have 0 in them as their value. | ||
Line 17: | Line 17: | ||
The original source code can be found at [https://github.com/shafeeq/Sudoku Link] | The original source code can be found at [https://github.com/shafeeq/Sudoku Link] | ||
− | + | =====Logic===== | |
In this program the Bruteforce algorithm first put 1 in the first cell. Then it moves to the second cell and put 1 in there and check if it satisfies all the rules and conditions. If it don't, then the algorithm will increment it's value to 2 and then check again. The value can change from 0-9 to find the correct value for a cell. If none of the value from the range of 0-9 satisfies the cell, then the program will iterate back and change the value of the first cell to 2 and then try the whole process again. In this way it will solve the puzzle. | In this program the Bruteforce algorithm first put 1 in the first cell. Then it moves to the second cell and put 1 in there and check if it satisfies all the rules and conditions. If it don't, then the algorithm will increment it's value to 2 and then check again. The value can change from 0-9 to find the correct value for a cell. If none of the value from the range of 0-9 satisfies the cell, then the program will iterate back and change the value of the first cell to 2 and then try the whole process again. In this way it will solve the puzzle. | ||
− | + | =====Compiling the program===== | |
Enter the following commands: | Enter the following commands: | ||
Line 56: | Line 56: | ||
− | + | =====Analysis===== | |
To analyze the call graph, enter the following command: | To analyze the call graph, enter the following command: | ||
gprof -q -b a> a.clg | gprof -q -b a> a.clg | ||
Line 212: | Line 212: | ||
From the above Call graph we can see that for a harder Sudoku puzzle, the time increased significantly. Moreover, it can also be seen that almost 50% of the time is consumed by the checkRow function, 18.8% by checkColumn and finally 12% by the checkSquare function. Thousand of calls were made to these 3 functions, if we parallelizing these functions then the efficiency of the program can be increased significantly. | From the above Call graph we can see that for a harder Sudoku puzzle, the time increased significantly. Moreover, it can also be seen that almost 50% of the time is consumed by the checkRow function, 18.8% by checkColumn and finally 12% by the checkSquare function. Thousand of calls were made to these 3 functions, if we parallelizing these functions then the efficiency of the program can be increased significantly. | ||
+ | ---- | ||
+ | ==== Simple Artificial Neural Network by Sukhbeer==== | ||
+ | =====Introduction===== | ||
+ | I am very interested in neural networks and I started learning about them recently. I think this is a good opportunity to build on my knowledge of a NN while also parallelising it. For that purpose, I have selected a very basic neural network which feeds forward with ReLu and softmax and back-propagates on a sample batch from MNIST handwritten digits dataset. In each iteration, the weights are adjusted to train the network for better predictions. | ||
+ | The code performs matrix-multiplication(dot-product) each time that the activation vector and delta vector are calculated for the next node and the previous node respectively. | ||
+ | |||
+ | =====Source Code===== | ||
+ | Here is the [https://cognitivedemons.wordpress.com/2018/06/08/neural-network-in-c-part-2-mnist-handwritten-digits-dataset/ source code] used. | ||
+ | The result given below is the comparison of predictions as made by the trained network after 10000 iterations. The ground truth is the actual value of the labels between 0-9 (true for the corresponding digit in the dataset). | ||
+ | |||
+ | -----------------------------------------------Epoch 10000-------------------------------------------------- | ||
+ | Predictions: | ||
+ | 0.000848207 9.07445e-06 0.000145165 0.797735 4.94866e-06 0.19374 1.55013e-06 0.000244941 0.00657041 0.000700498 | ||
+ | 1.36476e-05 1.07548e-07 8.3835e-05 0.000744837 0.299883 9.37717e-05 3.53349e-05 0.00822595 0.00210021 0.688819 | ||
+ | 5.11556e-06 0.000616957 0.000233088 0.87458 2.20579e-05 0.0140489 5.03569e-08 0.000518445 0.0826038 0.0273714 | ||
+ | 0.0178851 3.64621e-08 0.0174107 0.000322792 0.716312 0.00120967 0.189534 0.00303238 0.00613965 0.0481543 | ||
+ | 7.40077e-07 0.96872 0.014224 0.00555447 2.56397e-05 0.000115577 0.000157107 0.00366156 0.00669771 0.000842866 | ||
+ | 7.37584e-05 0.00306397 0.0184482 0.056542 0.000217984 0.0807415 0.000430994 1.09367e-05 0.838792 0.00167921 | ||
+ | 1.23026e-05 1.10682e-09 6.47478e-07 0.000129503 1.28475e-05 1.20242e-05 1.18166e-09 0.953265 2.63176e-05 0.046541 | ||
+ | 0.974183 3.50241e-18 1.99895e-07 3.4534e-07 2.3755e-11 0.0257772 1.96811e-09 6.99407e-09 3.92052e-05 2.28711e-08 | ||
+ | 2.21581e-05 9.26954e-09 0.000182046 0.00336899 3.40876e-05 0.0800376 8.35955e-07 1.2496e-07 0.914781 0.00157335 | ||
+ | 8.59312e-07 4.1739e-05 0.000106891 0.000122639 0.00018295 4.02451e-05 7.21105e-07 0.898311 0.00405182 0.0971408 | ||
+ | |||
+ | Ground truth: | ||
+ | 0 0 0 1 0 0 0 0 0 0 | ||
+ | 0 0 0 0 0 0 0 0 0 1 | ||
+ | 0 0 0 1 0 0 0 0 0 0 | ||
+ | 0 0 0 0 1 0 0 0 0 0 | ||
+ | 0 1 0 0 0 0 0 0 0 0 | ||
+ | 0 0 0 0 0 0 0 0 1 0 | ||
+ | 0 0 0 0 0 0 0 1 0 0 | ||
+ | 1 0 0 0 0 0 0 0 0 0 | ||
+ | 0 0 0 0 0 0 0 0 1 0 | ||
+ | 0 0 0 0 0 0 0 1 0 0 | ||
+ | Loss 0.184251 | ||
+ | --------------------------------------------End of Epoch :(------------------------------------------------ | ||
+ | |||
+ | |||
+ | =====Profiling===== | ||
+ | Here are the results of profiling the program | ||
+ | {| class="wikitable mw-collapsible mw-collapsed" | ||
+ | ! Flat profile | ||
+ | |- | ||
+ | | | ||
+ | |||
+ | Flat profile: | ||
+ | |||
+ | Each sample counts as 0.01 seconds. | ||
+ | % cumulative self self total | ||
+ | time seconds seconds calls ns/call ns/call name | ||
+ | 97.98 1061.73 1061.73 dot(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&, int, int, int) | ||
+ | 1.41 1076.95 15.23 transpose(float*, int, int) | ||
+ | 0.16 1078.65 1.70 operator-(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) | ||
+ | 0.14 1080.13 1.48 operator*(float, std::vector<float, std::allocator<float> > const&) | ||
+ | 0.12 1081.47 1.33 relu(std::vector<float, std::allocator<float> > const&) | ||
+ | 0.08 1082.34 0.87 519195026 1.68 1.68 void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) | ||
+ | 0.07 1083.07 0.73 operator*(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) | ||
+ | 0.05 1083.63 0.56 reluPrime(std::vector<float, std::allocator<float> > const&) | ||
+ | 0.03 1083.93 0.30 softmax(std::vector<float, std::allocator<float> > const&, int) | ||
+ | 0.02 1084.14 0.21 442679 474.87 474.87 void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) | ||
+ | 0.02 1084.31 0.17 13107321 12.98 12.98 void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&) | ||
+ | 0.01 1084.45 0.14 operator/(std::vector<float, std::allocator<float> > const&, float) | ||
+ | 0.01 1084.58 0.13 462000 281.67 281.67 void std::vector<std::string, std::allocator<std::string> >::_M_emplace_back_aux<std::string const&>(std::string const&) | ||
+ | 0.01 1084.68 0.10 split(std::string const&, char) | ||
+ | 0.00 1084.68 0.00 3 0.00 0.00 std::vector<float, std::allocator<float> >::vector(unsigned long, std::allocator<float> const&) | ||
+ | 0.00 1084.68 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z5printRKSt6vectorIfSaIfEEii | ||
+ | |||
+ | |||
+ | |} | ||
+ | {| class="wikitable mw-collapsible mw-collapsed" | ||
+ | ! Call graph | ||
+ | |- | ||
+ | | | ||
+ | |||
+ | Call graph | ||
+ | |||
+ | |||
+ | granularity: each sample hit covers 2 byte(s) for 0.00% of 1084.68 seconds | ||
+ | |||
+ | index % time self children called name | ||
+ | <spontaneous> | ||
+ | [1] 97.9 1061.73 0.00 dot(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&, int, int, int) [1] | ||
+ | ----------------------------------------------- | ||
+ | <spontaneous> | ||
+ | [2] 1.4 15.23 0.00 transpose(float*, int, int) [2] | ||
+ | ----------------------------------------------- | ||
+ | <spontaneous> | ||
+ | [3] 0.2 1.70 0.00 operator-(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) [3] | ||
+ | ----------------------------------------------- | ||
+ | <spontaneous> | ||
+ | [4] 0.1 0.56 0.97 reluPrime(std::vector<float, std::allocator<float> > const&) [4] | ||
+ | 0.82 0.00 491520000/519195026 void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [7] | ||
+ | 0.15 0.00 310000/442679 void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [11] | ||
+ | ----------------------------------------------- | ||
+ | <spontaneous> | ||
+ | [5] 0.1 1.48 0.00 operator*(float, std::vector<float, std::allocator<float> > const&) [5] | ||
+ | ----------------------------------------------- | ||
+ | <spontaneous> | ||
+ | [6] 0.1 1.33 0.01 relu(std::vector<float, std::allocator<float> > const&) [6] | ||
+ | 0.00 0.00 307321/13107321 void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&) [12] | ||
+ | 0.00 0.00 2075026/519195026 void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [7] | ||
+ | 0.00 0.00 2679/442679 void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [11] | ||
+ | ----------------------------------------------- | ||
+ | 0.00 0.00 2075026/519195026 relu(std::vector<float, std::allocator<float> > const&) [6] | ||
+ | 0.04 0.00 25600000/519195026 softmax(std::vector<float, std::allocator<float> > const&, int) [9] | ||
+ | 0.82 0.00 491520000/519195026 reluPrime(std::vector<float, std::allocator<float> > const&) [4] | ||
+ | [7] 0.1 0.87 0.00 519195026 void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [7] | ||
+ | ----------------------------------------------- | ||
+ | <spontaneous> | ||
+ | [8] 0.1 0.73 0.00 operator*(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) [8] | ||
+ | ----------------------------------------------- | ||
+ | <spontaneous> | ||
+ | [9] 0.1 0.30 0.27 softmax(std::vector<float, std::allocator<float> > const&, int) [9] | ||
+ | 0.17 0.00 12800000/13107321 void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&) [12] | ||
+ | 0.06 0.00 130000/442679 void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [11] | ||
+ | 0.04 0.00 25600000/519195026 void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [7] | ||
+ | ----------------------------------------------- | ||
+ | <spontaneous> | ||
+ | [10] 0.0 0.10 0.13 split(std::string const&, char) [10] | ||
+ | 0.13 0.00 462000/462000 void std::vector<std::string, std::allocator<std::string> >::_M_emplace_back_aux<std::string const&>(std::string const&) [14] | ||
+ | ----------------------------------------------- | ||
+ | 0.00 0.00 2679/442679 relu(std::vector<float, std::allocator<float> > const&) [6] | ||
+ | 0.06 0.00 130000/442679 softmax(std::vector<float, std::allocator<float> > const&, int) [9] | ||
+ | 0.15 0.00 310000/442679 reluPrime(std::vector<float, std::allocator<float> > const&) [4] | ||
+ | [11] 0.0 0.21 0.00 442679 void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [11] | ||
+ | ----------------------------------------------- | ||
+ | 0.00 0.00 307321/13107321 relu(std::vector<float, std::allocator<float> > const&) [6] | ||
+ | 0.17 0.00 12800000/13107321 softmax(std::vector<float, std::allocator<float> > const&, int) [9] | ||
+ | [12] 0.0 0.17 0.00 13107321 void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&) [12] | ||
+ | ----------------------------------------------- | ||
+ | <spontaneous> | ||
+ | [13] 0.0 0.14 0.00 operator/(std::vector<float, std::allocator<float> > const&, float) [13] | ||
+ | ----------------------------------------------- | ||
+ | 0.13 0.00 462000/462000 split(std::string const&, char) [10] | ||
+ | [14] 0.0 0.13 0.00 462000 void std::vector<std::string, std::allocator<std::string> >::_M_emplace_back_aux<std::string const&>(std::string const&) [14] | ||
+ | ----------------------------------------------- | ||
+ | 0.00 0.00 3/3 random_vector(int) [28] | ||
+ | [22] 0.0 0.00 0.00 3 std::vector<float, std::allocator<float> >::vector(unsigned long, std::allocator<float> const&) [22] | ||
+ | ----------------------------------------------- | ||
+ | 0.00 0.00 1/1 __libc_csu_init [38] | ||
+ | [23] 0.0 0.00 0.00 1 _GLOBAL__sub_I__Z5printRKSt6vectorIfSaIfEEii [23] | ||
+ | ----------------------------------------------- | ||
+ | � | ||
+ | Index by function name | ||
+ | |||
+ | [23] _GLOBAL__sub_I__Z5printRKSt6vectorIfSaIfEEii (nn.cpp) [2] transpose(float*, int, int) [13] operator/(std::vector<float, std::allocator<float> > const&, float) | ||
+ | [1] dot(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&, int, int, int) [14] void std::vector<std::string, std::allocator<std::string> >::_M_emplace_back_aux<std::string const&>(std::string const&) [3] operator-(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) | ||
+ | [6] relu(std::vector<float, std::allocator<float> > const&) [7] void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [8] operator*(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) | ||
+ | [10] split(std::string const&, char) [11] void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [5] operator*(float, std::vector<float, std::allocator<float> > const&) | ||
+ | [9] softmax(std::vector<float, std::allocator<float> > const&, int) [12] void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&) | ||
+ | [4] reluPrime(std::vector<float, std::allocator<float> > const&) [22] std::vector<float, std::allocator<float> >::vector(unsigned long, std::allocator<float> const&) | ||
+ | |||
+ | |} | ||
+ | |||
+ | =====Analysis===== | ||
+ | The total execution time of the program is around 10 minutes. As is evident from profiling results, most of the execution time is taken up by the ''dot()'' function as it does matrix-matrix multiplication. This is the hotspot of the program that can be made efficient by doing this computation and other vector multiplications on the GPU. | ||
---- | ---- | ||
=== Assignment 2 === | === Assignment 2 === | ||
=== Assignment 3 === | === Assignment 3 === |
Revision as of 23:23, 1 March 2019
GPU610/DPS915 | Student List | Group and Project Index | Student Resources | Glossary
Contents
Project Name Goes here
Team Members
- Sukhbeer Dhillon, Simple Backpropogation Neural Network
- Gurpreet Singh, Sudoku Puzzle Solver
- Edgar Giang, Some other other responsibility
- Email All
Progress
Assignment 1
Sudoku Puzzle Solver by Gurpreet Singh
Is it a program that solves Sudoku puzzles(9X9) using Bruteforce algorithm. The user can either pass a Sudoku files as an input or enter the values manually. Moreover, the file or the manual entry must strictly have 9 rows and 9 columns in them. Last but not the least, all the cells must be separated by a space and the cells that needs to be solved must have 0 in them as their value.
The original source code can be found at Link
Logic
In this program the Bruteforce algorithm first put 1 in the first cell. Then it moves to the second cell and put 1 in there and check if it satisfies all the rules and conditions. If it don't, then the algorithm will increment it's value to 2 and then check again. The value can change from 0-9 to find the correct value for a cell. If none of the value from the range of 0-9 satisfies the cell, then the program will iterate back and change the value of the first cell to 2 and then try the whole process again. In this way it will solve the puzzle.
Compiling the program
Enter the following commands:
g++ -std=c++0x -pg solver.cpp checks.cpp checksolution.cpp -o a a fileName
-pg directs the compiler to include the executable code required for profiling.
-o directs the compiler to name the executable a.
If we run the sample-puzzle-1 (level- easy) file, which has the following text inside it:
0 6 0 0 0 0 9 7 2 0 5 0 0 0 2 0 0 3 0 7 0 3 9 0 5 0 0 2 0 0 0 0 5 4 0 8 0 0 0 0 0 0 0 0 0 3 0 1 8 0 0 0 0 6 0 0 4 0 2 3 0 8 0 7 0 0 9 0 0 0 2 0 9 2 5 0 0 0 0 4 0
The output will be:
1 6 3 4 5 8 9 7 2 4 5 9 7 1 2 8 6 3 8 7 2 3 9 6 5 1 4 2 9 7 1 6 5 4 3 8 5 8 6 2 3 4 1 9 7 3 4 1 8 7 9 2 5 6 6 1 4 5 2 3 7 8 9 7 3 8 9 4 1 6 2 5 9 2 5 6 8 7 3 4 1
Analysis
To analyze the call graph, enter the following command:
gprof -q -b a> a.clg
-q directs the profiler (gprof) to output a call graph.
-b directs the profiler to omit detailed explanations of the column headings from the output.
The call graph for the above execution looks like:
Call graph granularity: each sample hit covers 2 byte(s) no time propagated index % time self children called name 0.00 0.00 4539/4539 placeNum(int, int) [10] [8] 0.0 0.00 0.00 4539 checkRow(int, int) [8] ----------------------------------------------- 0.00 0.00 1620/1620 placeNum(int, int) [10] [9] 0.0 0.00 0.00 1620 checkColumn(int, int) [9] ----------------------------------------------- 0.00 0.00 1120/1120 solveSudoku() [16] [10] 0.0 0.00 0.00 1120 placeNum(int, int) [10] 0.00 0.00 4539/4539 checkRow(int, int) [8] 0.00 0.00 1620/1620 checkColumn(int, int) [9] 0.00 0.00 698/698 checkSquare(int, int, int) [11] ----------------------------------------------- 0.00 0.00 698/698 placeNum(int, int) [10] [11] 0.0 0.00 0.00 698 checkSquare(int, int, int) [11] ----------------------------------------------- 0.00 0.00 476/476 solveSudoku() [16] [12] 0.0 0.00 0.00 476 goBack(int&, int&) [12] ----------------------------------------------- 0.00 0.00 2/2 main [6] [13] 0.0 0.00 0.00 2 print(int (*) [9]) [13] ----------------------------------------------- 0.00 0.00 1/1 __libc_csu_init [30] [14] 0.0 0.00 0.00 1 _GLOBAL__sub_I_sudoku [14] 0.00 0.00 1/1 __static_initialization_and_destruction_0(int, int) [18] ----------------------------------------------- 0.00 0.00 1/1 __libc_csu_init [30] [15] 0.0 0.00 0.00 1 _GLOBAL__sub_I_temp [15] 0.00 0.00 1/1 __static_initialization_and_destruction_0(int, int) [19] ----------------------------------------------- 0.00 0.00 1/1 main [6] [16] 0.0 0.00 0.00 1 solveSudoku() [16] 0.00 0.00 1120/1120 placeNum(int, int) [10] 0.00 0.00 476/476 goBack(int&, int&) [12] ----------------------------------------------- 0.00 0.00 1/1 main [6] [17] 0.0 0.00 0.00 1 storePositions() [17] ----------------------------------------------- 0.00 0.00 1/1 _GLOBAL__sub_I_sudoku [14] [18] 0.0 0.00 0.00 1 __static_initialization_and_destruction_0(int, int) [18] ----------------------------------------------- 0.00 0.00 1/1 _GLOBAL__sub_I_temp [15] [19] 0.0 0.00 0.00 1 __static_initialization_and_destruction_0(int, int) [19] ----------------------------------------------- Index by function name [14] _GLOBAL__sub_I_sudoku [16] solveSudoku() [13] print(int (*) [9]) [15] _GLOBAL__sub_I_temp [17] storePositions() [12] goBack(int&, int&) [9] checkColumn(int, int) [18] __static_initialization_and_destruction_0(int, int) [8] checkRow(int, int) [11] checkSquare(int, int, int) [19] __static_initialization_and_destruction_0(int, int) [10] placeNum(int, int)
From the above Call graph we can see that the program took no time in finding the solution and the maximum number of calls were made to the checkRow, checkColumn and checkSquare function. However, to get a better understanding of the program let's try a harder Sudoku puzzle.
If we run the sample-puzzle-2-hard (Level- hard) file, which has the following text inside it:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 8 5 0 0 1 0 2 0 0 0 0 0 0 0 5 0 7 0 0 0 0 0 4 0 0 0 1 0 0 0 9 0 0 0 0 0 0 0 5 0 0 0 0 0 0 7 3 0 0 2 0 1 0 0 0 0 0 0 0 0 4 0 0 0 9
The output will be:
9 8 7 6 5 4 3 2 1 2 4 6 1 7 3 9 8 5 3 5 1 9 2 8 7 4 6 1 2 8 5 3 7 6 9 4 6 3 4 8 9 2 1 5 7 7 9 5 4 6 1 8 3 2 5 1 9 2 8 6 4 7 3 4 7 2 3 1 9 5 6 8 8 6 3 7 4 5 2 1 9
The Call graph for the following looks like:
Call graph granularity: each sample hit covers 2 byte(s) for 0.04% of 26.79 seconds index % time self children called name <spontaneous> [1] 100.0 0.00 26.78 main [1] 0.68 26.09 1/1 solveSudoku() [2] 0.01 0.00 1/1 storePositions() [9] 0.00 0.00 2/2 print(int (*) [9]) [17] ----------------------------------------------- 0.68 26.09 1/1 main [1] [2] 99.9 0.68 26.09 1 solveSudoku() [2] 3.64 21.56 157353814/157353814 placeNum(int, int) [3] 0.89 0.00 69175252/69175252 goBack(int&, int&) [7] ----------------------------------------------- 3.64 21.56 157353814/157353814 solveSudoku() [2] [3] 94.1 3.64 21.56 157353814 placeNum(int, int) [3] 13.31 0.00 622577597/622577597 checkRow(int, int) [4] 5.04 0.00 223365661/223365661 checkColumn(int, int) [5] 3.21 0.00 100608583/100608583 checkSquare(int, int, int) [6] ----------------------------------------------- 13.31 0.00 622577597/622577597 placeNum(int, int) [3] [4] 49.7 13.31 0.00 622577597 checkRow(int, int) [4] ----------------------------------------------- 5.04 0.00 223365661/223365661 placeNum(int, int) [3] [5] 18.8 5.04 0.00 223365661 checkColumn(int, int) [5] ----------------------------------------------- 3.21 0.00 100608583/100608583 placeNum(int, int) [3] [6] 12.0 3.21 0.00 100608583 checkSquare(int, int, int) [6] ----------------------------------------------- 0.89 0.00 69175252/69175252 solveSudoku() [2] [7] 3.3 0.89 0.00 69175252 goBack(int&, int&) [7] ----------------------------------------------- 0.01 0.00 1/1 __libc_csu_init [10] [8] 0.0 0.01 0.00 1 _GLOBAL__sub_I_sudoku [8] 0.00 0.00 1/1 __static_initialization_and_destruction_0(int, int) [19] ----------------------------------------------- 0.01 0.00 1/1 main [1] [9] 0.0 0.01 0.00 1 storePositions() [9] ----------------------------------------------- <spontaneous> [10] 0.0 0.00 0.01 __libc_csu_init [10] 0.01 0.00 1/1 _GLOBAL__sub_I_sudoku [8] 0.00 0.00 1/1 _GLOBAL__sub_I_temp [18] ----------------------------------------------- 0.00 0.00 2/2 main [1] [17] 0.0 0.00 0.00 2 print(int (*) [9]) [17] ----------------------------------------------- 0.00 0.00 1/1 __libc_csu_init [10] [18] 0.0 0.00 0.00 1 _GLOBAL__sub_I_temp [18] 0.00 0.00 1/1 __static_initialization_and_destruction_0(int, int) [20] ----------------------------------------------- 0.00 0.00 1/1 _GLOBAL__sub_I_sudoku [8] [19] 0.0 0.00 0.00 1 __static_initialization_and_destruction_0(int, int) [19] ----------------------------------------------- 0.00 0.00 1/1 _GLOBAL__sub_I_temp [18] [20] 0.0 0.00 0.00 1 __static_initialization_and_destruction_0(int, int) [20] ----------------------------------------------- Index by function name [8] _GLOBAL__sub_I_sudoku [2] solveSudoku() [17] print(int (*) [9]) [18] _GLOBAL__sub_I_temp [9] storePositions() [7] goBack(int&, int&) [5] checkColumn(int, int) [19] __static_initialization_and_destruction_0(int, int) [4] checkRow(int, int) [6] checkSquare(int, int, int) [20] __static_initialization_and_destruction_0(int, int) [3] placeNum(int, int)
From the above Call graph we can see that for a harder Sudoku puzzle, the time increased significantly. Moreover, it can also be seen that almost 50% of the time is consumed by the checkRow function, 18.8% by checkColumn and finally 12% by the checkSquare function. Thousand of calls were made to these 3 functions, if we parallelizing these functions then the efficiency of the program can be increased significantly.
Simple Artificial Neural Network by Sukhbeer
Introduction
I am very interested in neural networks and I started learning about them recently. I think this is a good opportunity to build on my knowledge of a NN while also parallelising it. For that purpose, I have selected a very basic neural network which feeds forward with ReLu and softmax and back-propagates on a sample batch from MNIST handwritten digits dataset. In each iteration, the weights are adjusted to train the network for better predictions. The code performs matrix-multiplication(dot-product) each time that the activation vector and delta vector are calculated for the next node and the previous node respectively.
Source Code
Here is the source code used. The result given below is the comparison of predictions as made by the trained network after 10000 iterations. The ground truth is the actual value of the labels between 0-9 (true for the corresponding digit in the dataset).
-----------------------------------------------Epoch 10000-------------------------------------------------- Predictions: 0.000848207 9.07445e-06 0.000145165 0.797735 4.94866e-06 0.19374 1.55013e-06 0.000244941 0.00657041 0.000700498 1.36476e-05 1.07548e-07 8.3835e-05 0.000744837 0.299883 9.37717e-05 3.53349e-05 0.00822595 0.00210021 0.688819 5.11556e-06 0.000616957 0.000233088 0.87458 2.20579e-05 0.0140489 5.03569e-08 0.000518445 0.0826038 0.0273714 0.0178851 3.64621e-08 0.0174107 0.000322792 0.716312 0.00120967 0.189534 0.00303238 0.00613965 0.0481543 7.40077e-07 0.96872 0.014224 0.00555447 2.56397e-05 0.000115577 0.000157107 0.00366156 0.00669771 0.000842866 7.37584e-05 0.00306397 0.0184482 0.056542 0.000217984 0.0807415 0.000430994 1.09367e-05 0.838792 0.00167921 1.23026e-05 1.10682e-09 6.47478e-07 0.000129503 1.28475e-05 1.20242e-05 1.18166e-09 0.953265 2.63176e-05 0.046541 0.974183 3.50241e-18 1.99895e-07 3.4534e-07 2.3755e-11 0.0257772 1.96811e-09 6.99407e-09 3.92052e-05 2.28711e-08 2.21581e-05 9.26954e-09 0.000182046 0.00336899 3.40876e-05 0.0800376 8.35955e-07 1.2496e-07 0.914781 0.00157335 8.59312e-07 4.1739e-05 0.000106891 0.000122639 0.00018295 4.02451e-05 7.21105e-07 0.898311 0.00405182 0.0971408
Ground truth: 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 Loss 0.184251 --------------------------------------------End of Epoch :(------------------------------------------------
Profiling
Here are the results of profiling the program
Flat profile |
---|
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ns/call ns/call name 97.98 1061.73 1061.73 dot(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&, int, int, int) 1.41 1076.95 15.23 transpose(float*, int, int) 0.16 1078.65 1.70 operator-(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) 0.14 1080.13 1.48 operator*(float, std::vector<float, std::allocator<float> > const&) 0.12 1081.47 1.33 relu(std::vector<float, std::allocator<float> > const&) 0.08 1082.34 0.87 519195026 1.68 1.68 void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) 0.07 1083.07 0.73 operator*(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) 0.05 1083.63 0.56 reluPrime(std::vector<float, std::allocator<float> > const&) 0.03 1083.93 0.30 softmax(std::vector<float, std::allocator<float> > const&, int) 0.02 1084.14 0.21 442679 474.87 474.87 void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) 0.02 1084.31 0.17 13107321 12.98 12.98 void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&) 0.01 1084.45 0.14 operator/(std::vector<float, std::allocator<float> > const&, float) 0.01 1084.58 0.13 462000 281.67 281.67 void std::vector<std::string, std::allocator<std::string> >::_M_emplace_back_aux<std::string const&>(std::string const&) 0.01 1084.68 0.10 split(std::string const&, char) 0.00 1084.68 0.00 3 0.00 0.00 std::vector<float, std::allocator<float> >::vector(unsigned long, std::allocator<float> const&) 0.00 1084.68 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z5printRKSt6vectorIfSaIfEEii
|
Call graph |
---|
Call graph
index % time self children called name <spontaneous> [1] 97.9 1061.73 0.00 dot(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&, int, int, int) [1] <spontaneous> [2] 1.4 15.23 0.00 transpose(float*, int, int) [2] <spontaneous> [3] 0.2 1.70 0.00 operator-(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) [3] <spontaneous> [4] 0.1 0.56 0.97 reluPrime(std::vector<float, std::allocator<float> > const&) [4] 0.82 0.00 491520000/519195026 void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [7] 0.15 0.00 310000/442679 void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [11] <spontaneous> [5] 0.1 1.48 0.00 operator*(float, std::vector<float, std::allocator<float> > const&) [5] <spontaneous> [6] 0.1 1.33 0.01 relu(std::vector<float, std::allocator<float> > const&) [6] 0.00 0.00 307321/13107321 void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&) [12] 0.00 0.00 2075026/519195026 void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [7] 0.00 0.00 2679/442679 void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [11] 0.00 0.00 2075026/519195026 relu(std::vector<float, std::allocator<float> > const&) [6] 0.04 0.00 25600000/519195026 softmax(std::vector<float, std::allocator<float> > const&, int) [9] 0.82 0.00 491520000/519195026 reluPrime(std::vector<float, std::allocator<float> > const&) [4] [7] 0.1 0.87 0.00 519195026 void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [7] <spontaneous> [8] 0.1 0.73 0.00 operator*(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) [8] <spontaneous> [9] 0.1 0.30 0.27 softmax(std::vector<float, std::allocator<float> > const&, int) [9] 0.17 0.00 12800000/13107321 void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&) [12] 0.06 0.00 130000/442679 void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [11] 0.04 0.00 25600000/519195026 void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [7] <spontaneous> [10] 0.0 0.10 0.13 split(std::string const&, char) [10] 0.13 0.00 462000/462000 void std::vector<std::string, std::allocator<std::string> >::_M_emplace_back_aux<std::string const&>(std::string const&) [14] 0.00 0.00 2679/442679 relu(std::vector<float, std::allocator<float> > const&) [6] 0.06 0.00 130000/442679 softmax(std::vector<float, std::allocator<float> > const&, int) [9] 0.15 0.00 310000/442679 reluPrime(std::vector<float, std::allocator<float> > const&) [4] [11] 0.0 0.21 0.00 442679 void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [11] 0.00 0.00 307321/13107321 relu(std::vector<float, std::allocator<float> > const&) [6] 0.17 0.00 12800000/13107321 softmax(std::vector<float, std::allocator<float> > const&, int) [9] [12] 0.0 0.17 0.00 13107321 void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&) [12] <spontaneous> [13] 0.0 0.14 0.00 operator/(std::vector<float, std::allocator<float> > const&, float) [13] 0.13 0.00 462000/462000 split(std::string const&, char) [10] [14] 0.0 0.13 0.00 462000 void std::vector<std::string, std::allocator<std::string> >::_M_emplace_back_aux<std::string const&>(std::string const&) [14] 0.00 0.00 3/3 random_vector(int) [28] [22] 0.0 0.00 0.00 3 std::vector<float, std::allocator<float> >::vector(unsigned long, std::allocator<float> const&) [22] 0.00 0.00 1/1 __libc_csu_init [38] [23] 0.0 0.00 0.00 1 _GLOBAL__sub_I__Z5printRKSt6vectorIfSaIfEEii [23] � Index by function name [23] _GLOBAL__sub_I__Z5printRKSt6vectorIfSaIfEEii (nn.cpp) [2] transpose(float*, int, int) [13] operator/(std::vector<float, std::allocator<float> > const&, float) [1] dot(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&, int, int, int) [14] void std::vector<std::string, std::allocator<std::string> >::_M_emplace_back_aux<std::string const&>(std::string const&) [3] operator-(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) [6] relu(std::vector<float, std::allocator<float> > const&) [7] void std::vector<float, std::allocator<float> >::emplace_back<float>(float&&) [8] operator*(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) [10] split(std::string const&, char) [11] void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float>(float&&) [5] operator*(float, std::vector<float, std::allocator<float> > const&) [9] softmax(std::vector<float, std::allocator<float> > const&, int) [12] void std::vector<float, std::allocator<float> >::_M_emplace_back_aux<float const&>(float const&) [4] reluPrime(std::vector<float, std::allocator<float> > const&) [22] std::vector<float, std::allocator<float> >::vector(unsigned long, std::allocator<float> const&) |
Analysis
The total execution time of the program is around 10 minutes. As is evident from profiling results, most of the execution time is taken up by the dot() function as it does matrix-matrix multiplication. This is the hotspot of the program that can be made efficient by doing this computation and other vector multiplications on the GPU.