Difference between revisions of "GPU610/gpuchill"
(→Progress) |
(→Assignment 1) |
||
Line 11: | Line 11: | ||
== Progress == | == Progress == | ||
=== Assignment 1 === | === Assignment 1 === | ||
+ | ===Sudoku Brute Force Solver=== | ||
+ | |||
+ | I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver. | ||
+ | |||
+ | Test run with puzzle: | ||
+ | <pre> | ||
+ | Original configuration: | ||
+ | ------------- | ||
+ | | | | | | ||
+ | | | 3| 85| | ||
+ | | 1| 2 | | | ||
+ | ------------- | ||
+ | | |5 7| | | ||
+ | | 4| |1 | | ||
+ | | 9 | | | | ||
+ | ------------- | ||
+ | |5 | | 73| | ||
+ | | 2| 1 | | | ||
+ | | | 4 | 9| | ||
+ | ------------- | ||
+ | 17 entries filled | ||
+ | solution: | ||
+ | ------------- | ||
+ | |987|654|321| | ||
+ | |246|173|985| | ||
+ | |351|928|746| | ||
+ | ------------- | ||
+ | |128|537|694| | ||
+ | |634|892|157| | ||
+ | |795|461|832| | ||
+ | ------------- | ||
+ | |519|286|473| | ||
+ | |472|319|568| | ||
+ | |863|745|219| | ||
+ | ------------- | ||
+ | found 1 solutions | ||
+ | |||
+ | real 0m33.652s | ||
+ | user 0m33.098s | ||
+ | sys 0m0.015s | ||
+ | </pre> | ||
+ | |||
+ | |||
+ | Flat profile: | ||
+ | <pre> | ||
+ | Each sample counts as 0.01 seconds. | ||
+ | % cumulative self self total | ||
+ | time seconds seconds calls s/call s/call name | ||
+ | 46.42 10.04 10.04 622865043 0.00 0.00 check_row | ||
+ | 23.52 15.13 5.09 1 5.09 21.32 solve | ||
+ | 18.26 19.08 3.95 223473489 0.00 0.00 check_col | ||
+ | 10.02 21.25 2.17 100654218 0.00 0.00 check_region | ||
+ | 0.72 21.40 0.16 2 0.08 0.08 print | ||
+ | 0.39 21.49 0.09 frame_dummy | ||
+ | </pre> | ||
+ | |||
+ | I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed. | ||
+ | |||
+ | ===Christopher Ginac Image Processing Library=== | ||
+ | |||
+ | I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library [https://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ here]. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options: | ||
+ | |||
+ | <pre> | ||
+ | What would you like to do: | ||
+ | [1] Get a Sub Image | ||
+ | [2] Enlarge Image | ||
+ | [3] Shrink Image | ||
+ | [4] Reflect Image | ||
+ | [5] Translate Image | ||
+ | [6] Rotate Image | ||
+ | [7] Negate Image | ||
+ | </pre> | ||
+ | |||
+ | I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space. | ||
+ | |||
+ | <pre> | ||
+ | real 0m10.595s | ||
+ | user 0m5.325s | ||
+ | sys 0m1.446s | ||
+ | </pre> | ||
+ | Which isn't really bad, but when we look deeper, we see where most of our time is being spent | ||
+ | |||
+ | <pre> | ||
+ | Flat profile: | ||
+ | |||
+ | Each sample counts as 0.01 seconds. | ||
+ | % cumulative self self total | ||
+ | time seconds seconds calls s/call s/call name | ||
+ | 21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&) | ||
+ | 21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int) | ||
+ | 18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int) | ||
+ | 15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&) | ||
+ | 14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&) | ||
+ | 3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&) | ||
+ | 2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int) | ||
+ | 0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&) | ||
+ | 0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&) | ||
+ | 0.00 4.88 0.00 4 0.00 0.00 Image::~Image() | ||
+ | 0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode) | ||
+ | 0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev | ||
+ | 0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&) | ||
+ | 0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int) | ||
+ | </pre> | ||
+ | |||
+ | It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower. | ||
+ | |||
+ | ===Merge Sort Algorithm=== | ||
+ | |||
+ | I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code [https://codereview.stackexchange.com/questions/167680/merge-sort-implementation-with-vectors/ here]. | ||
+ | |||
+ | Profile for 10 million elements between 1 and 10000. Using -02 optimization. | ||
+ | <pre> | ||
+ | Flat profile: | ||
+ | |||
+ | Each sample counts as 0.01 seconds. | ||
+ | % cumulative self self total | ||
+ | time seconds seconds calls ns/call ns/call name | ||
+ | 48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, | ||
+ | std::vector<int, std::allocator<int> >&) | ||
+ | 32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&) | ||
+ | 19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, | ||
+ | std::allocator<int> > >, int const&) | ||
+ | 0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main | ||
+ | </pre> | ||
+ | As you can see 80% of the total time was spent in mergeSort and sort functions. <br /> | ||
+ | If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x. | ||
+ | |||
=== Assignment 2 === | === Assignment 2 === | ||
=== Assignment 3 === | === Assignment 3 === |
Revision as of 10:15, 4 March 2019
GPU610/DPS915 | Student List | Group and Project Index | Student Resources | Glossary
Contents
GPU n' Chill
Team Members
- Daniel Serpa, Some responsibility
- Abdul Kabia, Some responsibility
- Josh Tardif, Some responsibility
- Andrew Faux, Some responsibility
- ...
Progress
Assignment 1
Sudoku Brute Force Solver
I decided to profile a simple brute force Sudoku solver, found here (https://github.com/regehr/sudoku). The solver uses a simple back tracking algorithm, inserting possible values into cells, iterating through the puzzles thousands of times, until it eventually produces an answer which does not violate any of the rules of Sudoku. As such the solver runs at the same speed regardless of the human difficulty rating, able to solve easy and 'insane' level puzzles at the same speed. The solver also works independent of the ratio between clues and white space, producing quick results with even the most sparsely populated puzzles.As such the following run of the program uses a puzzle which is specifically made to play against the back tracking algorithm and provides maximum time for the solver.
Test run with puzzle:
Original configuration: ------------- | | | | | | 3| 85| | 1| 2 | | ------------- | |5 7| | | 4| |1 | | 9 | | | ------------- |5 | | 73| | 2| 1 | | | | 4 | 9| ------------- 17 entries filled solution: ------------- |987|654|321| |246|173|985| |351|928|746| ------------- |128|537|694| |634|892|157| |795|461|832| ------------- |519|286|473| |472|319|568| |863|745|219| ------------- found 1 solutions real 0m33.652s user 0m33.098s sys 0m0.015s
Flat profile:
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 46.42 10.04 10.04 622865043 0.00 0.00 check_row 23.52 15.13 5.09 1 5.09 21.32 solve 18.26 19.08 3.95 223473489 0.00 0.00 check_col 10.02 21.25 2.17 100654218 0.00 0.00 check_region 0.72 21.40 0.16 2 0.08 0.08 print 0.39 21.49 0.09 frame_dummy
I believe that if a GPU was used to enhance this program one would see a great increase of speed. All of the check functions essentially do the same thing, iterating through possible inserted values for any that violate the rules. If one is able to unload all of these iterations onto the GPU then there should be a corresponding increase in speed.
Christopher Ginac Image Processing Library
I decided to profile a single user created image processing library written by Christopher Ginac, you can follow his post of the library here. His library enables the user to play around with .PGM image formats. If given the right parameters, users have the following options:
What would you like to do: [1] Get a Sub Image [2] Enlarge Image [3] Shrink Image [4] Reflect Image [5] Translate Image [6] Rotate Image [7] Negate Image
I went with the Enlarge option to see how long that would take. In order for me to do this, I had to test both the limits of the program and my own seneca machine allowed space, in order to do this, I had to use a fairly large image. However, since the program creates a second image, my Seneca account ran out of space for the new image, so the program could not write out the newly enlarged image. So I had to settle on an image that was 16.3MB max, so that it could write a new one, totally in 32.6MB of space.
real 0m10.595s user 0m5.325s sys 0m1.446s
Which isn't really bad, but when we look deeper, we see where most of our time is being spent
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 21.74 1.06 1.06 1 1.06 1.06 Image::operator=(Image const&) 21.33 2.10 1.04 2 0.52 0.52 Image::Image(int, int, int) 18.66 3.01 0.91 154056114 0.00 0.00 Image::getPixelVal(int, int) 15.59 3.77 0.76 1 0.76 2.34 Image::enlargeImage(int, Image&) 14.97 4.50 0.73 1 0.73 1.67 writeImage(char*, Image&) 3.69 4.68 0.18 2 0.09 0.09 Image::Image(Image const&) 2.67 4.81 0.13 17117346 0.00 0.00 Image::setPixelVal(int, int, int) 0.82 4.85 0.04 1 0.04 0.17 readImage(char*, Image&) 0.62 4.88 0.03 1 0.03 0.03 Image::getImageInfo(int&, int&, int&) 0.00 4.88 0.00 4 0.00 0.00 Image::~Image() 0.00 4.88 0.00 3 0.00 0.00 std::operator|(std::_Ios_Openmode, std::_Ios_Openmode) 0.00 4.88 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev 0.00 4.88 0.00 1 0.00 0.00 readImageHeader(char*, int&, int&, int&, bool&) 0.00 4.88 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)
It seems most of our time in this part of the code is spent assigning our enlarged image to the now one, and also creating our image object in the first place. I think if we were to somehow use a GPU for this process, we would see an decrease in run-time for this part of the library. Also, there also seems to be room for improvement on the very 'Image::enlargeImage' function itself. I feel like by loading said functionality onto thje GPU, we can reduce it's 0.76s to something even lower.
Merge Sort Algorithm
I decide to profile a vector merge sort algorithm. A merge sort is based on a based on divide and conquer technique which recursively breaks down a problem into two or more sub-problems of the same or related types. When these become simple enough to be solved directly the sub-problems are then combined to give a solution to the original problem. It first divides the array into equal halves and then combines them in a sorted manner. Due to this type of sort being broken into equal parts, I thought that it would be perfect for a GPU to be able to accelerate the process. With the sort being broken down into multiple chunks and then sent to the GPU it will be able to accomplish its task more efficiently. I was able to find the source code here.
Profile for 10 million elements between 1 and 10000. Using -02 optimization.
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ns/call ns/call name 48.35 1.16 1.16 9999999 115.56 115.56 mergeSort(std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&, std::vector<int, std::allocator<int> >&) 32.80 1.94 0.78 sort(std::vector<int, std::allocator<int> >&) 19.34 2.40 0.46 43708492 10.58 10.58 std::vector<int, std::allocator<int> >::_M_insert_aux(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, int const&) 0.00 2.40 0.00 1 0.00 0.00 _GLOBAL__sub_I_main
As you can see 80% of the total time was spent in mergeSort and sort functions.
If we look at Amdahl's law Sn = 1 / ( 1 - 0.80 + 0.80/8 ) we can expect a maximum speedup of 3.3x.