Installation Wizards
GPU610/DPS915 | Student List | Group and Project Index | Student Resources | Glossary
Contents
Installation Wizards
Team Members
Progress
Assignment 1
Image Processing
Much like previous students who have taken this course, we decided to look over image processing (leveraging Christopher Ginac's PGM parser http://www.dreamincode.net/forums/topic/76816-image-processing-tutorial/ ).
For our profiling we scale an image up 10 times to better profile larger data sets, as our source image was only ~800kb in size. We then negate the image, and then reflect it.
The time taken to perform all 3 operations:
real 0m22.057s user 0m4.584s sys 0m14.221s
The results of our tests:
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 42.92 6.70 6.70 3 2.23 2.23 Image::operator=(Image const&) 29.15 11.25 4.55 2 2.27 2.27 Image::Image(int, int, int) 15.31 13.64 2.39 1 2.39 2.39 Image::Image(Image const&) 4.04 14.27 0.63 writeImage(char*, Image&) 2.56 14.67 0.40 Image::negateImage(Image&) 2.37 15.04 0.37 86832000 0.00 0.00 Image::getPixelVal(int, int) 2.18 15.38 0.34 Image::reflectImage(bool, Image&) 1.47 15.61 0.23 Image::enlargeImage(int, Image&) 0.00 15.61 0.00 868320 0.00 0.00 Image::setPixelVal(int, int, int) 0.00 15.61 0.00 3 0.00 0.00 Image::~Image() 0.00 15.61 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN5ImageC2Ev 0.00 15.61 0.00 1 0.00 0.00 _GLOBAL__sub_I_main 0.00 15.61 0.00 1 0.00 0.00 Image::getImageInfo(int&, int&, int&)
As we can see, negating the image as well as reflecting and enlarging it are all costly operations, and we believe that we can optimize it to take far less time.
Sudoku Brute Force Solver
For this first assignment I decided to look into a brute force sudoku solver to see if it could benefit from parallel programming. The sudoku solver that was investigated was provided by Bryan Smith (https://github.com/bryanesmith/Sudoku-solver).
This solver goes through the given puzzle guessing numbers for any space that is not already filled and then checking to verify it doesn't break any of the sudoku rules such as horizontal, vertical and in box collisions.
As is, this solver is quick to solve a 9x9 sudoku puzzle. The time taken to solve a sudoku puzzle is under a second as seen in our time results:
real 0m0.097s user 0m0.072s sys 0m0.016s
However if this we were to scale this solver up to solve nxn sized puzzles we would be able to see a use for optimizing this code.
Results from profiling:
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 60.00 0.03 0.03 468316 0.00 0.00 SudokuPuzzle::verifyValue(int, int) 20.00 0.04 0.01 468316 0.00 0.00 SudokuPuzzle::printTracerTryingValue(int, int) 20.00 0.05 0.01 1 10.00 50.00 SudokuPuzzle::solve(int, int) 0.00 0.05 0.00 445799 0.00 0.00 SudokuPuzzle::setBoardValue(int, int, int) 0.00 0.05 0.00 2 0.00 0.00 SudokuPuzzle::print() 0.00 0.05 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN12SudokuPuzzleC2Ev 0.00 0.05 0.00 1 0.00 0.00 _GLOBAL__sub_I_main 0.00 0.05 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int) 0.00 0.05 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int) 0.00 0.05 0.00 1 0.00 50.00 SudokuPuzzle::solve() 0.00 0.05 0.00 1 0.00 0.00 SudokuPuzzle::SudokuPuzzle()
The profiling of this solver shows that the verifyValue function is taking up most of the execution time at 60% which is expected as this function checks all for all of the collisions for the guessed values. This would be the function we would want to optimize however this may not be the easiest of tasks as it seems it has some data dependencies.
After modifying the code to handle nxn size sudoku puzzles we can get a better look at the execution time of the algorithm.
These are the time results for solving a 16x16 puzzle:
real 0m7.631s user 0m7.592s sys 0m0.012s
And Profiling results:
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 90.35 5.38 5.38 35639645 0.00 0.00 SudokuPuzzle::verifyValue(int, int) 5.79 5.73 0.34 1 0.34 5.95 SudokuPuzzle::solve(int, int) 2.35 5.87 0.14 34877040 0.00 0.00 SudokuPuzzle::setBoardValue(int, int, int) 1.34 5.95 0.08 35639645 0.00 0.00 SudokuPuzzle::printTracerTryingValue(int, int) 0.17 5.96 0.01 1 0.01 5.96 SudokuPuzzle::solve() 0.00 5.96 0.00 2 0.00 0.00 SudokuPuzzle::print() 0.00 5.96 0.00 1 0.00 0.00 _GLOBAL__sub_I__ZN12SudokuPuzzleC2Ev 0.00 5.96 0.00 1 0.00 0.00 _GLOBAL__sub_I_main 0.00 5.96 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int) 0.00 5.96 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int) 0.00 5.96 0.00 1 0.00 0.00 SudokuPuzzle::CreateBoard() 0.00 5.96 0.00 1 0.00 0.00 SudokuPuzzle::SudokuPuzzle(int) 0.00 5.96 0.00 1 0.00 0.00 SudokuPuzzle::~SudokuPuzzle()
It is clear that the verifyValue function takes up majority of the execution time being called 356,39,645 times and taking 5.38 seconds of the total execution time.
Project Selection
We have decided to pursue the image processing project as it has more potential for a significant speedup after using the parallel processing. We will focus our time on speeding up the functions used to negate, reflect and enlarge the image.
Assignment 2
Parallel Image Processor
For the second part of this assignment we proceeded to convert the C implementation of the image processor to a CUDA implementation that would use parallel programming for a performance increase. The first hurdle we encountered when beginning this process was that the image used in our image processor was stored as a 2D array which would create problems when we try to use it in a CUDA implementation. Matt took the responsibility of re writing the code to use a 1D array that stored the pixels in column major order.
Once we had the 1D array to store the image, we were able to start writing kernels for the compute-intensive functions of our processor. We created kernels for reflectImage(), enlargeImage() and negateImage() and began to test for any sort of performance increase.
Results:
Test CUDA C++ Negating and reflecting image 2.988s 2.752s Enlarge by scale 8, negate and reflect 5.728s 6.027s
As seen from the results above the c++ implementation of the processor seems to run at around the same time as the CUDA implementation when the image is only negated and reflected. However, once the image is scaled by any factor, there is a definite increase in performance from the CUDA implementation. It i also worth noting that the profiled times form the CUDA implementation seemed to vary a lot more than the c++ implementation which we think is from the variation in times for the cudaMemcpy() to run.
To continue our optimizations, we think that we could get more of a performance increase by minimizing the amount of data copied to and from the GPU. We are going to look into storing the image on the GPU until all transformations have been done and then copying the data back from the GPU.