Open main menu

CDOT Wiki β

Changes

AAA Adrina Arsa Andriy

6,890 bytes added, 14:26, 4 December 2014
Assignment 3
= To Be Determined AAA =
== Team Members ==
# [mailto:aasauvageot@myseneca.ca?subject=gpu610 Adrian Sauvageot], Developer
According to the profiling data such methods as “jpge::DCT2D(int*)” and “jpge::RGB_to_YCC” can be parallelized to improve the application performance which will be particularly useful for compressing large ‘jpg’ files at a better quality factor.
 
====Arsalan Khalid Findings====
=====Program To Parallelize=====
 
I discovered a C++ hangman application, it needs quite a bit of work but I believe that it is much more interesting than parallelizing the standard image processor.
Most of the processing power is spent on the main, therefore in can be better modularized into functions and contain a more complex and efficient algorithm to handle
more amount of words. The program's output is as so:
 
Welcome to hangman...Guess a country Name Each letter is represented by a star.
 
You have to type only one letter in one try You have 5 tries to try and guess the word.
 
21:48, 3 October 2014 (EDT)21:48, 3 October 2014 (EDT)21:48, 3 October 2014 (EDT)21:48, 3 October 2014 (EDT)21:48, 3 October 2014 (EDT)21:48, 3 October 2014 (EDT)21:48, 3 October 2014 (EDT)21:48, 3 October 2014 (EDT)[[User:Arsalan Khalid|Arsalan Khalid]]
 
**** Guess a letter: o Whoops! That letter isn't in there!You have 4 guesses left.
 
**** Guess a letter: p Whoops! That letter isn't in there!You have 3 guesses left.
 
**** Guess a letter: i You found a letter! Isn't that exciting!You have 3 guesses left.
 
i*** Guess a letter: n You found a letter! Isn't that exciting!You have 3 guesses left.
 
i**n Guess a letter: r You found a letter! Isn't that exciting!You have 3 guesses left.
 
ir*n Guess a letter: a You found a letter! Isn't that exciting!You have 3 guesses left.
 
iran
 
Yeah! You got it!
 
The profiling was as follows:
 
granularity: each sample hit covers 2 byte(s) no time propagated
 
index % time self children called name
0.00 0.00 1/1 __libc_csu_init [15]
[8] 0.0 0.00 0.00 1 _GLOBAL__sub_I_main [8]
 
Index by function name
 
[8] _GLOBAL__sub_I_main (hangman2.cpp)
 
Of course for this program there isn't too much being outputted as well as a complex algorithm behind the main program
and thus this program needs a lot of work. But I believe that's the beauty of it, instead of statically importing words
we can have words imported from a file, and even add features such as multi-player. All these features can be parallized
and thus be very efficient while in run time.
 
All in all I believe working on a project that you know everything about is essential, working off of an image processor
or large bits of existing code can time for a developer to learn and understand the code structure of the program.
 
Really looking forward to working on this if we end up doing so!
=== Assignment 2 ===
For our assignment 2 we chose to paralyze the searching for a letter inside the hangman game.
 
Because hangman usually uses a small word, paralyzing a search through a single word will not show a major increase. So we created a new game called letter search. In letter search, a "word" of length x is created at the beginning of a game. The word will contain all the letters of the alphabet chosen randomly minus 5 letters. One letter in the alphabet will be worth approximately double of the other letters.
 
The user tries to guess all the letters in the random character array for points.
 
The word can be anywhere from 1 letter, to 1000000 (The highest we have tested.)
 
With the parallelization, we noticed some increase in speed. Upon further experimentation, we realized that out memcopy's were what took up all the speed. When looking at the time taken for the search itself when going through the kernel, we noticed an almost 100% increase in speed.
 
If we are able to speed up the memcopy's we will be able to further increase the speed of the program.
 
The speed up times of the searches are below (including the Memcpy's).
 
 
'''Size 100000:'''
 
<u>Parallel</u>
 
Guess a letter: a
Search Time: - took - <u>0.001000 secs</u>
 
You found 3786 letters! Isn't that exciting!
You have 5 guesses left.
 
 
<u>Regular</u>
 
Guess a letter: a
Search Time: - took - <u>0.002000 secs</u>
 
You found a letter! Isn't that exciting!
You have 5 guesses left.
 
 
'''Size 500000:'''
 
<u>Parallel</u>
 
Guess a letter: a
Search Time: - took - <u>0.003000 secs</u>
 
You found 19198 letters! Isn't that exciting!
You have 4 guesses left.
 
<u>Regular</u>
 
Guess a letter: a
Search Time: - took - <u>0.006000 secs</u>
 
You found a letter! Isn't that exciting!
You have 5 guesses left.
 
 
'''Size 1000000:'''
 
<u>Parallel</u>
 
Guess a letter: a
Search Time: - took - <u>0.005000 secs</u>
 
You found 38256 letters! Isn't that exciting!
You have 5 guesses left.
 
 
<u>Regular</u>
 
Guess a letter: a
Search Time: - took - <u>0.012000 secs</u>
 
You found a letter! Isn't that exciting!
You have 5 guesses left.
 
 
The timing of the kernel command was 0.000000 secs in every case. This shows that the kernel search takes almost no time, and the time is taken by the memalcation's and the memcopy's
 
[[Image:Hotpath Kernel vs reg.jpg|thumb|800px|center]]
 
=== Assignment 3 ===
For or assignment 3 we did a few things to speed up the program, and we were able to observe an approximate speed up of around 50%.
 
[[Image:Hang man graph.png|thumb|800px|center]]
 
To observe this speed up we removed thread divergence from the kernels, and we removed some unnecessary memory copies.
 
By removing thread divergence, we initially saw a speed up of around 10%. We expected this speed up to be small, since when running our code, the majority of time was spent in the memory copy phase. To speed the process up slightly more, we also used shared memory within the kernel. After the kernels had all the updates applied, we realized a speed up slightly over 10%.
 
When we took a look at the memory copies we used in the previous version, we realized that there were several copies that did not need to be done. We also realized that some of the copies could be moved around for more efficiency. After moving these copies, we realized a speed up of just over 40%.
 
Our last version of the program now runs approximately 50% faster than the previous version.
 
 
'''To calculate the number of blocks per thread we used the CUDA calculator.'''
[[Image:Nvidia Occupancy Calculator on Code.jpg|thumb|800px|center]]
In the previous version we dynamically found the number of threads per block, we could not dynamically use the information in this version due to the fact that shared memory was used. On the school lab computers the NBPT was 1024.
 
'''Real World Application'''
To make the application more "real world" friendly, we were able to make test data load form a large dictionary file. This makes it so you can search for real words instead of gibberish.
 
'''What Would We Do Different?'''
We would have spent more time on our A1s. When we picked out A1 programs we tried to find programs that were cool, and had unique uses. We profiled the programs without taking an in depth look at the code base, and when it came to picking a topic for A2, we were stuck with only one program, since the other two were much too complex.