Difference between revisions of "DPS915 C U D A B O Y S"
(→Profiling) |
(→Assignment 2) |
||
Line 481: | Line 481: | ||
[[File:cpuvscuda.png]] | [[File:cpuvscuda.png]] | ||
+ | |||
+ | |||
+ | ==== Conclusion ==== | ||
+ | |||
+ | We are seeing about 540% (~5.4x) performance increase while using CUDA instead of the CPU in all 3 of the test cases. We think this is an amazing result and a significant improvement in performance over the CPU version of the code. | ||
== Assignment 3 == | == Assignment 3 == |
Revision as of 16:12, 3 November 2015
Contents
C U D A B O Y S
Team Members
- Manjot Sandhu, Some responsibility
- Johnathan Ragimov, Some other responsibility
- Oleg Eustace, Some other responsibility
Progress
Assignment 1
✓ Profile 0: File Encryption
Description
This piece of software takes a file and ecrypts it one of 4 ways:
- Byte Inversion
- Byte Cycle
- Xor Cipher
- RC4 Cipher
Inside the byteCipher method, exists a for loop that could use optimization:
for (int i = 0; i < bufferSize; i++){ // going over every byte in the file switch (mode) { case 0: // inversion buffer[i] = ~buffer[i]; break; case 1: // cycle buffer [i] = cycle (buffer [i]); break; case 2: // RC4 buffer [i] = buffer [i] ^ rc4_output(); break; } }
This for loop then can call one of two functions: cycle
or rc4_output
.
char cycle (char value) { int leftMask = 170; int rightMask = 85; int iLeft = value & leftMask; int iRight = value & rightMask; iLeft = iLeft >> 1; iRight = iRight << 1; return iLeft | iRight; }
unsigned char rc4_output() { unsigned char temp; i = (i + 1) & 0xFF; j = (j + S[i]) & 0xFF; temp = S[i]; S[i] = S[j]; S[j] = temp; return S[(S[i] + S[j]) & 0xFF]; }
We need to change these two functions so they are added to the device as "device functions". We also need to convert this for loop into a kernel.
Profiling on Linux
The following test runs were performed on the following Virtual Machine:
- CentOS 7
- i7-3820 @ 3.6 GHz
- 2GB DDR3
- gcc version 4.8.3
Using compiler settings:
g++ -c -O2 -g -pg -std=c++11 encFile.cpp
RC4 Cipher - 283 MB mp3 File
[root@jr-net-cent7 aes]# time ./encFile 4 /home/johny/aes/music.mp3 /home/johny/aes/music.mp3 * * * File Protector * * * Mode 4: RC4 cipher Please enter the RC4 key (8 chars min) testing123 The password is: testing123 Beginning encryption Completed: 100% Cipher completed. Program terminated.
real 0m6.758s user 0m3.551s sys 0m0.068s
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 84.05 1.70 1.70 296271519 0.00 0.00 rc4_output() 13.39 1.97 0.27 byteCipher(int, std::string) 2.73 2.02 0.06 1 55.09 55.09 rc4_init(unsigned char*, unsigned int) 0.00 2.02 0.00 1 0.00 0.00 _GLOBAL__sub_I_S
As we can see the rc4_output
and byteCipher
functions take up most of the processing time.
RC4 Cipher - 636 MB iso File
[root@jr-net-cent7 aes]# time ./encFile 4 /home/johny/aes/cent.iso /home/johny/aes/cent.iso * * * File Protector * * * Mode 4: RC4 cipher Please enter the RC4 key (8 chars min) testing123 The password is: testing123 Beginning encryption Completed: 100% Cipher completed. Program terminated.
real 0m10.293s user 0m8.235s sys 0m0.312s
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 74.86 3.59 3.59 666894336 0.00 0.00 rc4_output() 23.21 4.70 1.11 byteCipher(int, std::string) 2.09 4.80 0.10 1 100.16 100.16 rc4_init(unsigned char*, unsigned int) 0.00 4.80 0.00 1 0.00 0.00 _GLOBAL__sub_I_S
RC4 Cipher - 789 MB iso File
[root@jr-net-cent7 aes]# time ./encFile 4 /home/johny/aes/xu.iso /home/johny/aes/xu.iso * * * File Protector * * * Mode 4: RC4 cipher Please enter the RC4 key (8 chars min) testing123 The password is: testing123 Beginning encryption Completed: 100% Cipher completed. Program terminated.
real 0m12.566s user 0m10.170s sys 0m0.228s
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 75.51 4.40 4.40 827326464 0.00 0.00 rc4_output() 23.02 5.74 1.34 byteCipher(int, std::string) 1.63 5.84 0.10 1 95.15 95.15 rc4_init(unsigned char*, unsigned int) 0.00 5.84 0.00 1 0.00 0.00 _GLOBAL__sub_I_S
Profiling on Windows
The following test runs were performed on the following Machine:
- Windows 10
- i7-4790k @ 4GHz
- 16GB DDR3
- Visual Studio 2013
RC4 Cipher - 283 MB mp3 File
RC4 Cipher - 636 MB iso File
RC4 Cipher - 789 MB iso File
✗ Profile 1: PI Approximation
- Sample run:
[root@jr-net-cent a1]# time ./pi 3.024 operation - took - 0.0001040000 secs 3.1676000000 operation - took - 0.0002280000 secs 3.1422800000 operation - took - 0.0022700000 secs 3.1418720000 operation - took - 0.0222910000 secs 3.1412748000 operation - took - 0.2185140000 secs 3.1417290800 operation - took - 2.2039310000 secs 3.1415420600 operation - took - 21.9592080000 secs 3.1415625844 operation - took - 47.1807910000 secs 3.1415537704
real 3m33.129s user 3m32.925s sys 0m0.016s
- gprof:
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 100.52 106.93 106.93 11 9.72 9.72 calcpi(int, int*) 0.00 106.93 0.00 11 0.00 0.00 reportTime(char const*, std::chrono::duration<long, std::ratio<1l, 1000000l> >) 0.00 106.93 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z10reportTimePKcNSt6chrono8durationIlSt5ratioILl1ELl1000000EEEE
✗ Profile 2: Wave Form Generator
This is the program we selected to optimize. It's a great candidate because it has 2 primary functions that have a few for loops in them. One of the functions reads an Mp3 file and writes wave data to a file -- this function takes quite a bit of time to execute. The other function actually takes this data and converts it to a view-able sound wave image. Both functions would benefit greatly from the extra processing power that a GPU provides: mp3 read/decode time would be greatly reduced.
'This piece of code is too complex and requires a linux environment to run. Please see Profile 0 for the one we are currently using.'
- Sample Run
[root@jr-net-cent7 ~]# time audiowaveform -i Steph\ DJ\ -\ Noise\ Control\ Episode\ 025\ Feat\ Jack\ Diamond\ 13th\ January\ 2014.mp3 -o test.dat -z 256 -b 8 Input file: Steph DJ - Noise Control Episode 025 Feat Jack Diamond 13th January 2014.mp3 Format: Audio MPEG layer III stream Bit rate: 320000 kbit/s CRC: no Mode: normal LR stereo Emphasis: no Sample rate: 44100 Hz Generating waveform data... Samples per pixel: 256 Input channels: 2 Done: 100% Recoverable frame level error: lost synchronization Frames decoded: 283540 (123:26.759) Generated 1275930 points Writing output file: test.dat Resolution: 8 bits
real 0m32.486s user 0m32.409s sys 0m0.056s
[root@jr-net-cent7 ~]# which audiowaveform /usr/local/bin/audiowaveform [root@jr-net-cent7 ~]# gprof -p -b /usr/local/bin/audiowaveform > final.dat
- gprof
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 58.71 4.28 4.28 79746 0.00 0.00 WaveformGenerator::process(short const*, int) 31.00 6.54 2.26 1 2.26 7.27 Mp3AudioFileReader::run(AudioProcessor&) 9.33 7.22 0.68 653276160 0.00 0.00 MadFixedToSshort(int) 0.41 7.25 0.03 1 0.03 0.03 dumpInfo(std::ostream&, mad_header const&) 0.14 7.26 0.01 7655603 0.00 0.00 short const& std::forward<short const&>(std::remove_reference<short const&>::type&) 0.14 7.27 0.01 2551860 0.00 0.00 writeInt8(std::ostream&, signed char) 0.14 7.28 0.01 2551860 0.00 0.00 std::vector<short, std::allocator<short> >::push_back(short const&) 0.14 7.29 0.01 1275930 0.00 0.00 WaveformBuffer::getMinSample(int) const 0.00 7.29 0.00 2551938 0.00 0.00 operator new(unsigned long, void*) 0.00 7.29 0.00 2551860 0.00 0.00 void __gnu_cxx::new_allocator<short>::construct<short, short const&>(short*, short const&) 0.00 7.29 0.00 2551860 0.00 0.00 std::vector<short, std::allocator<short> >::operator[](unsigned long) const 0.00 7.29 0.00 2551860 0.00 0.00 std::enable_if<std::allocator_traits<std::allocator<short> >::__construct_helper<short, short const&>::value, void>::type std::allocator_traits<std::allocator<short> >::_S_construct<short, short const&>(std::allocator<short>&, short*, short const&) 0.00 7.29 0.00 2551860 0.00 0.00 decltype (_S_construct({parm#1}, {parm#2}, (forward<short const&>)({parm#3}))) std::allocator_traits<std::allocator<short> >::construct<short, short const&>(std::allocator<short>&, short*, short const&) 0.00 7.29 0.00 1275931 0.00 0.00 WaveformGenerator::reset() 0.00 7.29 0.00 1275930 0.00 0.00 WaveformBuffer::appendSamples(short, short) 0.00 7.29 0.00 1275930 0.00 0.00 WaveformBuffer::getMaxSample(int) const 0.00 7.29 0.00 79747 0.00 0.00 AudioFileReader::showProgress(long long, long long) 0.00 7.29 0.00 7272 0.00 0.00 BstdRead 0.00 7.29 0.00 7271 0.00 0.00 BstdFileEofP .....
✗Profile 3: String Processor
compiled by Oleg
- Sample run:
ext_string_example es + 123 = ext_string123 456 + es = 456ext_string es * 3 = ext_stringext_stringext_string 3 * es = ext_stringext_stringext_string original: abc1234?abc1234?abc1234 es - abc = 1234?1234?1234 es - 123 = abc?abc?abc es - ? = abc1234abc1234abc1234 ext_string == eXt_StRiNg original: eXt_StRiNg lowercase: ext_string uppercase: EXT_STRING original: [ ext_string ] remove leading space: [ext_string ] remove trailing space: [ext_string] es: abc, ijk, pqr, xyz ---> split: (abc) (ijk) (pqr) (xyz) es: abc, ijk, pqr, xyz ---> split_n(3): (abc) (ijk) (pqr) es: 1, -23, 456, -7890 ---> parse: (1) (-23) (456) (-7890) es: 1.1, -23.32, 456.654, -7890.0987 ---> parsed: (1.1000000000000001) (-23.32) (456.654) (-7890.0986999999996) non_repeated_char_example No non-repeated chars in string. First non repeated char: a First non repeated char: b No non-repeated chars in string. First non repeated char: c First non repeated char: 1 translation_table_example Before: Such is this simple string sample....Wowzers! After: S5ch 3s th3s s3mpl2 str3ng s1mpl2....W4wz2rs! Before: Such is this simple string sample....Wowzers! After: S5ch 3s th3s s3mpl2 str3ng s1mpl2....W4wz2rs! Before: Such is this simple string sample....Wowzers! After: S5ch 3s th3s s3mpl2 str3ng s1mpl2....W4wz2rs! Before: Such is this simple string sample....Wowzers! After: S5ch 3s th3s s3mpl2 str3ng s1mpl2....W4wz2rs! Before: Such is this simple string sample....Wowzers! After: S5ch 3s th3s s3mpl2 str3ng s1mpl2....W4wz2rs! find_n_consecutive_example Result-01: [1] Location: [0]] Length: [1] Result-02: [22] Location: [2]] Length: [2] Result-03: [333] Location: [5]] Length: [3] Result-04: [4444] Location: [9]] Length: [4] Result-05: [55555] Location: [14]] Length: [5] Result-06: [666666] Location: [20]] Length: [6] Result-07: [7777777] Location: [27]] Length: [7] Result-08: [88888888] Location: [35]] Length: [8] Result-09: [999999999] Location: [44]] Length: [9] Result-01: [a] Location: [0]] Length: [1] Result-02: [bB] Location: [2]] Length: [2] Result-03: [cCc] Location: [5]] Length: [3] Result-04: [dDdD] Location: [9]] Length: [4] Result-05: [EeEeE] Location: [14]] Length: [5] Result-06: [fFfFfF] Location: [20]] Length: [6] Result-07: [gGgGgGg] Location: [27]] Length: [7] Result-08: [HhHhHhHh] Location: [35]] Length: [8] Result-09: [IiIiIiIiI] Location: [44]] Length: [9] split_on_consecutive_example 1 Consecutive digits: 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 7 8 9 9 0 0 0 1 1 1 1 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 4 2 Consecutive digits: 22 33 44 44 55 55 66 66 66 77 77 77 99 00 11 11 22 22 33 33 33 44 44 44 3 Consecutive digits: 333 444 555 666 666 777 777 000 111 222 333 333 444 444 4 Consecutive digits: 4444 5555 6666 7777 1111 2222 3333 4444 5 Consecutive digits: 55555 66666 77777 22222 33333 44444 6 Consecutive digits: 666666 777777 333333 444444 1 Consecutive letters: A B B C C C D D D D E E E E E F F F F F F G G G G G G G H I I J J J K K K K L L L L L M M M M M M N N N N N N N 2 Consecutive letters: BB CC DD DD EE EE FF FF FF GG GG GG II JJ KK KK LL LL MM MM MM NN NN NN 3 Consecutive letters: CCC DDD EEE FFF FFF GGG GGG JJJ KKK LLL MMM MMM NNN NNN 4 Consecutive letters: DDDD EEEE FFFF GGGG KKKK LLLL MMMM NNNN 5 Consecutive letters: EEEEE FFFFF GGGGG LLLLL MMMMM NNNNN 6 Consecutive letters: FFFFFF GGGGGG MMMMMM NNNNNN index_of_example Index of pattern[0123456789ABC]: 0 Index of pattern[123456789ABC]: 1 Index of pattern[23456789ABC]: 2 Index of pattern[3456789ABC]: 3 Index of pattern[456789ABC]: 4 Index of pattern[56789ABC]: 5 Index of pattern[6789ABC]: 6 Index of pattern[789ABC]: 7 Index of pattern[89ABC]: 8 Index of pattern[9ABC]: 9 Index of pattern[xyz]: 4294967295 truncatedint_example i = -1234 i = -1234 u = 1234 i = -1234 u = 1234
real 0m0.248s user 0m0.080s sys 0m0.024s
- Profile:
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 83.33 0.05 0.05 1000008 0.00 0.00 unsigned int boost::uniform_int<unsigned int>::generate<boost::random::detail::pass_through_engine<boost::random::detail::pass_through_engine<boost::random::mersenne_twister<unsigned int, 32, 624, 397, 31, 2567483615u, 11, 7, 2636928640u, 15, 4022730752u, 18, 3346425566u>&> > >(boost::random::detail::pass_through_engine<boost::random::detail::pass_through_engine<boost::random::mersenne_twister<unsigned int, 32, 624, 397, 31, 2567483615u, 11, 7, 2636928640u, 15, 4022730752u, 18, 3346425566u>&> >&, unsigned int, unsigned int, unsigned int) 16.67 0.06 0.01 1 10.00 60.00 strtk::generate_random_data(unsigned char*, unsigned int, unsigned int, unsigned int) 0.00 0.06 0.00 1642 0.00 0.00 boost::random::mersenne_twister<unsigned int, 32, 624, 397, 31, 2567483615u, 11, 7, 2636928640u, 15, 4022730752u, 18, 3346425566u>::twist(int) 0.00 0.06 0.00 979 0.00 0.00 strtk::text::is_digit(char) 0.00 0.06 0.00 978 0.00 0.00 strtk::text::is_letter(char)
Assignment 2
Description
Removing the old CPU bottleneck:
for (int i = 0; i < bufferSize; i++){ // going over every byte in the file switch (mode) { case 0: // inversion buffer[i] = ~buffer[i]; break; case 1: // cycle buffer [i] = cycle (buffer [i]); break; case 2: // RC4 buffer [i] = buffer [i] ^ rc4_output(); break; } }
And replacing it with
... if (mode == 0) getInversionBuffer << < dGrid, dBlock >> >(d_a, bufferSize, d_output); if (mode == 1) getCycleBuffer << < dGrid, dBlock >> >(d_a, bufferSize, d_output); if (mode == 2) getRC4Buffer << < dGrid, dBlock >> >(d_a, bufferSize, d_output); ...
Converting cycle
and rc4_output
functions to device functions:
/** * Description: Device function cycle **/ __device__ char cycle(char value) { int leftMask = 170; int rightMask = 85; int iLeft = value & leftMask; int iRight = value & rightMask; iLeft = iLeft >> 1; iRight = iRight << 1; return iLeft | iRight; }
/** * Description: Device function RC4 **/ __device__ unsigned char rc4_output() { unsigned char temp; unsigned char S[0x100]; // dec 256 unsigned int i, j; i = (i + 1) & 0xFF; j = (j + S[i]) & 0xFF; temp = S[i]; S[i] = S[j]; S[j] = temp; return S[(S[i] + S[j]) & 0xFF]; }
Profiling
The following test runs were performed on the following Machine:
- Windows 10
- i7-4790k @ 4.0GHz
- 16GB DDR3
- Nvidia GTX 430
RC4 Cipher - 283 MB mp3 File
Total runtime: 1.358 seconds
RC4 Cipher - 636 MB iso File
Total runtime: 3.87 seconds
RC4 Cipher - 789 MB iso File
Total runtime: 5.072 seconds
RC4 Run time comparisons: CPU vs. CUDA
Comparing Windows vs. Windows for most accurate results.
Conclusion
We are seeing about 540% (~5.4x) performance increase while using CUDA instead of the CPU in all 3 of the test cases. We think this is an amazing result and a significant improvement in performance over the CPU version of the code.