DPS915 C U D A B O Y S
Contents
C U D A B O Y S
Team Members
- Manjot Sandhu, Some responsibility
- Johnathan Ragimov, Some other responsibility
- Oleg Eustace, Some other responsibility
Progress
Assignment 1
✓ Profile 0: File Encryption
Description
This piece of software takes a file and ecrypts it one of 4 ways:
- Byte Inversion
- Byte Cycle
- Xor Cipher
- RC4 Cipher
Inside the byteCipher method, exists a for loop that could use optimization. Within this loop specifically, the lines that call the cycle
and rc4_output
functions are the ones that are taking the longest time to execute:
for (int i = 0; i < bufferSize; i++){ // going over every byte in the file switch (mode) { case 0: // inversion buffer[i] = ~buffer[i]; break; case 1: // cycle buffer [i] = cycle (buffer [i]); break; case 2: // RC4 buffer [i] = buffer [i] ^ rc4_output(); break; } }
Here is what these functions cycle
and rc4_output
functions look like:
char cycle (char value) { int leftMask = 170; int rightMask = 85; int iLeft = value & leftMask; int iRight = value & rightMask; iLeft = iLeft >> 1; iRight = iRight << 1; return iLeft | iRight; }
unsigned char rc4_output() { unsigned char temp; i = (i + 1) & 0xFF; j = (j + S[i]) & 0xFF; temp = S[i]; S[i] = S[j]; S[j] = temp; return S[(S[i] + S[j]) & 0xFF]; }
We need to change these two functions so they are added to the CUDA device as "device functions".
Profiling on Linux
The following test runs were performed on the following Virtual Machine:
- CentOS 7
- i7-3820 @ 3.6 GHz
- 2GB DDR3
- gcc version 4.8.3
Using compiler settings:
g++ -c -O2 -g -pg -std=c++11 encFile.cpp
RC4 Cipher - 283 MB mp3 File
[root@jr-net-cent7 aes]# time ./encFile 4 /home/johny/aes/music.mp3 /home/johny/aes/music.mp3 * * * File Protector * * * Mode 4: RC4 cipher Please enter the RC4 key (8 chars min) testing123 The password is: testing123 Beginning encryption Completed: 100% Cipher completed. Program terminated.
real 0m6.758s user 0m3.551s sys 0m0.068s
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 84.05 1.70 1.70 296271519 0.00 0.00 rc4_output() 13.39 1.97 0.27 byteCipher(int, std::string) 2.73 2.02 0.06 1 55.09 55.09 rc4_init(unsigned char*, unsigned int) 0.00 2.02 0.00 1 0.00 0.00 _GLOBAL__sub_I_S
As we can see the rc4_output
and byteCipher
functions take up most of the processing time.
RC4 Cipher - 636 MB iso File
[root@jr-net-cent7 aes]# time ./encFile 4 /home/johny/aes/cent.iso /home/johny/aes/cent.iso * * * File Protector * * * Mode 4: RC4 cipher Please enter the RC4 key (8 chars min) testing123 The password is: testing123 Beginning encryption Completed: 100% Cipher completed. Program terminated.
real 0m10.293s user 0m8.235s sys 0m0.312s
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 74.86 3.59 3.59 666894336 0.00 0.00 rc4_output() 23.21 4.70 1.11 byteCipher(int, std::string) 2.09 4.80 0.10 1 100.16 100.16 rc4_init(unsigned char*, unsigned int) 0.00 4.80 0.00 1 0.00 0.00 _GLOBAL__sub_I_S
RC4 Cipher - 789 MB iso File
[root@jr-net-cent7 aes]# time ./encFile 4 /home/johny/aes/xu.iso /home/johny/aes/xu.iso * * * File Protector * * * Mode 4: RC4 cipher Please enter the RC4 key (8 chars min) testing123 The password is: testing123 Beginning encryption Completed: 100% Cipher completed. Program terminated.
real 0m12.566s user 0m10.170s sys 0m0.228s
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 75.51 4.40 4.40 827326464 0.00 0.00 rc4_output() 23.02 5.74 1.34 byteCipher(int, std::string) 1.63 5.84 0.10 1 95.15 95.15 rc4_init(unsigned char*, unsigned int) 0.00 5.84 0.00 1 0.00 0.00 _GLOBAL__sub_I_S
Profiling on Windows
The following test runs were performed on the following Machine:
- Windows 10
- i7-4790k @ 4GHz
- 16GB DDR3
- Visual Studio 2013
RC4 Cipher - 283 MB mp3 File
RC4 Cipher - 636 MB iso File
RC4 Cipher - 789 MB iso File
Byte Cycle - 283 MB mp3 File
Byte Cycle - 636 MB iso File
Byte Cycle - 789 MB iso File
✗ Profile 1: PI Approximation
- Sample run:
[root@jr-net-cent a1]# time ./pi 3.024 operation - took - 0.0001040000 secs 3.1676000000 operation - took - 0.0002280000 secs 3.1422800000 operation - took - 0.0022700000 secs 3.1418720000 operation - took - 0.0222910000 secs 3.1412748000 operation - took - 0.2185140000 secs 3.1417290800 operation - took - 2.2039310000 secs 3.1415420600 operation - took - 21.9592080000 secs 3.1415625844 operation - took - 47.1807910000 secs 3.1415537704
real 3m33.129s user 3m32.925s sys 0m0.016s
- gprof:
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 100.52 106.93 106.93 11 9.72 9.72 calcpi(int, int*) 0.00 106.93 0.00 11 0.00 0.00 reportTime(char const*, std::chrono::duration<long, std::ratio<1l, 1000000l> >) 0.00 106.93 0.00 1 0.00 0.00 _GLOBAL__sub_I__Z10reportTimePKcNSt6chrono8durationIlSt5ratioILl1ELl1000000EEEE
✗ Profile 2: Wave Form Generator
This is the program we selected to optimize. It's a great candidate because it has 2 primary functions that have a few for loops in them. One of the functions reads an Mp3 file and writes wave data to a file -- this function takes quite a bit of time to execute. The other function actually takes this data and converts it to a view-able sound wave image. Both functions would benefit greatly from the extra processing power that a GPU provides: mp3 read/decode time would be greatly reduced.
'This piece of code is too complex and requires a linux environment to run. Please see Profile 0 for the one we are currently using.'
- Sample Run
[root@jr-net-cent7 ~]# time audiowaveform -i Steph\ DJ\ -\ Noise\ Control\ Episode\ 025\ Feat\ Jack\ Diamond\ 13th\ January\ 2014.mp3 -o test.dat -z 256 -b 8 Input file: Steph DJ - Noise Control Episode 025 Feat Jack Diamond 13th January 2014.mp3 Format: Audio MPEG layer III stream Bit rate: 320000 kbit/s CRC: no Mode: normal LR stereo Emphasis: no Sample rate: 44100 Hz Generating waveform data... Samples per pixel: 256 Input channels: 2 Done: 100% Recoverable frame level error: lost synchronization Frames decoded: 283540 (123:26.759) Generated 1275930 points Writing output file: test.dat Resolution: 8 bits
real 0m32.486s user 0m32.409s sys 0m0.056s
[root@jr-net-cent7 ~]# which audiowaveform /usr/local/bin/audiowaveform [root@jr-net-cent7 ~]# gprof -p -b /usr/local/bin/audiowaveform > final.dat
- gprof
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 58.71 4.28 4.28 79746 0.00 0.00 WaveformGenerator::process(short const*, int) 31.00 6.54 2.26 1 2.26 7.27 Mp3AudioFileReader::run(AudioProcessor&) 9.33 7.22 0.68 653276160 0.00 0.00 MadFixedToSshort(int) 0.41 7.25 0.03 1 0.03 0.03 dumpInfo(std::ostream&, mad_header const&) 0.14 7.26 0.01 7655603 0.00 0.00 short const& std::forward<short const&>(std::remove_reference<short const&>::type&) 0.14 7.27 0.01 2551860 0.00 0.00 writeInt8(std::ostream&, signed char) 0.14 7.28 0.01 2551860 0.00 0.00 std::vector<short, std::allocator<short> >::push_back(short const&) 0.14 7.29 0.01 1275930 0.00 0.00 WaveformBuffer::getMinSample(int) const 0.00 7.29 0.00 2551938 0.00 0.00 operator new(unsigned long, void*) 0.00 7.29 0.00 2551860 0.00 0.00 void __gnu_cxx::new_allocator<short>::construct<short, short const&>(short*, short const&) 0.00 7.29 0.00 2551860 0.00 0.00 std::vector<short, std::allocator<short> >::operator[](unsigned long) const 0.00 7.29 0.00 2551860 0.00 0.00 std::enable_if<std::allocator_traits<std::allocator<short> >::__construct_helper<short, short const&>::value, void>::type std::allocator_traits<std::allocator<short> >::_S_construct<short, short const&>(std::allocator<short>&, short*, short const&) 0.00 7.29 0.00 2551860 0.00 0.00 decltype (_S_construct({parm#1}, {parm#2}, (forward<short const&>)({parm#3}))) std::allocator_traits<std::allocator<short> >::construct<short, short const&>(std::allocator<short>&, short*, short const&) 0.00 7.29 0.00 1275931 0.00 0.00 WaveformGenerator::reset() 0.00 7.29 0.00 1275930 0.00 0.00 WaveformBuffer::appendSamples(short, short) 0.00 7.29 0.00 1275930 0.00 0.00 WaveformBuffer::getMaxSample(int) const 0.00 7.29 0.00 79747 0.00 0.00 AudioFileReader::showProgress(long long, long long) 0.00 7.29 0.00 7272 0.00 0.00 BstdRead 0.00 7.29 0.00 7271 0.00 0.00 BstdFileEofP .....
✗Profile 3: String Processor
- Sample run:
ext_string_example es + 123 = ext_string123 456 + es = 456ext_string es * 3 = ext_stringext_stringext_string 3 * es = ext_stringext_stringext_string original: abc1234?abc1234?abc1234 es - abc = 1234?1234?1234 es - 123 = abc?abc?abc es - ? = abc1234abc1234abc1234 ext_string == eXt_StRiNg original: eXt_StRiNg lowercase: ext_string uppercase: EXT_STRING original: [ ext_string ] remove leading space: [ext_string ] remove trailing space: [ext_string] es: abc, ijk, pqr, xyz ---> split: (abc) (ijk) (pqr) (xyz) es: abc, ijk, pqr, xyz ---> split_n(3): (abc) (ijk) (pqr) es: 1, -23, 456, -7890 ---> parse: (1) (-23) (456) (-7890) es: 1.1, -23.32, 456.654, -7890.0987 ---> parsed: (1.1000000000000001) (-23.32) (456.654) (-7890.0986999999996) non_repeated_char_example No non-repeated chars in string. First non repeated char: a First non repeated char: b No non-repeated chars in string. First non repeated char: c First non repeated char: 1 translation_table_example Before: Such is this simple string sample....Wowzers! After: S5ch 3s th3s s3mpl2 str3ng s1mpl2....W4wz2rs! Before: Such is this simple string sample....Wowzers! After: S5ch 3s th3s s3mpl2 str3ng s1mpl2....W4wz2rs! Before: Such is this simple string sample....Wowzers! After: S5ch 3s th3s s3mpl2 str3ng s1mpl2....W4wz2rs! Before: Such is this simple string sample....Wowzers! After: S5ch 3s th3s s3mpl2 str3ng s1mpl2....W4wz2rs! Before: Such is this simple string sample....Wowzers! After: S5ch 3s th3s s3mpl2 str3ng s1mpl2....W4wz2rs! find_n_consecutive_example Result-01: [1] Location: [0]] Length: [1] Result-02: [22] Location: [2]] Length: [2] Result-03: [333] Location: [5]] Length: [3] Result-04: [4444] Location: [9]] Length: [4] Result-05: [55555] Location: [14]] Length: [5] Result-06: [666666] Location: [20]] Length: [6] Result-07: [7777777] Location: [27]] Length: [7] Result-08: [88888888] Location: [35]] Length: [8] Result-09: [999999999] Location: [44]] Length: [9] Result-01: [a] Location: [0]] Length: [1] Result-02: [bB] Location: [2]] Length: [2] Result-03: [cCc] Location: [5]] Length: [3] Result-04: [dDdD] Location: [9]] Length: [4] Result-05: [EeEeE] Location: [14]] Length: [5] Result-06: [fFfFfF] Location: [20]] Length: [6] Result-07: [gGgGgGg] Location: [27]] Length: [7] Result-08: [HhHhHhHh] Location: [35]] Length: [8] Result-09: [IiIiIiIiI] Location: [44]] Length: [9] split_on_consecutive_example 1 Consecutive digits: 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 7 8 9 9 0 0 0 1 1 1 1 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 4 2 Consecutive digits: 22 33 44 44 55 55 66 66 66 77 77 77 99 00 11 11 22 22 33 33 33 44 44 44 3 Consecutive digits: 333 444 555 666 666 777 777 000 111 222 333 333 444 444 4 Consecutive digits: 4444 5555 6666 7777 1111 2222 3333 4444 5 Consecutive digits: 55555 66666 77777 22222 33333 44444 6 Consecutive digits: 666666 777777 333333 444444 1 Consecutive letters: A B B C C C D D D D E E E E E F F F F F F G G G G G G G H I I J J J K K K K L L L L L M M M M M M N N N N N N N 2 Consecutive letters: BB CC DD DD EE EE FF FF FF GG GG GG II JJ KK KK LL LL MM MM MM NN NN NN 3 Consecutive letters: CCC DDD EEE FFF FFF GGG GGG JJJ KKK LLL MMM MMM NNN NNN 4 Consecutive letters: DDDD EEEE FFFF GGGG KKKK LLLL MMMM NNNN 5 Consecutive letters: EEEEE FFFFF GGGGG LLLLL MMMMM NNNNN 6 Consecutive letters: FFFFFF GGGGGG MMMMMM NNNNNN index_of_example Index of pattern[0123456789ABC]: 0 Index of pattern[123456789ABC]: 1 Index of pattern[23456789ABC]: 2 Index of pattern[3456789ABC]: 3 Index of pattern[456789ABC]: 4 Index of pattern[56789ABC]: 5 Index of pattern[6789ABC]: 6 Index of pattern[789ABC]: 7 Index of pattern[89ABC]: 8 Index of pattern[9ABC]: 9 Index of pattern[xyz]: 4294967295 truncatedint_example i = -1234 i = -1234 u = 1234 i = -1234 u = 1234
real 0m0.248s user 0m0.080s sys 0m0.024s
- Profile:
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 83.33 0.05 0.05 1000008 0.00 0.00 unsigned int boost::uniform_int<unsigned int>::generate<boost::random::detail::pass_through_engine<boost::random::detail::pass_through_engine<boost::random::mersenne_twister<unsigned int, 32, 624, 397, 31, 2567483615u, 11, 7, 2636928640u, 15, 4022730752u, 18, 3346425566u>&> > >(boost::random::detail::pass_through_engine<boost::random::detail::pass_through_engine<boost::random::mersenne_twister<unsigned int, 32, 624, 397, 31, 2567483615u, 11, 7, 2636928640u, 15, 4022730752u, 18, 3346425566u>&> >&, unsigned int, unsigned int, unsigned int) 16.67 0.06 0.01 1 10.00 60.00 strtk::generate_random_data(unsigned char*, unsigned int, unsigned int, unsigned int) 0.00 0.06 0.00 1642 0.00 0.00 boost::random::mersenne_twister<unsigned int, 32, 624, 397, 31, 2567483615u, 11, 7, 2636928640u, 15, 4022730752u, 18, 3346425566u>::twist(int) 0.00 0.06 0.00 979 0.00 0.00 strtk::text::is_digit(char) 0.00 0.06 0.00 978 0.00 0.00 strtk::text::is_letter(char)
Assignment 2
Description
Removing CPU Bottleneck
Removing the old CPU bottleneck in the byteCipher
function:
for (int i = 0; i < bufferSize; i++){ // going over every byte in the file switch (mode) { case 0: // inversion buffer[i] = ~buffer[i]; break; case 1: // cycle buffer [i] = cycle (buffer [i]); break; case 2: // RC4 buffer [i] = buffer [i] ^ rc4_output(); break; } }
And replacing it with
... if (mode == 1) getCycleBuffer << < dGrid, dBlock >> >(d_a, bufferSize, d_output); if (mode == 2) getRC4Buffer << < dGrid, dBlock >> >(d_a, bufferSize, d_output); ...
Device Functions
Converting cycle
and rc4_output
functions to device functions:
/** * Description: Device function cycle **/ __device__ char cycle(char value) { int leftMask = 170; int rightMask = 85; int iLeft = value & leftMask; int iRight = value & rightMask; iLeft = iLeft >> 1; iRight = iRight << 1; return iLeft | iRight; }
/** * Description: Device function RC4 **/ __device__ unsigned char rc4_output() { unsigned char temp; unsigned char S[0x100]; // dec 256 unsigned int i, j; i = (i + 1) & 0xFF; j = (j + S[i]) & 0xFF; temp = S[i]; S[i] = S[j]; S[j] = temp; return S[(S[i] + S[j]) & 0xFF]; }
Creating Kernels
We created kernels for each of the 2 different methods of Cipher that the program handles (RC4 and Cycle, but not the others -- read on):
/** * Description: RC4 Cuda Kernel **/ __global__ void getRC4Buffer(char * buffer, int bufferSize) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < bufferSize) buffer[idx] = buffer[idx] ^ rc4_output(); }
/** * Description: Cycle Cuda Kernel **/ __global__ void getCycleBuffer(char * buffer, int bufferSize) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < bufferSize) buffer[idx] = cycle(buffer[idx]); }
You may be asking what about the two other methods of cipher: byte inversion and xor cipher? Well, as it turns out these methods run perfectly fine on the CPU and usually are faster on the CPU than the GPU. We initially had converted these functions over to CUDA, but we soon discovered that these functions did not need to be converted as they ran faster on the CPU than they did on the GPU.
Here's an example of run time of Xor Cipher on both CPU and GPU with the 789MB file:
GPU: http://i.imgur.com/0PsLxzQ.png -- 6.263 seconds
CPU: http://i.imgur.com/ktn14q3.png -- 3.722 seconds
As we can see, the CPU runs way faster than the GPU: no parallelization needed here!
Profiling
The following test runs were performed on the following Machine:
- Windows 10
- i7-4790k @ 4.0GHz
- 16GB DDR3
- Nvidia GTX 430
RC4 Profiling
RC4 Cipher - 283 MB mp3 File
Total runtime: 1.358 seconds
RC4 Cipher - 636 MB iso File
Total runtime: 3.87 seconds
RC4 Cipher - 789 MB iso File
Total runtime: 5.072 seconds
RC4 Run time comparisons: CPU vs. CUDA
Comparing Windows vs. Windows for most accurate results.
Byte Cycle Profiling
Byte Cycle - 283 MB mp3 File
Total runtime: 3.467 seconds
Byte Cycle - 636 MB iso File
Total runtime: 8.088 seconds
Byte Cycle - 789 MB iso File
Total runtime: 9.472 seconds
Byte Cycle time comparisons: CPU vs. CUDA
Comparing Windows vs. Windows for most accurate results.
Conclusion
RC4 Findings
We are seeing about 540% (~5.4x) performance increase while using CUDA instead of the CPU in all 3 of the test cases.
Byte Cycle Findings
We are seeing about 320% (~3.2x) performance increase while using CUDA instead of the CPU in all 3 of the test cases.
Overall, we think that these are amazing results and a significant improvement in performance over the CPU version of the code. Both of these functions have greatly improved in run time and efficiency
Assignment 3
Due to the nature of the way this program was structured by the original developer, optimization was not really needed. The benefits were very small, but here are the optimized kernels for safe measure:
RC4 OPTIMIZED Cuda Kernel
/** * Description: RC4 Cuda Kernel **/ __global__ void getRC4Buffer(char * buffer, int bufferSize, int ntpb) { int idx = blockIdx.x * blockDim.x + threadIdx.x; int tid = threadIdx.x; __shared__ float sharedMem[_NTPB]; sharedMem[tid] = buffer[idx]; __syncthreads(); if (idx < bufferSize) sharedMem[tid] = cycle(sharedMem[tid]); __syncthreads(); buffer[idx] = sharedMem[tid]; }
Cycle OPTIMIZED Cuda Kernel
/** * Description: Cycle Cuda Kernel **/ __global__ void getCycleBuffer(char * buffer, int bufferSize) { int idx = blockIdx.x * blockDim.x + threadIdx.x; int tid = threadIdx.x; __shared__ float sharedMem[_NTPB]; sharedMem[tid] = buffer[idx]; __syncthreads(); if (idx < bufferSize) sharedMem[tid] = cycle(sharedMem[tid]); __syncthreads(); buffer[idx] = sharedMem[tid]; }
The device functions were not modified.