Changes

Jump to: navigation, search

DPS915 C U D A B O Y S

2,253 bytes added, 12:57, 8 December 2015
Assignment 3
Inside the byteCipher method, exists a for loop that could use optimization. Within this loop specifically, the lines that call the <code>cycle</code> and <code>rc4_output</code> functions are the ones that are taking the longest time to execute:
for (int i = 0; i < bufferSize; i++){
// going over every byte in the file
}
This for loop then can call one of two Here is what these functions: <code>cycle</code> or and <code>rc4_output</code>. functions look like:
char cycle (char value) {
int leftMask = 170;
We need to change these two functions so they are added to the CUDA device as "device functions". We also need to convert this for loop into a kernel.
==== Profiling on Linux ====
[[File:winxu.png]]
 
 
 
'''Byte Cycle - 283 MB mp3 File'''
 
[[File:winmp32.png]]
 
 
'''Byte Cycle - 636 MB iso File'''
 
[[File:wincent2.png]]
 
 
'''Byte Cycle - 789 MB iso File'''
 
[[File:winxu2.png]]
=== <span style="color: red">&#x2717; Profile 1: PI Approximation</span> ===
* Nvidia GTX 430
 
===== RC4 Profiling =====
'''RC4 Cipher - 283 MB mp3 File'''
[[File:cpuvscuda.png]]
 
===== Byte Cycle Profiling =====
 
'''Byte Cycle - 283 MB mp3 File'''
 
Total runtime: 3.467 seconds
 
[[File:music2.png]]
 
 
'''Byte Cycle - 636 MB iso File'''
 
Total runtime: 8.088 seconds
 
[[File:cent2.png]]
 
 
'''Byte Cycle - 789 MB iso File'''
 
Total runtime: 9.472 seconds
 
[[File:xu2.png]]
 
 
''' Byte Cycle time comparisons: CPU vs. CUDA '''
 
Comparing Windows vs. Windows for most accurate results.
 
[[File:cpuvscuda2.png]]
==== Conclusion ====
''' RC4 Findings''' We are seeing about <span style="color: green; font-size:14px">540% (~5.4x) </span> performance increase while using CUDA instead of the CPU in all 3 of the test cases.   ''' Byte Cycle Findings''' We are seeing about <span style="color: green; font-size:14px">320% (~3.2x)</span> performance increase while using CUDA instead of the CPU in all 3 of the test cases.  Overall, we think this is an that these are amazing result results and a significant improvement in performance over the CPU version of the code.Both of these functions have greatly improved in run time and efficiency
== Assignment 3 ==
 
Due to the nature of the way this program was structured by the original developer, optimization was not really needed. The benefits were very small, but here are the optimized kernels for safe measure:
 
'''RC4 OPTIMIZED Cuda Kernel'''
<pre>
/**
* Description: RC4 Cuda Kernel
**/
__global__ void getRC4Buffer(char * buffer, int bufferSize, int ntpb) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
 
__shared__ float sharedMem[_NTPB];
 
sharedMem[tid] = buffer[idx];
__syncthreads();
 
if (idx < bufferSize)
sharedMem[tid] = cycle(sharedMem[tid]);
__syncthreads();
 
buffer[idx] = sharedMem[tid];
}
</pre>
 
'''Cycle OPTIMIZED Cuda Kernel'''
<pre>
/**
* Description: Cycle Cuda Kernel
**/
__global__ void getCycleBuffer(char * buffer, int bufferSize) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int tid = threadIdx.x;
 
__shared__ float sharedMem[_NTPB];
 
sharedMem[tid] = buffer[idx];
__syncthreads();
 
if (idx < bufferSize)
sharedMem[tid] = cycle(sharedMem[tid]);
__syncthreads();
 
buffer[idx] = sharedMem[tid];
}
</pre>
 
The device functions were not modified.
 
[[File:a3graph.png]]

Navigation menu