And replacing it with
...
if (mode == 0)
getInversionBuffer << < dGrid, dBlock >> >(d_a, bufferSize, d_output);
if (mode == 1)
getCycleBuffer << < dGrid, dBlock >> >(d_a, bufferSize, d_output);
if (mode == 2)
getRC4Buffer << < dGrid, dBlock >> >(d_a, bufferSize, d_output);
...
Removing the CPU bottleneck inside the <code>xorCipher</code> method:
for (int i = 0; i < bufferSize; i++){
// inverting every byte in the buffer
buffer[i] = buffer[i] ^ keyBuffer[i];
}
And replacing it with
...
getXorBuffer << < (n + ntpb - 1) / ntpb, ntpb >> >(d_a, d_b, bufferSize);
...
'''Creating Kernels'''
We created kernels for each of the 4 2 different methods of Cipher that the program handles(RC4 and Cycle, but not the others -- read on):
/**
* Description: RC4 Cuda Kernel
}
/**
* Description: Inversion Cuda Kernel
**/
__global__ void getInversionBuffer(char * buffer, int bufferSize) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < bufferSize)
buffer[idx] = ~buffer[idx];
}
/**You may be asking what about the two other methods of cipher: '''byte inversion''' and '''xor cipher'''? Well, as it turns out these methods run perfectly fine on the CPU and usually are faster on the CPU than the GPU. We initially had converted these functions over to CUDA, but we soon discovered that these functions did not need to be converted as they ran faster on the CPU than they did on the GPU. * DescriptionHere's an example of run time of Xor Cipher on both CPU and GPU with the 789MB file: XOR Cuda Kernel **GPU: http://i.imgur.com/0PsLxzQ.png -- 6.263 seconds __global__ void getXorBuffer(char * buffer, char * keyBuffer, int bufferSize) { int idx = blockIdxCPU: http://i.imgur.x * blockDimcom/ktn14q3.x + threadIdxpng -- 3.x;722 seconds if (idx < bufferSize) buffer[idx] = buffer[idx] ^ keyBuffer[idx]; }As we can see, the CPU runs way faster than the GPU: no parallelization needed here!
==== Profiling ====