GPU610/DPS915 | Student List | Group and Project Index | Student Resources | Glossary

Team Kappa

Team Members

Ryan Mullings, Team Member
Andy Cooc, Team Member
Matt Jang, Team Member

Email All

Assignment 1

Program 1: RSA Encryption

Code From Github: link

Profiling

gprof: Encryption 10MB

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 60.48      1.50     1.50  3495254     0.00     0.00  fastModularExp(unsigned long long, unsigned long long, unsigned long long)
 32.66      2.31     0.81  3495254     0.00     0.00  findHighestSetBit(unsigned long long)
  4.44      2.42     0.11        1     0.11     2.48  encryption(std::string, std::string, std::string)
  2.42      2.48     0.06  3495254     0.00     0.00  intToByteArray(int, char*)
  0.00      2.48     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z10encryptionSsSsSs
  0.00      2.48     0.00        1     0.00     0.00  _GLOBAL__sub_I_main
  0.00      2.48     0.00        1     0.00     0.00  __static_initialization_and_destruction_0(int, int)
  0.00      2.48     0.00        1     0.00     0.00  __static_initialization_and_destruction_0(int, int)

gprof: Encryption 20MB

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 63.65      3.27     3.27  6990507     0.00     0.00  fastModularExp(unsigned long long, unsigned long long, unsigned long long)
 29.04      4.75     1.49  6990507     0.00     0.00  findHighestSetBit(unsigned long long)
  3.90      4.96     0.20        1     0.20     5.13  encryption(std::string, std::string, std::string)
  3.41      5.13     0.17  6990507     0.00     0.00  intToByteArray(int, char*)
  0.00      5.13     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z10encryptionSsSsSs
  0.00      5.13     0.00        1     0.00     0.00  _GLOBAL__sub_I_main
  0.00      5.13     0.00        1     0.00     0.00  __static_initialization_and_destruction_0(int, int)
  0.00      5.13     0.00        1     0.00     0.00  __static_initialization_and_destruction_0(int, int)

gprof: Encryption 30MB

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 68.04      5.11     5.11 10485760     0.00     0.00  fastModularExp(unsigned long long, unsigned long long, unsigned long long)
 25.23      7.00     1.90 10485760     0.00     0.00  findHighestSetBit(unsigned long long)
  3.99      7.30     0.30        1     0.30     7.51  encryption(std::string, std::string, std::string)
  2.73      7.51     0.20 10485760     0.00     0.00  intToByteArray(int, char*)
  0.00      7.51     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z10encryptionSsSsSs
  0.00      7.51     0.00        1     0.00     0.00  _GLOBAL__sub_I_main
  0.00      7.51     0.00        1     0.00     0.00  __static_initialization_and_destruction_0(int, int)
  0.00      7.51     0.00        1     0.00     0.00  __static_initialization_and_destruction_0(int, int)

gprof: Decryption 10MB

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 69.80      1.71     1.71  3495254     0.00     0.00  fastModularExp(unsigned long long, unsigned long long, unsigned long long)
 21.63      2.24     0.53  3495254     0.00     0.00  findHighestSetBit(unsigned long long)
  6.12      2.39     0.15        1     0.15     2.45  decryption(std::string, std::string, std::string)
  2.45      2.45     0.06  3495254     0.00     0.00  intToByteArray(int, char*)
  0.00      2.45     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z10encryptionSsSsSs
  0.00      2.45     0.00        1     0.00     0.00  _GLOBAL__sub_I_main
  0.00      2.45     0.00        1     0.00     0.00  __static_initialization_and_destruction_0(int, int)
  0.00      2.45     0.00        1     0.00     0.00  __static_initialization_and_destruction_0(int, int)

gprof: Decryption 20MB

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 64.11      3.38     3.38  6990507     0.00     0.00  fastModularExp(unsigned long long, unsigned long long, unsigned long long)
 27.18      4.82     1.44  6990507     0.00     0.00  findHighestSetBit(unsigned long long)
  4.73      5.07     0.25        1     0.25     5.28  decryption(std::string, std::string, std::string)
  3.98      5.28     0.21  6990507     0.00     0.00  intToByteArray(int, char*)
  0.00      5.28     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z10encryptionSsSsSs
  0.00      5.28     0.00        1     0.00     0.00  _GLOBAL__sub_I_main
  0.00      5.28     0.00        1     0.00     0.00  __static_initialization_and_destruction_0(int, int)
  0.00      5.28     0.00        1     0.00     0.00  __static_initialization_and_destruction_0(int, int)

gprof: Decryption 30MB

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 66.88      5.25     5.25 10485760     0.00     0.00  fastModularExp(unsigned long long, unsigned long long, unsigned long long)
 24.59      7.18     1.93 10485760     0.00     0.00  findHighestSetBit(unsigned long long)
  5.73      7.63     0.45        1     0.45     7.85  decryption(std::string, std::string, std::string)
  2.80      7.85     0.22 10485760     0.00     0.00  intToByteArray(int, char*)
  0.00      7.85     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z10encryptionSsSsSs
  0.00      7.85     0.00        1     0.00     0.00  _GLOBAL__sub_I_main
  0.00      7.85     0.00        1     0.00     0.00  __static_initialization_and_destruction_0(int, int)
  0.00      7.85     0.00        1     0.00     0.00  __static_initialization_and_destruction_0(int, int)

Bottleneck Code

The two slowest parts of this encryption method are fastModularExp and findHighestSetBit. A third function, intToByteArray, takes up a relatively small amount of time but may still be able to be optimized.

unsigned int fastModularExp(ULong a, ULong b, ULong c) {
    ULong result = 1;
    ULong leadingbit = findHighestSetBit(b); // Heighest set bit
    while(leadingbit > 0){ //while there are bits left
        result = ((result*result) % c); //case 1: bit is a 0
        if((b & leadingbit) > 0){
            result = ((result * a) % c); //case 2: if bit is a 1
        }
        leadingbit = leadingbit >> 1;
    }
    return (unsigned int)result;
}

ULong findHighestSetBit(ULong num){
    ULong result = 0;
    for(int i = 63; i >= 0; i--){
        if(num & (1ULL << i)){
            result = 1ULL << i;
            return result;
        }
    }
    return result;
}

byte* intToByteArray(int num, byte *result){
    for(int i = 0; i < 4; i++){
        result[i] = (num & (0xFF << (8 *(3-i)))) >> (8 *(3-i));
    }
    return result;
}

At first glance, the fastest function is also the function that appears like it would be the easiest to run on a GPU. The other code does not look like it can be optimized with a GPU easily as it does not use large arrays of N size. However, after some short internet research, several documents turned up on the topic of using a GPU to make RSA faster (example: here and here). This leaves me hopeful that in the further stages of this assignment that this could be an interesting program to work with.

Ryan Mullings: Program 2-Image Manipulation Processor

I decided to profile an Image manipulation program since it had many different functions that I could play around with. Also depending on the number of files you enter as arguments the processor will display different options Source Code:link to dream in code

Sample Run

Code for some of the manipulations

Profiles

Profile for shrinking image by 2

  %   cumulative                   self     total           
 time   seconds    calls   s/call   s/call  name    
 40.22      5.75     2     0.00     0.00  Image(Image const&)
 27.53      3.32     1     0.00     0.00  readImage(char*, Image&)
 13.23      2.21     1     0.00     2.48  shrinkImage(int, Image&)
 12.67      2.11     1     0.00     0.00  Image(int, int, int)
  6.32      2.48     1     0.00     0.00  writeImage(char*, Image&)

Profile for Expanding image by 3 and Rotating by 20 degrees

  %   cumulative                   self     total           
 time   seconds    calls   s/call   s/call  name    
 64.28      11.23    1     0.00     0.00  rotateImage(Image const&)
 16.89      4.51     1     0.00     0.00  enlargeImage(char*, Image&)
 13.23      2.21     1     0.00     0.00  writeImage(int, Image&)

Essentially, all of the functions are the main hotspots from the test runs.

Program 3: Serial pi Calculation - C version

I've decided to profile the calculate (estimate) of pi using a "dartboard" algorithm.

work in progress...

Source Code:link title

Assignment 2

For assignment 2, we have elected to parallelize the image manipulation program that Ryan Mullings looked at for the first assignment. This program preforms algorithms on black and white *.pgm image files.

Image Manipulation: Rotation

For all benchmarks, the image was rotated 45 degrees counter clockwise. The following is the example input and output from running one of the tests:

C:\Users\Matt\Documents\Visual Studio 2013\Projects\A2\rotate>GPU 1600_900.pgm 1600_900_gpu.pgm
What would you like to do:
[1] Get a Sub Image
[2] Enlarge Image
[3] Shrink Image
[4] Reflect Image
[5] Translate Image
[6] Rotate Image
[7] Negate Image

[0] Quit
6
Enter to what degree you want to rotate the image:
45
method: rotate - cuda (122)
method: rotate - full (168)
What would you like to do:
[1] Get a Sub Image
[2] Enlarge Image
[3] Shrink Image
[4] Reflect Image
[5] Translate Image
[6] Rotate Image
[7] Negate Image

[0] Quit
0
You have chosen to close the progam.
Press any key to continue . . .

All benchmarks were run with the following system specs:

AMD FX-8150 Eight-Core 3.6 GHz
Cuda 7.5
NVIDIA GeForce GTX 670

Benchmark

image size	algorithm time - gpu	full time - gpu	algorithm time - scalar	full time - scalar
500 x 600	109ms	117ms	119ms	127ms
800 x 800	113ms	127ms	254ms	256ms
1600 x 900	122ms	168ms	567ms	592ms
1920 x 1080	130ms	203ms	930ms	972ms
2747 x 1545	153ms	293ms	1704ms	1783ms

Scalar Code

void Image::rotateImage(int theta, Image& source) {

    steady_clock::time_point first_start;
    first_start = steady_clock::now();

    int rows = source.N;
    int cols = source.M;

    Image temp(rows, cols, source.Q);

    steady_clock::time_point second_start;
    second_start = steady_clock::now();

    float rads = (theta * 3.14159265) / 180.0;

    for (int r = 0; r < rows; r++) {

        for (int c = 0; c < cols; c++) {

            int new_row = (int)(rows / 2 + ((r - rows / 2) * cos(rads)) - ((c - cols / 2) * sin(rads)));
            int new_col = (int)(cols / 2 + ((r - rows / 2) * sin(rads)) + ((c - cols / 2) * cos(rads)));

            if (inBounds(new_row, new_col)) {

                temp.pixelVal[new_row][new_col] = source.pixelVal[r][c];

            }

        }

    }

    profile("rotate - cuda", steady_clock::now() - second_start);

    for (int r = 0; r < rows; r++) {

        for (int c = 0; c < cols; c++) {

            if (temp.pixelVal[r][c] == 0) {

                temp.pixelVal[r][c] = temp.pixelVal[r][c + 1];

            }

        }

    }

    source = temp;

    profile("rotate - full", steady_clock::now() - first_start);

}

GPU Code

const unsigned ntpb = 1024;

__global__ void kernel_rotate(int * old_image, int * temp_image, float rads, int rows, int cols) {

    int index = blockIdx.x * blockDim.x + threadIdx.x;

    if (index > rows * cols) {

        return;

    }

    int row = index % rows;
    int col = index / rows;

    int new_row = (int)(rows / 2 + ((row - rows / 2) * cos(rads)) - ((col - cols / 2) * sin(rads)));
    int new_col = (int)(cols / 2 + ((row - rows / 2) * sin(rads)) + ((col - cols / 2) * cos(rads)));

    if (!(new_row >= rows || new_row < 0 || new_col >= cols || new_col < 0)) {

        temp_image[rows * new_col + new_row] = old_image[index];

    }

}

void Image::rotateImage(int theta, Image & source) {

    steady_clock::time_point first_start;
    first_start = steady_clock::now();

    int rows = source.N;
    int cols = source.M;

    int nb = (rows * cols + ntpb - 1) / ntpb;

    int * d_temp_image;
    int * d_old_image;

    int * h_temp_image = new int[rows * cols];
    int * h_old_image = new int[rows * cols];

    for (int r = 0; r < rows; r++) {

        for (int c = 0; c < cols; c++) {

            h_old_image[rows * c + r] = source.pixelVal[r][c];
        }

    }

    steady_clock::time_point second_start;
    second_start = steady_clock::now();

    cudaMalloc((void**)&d_old_image, rows * cols * sizeof(int));

    if (!d_old_image) {

        cout << "CUDA: out of memory (d_old_image)" << endl;
        return;

    }

    cudaMalloc((void**)&d_temp_image, rows * cols * sizeof(int));

    if (!d_temp_image) {

        cout << "CUDA: out of memory (d_temp_image)" << endl;
        return;

    }

    cudaMemcpy(d_old_image, h_old_image, rows * cols * sizeof(int), cudaMemcpyHostToDevice);

    dim3 dGrid(nb);
    dim3 dBlock(ntpb);

    kernel_rotate <<<dGrid, dBlock>>>(d_old_image, d_temp_image, (theta * 3.14159265) / 180.0, rows, cols);

    cudaDeviceSynchronize();

    cudaMemcpy(h_temp_image, d_temp_image, rows * cols * sizeof(int), cudaMemcpyDeviceToHost);

    profile("rotate - cuda", steady_clock::now() - second_start);

    for (int r = 0; r < rows; r++) {

        for (int c = 0; c < cols; c++) {

            if (h_temp_image[rows * c + r] == 0 && c + 1 < cols)
                source.pixelVal[r][c] = h_temp_image[rows * (c + 1) + r];
            else
                source.pixelVal[r][c] = h_temp_image[rows * c + r];

        }

    }

    profile("rotate - full", steady_clock::now() - first_start);

}

Summary

For the rotation code, at around 500x600 dimensions or 300,000 pixels, the speed is about the same. However, as the image size increases, the scalar code will become much slower in comparison to the GPU code. To parallelize the code, I just used the straight forward tactic of unrolling the two for loops and assigning one thread for what would be each iteration. Since each index of the array was looked at individually, there is no problem doing that. As you can see in the benchmarks, through this process the speed was reduced greatly at higher resolutions.

CDOT Wiki ^β

GPU610/TeamKappa

Team Kappa

Team Members

Assignment 1

Program 1: RSA Encryption

Profiling

Bottleneck Code

Ryan Mullings: Program 2-Image Manipulation Processor

Program 3: Serial pi Calculation - C version

Assignment 2

Image Manipulation: Rotation

Benchmark

Scalar Code

GPU Code

Summary

Assignment 3

CDOT Wiki β

GPU610/TeamKappa

Team Kappa

Team Members

Assignment 1

Program 1: RSA Encryption

Profiling

Bottleneck Code

Ryan Mullings: Program 2-Image Manipulation Processor

Program 3: Serial pi Calculation - C version

Assignment 2

Image Manipulation: Rotation

Benchmark

Scalar Code

GPU Code

Summary

Assignment 3

CDOT Wiki ^β