Changes

Jump to: navigation, search

GPU610/TeamKappa

13,862 bytes added, 09:09, 16 December 2015
Code Snippet of CUDA kernel
==== Plans ====
Using the serial Pi calculator/estimator, conduct benchmark a large amount of sampling which reaches the limit of my computer. Perhaps changing the number of round to one and and run the maximum sampling size and accuracy that my computer can handle.
==== Problem ====
When using the code with visual studio, I had errors trying to compile it. It appears after some time searching online, the #include <time.h> function void srandom(unsigned seed); and random() does not come standard in all ANSI code so I tried the approach using srand(time(NULL)) and rand(); and still failed resulting returns of 0.0000. zeroes. I modified the code for the random generate number and added report time and pretty much kept the idea the same into a simpler program.
==== Code ====
}
// report system time
/* void reportTime(const char* msg, steady_clock::duration span) {
auto ms = duration_cast<milliseconds>(span);
std::cout << msg << " - took - " <<
ms.count() << " millisecs" << std::endl;
}
*/
int main(int argc, char* argv[])
ts = steady_clock::now();
//main loop
for (i = 0; i<dart; ++i) {
//get random points
x = (double)generateNumber();
y = (double)generateNumber();
++score;
}
pi = ((double)score / (double)dart)*4.0; // p = 4(m/n)
std::cout << "After " << i << " throws, average pi is " << pi << std::endl;
te = steady_clock::now();
[[Image:Serialpiresults.png|481px| ]]
 
==== Summary ====
The Big-O Classification for serial pi calculator (estimate) appears to be O(1) run time.
 
=== Part 2 of Pi ===
Interesting enough, I had stumbled upon a monte carlo pi calculation written with "Parallel.For" to speed up the sampling using multi-threads (concurrency) on the CPU. I figured I'll compare my results to this with the serial program I was working on. Since this part is part of the parallelization of the program, I will be posting in the part 2 of the assignment.
[[Image:Parallelformontecarlopi.png]]
 
[https://helloacm.com/c-coding-exercise-parallel-for-monte-carlo-pi-calculation/ Article link]
 
==== Source File ====
the source code can be found at [https://github.com/DoctorLai/coding_exercise/blob/master/parallel_monte_carlo_pi.cpp github]
 
==== Code Snippet ====
<code><pre>int main()
{
srand(time(NULL));
const int N1 = 1000;
const int N2 = 100000;
int n = 0;
int c = 0;
Concurrency::critical_section cs;
// it is better that N1 >> N2 for better performance
Concurrency::parallel_for(0, N1, [&](int i)
{
int t = monte_carlo_count_pi(N2);
cs.lock(); // race condition
n += N2; // total sampling points
c += t; // points fall in the circle
cs.unlock();
});
cout << "pi ~= " << setprecision(9) << (double)c / n * 4.0 << endl;
return 0;
 
}</pre></code>
 
 
[[Image:Serialandconcurrencypi.png|481px| ]]
 
When comparing the serial version and the parallel.for version of pi's usage of the CPU resource. You can see the CPU concurrency version demonstrates the benefit of multi-core processors.
 
 
'''kernel CUDA version work in progress...and comparasions'''
 
==== Assignment 2 - Code Snippet of CUDA kernel====
<code><pre>__global__ void montecarlo(const double* d_x, const double* d_y, int* d_score) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if ((d_x[idx]) * (d_x[idx]) +
(d_y[idx]) * (d_y[idx]) <= 1){
d_score[idx] = 1;
}
else
d_score[idx] = 0;
}
</pre></code>
 
 
It is not so much of a huge performance difference over the CPU code. The issue is with the data being transferred and initialized.
It appears the calculation time of the program were only a fraction of the computation. The problem lies when the generation of random
of points on the CPU on host and copying it over device. However, the GPU compute the results much quicker than the CPU. The next approach is to find a way to generate the random numbers concurrently on the GPU reducing the amount of serial work on the CPU.
 
 
 
==== Assignment 3 - Code Snippet of CUDA kernel Optimized version====
<code><pre>___global__ void montecarlo(const double* d_x, const double* d_y, int* d_score) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if ((d_x[idx]) * (d_x[idx]) +
(d_y[idx]) * (d_y[idx]) <= 1){
d_score[idx] = 1;
}
else
d_score[idx] = 0;
}
 
 
__global__ void reduce(int* c, int* d, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int t = threadIdx.x;
__shared__ float s_c[1024];
if (i < n)
s_c[t] = c[i];
else
s_c[t] = 0;
__syncthreads();
 
for (int stride = 1; stride < blockDim.x; stride *= 2) {
if (t % (2 * stride) == 0)
s_c[t] += s_c[t + stride];
__syncthreads();
}
 
if (t == 0)
d[blockIdx.x] = s_c[0];
}
</pre></code>
 
Currently having problems trying to generate random numbers using cuRAND
== Assignment 2 ==
For the rotation code, at around 500x600 dimensions or 300,000 pixels, the speed is about the same. However, as the image size increases, the scalar code will become much slower in comparison to the GPU code. To parallelize the code, I just used the straight forward tactic of unrolling the two for loops and assigning one thread for what would be each iteration. Since each index of the array was looked at individually, there is no problem doing that. As you can see in the benchmarks, through this process the speed was reduced greatly at higher resolutions.
== Assignment 3 (Matt Jang) == === Image Rotation === This is the kernel that I had originally made for Assignment 2. To optimize this I wanted to make use of all the techniques that we learned in class like shared memory and the like. Although I wasn't able to take advantage of everything I wanted to do, I was still able to speed up this kernel to preform between 3 and 4 times faster. The following are the steps that I used to optimize.; Changed the number of threads per block to 128 from 1024. : This step was to improve the occupancy. However, the occupancy was already high on my card so this didn't make much of a difference. On the smallest image, the time went from '''293μs to 265μs''' and on the largest image, the time went from '''3995μs to 3617μs'''.; Optimized out rows/2 and cols/2 into parameters. : Since this operation was preformed every single iteration, I figured it would be quicker to pass through that value in the parameters. On the smallest image, the time went from '''265μs to 193μs''' and on the largest image, the time went from '''3617μs to 2621μs'''. ; Used device functions __cosf and __sinf. : In class we had learned that there were special device functions for trigonometry. I figured I would use them but the speed up was much more than I had expected. On the smallest image, the time went from '''193μs to 93μs''' and on the largest image, the time went from '''2621μs to 1232μs'''.; Optimize uses of __cosf and __sinf. : Since using the device functions made such a big change and since there were two identical calls to each, I stored the value of their result in a register and used the register twice. This didn't make such a big change but I am sure that if I had done it before using __cosf and __sinf, the difference would have been much bigger. On the smallest image, the tiem went from '''93μs to 91μs''' and on the largest image, the time went from '''1232μs to 1196μs'''.; Experimental division optimization. : This last optimization made almost no difference but I more wanted to try something new. I changed "'''int col = index / rows;'''" to "'''int col = (index - row) / rows;'''". The idea is that if the computer gets a whole number from division, it would be a bit faster. I don't fully understand why it worked but there was a consistent speed up of around 0.5% to 1%. On the smallest image, the time went from '''91μs to 90μs''' and on the largest image, the time went from '''1196μs to 1191μs'''. ==== Kernel Speeds ==== [[Image:Gpu610_matt_a3_1.png|chart|481px|chart]] {| class="wikitable" border="1"! Image Size !! Before !! After|-| 500 x 600 || 293μs || 90μs|-| 800 x 800 || 613μs || 188μs|-| 1600 x 900 || 1359μs || 408μs|-| 1920 x 1080 || 2175μs || 654μs|-| 2747 x 1545 || 3995μs || 1191μs|} ==== Unoptimized ====  const unsigned ntpb = 1024; __global__ void kernel_rotate(int * old_image, int * temp_image, float rads, int rows, int cols) { int index = blockIdx.x * blockDim.x + threadIdx.x; if (index > rows * cols) { return; } int row = index % rows; int col = index / rows; int new_row = (int)(rows / 2 + ((row - rows / 2) * cos(rads)) - ((col - cols / 2) * sin(rads))); int new_col = (int)(cols / 2 + ((row - rows / 2) * sin(rads)) + ((col - cols / 2) * cos(rads))); if (!(new_row >= rows || new_row < 0 || new_col >= cols || new_col < 0)) { temp_image[rows * new_col + new_row] = old_image[index]; } } ==== Optimized ====  const unsigned ntpb = 128; __global__ void kernel_rotate(int * old_image, int * temp_image, float rads, int rows, int cols, int half_rows, int half_cols, int rows_x_cols) { int index = blockIdx.x * blockDim.x + threadIdx.x; if (index > rows_x_cols) { return; } int row = index % rows; int col = (index - row) / rows; float cosf_rads = __cosf(rads); float sinf_rads = __sinf(rads); int new_row = (int)(half_rows + ((row - half_rows) * cosf_rads) - ((col - half_cols) * sinf_rads)); int new_col = (int)(half_cols + ((row - half_rows) * sinf_rads) + ((col - half_cols) * cosf_rads)); if (!(new_row >= rows || new_row < 0 || new_col >= cols || new_col < 0)) { temp_image[rows * new_col + new_row] = old_image[index]; } } === Image Reflection === Since I wasn't able to do too many big optimization techniques with the image rotation, I decided to do an additional image manipulation function. Although I still wasn't able to use shared memory to make anything faster, I was able to try one or two new things. ; Split the reflection into two kernels. : This is the most obvious of the optimizations. Since there are two distinct operations (horizontal flip, vertical flip), it only makes sense to have one kernel for each. Each kernel would be optimized for each one. On the smallest image, the time went from '''59μs to 47μs''' and on the largest image the time went from '''711μs to 629μs'''.; Only process half the image. : This is also the other obvious optimization. Instead of going through each pixel and flipping each one to a temporary array, I would only iterate through half of them and swap each pixel with the one on the other side. There was one catch to this optimization. For the two different kernels, I had to populate my one dimensional array as either row major or column major order. This was so that the first half of index were either the top side or the left side. That way, on each of the horizontal and vertical kernels, I just had to subtract either rows or cols from a value. The memory access is also sequential. On the smallest image, the time went from '''47μs to 31μs''' and on the largest image, the time went from '''629μs to 416μs'''. ==== Kernel Speeds ==== [[Image:Gpu610_matt_a3_2.PNG|chart|481px|chart]] {| class="wikitable" border="1"! Image Size !! Before !! After|-| 500 x 600 || 59μs || 31μs|-| 800 x 800 || 110μs || 63μs|-| 1600 x 900 || 244μs || 141μs|-| 1920 x 1080 || 388μs || 219μs|-| 2747 x 1545 || 711μs || 416μs|} ==== Unoptimized ====  __global__ void kernel_reflect(int * old_image, int * temp_image, bool flag, int rows, int cols) { int index = blockIdx.x * blockDim.x + threadIdx.x; if (index > rows * cols) { return; } int row = index % rows; int col = index / rows; int new_row = 0; int new_col = 0; if (flag) { new_row = row; new_col = cols - col; } else { new_row = rows - row; new_col = col; } temp_image[rows * new_col + new_row] = old_image[index]; } ==== Optimized ====  const int reflect_ntpb = 128; __global__ void kernel_reflect_horizontal(int * old_image, int rows, int cols, int half_cols) { int index = blockIdx.x * blockDim.x + threadIdx.x; if (index > rows * half_cols) { return; } int other_index = rows * (cols - index / rows) + index % rows; int temp = old_image[other_index]; old_image[other_index] = old_image[index]; old_image[index] = temp; } __global__ void kernel_reflect_vertical(int * old_image, int rows, int half_rows, int cols) { int index = blockIdx.x * blockDim.x + threadIdx.x; if (index > half_rows * cols) { return; } int other_index = ((rows - index / cols) * cols) + index % cols; int temp = old_image[other_index]; old_image[other_index] = old_image[index]; old_image[index] = temp; } long long Image::reflectImage(bool flag, Image & source) { int rows = source.N; int cols = source.M; int half_cols = cols / 2; int half_rows = rows / 2; int nb = 0; if (flag) nb = (rows * half_cols + reflect_ntpb - 1) / reflect_ntpb; else nb = (half_rows * cols + reflect_ntpb - 1) / reflect_ntpb; int * d_old_image; int * h_old_image = new int[rows * cols]; if (flag) { for (int r = 0; r < rows; r++) for (int c = 0; c < cols; c++) h_old_image[rows * c + r] = source.pixelVal[r][c]; } else { for (int r = 0; r < rows; r++) for (int c = 0; c < cols; c++) h_old_image[cols * r + c] = source.pixelVal[r][c]; } cudaMalloc((void**)&d_old_image, rows * cols * sizeof(int)); if (!d_old_image) { cudaDeviceReset(); cout << "CUDA: out of memory (d_old_image)" << endl; return -1; } high_resolution_clock::time_point first_start; first_start = high_resolution_clock::now(); cudaMemcpy(d_old_image, h_old_image, rows * cols * sizeof(int), cudaMemcpyHostToDevice); dim3 dGrid(nb); dim3 dBlock(reflect_ntpb); if (flag) kernel_reflect_horizontal << <dGrid, dBlock >> >(d_old_image, rows, cols, half_cols); else kernel_reflect_vertical << <dGrid, dBlock >> >(d_old_image, rows, half_rows, cols); cudaDeviceSynchronize(); cudaMemcpy(h_old_image, d_old_image, rows * cols * sizeof(int), cudaMemcpyDeviceToHost); cudaDeviceSynchronize(); if (flag) { for (int r = 0; r < rows; r++) for (int c = 0; c < cols; c++) source.pixelVal[r][c] = h_old_image[rows * c + r]; } else { for (int r = 0; r < rows; r++) for (int c = 0; c < cols; c++) source.pixelVal[r][c] = h_old_image[cols * r + c]; } auto duration = duration_cast<milliseconds>(high_resolution_clock::now() - first_start); cudaFree(d_old_image); cudaDeviceReset(); return duration.count(); } === Conclusions === With this project, I had originally expected to get a bigger speed difference when optimizing one way or another but it turns out that it isn't so easy to do. I was never able to get any meaningful results using shared memory in these kernels because every pixel is only looked at once. My optimization benchmarks came from NSIGHT so they didn't include the code I had that created and read the 1D arrays so if I were to want to make a very fast image library, I would want to read and store data in the same way that the kernels expect it to avoid that overhead.
13
edits

Navigation menu