Changes

Jump to: navigation, search

BETTERRED

5,756 bytes added, 12:43, 12 April 2017
Assignment 2 - Parallelize
= Assignment 2 - Parallelize =
 
== Gaussian Blur ==
{| class="wikitable mw-collapsible mw-collapsed"
== Output Images ==
[http://imgur.com/a/CtMOc Image Gallery]
 
 
== Mandelbrot ==
{| class="wikitable mw-collapsible mw-collapsed"
! Unoptimized - Mandelbrot( ... )
|-
|
//C++ Includes
#include <iostream>
#include <complex>
#include <vector>
#include <chrono>
#include <functional>
#include <cuda_runtime.h>
 
//CUDA Complex Numbers
#include <cuComplex.h>
 
//Helper Includes
#include "window.h"
#include "save_image.h"
#include "utils.h"
 
const int ntpb = 32;
 
//Compute Color for each pixel
__global__ void computeMandelbrot( int iter_max, int* d_colors,
int fract_width, int fract_height,
int scr_width, int scr_height,
int fract_xmin, int fract_ymin){
int row = blockIdx.y * blockDim.y + threadIdx.y; //Row
int col = blockIdx.x * blockDim.x + threadIdx.x; //Col
 
int idx = row * scr_width + col; //Pixel Index
 
if(col < scr_width && row < scr_height){
 
//Use Floating Complex Numbers to calculate color for each pixel
int result = 0;
cuFloatComplex c = make_cuFloatComplex((float)col, (float)row);
cuFloatComplex d = make_cuFloatComplex(cuCrealf(c) / (float)scr_width * fract_width + fract_xmin , cuCimagf(c) / (float)scr_height * fract_height + fract_ymin);
cuFloatComplex z = make_cuFloatComplex(0.0f, 0.0f);
 
while((cuCabsf(z) < 2.0f) && (result < iter_max)){
z = (cuCaddf(cuCmulf(z,z),d));
result++;
}
d_colors[idx] = result; //Output
}
}
 
void mandelbrot(){
window<int> scr(0, 1000, 0, 1000); //Image Size
window<float> fract(-2.2,1.2,-1.7,1.7); //Fractal Size
int iter_max = 500; //Iterations
const char* fname = "mandlebrot_gpu.png"; //Output File Name
bool smooth_color = true; //Color Smoothing
 
int nblks = (scr.width() + ntpb - 1)/ ntpb; //Blocks
std::vector<int> colors(scr.size()); //Output Vector
//Allocate Device Memory
int* d_colors;
cudaMalloc((void**)&d_colors, scr.size() * sizeof(int));
 
//Grid Layout
dim3 dGrid(nblks, nblks);
dim3 dBlock(ntpb, ntpb);
 
//Execute Kernel
auto start = std::chrono::steady_clock::now();
computeMandelbrot<<<dGrid, dBlock>>>(iter_max, d_colors, fract.width(), fract.height(), scr.width(), scr.height(), fract.x_min(), fract.y_min());
cudaDeviceSynchronize();
auto end = std::chrono::steady_clock::now();
 
//Output Time
std::cout << "Time to generate " << fname << " = " << std::chrono::duration <float, std::milli> (end - start).count() << " [ms]" << std::endl;
 
//Copy Data back to Host
cudaMemcpy(colors.data(), d_colors, scr.size() * sizeof(int), cudaMemcpyDeviceToHost);
 
//Plot Data and Free Memory
plot(scr, colors, iter_max, fname, smooth_color);
cudaFree(d_colors);
}
 
int main(){
mandelbrot();
return 0;
}
</syntaxhighlight>
|}
 
== Objectives ==
The main objective was refactor the get_number_iterations() function and the subsequent functions called that created the nested loops. The objective was met as all the functions were refactored into a single device function that did the calculation for a single pixel of the image. As the original program was done with doubles, all of the doubles were changed to floats.
== Steps ==
 
=== Host Memory Management ===
No changes were needed to the Host Memory as no data is copied from the host to the device. The vector on the host that contains the data was not changed and data from the device was copied to this vector to be output the plot file.
 
=== Device Memory Management ===
Only a single array to hold the value for each pixel was created on the device. This array has a size of image width * image height and the row and columns for each image are calculated from this which are used in the complex number calculations along with the values that specify the parameters of the fractal.
 
=== Kernels ===
The three functions from the original code ( get_number_iterations() , escape() and scale() were refactored into a single computeMandelbrot() function. The device kernel calculates the row and column for the pixel and then uses the row and colmn values along with the picture width and fractal parameters to calculate the value. Complex floating point numbers are used using the cuComplex.h header file which also includes the operations for the complex numbers as well. As threads are not reliant on each other for any data, no use of __syncthreads() is required. As threads complete computing the values, they output the value to the d_colors array.
 
=== Device to Host ===
After that is done the image is copied back using a single memcpy to the host.
 
== Results ==
The program was compiled using clang++ , icpc (Intel Parallel Studio Compiler) and NVCC for the GPU. Runtimes for the standard clang++ version were extremely slow as the size of the resultant image increased. Compiling the program using the icpc compiler brought in significant changes without modifying any code and reduced runtimes drastically for running purely on a CPU. Using the parallel version based on CUDA improved the runtime massively over the clang++ compiled version and even the icpc version as more values could be calculated in parallel.
 
== Output Images ==
[http://imgur.com/a/R3ZAH Image Output]
 
== Future Optimizations ==
As there isn't any data intensive tasks in this program, further optimizations would include creating streams of kernels and having them execute concurrently in order to improve runtime of the current solution.
= Assignment 3 - Optimize =
17
edits

Navigation menu