Changes

BETTERRED

5,756 bytes added, 13:43, 12 April 2017

→‎Assignment 2 - Parallelize

= Assignment 2 - Parallelize =

== Gaussian Blur ==

{| class="wikitable mw-collapsible mw-collapsed"

== Output Images ==

[http://imgur.com/a/CtMOc Image Gallery]

== Mandelbrot ==

{| class="wikitable mw-collapsible mw-collapsed"

! Unoptimized - Mandelbrot( ... )

|-

|

//C++ Includes

#include <iostream>

#include <complex>

#include <vector>

#include <chrono>

#include <functional>

#include <cuda_runtime.h>

//CUDA Complex Numbers

#include <cuComplex.h>

//Helper Includes

#include "window.h"

#include "save_image.h"

#include "utils.h"

const int ntpb = 32;

//Compute Color for each pixel

__global__ void computeMandelbrot( int iter_max, int* d_colors,

int fract_width, int fract_height,

int scr_width, int scr_height,

int fract_xmin, int fract_ymin){

int row = blockIdx.y * blockDim.y + threadIdx.y; //Row

int col = blockIdx.x * blockDim.x + threadIdx.x; //Col

int idx = row * scr_width + col; //Pixel Index

if(col < scr_width && row < scr_height){

//Use Floating Complex Numbers to calculate color for each pixel

int result = 0;

cuFloatComplex c = make_cuFloatComplex((float)col, (float)row);

cuFloatComplex d = make_cuFloatComplex(cuCrealf(c) / (float)scr_width * fract_width + fract_xmin , cuCimagf(c) / (float)scr_height * fract_height + fract_ymin);

cuFloatComplex z = make_cuFloatComplex(0.0f, 0.0f);

while((cuCabsf(z) < 2.0f) && (result < iter_max)){

z = (cuCaddf(cuCmulf(z,z),d));

result++;

}

d_colors[idx] = result; //Output

}

void mandelbrot(){

window<int> scr(0, 1000, 0, 1000); //Image Size

window<float> fract(-2.2,1.2,-1.7,1.7); //Fractal Size

int iter_max = 500; //Iterations

const char* fname = "mandlebrot_gpu.png"; //Output File Name

bool smooth_color = true; //Color Smoothing

int nblks = (scr.width() + ntpb - 1)/ ntpb; //Blocks

std::vector<int> colors(scr.size()); //Output Vector

//Allocate Device Memory

int* d_colors;

cudaMalloc((void**)&d_colors, scr.size() * sizeof(int));

//Grid Layout

dim3 dGrid(nblks, nblks);

dim3 dBlock(ntpb, ntpb);

//Execute Kernel

auto start = std::chrono::steady_clock::now();

computeMandelbrot<<<dGrid, dBlock>>>(iter_max, d_colors, fract.width(), fract.height(), scr.width(), scr.height(), fract.x_min(), fract.y_min());

cudaDeviceSynchronize();

auto end = std::chrono::steady_clock::now();

//Output Time

std::cout << "Time to generate " << fname << " = " << std::chrono::duration <float, std::milli> (end - start).count() << " [ms]" << std::endl;

//Copy Data back to Host

cudaMemcpy(colors.data(), d_colors, scr.size() * sizeof(int), cudaMemcpyDeviceToHost);

//Plot Data and Free Memory

plot(scr, colors, iter_max, fname, smooth_color);

cudaFree(d_colors);

}

int main(){

mandelbrot();

return 0;

}

</syntaxhighlight>

|}

== Objectives ==

The main objective was refactor the get_number_iterations() function and the subsequent functions called that created the nested loops. The objective was met as all the functions were refactored into a single device function that did the calculation for a single pixel of the image. As the original program was done with doubles, all of the doubles were changed to floats.

== Steps ==

=== Host Memory Management ===

No changes were needed to the Host Memory as no data is copied from the host to the device. The vector on the host that contains the data was not changed and data from the device was copied to this vector to be output the plot file.

=== Device Memory Management ===

Only a single array to hold the value for each pixel was created on the device. This array has a size of image width * image height and the row and columns for each image are calculated from this which are used in the complex number calculations along with the values that specify the parameters of the fractal.

=== Kernels ===

The three functions from the original code ( get_number_iterations() , escape() and scale() were refactored into a single computeMandelbrot() function. The device kernel calculates the row and column for the pixel and then uses the row and colmn values along with the picture width and fractal parameters to calculate the value. Complex floating point numbers are used using the cuComplex.h header file which also includes the operations for the complex numbers as well. As threads are not reliant on each other for any data, no use of __syncthreads() is required. As threads complete computing the values, they output the value to the d_colors array.

=== Device to Host ===

After that is done the image is copied back using a single memcpy to the host.

== Results ==

The program was compiled using clang++ , icpc (Intel Parallel Studio Compiler) and NVCC for the GPU. Runtimes for the standard clang++ version were extremely slow as the size of the resultant image increased. Compiling the program using the icpc compiler brought in significant changes without modifying any code and reduced runtimes drastically for running purely on a CPU. Using the parallel version based on CUDA improved the runtime massively over the clang++ compiled version and even the icpc version as more values could be calculated in parallel.

== Output Images ==

[http://imgur.com/a/R3ZAH Image Output]

== Future Optimizations ==

As there isn't any data intensive tasks in this program, further optimizations would include creating streams of kernels and having them execute concurrently in order to improve runtime of the current solution.

= Assignment 3 - Optimize =

Knagarajan1

17

edits

Changes

BETTERRED

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools