Changes

Jump to: navigation, search

GPUSquad

2,208 bytes removed, 19:29, 11 April 2018
m
Assignment 3
int ymin, int ymax, float dx, float dy, float dxxinv, float dyyinv) {//MODIFY to suit algorithm
int j = blockDim.x * blockIdx.x + threadIdx.x + 1;
//above: we are using block and thread indexes to replace some of the iteration logic
<nowiki>****************</nowiki>
CODE FOR 1D BLOCK OF 1D THREADS WITH CONSTANT MEMORYA NOTE ON SCALABILITY:
In our attempts to make the kernel scalable with ghost cells, we tried just replacing scaled along one dimension. However, we were inconsistent in our scaling. The 1D kernel scaled along the n (y) dimension while the 2d kernels scaled along the m (x) dimension. Scaling along the global memory values that were used as calculation constants with constant memoryx dimension, since these values would not have while allowing results to be modified by kernel logictestable between serial and 2D parallelized versions of the code, produced distributions that were strangely banded and skewed.In other words, we made the code render weird things faster:
[[File:MDimensionScale.png]]
#include FINAL TIMINGS <pre style="cuda_runtime.h"#include "device_launch_parameters.hcolor: red"> THE GRAPH IMMEDIATELY BELOW IS INCORRECT: there was an error recording the 1D runtimes for assignment 2</pre>
#include <stdio.hnowiki>****************</nowiki>
// Load standard libraries#include <cstdio>#include <cstdlib>#include <iostream>#include <fstream>#include <cmath>#include <chrono>using namespace std;<pre style="color[[File: blue">__device__ __constant__ int d_m;__device__ __constant__ int d_n;__device__ __constant__ float d_xmin;__device__ __constant__ float d_xmax;__device__ __constant__ float d_ymin;__device__ __constant__ float d_ymax;__device__ __constant__ float d_pi;__device__ __constant__ float d_omega;__device__ __constant__ float d_dx;__device__ __constant__ float d_dy;__device__ __constant__ float d_dxxinv;__device__ __constant__ float d_dyyinv;__device__ __constant__ float d_dcent;</pre>__global__ void matrixKernel(float* a, float* b){//, float dcent, int n, int m, int xmin, int xmax, //int ymin, int ymax, float dx, float dy, float dxxinv, float dyyinv) {//MODIFY to suit algorithmdps915_gpusquad_a3chart.png]]
int j = threadIdx.x + 1; //above: we are using block and thread indexes to replace some of the iteration logic if (j < d_n - 1) { for (int i = 1; i nowiki>****************< d_m - 1; i++) { int ij = i + d_m*j;/nowiki>
float x = d_xmin + i*d_dx, y = d_ymin + j*d_dy;PROPER TIMINGS:
float input = abs(x) > 0[[File:Code_timings2.5 || abs(y) > 0.5 ? 0 : 1;png]]
a[ij] Blue = Serial (input + d_dxxinv*A1)Orange = Parallel (b[ij - 1] + b[ij + 1]A2) + d_dyyinv*Grey = Optimized Global (b[ij - d_m] + b[ij + d_m]A3)Yellow = Optimized Shared (A3)*d_dcent; } }
}The above graph includes the total run times for the serial code, the 1D kernel from assignment 2, a kernel with global and constant memory with a 2D thread arrangement, and the same 2D arrangement but with shared memory utilizing ghost cells.
// Set grid size We found that the most efficient version of the code was the 2d version that used constant memory and number did not use shared memory. Because the shared memory version of the kernel required synchronization of iterationsconst int save_iters = 20;const int total_iters = threads to allocate shared memory every time a kernel was run, and a kernel was run 5000;const int error_every = 2;const int m = 32, n = 1024;const float xmin = -1, xmax = 1;const float ymin = -1times for each version of our code, ymax = 1;this increased overhead for memory setup actually made the execution slower than the version with global memory.
// Compute useful constantsconst float pi = 3We found that the most efficient version of the code was the 2d version that used constant memory and did not use shared memory. Because the shared memory version of the kernel required synchronization of threads to allocate shared memory every time a kernel was run, and a kernel was run 5000 times for each version of our code, the if statements required to set up the ghost cells for shared memory may have created a certain amount of warp divergence, thus slowing down the runtimes of each individual kernel.1415926535897932384626433832795;const float omega = 2 / (1 + sin(2 * pi / n));const float dx = (xmax - xmin) / (m - 1);const float dy = (ymax - ymin) / (n - 1);const float dxxinv = 1 / (dx*dx);const float dyyinv = 1 / (dy*dy);const float dcent = 1 / (2 * (dxxinv + dyyinv));
// Input functioninline float f(int iBelow, int j) { float x = xmin + i*dx, y = ymin + j*dy; return abs(x) > 0.5 || abs(y) > 0.5 ? 0 : 1;} // Common output are two images that show 4 consecutive kernel runs for both global and error routinevoid outputImage(char* filename, float *a) { // Computes shared versions of the error if sn%error every==0   // Saves code. It is apparent that shared kernel runs actually take more time than the matrix if sn<=save iters int i, j, ij = 0, ds = sizeof(float); float x, y, data_float; const char *pfloat; pfloat = (const char*)&data_float;  ofstream outfile; static char fname[256]; sprintf(fname, "%s.%d", filename, 101); outfile.open(fname, fstream::out | fstream::trunc | fstream::binary);  data_float = m; outfile.write(pfloat, ds);  for (i = 0; i < m; i++) { x = xmin + i*dx; data_float = x; outfile.write(pfloat, ds); }  for (j = 0; j < n; j++) { y = ymin + j*dy; data_float = y; outfile.write(pfloat, ds);  for (i = 0; i < m; i++) { data_float = a[ij++]; outfile.write(pfloat, ds); } }  outfileglobal memory versions.close();}  void dojacobi() { int i, j, ij, k; float error, z; float *a, *b; float *u; float *v; u = new float[m*n]; v = new float[m*n];  // Set initial guess to be identically zero for (ij = 0; ij < m*n; ij++) u[ij] = v[ij] = 0;  a = v; b = u; float* d_a; float* d_b;  //malloc cudaMalloc((void**)&d_a, m*n * sizeof(float)); cudaMalloc((void**)&d_b, m*n * sizeof(float));
cudaMemcpy(d_a, a, n* m * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, n* m * sizeof(float), cudaMemcpyHostToDevice);
int nblocks = n / 1024;TIMES FOR THE GLOBAL KERNEL dim3 dGrid(nblocks); dim3 dBlock(1024);[[File:kernelGlobalTimes.png]]
// Carry out Jacobi iterations
for (k = 1; k <= total_iters; k++) {
if (k % 2 == 0) {
cudaError_t error = cudaGetLastError();
matrixKernel << <dGrid, dBlock >> > (d_a, d_b);// , dcent, n, m, xmin, xmax, ymin, ymax, dx, dy, dxxinv, dyyinv);TIMES FOR THE SHARED KERNEL
cudaDeviceSynchronize(); if (cudaGetLastError()) { std[[File::cout << "error"; } }sharedKernelTimes.png]]
else { cudaError_t error = cudaGetLastError();  matrixKernel << <dGrid, dBlock >> > (d_b, d_a);// , dcent, n, m, xmin, xmax, ymin, ymax, dx, dy, dxxinv, dyyinv);  cudaDeviceSynchronize(); if (cudaGetLastError()) { std::cout << "error"; } } }  cudaMemcpy(a, d_a, n* m * sizeof(float), cudaMemcpyDeviceToHost); outputImage("jacobi out", a); cudaFree(d_a); cudaFree(d_b); delete[] u; delete[] v;} int main() {  cudaMemcpyToSymbol(&d_n, &n, sizeof(int)); cudaMemcpyToSymbol(&d_xmin, &xmin, sizeof(float)); cudaMemcpyToSymbol(&d_xmax, &xmax, sizeof(float)); cudaMemcpyToSymbol(&d_ymin, &ymin, sizeof(float)); cudaMemcpyToSymbol(&d_ymax, &ymax, sizeof(float)); cudaMemcpyToSymbol(&d_pi, &pi, sizeof(float)); cudaMemcpyToSymbol(&d_omega, &pi, sizeof(float)); cudaMemcpyToSymbol(&d_dx, &dx, sizeof(float)); cudaMemcpyToSymbol(&d_dy, &dy, sizeof(float));   std::chrono::steady_clock::time_point ts, te; ts = std::chrono::steady_clock::now(); dojacobi(); te = std::chrono::steady_clock::now(); std::chrono::steady_clock::duration duration = te - ts; auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(duration); std::cout << "Parallel Code Time: " << ms.count() << " ms" << std::endl; cudaDeviceReset();  return 0;} <nowiki>****************</nowiki> FINAL TIMINGS <pre style="color: red"> THE GRAPH IMMEDIATELY BELOW IS INCORRECT: there was an error recording Note how the 1D runtimes run times for assignment 2</pre> <nowiki>****************</nowiki> [[File:dps915_gpusquad_a3charteach kernel with shared memory are significantly longer than those with global.png]] <nowiki>****************</nowiki> PROPER TIMINGS:
[[FileTo try to determine if this issue was one of warp divergence, we tried to time a kernel with global memory that also initialized shared memory, although referenced global memory when carrying out the actual calculations:Code_timings.png]]
The above graph includes the total run times for the serial code, the 1D kernel from assignment 2, the 1d kernel using constant memory for calculation constants, a kernel with global and constant memory with a 2D thread arrangement, and the same 2D arrangement but with shared memory utilizing ghost cells[[File:GlobalInitSharedKernelTimes.png]]
We found The run of a kernel that allocated shared memory using a series of if statements, but executed instructions using global memory is shown in the most efficient version of figure above. While slightly longer than the code was run with global memory where shared memory is not initialized for ghost cells, it still takes less time to run than the 1D implementation version with Global memory. It is likely that used constant Our group's attempts to employ shared memory. Because failed because we did not adequately schedule or partition the shared memory version , and the kernel was slowed as a result. The supposed occupancy of a block of shared memory was 34x32 (the kernel required synchronization dimensions of threads to allocate the shared memory every time matrix) x 4 (the size of a kernel was runfloat) which equals 4,352 bytes per block, and which is supposedly less than the maximum of about 49KB stated for a device with a 5.0 compute capability (which this series of tests on individual kernel was run 5000 times for each version of our code, was performed on). With this increased overhead for is mind it is still unclear as to why the shared memory setup actually made the execution slower than performed more poorly that the version with global memoryimplementation.
The 1D design ran better than the 2d implementation for Unfortunately our group's inability to effectively use profiling tools has left this discrepancy as a couple of reasons (including that it scaled along the m dimension, which still produced readable graphs)mystery.
[TODO: INCLUDE PROFILING BREAKDOWNS OF INDIVIDUAL In conclusion, while it may be possible to parallelize the algorithm we chose well, the effort to do so would involve ensuring that shared memory is properly synchronized in two block dimensions (NOT 50002 dimensions of ghost cells rather than the 1 we implemented) KERNEL RUNS TO SEE SPECIFIC TIMELINE FEATURES, and to ensure that shared memory is allocated appropriately such that maximum occupancy is established within the GPU. Unfortunately, our attempts fell short, and while implementing constant memory seemed to speed up the kernel a bit, our solution was not fully scalable in both dimensions, and shared memory was not implemented in a way that improved kernel efficiency. EXPLAIN THE DIFFERENCES IN RUN TIMES]
41
edits

Navigation menu