We found that the most efficient version of the code was the 2d version that used constant memory and did not use shared memory. Because the shared memory version of the kernel required synchronization of threads to allocate shared memory every time a kernel was run, and a kernel was run 5000 times for each version of our code, this increased overhead for memory setup actually made the execution slower than the version with global memory.
One We found that the most efficient version of the issues encountered when trying to profile code was the fact 2d version that different group members were trying used constant memory and did not use shared memory. Because the shared memory version of the kernel required synchronization of threads to allocate shared memory every time a kernel was run, and a kernel was run 5000 times for each version of our code, the if statements required to work set up the ghost cells for shared memory may have created a certain amount of warp divergence, thus slowing down the runtimes of each individual kernel. Below, are two images that show 4 consecutive kernel runs for both global and shared versions of the code. It is apparent that shared kernel runs actually take more time than the global memory versions. TIMES FOR THE GLOBAL KERNEL[[File:kernelGlobalTimes.png]] TIMES FOR THE SHARED KERNEL [[File:sharedKernelTimes.png]] Note how the run times for each kernel with shared memory are significantly longer than those with different hardwareglobal. The hardware changed based on To demonstrate that this is probably an issue of warp divergence, here is another diagram with timings where the rooms we ended kernel both sets up profiling in shared memory using if statments to determine when to initialize ghost cells, but runs the Jacobi calculations using global memory: [[File:GlobalInitSharedKernelTimes.png]] It turns out that this does not run as slowly either--the issue is probably with resource allocation (open lab vs lab computers vs laptops with different video cardstrying to allocate more shared memory than a block can handle). The above graph was done on an open lab computer with a QuadroK620 card.. try reducing the size of shared memory to 32x16?
[TODO: INCLUDE PROFILING BREAKDOWNS OF INDIVIDUAL (NOT 5000) KERNEL RUNS TO SEE SPECIFIC TIMELINE FEATURES. EXPLAIN THE DIFFERENCES IN RUN TIMES]