Note how the run times for each kernel with shared memory are significantly longer than those with global.
To demonstrate that try to determine if this is probably an issue was one of warp divergence, here is another diagram we tried to time a kernel with timings where the kernel both sets up global memory that also initialized shared memory using if statments to determine , although referenced global memory when to initialize ghost cells, but runs carrying out the Jacobi actual calculations using global memory:
[[File:GlobalInitSharedKernelTimes.png]]
It turns out The run of a kernel that this does allocated shared memory using a series of if statements, but executed instructions using global memory is shown in the figure above. While slightly longer than the run with global memory where shared memory is not initialized for ghost cells, it still takes less time to run as slowly either--than the issue version with Global memory. It is probably with resource allocation (trying likely that Our group's attempts to allocate more employ shared memory failed because we did not adequately schedule or partition the shared memory than , and the kernel was slowed as a result. The supposed occupancy of a block can handleof shared memory was 34x32 (the dimensions of the shared memory matrix)x 4 (the size of a float) which equals 4,352 bytes per block, which is supposedly less than the maximum of about 49KB stated for a device with a 5.0 compute capability (which this series of tests on individual kernel run times was performed on).. try reducing With this is mind it is still unclear as to why the size of shared memory to 32x16?performed more poorly that the global memory implementation.
[TODO: INCLUDE PROFILING BREAKDOWNS OF INDIVIDUAL (NOT 5000) KERNEL RUNS TO SEE SPECIFIC TIMELINE FEATURESUnfortunately our group's inability to effectively use profiling tools has left this discrepancy as a mystery. EXPLAIN THE DIFFERENCES IN RUN TIMES]