Open main menu

CDOT Wiki β

Changes

GPUSquad

1,831 bytes added, 13:21, 10 November 2023
m
Team Members
</source>
== Team Members ==
# [mailto:tsarkar3@myseneca.ca?subject=dps915 Tanvir Sarkar]TS
# [mailto:moverall@myseneca.ca?subject=dps915 Michael Overall]
# [mailto:ikrasnyanskiy@myseneca.ca?subject=gpu610 Igor Krasnyanskiy]
# [mailto:tsarkar3@myseneca.ca;moverall@myseneca.ca;ikrasnyanskiy@myseneca.ca?subject=dps915gpu610 Email All]
== Progress ==
</source>
<nowiki>****************</nowiki>
 
A NOTE ON SCALABILITY:
 
In our attempts to make the kernel scalable with ghost cells, we scaled along one dimension. However, we were inconsistent in our scaling. The 1D kernel scaled along the n (y) dimension while the 2d kernels scaled along the m (x) dimension. Scaling along the x dimension, while allowing results to be testable between serial and 2D parallelized versions of the code, produced distributions that were strangely banded and skewed. In other words, we made the code render weird things faster:
[[File:MDimensionScale.png]]
FINAL TIMINGS <pre style="color: red"> THE GRAPH IMMEDIATELY BELOW IS INCORRECT: there was an error recording the 1D runtimes for assignment 2</pre>
Note how the run times for each kernel with shared memory are significantly longer than those with global.
To demonstrate that try to determine if this is probably an issue was one of warp divergence, here is another diagram we tried to time a kernel with timings where the kernel both sets up global memory that also initialized shared memory using if statments to determine , although referenced global memory when to initialize ghost cells, but runs carrying out the Jacobi actual calculations using global memory:
[[File:GlobalInitSharedKernelTimes.png]]
It turns out The run of a kernel that this does allocated shared memory using a series of if statements, but executed instructions using global memory is shown in the figure above. While slightly longer than the run with global memory where shared memory is not initialized for ghost cells, it still takes less time to run as slowly either--than the issue version with Global memory. It is probably with resource allocation (trying likely that Our group's attempts to allocate more employ shared memory failed because we did not adequately schedule or partition the shared memory than , and the kernel was slowed as a result. The supposed occupancy of a block can handleof shared memory was 34x32 (the dimensions of the shared memory matrix)x 4 (the size of a float) which equals 4,352 bytes per block, which is supposedly less than the maximum of about 49KB stated for a device with a 5.0 compute capability (which this series of tests on individual kernel run times was performed on).. try reducing With this is mind it is still unclear as to why the size of shared memory to 32x16?performed more poorly that the global memory implementation.
Unfortunately our group's inability to effectively use profiling tools has left this discrepancy as a mystery.
[TODO: INCLUDE PROFILING BREAKDOWNS OF INDIVIDUAL In conclusion, while it may be possible to parallelize the algorithm we chose well, the effort to do so would involve ensuring that shared memory is properly synchronized in two block dimensions (NOT 50002 dimensions of ghost cells rather than the 1 we implemented) KERNEL RUNS TO SEE SPECIFIC TIMELINE FEATURES, and to ensure that shared memory is allocated appropriately such that maximum occupancy is established within the GPU. Unfortunately, our attempts fell short, and while implementing constant memory seemed to speed up the kernel a bit, our solution was not fully scalable in both dimensions, and shared memory was not implemented in a way that improved kernel efficiency. EXPLAIN THE DIFFERENCES IN RUN TIMES]
93
edits