120
edits
Changes
Solo Act
,→Assignment 3
The performance comparison can be seen in the graph above. The graph shows no significant improvement.
One other likely cause for such similar results is the effect of coalescence on memory access. Each leaf node in a round is stored concurrently within the global memory. This means that each thread is accessing concurrent memory, and the hardware is likely merging these global read requests which reduces the detriment of global access. The extra instruction reduces the benefit gain from shared memory, and the coalesced access speeds up the global memory. Despite both of these factors, the results are close to the same. This demonstrates just how much shared memory is faster, but also shows that the use of shared memory is not very effective in this situation. I only tested this up to x 32 elements to be sure I wouldn't run out of shared memory. For larger data sets, multiple blocks would have to be partitioned by memory , per block , and threads per block. These values will constrain the maximum leaf number.