120
edits
Changes
Solo Act
,→Assignment 3
((ti + q) + ((ti + 1) % 2)) - (((((ti + 1) / 2) - 1) + ((ti + 1) % 2)) + (t / 2) + 1);
The improved kernel can be seen above.
I then made version of the kernel that used shared memory, seen above.
The performance comparison can be seen in the graph above.
I only tested this up to x elements to be sure I wouldn't run out of shared memory. For larger data sets, multiple blocks would have to be partitioned by memory per block and threads per block. These values will constrain the maximum leaf number.