1
edit
Changes
Hu3Team
,→CUDA Coding
}
</pre>
We made use of shared memory to speed up the memory access for the kernel, along with coalesced memory access. We were already doing a simple reduction for getting the biggest difference, but with these tow optimizations alone we were able to get a speed up of almost 50% from the first not-optimized CUDA solution.
Because the code works getting neighboring elements from the matrix, we had to make a bigger check on the heat calculation part, to avoid illegal memory access.
====Comparing the results====