56
edits
Changes
→Switching to shared memory
Low memcpy/compute overlap is related to the Concurrent Kernel Execution. In theory, you can pass chunks of the input array asynchronously into each kernel in the array. However, it seems to be hard to partition the inout data in any meaningful way.
==== Switching to x86 from x64 ====