56
edits
Changes
→Switching to CudaMallocHost
Low memcpy/compute overlap is related to the Concurrent Kernel Execution. In theory, you can pass chunks of the input array asynchronously into each kernel in the array. However, it seems to be hard to partition the inout data in any meaningful way.
==== Switching to CudaMallocHost ====
There are slightly performance increase when switch to CudaMallocHost.
''' The data table '''
[[File:HVMallocHosttable.png|800px]]
''' The diagram '''
[[File:HVMallocHost.png|800px]]
==== Switching to x86 from x64 ====