Changes

Jump to: navigation, search

Happy Valley

435 bytes added, 11:07, 11 April 2018
m
Assignment 3
Low memcpy/compute overlap is related to the Concurrent Kernel Execution. In theory, you can pass chunks of the input array asynchronously into each kernel in the array. However, it seems to be hard to partition the inout data in any meaningful way.
 
==== Switching to CudaMallocHost ====
 
There is slightly performance increase when switch to CudaMallocHost.
 
''' The data table '''
 
[[File:HVMallocHosttable.png|800px]]
 
''' The diagram '''
 
[[File:HVMallocHost.png|800px]]
==== Switching to x86 from x64 ====
Although the result of the program did not change, the console showed this warning massage:
==4500== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
 
=== Reference ===
 
''' Links '''
 
<pre>
http://parallelcomp.uw.hu/ch09lev1sec2.html
</pre>
 
<pre>
https://www.geeksforgeeks.org/bitonic-sort/
</pre>
 
<pre>
https://en.wikipedia.org/wiki/Selection_sort
</pre>
56
edits

Navigation menu