Changes

AAA Adrina Arsa Andriy

1,092 bytes added, 16:21, 2 December 2014

→‎Assignment 3

For or assignment 3 we did a few things to speed up the program, and we were able to observe an approximate speed up of around 50%.

To observe this speed up we removed thread divergence from the kernels, and we removed some unnecessary memory copies. By removing thread divergence, we initially saw a speed up of around 10%. We expected this speed up to be small, since when running our code, the majority of time was spent in the memory copy phase. To speed the process up slightly more, we also used shared memory within the kernel. After the kernels had all the updates applied, we realized a speed up slightly over 10%. When we took a look at the memory copies we used in the previous version, we realized that there were several copies that did not need to be done. We also realized that some of the copies could be moved around for more efficiency. After moving these copies, we realized a speed up of just over 40%. Our last version of the program now runs approximately 50% faster than the previous version. '''To calculate the number of blocks per thread we used the CUDA calculator.'''

[[Image:Nvidia Occupancy Calculator on Code.jpg|thumb|800px|center]]

In the previous version we dynamically found the number of threads per block, we could not dynamically use the information in this version due to the fact that shared memory was used. On the school lab computers the NBPT was 1024.

Adrian A Sauvageot

1

edit

Changes

AAA Adrina Arsa Andriy

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools