Changes

Algo holics

822 bytes added, 02:44, 8 April 2019

→‎Assignment 3

=== Assignment 3 ===

For optimizing the code better, we thought of removing the iterative loop from the kernel by using threadIdx.y to control calculation of each element's cosine for that position in the supposed matrix. The problem in this was that each thread was in a racing condition to write to the same memory location, to sum up the cosine transformations for all elements of that row. We solved this by using the atomic function. Its prototype is as follows.

double atomicAdd(double* address, double value)

Here is a comparison between the naive and optimized kernel

[[File:Example.jpg]]

Evidently, there is some performance boost for the new version. However, each call to atomicAdd by a thread locks the global memory until the old value is read and added to the passed value. This deters faster execution as might be expected.

Ssdhillon20

57

edits

Changes

Algo holics

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools