Open main menu

CDOT Wiki β

Changes

GPU621/False Sharing

53 bytes added, 19:37, 4 December 2021
Analyzing False Sharing
=== Results ===
[[File:naive_implementation.png|600px500px|thumb|Execution time of naive implementation without any optimization levels (Od).]]
The algorithm calculates the correct answer, but the performance is absolutely terrible. The reason is that an int array is a contiguous block of memory with each integer taking up 4 bytes. Assuming a 64 byte cache line, our entire array only takes up half of the cache opening up the possibility for multiple threads to share the same cache line resulting in false sharing. Although there were cases where higher thread count produced better results, there were many cases that performed worse than a single thread. This is due to the scheduling of thread execution that is out of the programmer's hands. It is possible that the selected schedule managed to minimize the frequency of false sharing giving better performance. However, this is extremely unreliable, so we need a better solution to false sharing.
<br><br><br><br><br><br><br><br><br><br><br><br>
== Solutions to False Sharing ==
=== Padding ===
Assuming a 64 byte cache line, what if each element in the array was the only piece of data in the cache line for each thread? [[File:naive_vs_padded_arr.png|700px600px|thumb|<br>Array comparison between naïve and padded implementation. Colored blocks represent index handled by individual threads. Grey blocks represent empty space.]]Assuming a 64 byte cache line, what if each element in the array was the only piece of data in the cache line for each thread?
This observation was the basis for our first solution to false sharing. The idea was to pad out each element in the array with enough space to act as a boundary or separator for the cache line. If we can force each thread to only bring in their array index to their cache then each individual thread will have a distinct cache line, thus eliminating false sharing.
<br><br><br><br><br><br><br><br><br><br><br>
In the workshop, we accomplished this by making several modifications to the original code:
=== Drawbacks ===
While our the idea works, there were several issues that prevented it from being an ideal solution: [[File:padded_implementation.png|600px500px|thumb|Execution time of padded implementation without optimization (Od).]]
83
edits