==== Results ====
[[File:naive_implementation.png|600px|thumb|Execution time of naive implementation without any optimization levels (Od)]]
The algorithm calculates the correct answer, but the performance is absolutely terrible. The reason is that an int array is a contiguous block of memory with each integer taking up 4 bytes. Assuming a 64 byte cache line, our entire array only takes up half of the cache opening up the possibility for multiple threads to share the same cache line resulting in false sharing. Although there were cases where higher thread count produced better results, there were many cases that performed worse than a single thread. This is due to the scheduling of thread execution that is out of the programmer's hands. It is possible that the selected schedule managed to minimize the frequency of false sharing giving better performance. However, this is extremely unreliable, so we need a better solution to false sharing.
== Solutions to False Sharing ==
=== Padding ===
Since Assuming a 64 byte cache line, what if each element in the array was the only piece of data in the cache line for each thread? [[File:naive_vs_padded_arr.png|700px|thumb|<br>Array comparison between naïve and padded implementation. Colored blocks represent index handled by individual threads. Grey blocks represent empty space.]] This observation was the basis to our first solution to false sharing. The idea was to pad out each element in the array with enough space to act as a boundary or separator for the cache line. If we can force each thread to only bring in their array index to their cache then each individual thread will have a distinct cache line, thus eliminating false sharing. In the workshop, we accomplished this by making several modifications to the original code: 1. Defining padding size to match 64 byte cache line.<pre>#define PAD 15</pre> 2. Change sum to a 2D array.<pre>double sum[NUM_THREADS][PAD + 1] = { 0.0f };</pre> 3. Modify loops to only store data in the first element of each row of the 2D array<pre>for (int i = id; i < n; i = i + num_threads){ x = ((double)i + 0.5f) * step; sum[id][0] += 1.0f / (1.0f + x * x);} for (int i = 0; i < actual_thread_count; i++){ pi += 4 * sum[i][0] * step;}</pre> === Drawbacks === While our idea works, there are several issues that prevent it from being an ideal solution. '''- tons of wasted memory''' For every 4 bytes for storing a thread's calculation, 60 bytes is sharingempty padding. This empty space increases rapidly as the number of threads increases. '''- the cache line only contains one piece of data with the rest being completely empty''' This is what we wanted, but this is incredibly inefficient. The whole point of the cache is to improve performance by contain relevant data to minimize cache misses. In practice, a processor will likely be juggling other threads with their own data that may or may not be related to this program. By padding out the cache line, we are forcing cache misses to occur. '''- must know cache size.''' The cache size can vary between machines making the code not very portable.
=== Synchronization ===