Changes

← Older edit

Team False Sharing

1,184 bytes added, 00:42, 22 December 2017

→‎Identifying False Sharing

== Introduction ==

Multicore ~~process~~ processors are more prevalent now more than ever, and Multicore programming is essential to benefit from the power of the hardware as it allows to run our code different CPU cores. But it is very important to know and understand the underlying hardware to fully utilize it. One of the most important system resources is the cache. And most architectures have shared cache lines. And this is why false sharing is a well know know problem in multicore/multithreaded processes.

'''What is False Sharing (aka cache line ping-ponging)?''' <br>

False Sharing is one of the sharing pattern that affect performance when multiple threads share data. It arises when at least two threads modify or use data that happens to be close enough in memory that they end up in the same cache line. False sharing occurs when they constantly update their respective data in a way that the cache line migrates back and forth between two threads' caches.

#include <algorithm>

#include <omp.h>

#define NUM_THREADS 84#define DIM ~~1000~~10000

using namespace std::chrono;

}

int threads_used;

~~int tid;~~

omp_set_num_threads(NUM_THREADS);

double start_time = omp_get_wtime();

#pragma omp parallel

{

int tid = omp_get_thread_num(); odds_local[tid] = 0.0;

#pragma omp for

for(int i=0; i < DIM; ++i){

}

</source>

[[File:~~FalseSpeedup~~SpeedupFs.png|center|~~850px~~500px]]

~~As you can see~~ According to Amdahl's Law the ~~execution time increase with the number~~ potential speedup of ~~threads~~any application is given by Sn = 1 / ( 1 - P + P/n ). ~~These results are not what you would expect but~~ Assuming 95% of our application is parallelizable, Amdahl's law tell's use there ~~are 2 reasons that may have caused this~~is a maximum potential speedup of 3.478 times. ~~The first~~ This is ~~that~~ not the ~~overhead for creating and maintaining the threads is overwhelming larger than the contents~~ case according to our results. We reach a speedup of 2.275 times the ~~for loop~~original speed. ~~The second~~ As you can tell from the graph our code is ~~False sharing~~not scalable and these are results are very underwhelming.

=Eliminating False Sharing=

===Padding===

~~[[File:Speedup.png|850px|center]]~~

#define CACHE_LINE_SIZE 64

</source>

[[File:Numpad0.png]][[File:Numpad7.png]][[File:Numpad15.png]]

Padding your data is one way to prevent false sharing. What this does is by adding padding to the data elements sitting in a contiguous array you separate each element from each other in memory. The goal of this method is to have less data elements sitting the same cache line so when you write to memory the invalidation of a cache line doesn't prevent you from modifying data sitting on the same cache line because of cache coherence. You're goal here is to put each array element on its own cache line so if one element is modified, cache coherence will not bottleneck modifying data because each element in the array is on its own cache line.

===Thread Local Variables===

Wasting memory to put your data on different cache lines is not ideal solution to the False Sharing problem even though it works. There are 2 problems with this solution: 1 you're wasting memory of course and 2 this solution isn't scalable because you aren't always going to know the L1 cache line size. Using variables local to each thread, instead of contiguous array locations reduces the number of times that a thread will write to a cache line that shares data with threads. The benefit to this approach is that you do not have multiple threads writing to the same cache line, invalidating the data and bottlenecking the processes.

#include <iostream>

}

</source>

[[File:~~ExecutionSpeedupLocal~~SpeedupTl.png|800px|center|frame]]

~~Wasting memory to put your data on different cache lines is not ideal solution to~~ Here we see that the speedup increases linearly with the ~~False Sharing problem even though it works~~number of threads used. ~~Using local variables, instead of contiguous array locations, the writes to memory will be spread out to different cache lines~~ The speed up using 4 threads is 3. ~~Another benefit~~ 49 times according to ~~this approach~~ our tests which is ~~that you do not have multiple threads writing~~ much closer to the ~~same cache line, invalidating the data and bottlenecking the processes~~speedup predicted by Amdahl's law (3.478 times).

= Intel VTune Amplifier =

Msivanesan4

96

edits

Changes

Team False Sharing

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools