96
edits
Changes
→Identifying False Sharing
== Introduction ==
Multicore processors are more prevalent now more than ever, and Multicore programming is essential to benefit from the power of the hardware as it allows to run our code different CPU cores. But it is very important to know and understand the underlying hardware to fully utilize it. One of the most important system resources is the cache. And most architectures have shared cache lines. And this is why false sharing is a well know know problem in multicore/multithreaded processes.
'''What is False Sharing (aka cache line ping-ponging)?''' <br>
False Sharing is one of the sharing pattern that affect performance when multiple threads share data. It arises when at least two threads modify or use data that happens to be close enough in memory that they end up in the same cache line. False sharing occurs when they constantly update their respective data in a way that the cache line migrates back and forth between two threads' caches.
In this article, we will look at some examples that demonstrate false sharing, tools to analyze false sharing, and the two coding techniques we can implement to eliminate false sharing.
=Cache Coherence=
In Symmetric Multiprocessor (SMP)systems , each processor has a local cache. The local cache is a smaller, faster memory which stores copies of data from frequently used main memory locations. Cache lines are closer to the CPU than the main memory and are intended to make memory access more efficient. In a shared memory multiprocessor system with a separate cache memory for each processor, it is possible to have many copies of shared data: one copy in the main memory and one in the local cache of each processor that requested it. When one of the copies of data is changed, the other copies must reflect that change. Cache coherence is the discipline which ensures that the changes in the values of shared operands(data) are propagated throughout the system in a timely fashion.
To ensure data consistency across multiple caches, multiprocessor-capable Intel® processors follow the MESI (Modified/Exclusive/Shared/Invalid) protocol. On first load of a cache line, the processor will mark the cache line as ‘Exclusive’ access. As long as the cache line is marked exclusive, subsequent loads are free to use the existing data in cache. If the processor sees the same cache line loaded by another processor on the bus, it marks the cache line with ‘Shared’ access. If the processor stores a cache line marked as ‘S’, the cache line is marked as ‘Modified’ and all other processors are sent an ‘Invalid’ cache line message. If the processor sees the same cache line which is now marked ‘M’ being accessed by another processor, the processor stores the cache line back to memory and marks its cache line as ‘Shared’. The other processor that is accessing the same cache line incurs a cache miss.
[[File:Coherent.gif|500px|leftcenter]]
<br style="clear:both" />
False sharing is a well-know performance issue on SMP systems, where each processor has a local cache. it occurs when treads on different processors modify varibles that reside on th the same cache line like so.
<br style="clear:both" />
[[File:CPUCacheline.png|center|frame]]
<br style="clear:both" />
The frequent coordination required between processors when cache lines are marked ‘Invalid’ requires cache lines to be written to memory and subsequently loaded. False sharing increases this coordination and can significantly degrade application performance.
<source lang="cpp">
#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <chrono>
#include <algorithm>
#include <omp.h>
#include "timer.h"define NUM_THREADS 4#define NUM_THREADS 8 DIM 10000using namespace std::chrono; int main(int argc, const char ** argv) { int* matrix = new int[DIM*DIM]; int odds = 0; // Initialize matrix to random Values srand(200) {; struct sfor (int i = 0; i < DIM; i++) { float value for(int j = 0; j < DIM;++j){ }Array matrix[4i*DIM + j]= rand()%50; } } int numThreadsUsed* odds_local = new int[NUM_THREADS];//odd numbers in matrix local to thread const for(int SomeBigNumber i = 1000000000; i < NUM_THREADS;i++){ odds_local[i]=0; } int threads_used;
omp_set_num_threads(NUM_THREADS);
}
}
#pragma omp critical
odds += odds_local[tid];
}
double time = omp_get_wtime() - start_time;
std::cout<<"Execution Time: "<<time<<std::endl; std::cout<<"Threads Used: "<<numThreadsUsedthreads_used<<std::endl; std::cout<<"Odds: "<<odds<<std::endl;
return 0;
}
</source>
[[File:ExecVsThreadsFalseSpeedupFs.png|center|500px]]
=Eliminating False Sharing=
===Padding===
<source lang ="cpp">
#define CACHE_LINE_SIZE 64
template<typename T>
struct cache_line_storage {
[[ align(CACHE_LINE_SIZE) ]] T data;
char pad[ CACHE_LINE_SIZE > sizeof(T)
? CACHE_LINE_SIZE - sizeof(T)
: 1 ];
};
#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <chrono>
#include <algorithm>
#include <omp.h>
#include "timerPadding.h"#define NUMPAD 7NUM_THREADS 9#define NUM_THREADS 8 DIM 1000using namespace std::chrono; int main(int argc, const char ** argv) { int odds = 0; int* matrix = new int[DIM*DIM]; // Initialize matrix to random Values srand(200) {; struct sfor (int i = 0; i < DIM; i++) { float value for(int j = 0; j < DIM;++j){ int pad matrix[NUMPADi*DIM + j]= rand()%50; } }Array cache_line_storage<int> odds_local[4NUM_THREADS]; Timer stopwatchfor(int i = 0;i<NUM_THREADS;i++){//initilize local odds_local[i].data=0; int numThreadsUsed;} const int SomeBigNumber = 100000000threads_used;
omp_set_num_threads(NUM_THREADS);
double start_time = omp_get_wtime();
#pragma omp parallel { int tid = omp_get_thread_num();#pragma omp for for(int i = 0; i < 4DIM;++i){ for(int j = 0; j < DIM; ++j){ if(i ==0 && j==0){numThreadsUsed threads_used = omp_get_num_threads();} for if(int matrix[i*DIM + j ] % 2 != 0;j < SomeBigNumber;j) ++){ Array[i].value = Arrayodds_local[itid].value + (float)rand()data;
}
}
#pragma omp critical
odds += odds_local[tid].data;
}
double time = omp_get_wtime() - start_time;
std::cout<<"Execution Time: "<<time<<std::endl; std::cout<<"Threads Used: "<<numThreadsUsedthreads_used<<std::endl; std::cout<<"Odds: "<<odds<<std::endl;
return 0;
}
</source>
[[File:Numpad0.png]][[File:Numpad7.png]][[File:Numpad15.png]]
Padding your data is one way to prevent false sharing. What this does is by adding padding to the data elements sitting in a contiguous array you separate each element from each other in memory. The goal of this method is to have less data elements sitting the same cache line so when you write to memory the invalidation of a cache line doesn't prevent you from modifying data sitting on the same cache line because of cache coherence. You're goal here is to put each array element on its own cache line so if one element is modified, cache coherence will not bottleneck modifying data because each element in the array is on its own cache line.
===Thread Local Variables===
Wasting memory to put your data on different cache lines is not ideal solution to the False Sharing problem even though it works. There are 2 problems with this solution: 1 you're wasting memory of course and 2 this solution isn't scalable because you aren't always going to know the L1 cache line size. Using variables local to each thread, instead of contiguous array locations reduces the number of times that a thread will write to a cache line that shares data with threads. The benefit to this approach is that you do not have multiple threads writing to the same cache line, invalidating the data and bottlenecking the processes.<source lang ="cpp">
#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <chrono>
#include <algorithm>
#include <omp.h>
#define NUM_THREADS 8
#define DIM 1000using namespace std::chrono; int main(int argc, const char ** argv) { int* matrix = new int[DIM*DIM]; int odds = 0; // Initialize matrix to random Values srand(200) {; struct sfor (int i = 0; i < DIM; i++) { float value for(int j = 0; j < DIM;++j){ }Array matrix[4i*DIM + j]= rand()%50; } Timer stopwatch;} int numThreadsUsed; const int SomeBigNumber = 100000000threads_used;
omp_set_num_threads(NUM_THREADS);
}
}
#pragma omp critical
double time = omp_get_wtime() - start_time;
std::cout<<"Execution Time: "<<time<<std::endl; std::cout<<"Threads Used: "<<numThreadsUsedthreads_used<<std::endl; std::cout<<"Odds: "<<odds<<std::endl;
return 0;
}
</source>