Changes

← Older edit

GPU621/Analyzing False Sharing

1,144 bytes added, 15:14, 4 December 2022

→‎Use local variables for each thread

= '''Preface''' =

In multicore concurrent programming, if the mutual exclusion is a "performance killer", then false sharing is a worthy "performance assassin". The difference between "killer" and "assassin" is that a "killer" can be easily detected, and we always have many ways to deal with a "killer" when we fight it, such as fighting, running away, detouring, and begging for mercy, but an "assassin" is completely different. Assassins are good at hiding in the shadows and can ~~strike~~ give a fatal blow at any time, ~~which makes~~ making people ~~defenceless~~defensive.

In our concurrent development, when we encounter situations that affect concurrency performance, we always have many ways to find and improve the concurrency performance of the program. But false sharing leaves no trace in our code that it is in the "dark" and is seriously slowing down concurrency performance, making it hard to find the root cause of the problem, let alone improve concurrency performance.

== Cache Coherence ==

[[File:cacheCoherence.jpg|center|800px]]

When the CPU accesses the data at that time data goes into the cache and this memory block is known as the cache line. But it is not possible that modify the original data when changes in the cache line. Here cache coherence helps that stored in multiple local caches. Cache coherence connects all cache. == MESI ==First, we know that the cache line has four different states. Like, exclusive, shared, modified, and invalid. So MESI is an acronym for all four states. To understand it well we throw light on this example. [[File:MESI1.jpg|400px]] This picture shows that the core A read value a from the main memory. So this core also fetches some nearby values from the main memory and stores these values in the cache line and makes it exclusive because only core A operates this cache line. [[File:MESI2.jpg|400px]] Later, core B also reads the value of b from memory. Now, a and b both read the same cache line and close so, both cores are shared. [[File:MESI3.jpg|400px]] Moreover, core A change the value of a so, this change is stored in the buffer and cache line tagged as modified. In the end, it communicates with core B and is known as an invalid cache line.

= '''What is False Sharing?''' =

The foremost reason why false sharing happens can be found in how operating systems read and write data. When a program reads or writes data from the hard drive or other sources at that time this data loads into a temporary cache. Because it is a very fast way to access it. This knows as a cache line.

= '''~~Solutions Of~~ False ~~Sharing~~sharing example''' =~~== Using local variable =~~=

#include <thread>

#include <vector>

for (int i = 0; i < numbers.size(); i++) {

if (i % sizeOfSum == id) {

sum[id] += Ii;

}

In this program code, we can see that at the beginning of the main function we declare a vector numbers and initialize it from 0 to ~~100~~10000. We can also see in the main function that the code is divided into two main blocks. The first block is the Thread block, which is executed concurrently using multiple threads. While the second block is the Serial block, it is executed concurrently using normal serial logic.

The main purpose of the sumUp function is to calculate the sum of odd elements or even elements based on the data in the first vector argument in the argument list. Also, the sum will be recorded in the corresponding position of the second vector argument using the int argument as the index.

To our surprise, the serial block took much less time, no matter how many times I ran it. This turned our existing knowledge upside down, but don't worry, it's because you don't understand False Sharing yet.

= '''Solutions Of False Sharing''' === Use local variables for each thread == As we said above, the smallest unit of CPU operation on the cache is the size of a cache line, which is 64 bytes. As you can see in our program code, the sum is a vector that stores two data with long ~~data types~~type. The two long data are in fact located on the same cache line. When two threads read and write sum[0] and sum[1] separately, it looks like they are not using the same variable, but in fact, they are affecting each other.For example, if a thread updates sum[0] in CPU0, this causes the cache line of sum[1] in CPU1 to be invalidated. This causes the program to take longer to run, as it needs to re-read the data. We try to change the sizeOfSum to 8, which means that changing the size of the vector sum to 8 and the if condition in the sumUp function to i%8==id will give the program more data to process.

~~We try to change the sizeOfSum to 8, which means that changing the size of the vector sum to 8 and the if condition in the sumUp function to i%8==id will give the program more data to process~~[[File:sizeOfSum8. jpg|400px]] Nevertheless, the result still does not bring the time between the Serial block and Thread block closer and the Serial block still takes less time.

What would it take to show the advantages of multicore? As you can see in our program, multiple threads are not operating on the same variable, but only a single thread is reading and writing a lot of variables in the same cache line, so you can introduce a local variable.

Also, we add the following code to our main function:

cout << endl << "----~~new~~ Local variable block---" << endl; // ~~new~~ Local variable block

{

vector<long> sum(sizeOfSum, 0);

}

auto end = chrono::steady_clock::now();

cout << "~~New~~ Local variable block consumption: " << chrono::duration_cast<chrono::microseconds>(end - start).count() << "ms" << endl;

}

When we ran the program again, we came to this conclusion:

We can see that the ~~new~~ Local variable block is already much faster than the Thread block, and even comparable to the Serial block is almost the same. But overall the Serial block is still the least time-consuming because the ~~new~~ Local variable block still needs to incur extra overhead for thread creation and scheduling.

We could try increasing sizeOfNumbers to 1000000 as well, which would allow the program to process more data, thus compensating for the extra overhead of thread creation and scheduling.

~~[[File:newBlockOutput1000000.jpg|400px]] ~~ [[File:newBlockOutput1000000(1new).jpg|400px]]

Now we can already see the advantage of multi-threading. Even when the vector numbers reach the size of 1000000, the Thread block even runs faster than the Serial block.

So the problem of false sharing sometimes cannot be generalized, and the occurrence of false sharing does not necessarily make performance lower. But no matter how the ~~new~~ Local variable block is the optimal solution.

== Byte alignment ==

}

In this program code, we can easily see that the Number struct has two member variables (num1 and num2). They are for loop in two different threads to execute the increment operator 1000000 times. This time we use Linux with multiple cores to compile (note that here we are not using any optimization).

g++ -std=c++11 -pthread false_sharing.cpp

Have you learned? The so-called false sharing can make performance worse in fact this cannot be generalized. We can see that with O2 on, the CPU and compiler are able to optimize the serial version of the code very well, while the multi-threaded code is not well optimized, which leads to higher performance of the serial version of the code. Even using a byte-aligned cache line size does not help. In the first example, we increased the number of vector numbers, which caused the serial code to slow down significantly. The differences in the two examples are related to the specific execution logic.

So don't treat ~~pseudo-~~false sharing as a beast, and don't treat 64-byte alignment as a panacea. Everything has to be actually tested to know. Don't throw out the conclusion under O0 and then take it as a guideline, that makes no sense ......

= '''Conclusion''' =

In conclusion, the term false sharing refers to the use of shared memory by multiple threads at the same time. A multiprocessor environment decreases performance. Shared data is modified by multiple processors simultaneously. This is a very common occurrence in the loop. In many cases, false sharing can be overlooked, resulting in a program's inability to scale. In parallel programming, where performance is fundamental, keeping an eye out for problems and recognizing them quickly is essential.

Finally, we discussed some solutions to minimize false sharing by making as much private data as possible to reduce how often shared data needs to be updated. Making use of the compiler's optimization features to reduce memory loads and stores. ~~By increasing padding~~Also, we ~~will be able to~~ can improve performanceby byte alignment.

= References =

https://wiki.cdot.senecacollege.ca/wiki/GPU621/False_Sharing

https://www.baeldung.com/java-false-sharing-contended#:~:text=In%20the%20MESI%20protocol%2C%20each,the%20acronym%20of%20these%20states.

https://learn.microsoft.com/en-us/archive/msdn-magazine/2008/october/net-matters-false-sharing

http://www.nic.uoregon.edu/~khuck/ts/acumem-report/manual_html/multithreading_problems.html

Ryan Leong

118

edits

CDOT Wiki β

Changes

GPU621/Analyzing False Sharing

CDOT Wiki ^β