Revision as of 13:24, 26 November 2021

Analyzing False Sharing

Group Members

- Kevin Chou

Introduction

False sharing is a well-known problem that can occur when taking advantage of parallelization in modern multi-processor computers with shared memory. Although the results will be correct, frequent false sharing results in significant performance loss and reduction in scalability, effectively nullifying the benefits of parallel computing.

The Cache

What is a Cache?

Before we can talk about false sharing, we need a brief overview of a cache. Like other storage mediums, a cache is a place used to store data and instructions that the CPU may use.

So why do we have so many types?

In an ideal world, the CPU would have lightning-fast access to all data on your machine at any given time to maximize the power and speed of modern-day CPUs. In reality, there are limitations and trade-offs when balancing speed, storage size, location, and cost. To compromise, a computer utilizes many types of memory. To name the important ones in the hierarchy, we have secondary memory like hard drives and SSDs which can permanently hold vast quantities of data but are slow. Next is Dynamic Random Access Memory(DRAM) which is much smaller, faster, and volatile. Last, we have the cache or Static Random Access Memory(SRAM), which is extremely fast, but even smaller and more expensive.

If the CPU solely relied on getting data from secondary memory, the slow access speed of the storage device would become a massive bottleneck in computation time. There would be huge gaps of time where the CPU sits around doing nothing waiting for data. To the end user, your machine would appear to be extremely sluggish and non-responsive.

At the same time, only small amounts piece of data is needed at a given moment. Even if you brought everything in from memory, most of it will be unused. Utilizing the hierarchy of memory storage, the most relevant data can be siphoned from secondary memory ahead of time and stored in the cache and RAM. When searching for data, the CPU can grab it from the cache, a cache hit. If it is not there, a cache miss, then it looks moves down the hierarchy until it finds it. Minimizing the number of cache misses ensures the CPU has a steady flow of data it can quickly retrieve and compute.

Cache Coherence and Cache Line

Each processor has their own local cache. When data is needed, a fixed block of memory is transferred to the cache; this block is known as a cache line.

However, if they each modify their own copy, how do we know which one correct? We can’t store all the different versions or pick a random one to be the correct version, this would create havoc in our computer systems.

The answer is we need cache coherence. Cache coherence is defined as the uniformity of shared resource data that ends up stored in multiple local caches. In other words, we must keep all the local caches synchronized. The challenge of doing so is known as the Cache Coherence Problem. To solve this problem, multi-processor systems rely on cache coherence protocols to manage and maintain the cache.

False Sharing

What is False Sharing?

Now that we understand the basics of a cache, how does this relate to the concept of false sharing?

False sharing occurs when multiple processors modify data that resides on the same cache line. When this data is written back to memory, the shared cache lines are marked as invalid or dirty. Processors must fetch an updated copy introducing significant delays due to memory access speeds. As you increase the number of processors, this quickly spirals out of control as there is an increasing chance the cache line is invalid.

The key thing to note is that you do not need to modify the same piece of data. If the modified data happens to belong to the same cache line, the cache will be invalidated, forcing a memory update to maintain cache coherency.

Example

In workshop 2, we briefly encountered false sharing. However, we did not get a formal explanation in lecture, so this section will serve to provide more context to the problem.

We were asked to multi-thread a serial version of a simple algorithm that calculated PI by integrating 1/(1 + x^2). The serial version utilized a scalar sum variable to accumulate the calculations. For our naïve attempt, we identified a potential issue with using a scalar sum in a multi-threaded program. We have no direct control on the order a thread finishes their assigned work and leaving it up to chance resulted in varying resulted in a race condition. This is not good. To get around this we came up with the idea to change sum to an array. Now, each thread can index this array by their own thread id and store their calculations.

#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <chrono>
#include <omp.h>

#define NUM_THREADS 8 // number of threads to request

using namespace std::chrono;

// report system time
void reportTime(const char* msg, steady_clock::duration span)
{
    auto ms = duration_cast<milliseconds>(span);
    std::cout << msg << " - took - " <<
        ms.count() << " milliseconds" << std::endl;
}

int main(int argc, char** argv)
{
    if (argc != 2)
    {
        std::cerr << argv[0] << ": invalid number of arguments\n";
        std::cerr << "Usage: " << argv[0] << "  no_of_slices\n";
        return 1;
    }

    int n = std::atoi(argv[1]);
    steady_clock::time_point ts, te;

    // calculate pi by integrating the area under 1/(1 + x^2) in n steps
    ts = steady_clock::now();

    int actual_thread_count;
    double pi = 0.0f;
    double sum[NUM_THREADS] = { 0.0f };
    double step = 1.0 / (double)n;

    omp_set_num_threads(NUM_THREADS);
    #pragma omp parallel
    {
        int id, num_threads;
        double x;

        id = omp_get_thread_num();
        num_threads = omp_get_num_threads();

        // get master thread to return how many threads were actually created
        if (id == 0)
        {
            actual_thread_count = num_threads;
        }

        // each thread is responsible for calculating the area of a specific set of sections underneath the curve
        for (int i = id; i < n; i = i + num_threads)
        {
            x = ((double)i + 0.5f) * step;
            sum[id] += 1.0f / (1.0f + x * x);
        }
    }

    // sum up each calculation to get approximation of pi
    for (int i = 0; i < actual_thread_count; i++)
    {
        pi += 4 * sum[i] * step;
    }

    te = steady_clock::now();

    std::cout << "n = " << n <<
        std::fixed << std::setprecision(15) <<
        "\n pi(exact) = " << 3.141592653589793 <<
        "\n pi(calcd) = " << pi << std::endl;
    reportTime("Integration", te - ts);
}

Results

Execution time of naive implementation without any optimization levels (Od)

The algorithm calculates the correct answer, but the performance is absolutely terrible. The reason is that an array is a contiguous block of memory with each integer taking up 4 bytes. Assuming a 64 byte cache line, our entire array only takes up half of the cache opening up the possibility for multiple threads to share the same cache line resulting in false sharing. Although there were cases where higher thread count produced better results, there were many cases that performed worse than a single thread. This is due to the scheduling of thread execution that is out of the programmer's hands. It is possible that the selected schedule managed to minimize the frequency of false sharing giving better performance. However, this is extremely unreliable, so we need a better solution to false sharing.

@@ Line 7: / Line 7: @@
 False sharing is a well-known problem that can occur when taking advantage of parallelization in modern multi-processor computers with shared memory. Although the results will be correct, frequent false sharing results in significant performance loss and reduction in scalability, effectively nullifying the benefits of parallel computing.
 == The Cache ==
@@ Line 23: / Line 24: @@
 At the same time, only small amounts piece of data is needed at a given moment. Even if you brought everything in from memory, most of it will be unused. Utilizing the hierarchy of memory storage, the most relevant data can be siphoned from secondary memory ahead of time and stored in the cache and RAM. When searching for data, the CPU can grab it from the cache, a cache hit. If it is not there, a cache miss, then it looks moves down the hierarchy until it finds it. Minimizing the number of cache misses ensures the CPU has a steady flow of data it can quickly retrieve and compute.
 === Cache Coherence and Cache Line ===
-[[File:cache_coherence.png]]
+[[File:cache_coherence.png|right]]
 Each processor has their own local cache. When data is needed, a fixed block of memory is transferred to the cache; this block is known as a cache line.
@@ Line 33: / Line 35: @@
 The answer is we need cache coherence. '''Cache coherence''' is defined as the '''uniformity of shared resource data that ends up stored in multiple local caches'''. In other words, we must keep all the local caches synchronized. The challenge of doing so is known as the Cache Coherence Problem. To solve this problem, multi-processor systems rely on cache coherence protocols to manage and maintain the cache.
 == False Sharing ==
@@ Line 43: / Line 54: @@
 The key thing to note is that you do not need to modify the same piece of data. If the modified data happens to belong to the same cache line, the cache will be invalidated, forcing a memory update to maintain cache coherency.
 === Example ===
@@ Line 127: / Line 139: @@
 }
 </pre>
 ==== Results ====
 [[File:naive_implementation.png|600px|thumb|Execution time of naive implementation without any optimization levels (Od)]]
-The algorithm calculates the correct answer, but the performance is absolutely terrible due to false sharing. Although there were cases where higher thread count produced better results, there were many cases that performed worse than a single thread. This is due to the scheduling of thread execution that is out of the programmer's hands. It is possible that the selected schedule managed to minimize the frequency of false sharing giving better performance. However, this is extremely unreliable, so we need a better solution to false sharing.
+The algorithm calculates the correct answer, but the performance is absolutely terrible. The reason is that an array is a contiguous block of memory with each integer taking up 4 bytes. Assuming a 64 byte cache line, our entire array only takes up half of the cache opening up the possibility for multiple threads to share the same cache line resulting in false sharing. Although there were cases where higher thread count produced better results, there were many cases that performed worse than a single thread. This is due to the scheduling of thread execution that is out of the programmer's hands. It is possible that the selected schedule managed to minimize the frequency of false sharing giving better performance. However, this is extremely unreliable, so we need a better solution to false sharing.
 == Solutions to False Sharing ==
 === Padding ===
+Since the data is sharing
 === Synchronization ===

Difference between revisions of "GPU621/False Sharing"

Revision as of 13:24, 26 November 2021

Contents

Analyzing False Sharing

Group Members

Introduction

The Cache

What is a Cache?

Cache Coherence and Cache Line

False Sharing

What is False Sharing?

Example

Results

Solutions to False Sharing

Padding

Synchronization

Other Alternatives

Conclusion

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools