C++11 STL Comparison to TBB - Case Studies
C++ has a standard template library (STL) that has over 100,000 algorithms for different jobs. Within these algorithms there are ones that provide the capability of parallelising work. TBB is the Threading Building Block template library that also allows for parallel programming. TBB breaks down work into tasks that can run in parallel. For our project, we will look into the completion of different use cases using both the TBB library as well as the STL library, analyzing differences in the code, as well as run times for these algorithms.
Group Members
What we aim to display
Throughout this wiki, differences between C++11’s and C++17’s standard template library and Intel’s Thread Building Blocks template library look to be addressed. This will include C++11 multitasking capabilities vs Threaded Building Block parallelism and how they are setup within the IDE and ran. We will also investigate the more recent C++17 and its support for true parallelism, and how that compares to Intel’s Threaded Building Block library.
C++ 11 Standard Template Library
C++11 was released in September of 2011, approved by the International Organization for Standardization[1]. C++ has a long history dating back to 1997, and for that reason some concepts may be outdated which is why we will be talking about C++11 and C++17. STL is the standard template library that allows for users to leverage data structures without the hassle of implementation. A simple function call or object instantiation is all that is needed. Although STL simplifies a lot, it still requires a working knowledge of Template Classes to work with. STL provides three main components: Algorithms, Iterators, and containers.
Containers
STL includes containers such as unordered_map, and forward_list, a singly-linked list, vectors, and queues. The list of containers introduced is expansive and was done for performance reasons.
Iterators
A way to access data stored within the containers above. It is almost like a pointer to a large container of items and helps iterate through the container. Iterators also allow for dereferencing the iterator to access the data within the pointer.
Algorithms
Operations to be performed on a range of elements using Containers and Iterators. The algorithms act directly on the iterator and has nothing to do with the structure of the container. This means you will only be working with the elements and cannot change the size or structure of the container. Algorithms such as remove_if, reverse, and fill are common algorithms.
Multitasking in C++11
C++ 11 introduced the ability to code using multitasking, a form of parallelism on shared memory instead of serial. This is known as “virtual parallelism” as it is not true parallelism (see figure 1). Every task created through “virtual parallelism” or multitasking in the Standard Template Library runs on one available core. If multiple tasks are created, they are split into minor tasks, and then split into an order of when they should be running. This means that you can have two tasks in progress at the same time, but they are never working at the same time.
What is provided?
Threads
Each thread represents a single thread of execution. If you plan on running two tasks through multi tasking, you will need two threads created. Once a new thread object is created the thread will execute. Note that a thread object being created with no passed arguments in the constructor will lead to a thread object being created, but no thread.
Threads will accept Function Pointers, and if the function requires parameters, then the parameters required, function objects, and even lambda functions.
To identify each thread object there is ``` std::thread::get_id() ``` which will provide the ide of the thread object, and ```std::this_thread::get_id()``` to get the id of the current thread in use.
All Threads created must be joined using ```std::thread::join(). If this is not in place, a run time error will occur as the main thread will complete its executing leaving the other threads running.
//Source: https://en.cppreference.com/w/cpp/thread/thread/thread #include <thread> #include <iostream> #include <utility> #include <thread> #include <chrono> void f1(int n) { for (int i = 0; i < 5; ++i) { std::cout << "Thread 1 executing\n"; ++n; } } void f2(int& n) { for (int i = 0; i < 5; ++i) { std::cout << "Thread 2 executing\n"; ++n; } } class foo { public: void bar() { for (int i = 0; i < 5; ++i) { std::cout << "Thread 3 executing\n"; ++n; } } int n = 0; }; class baz { public: void operator()() { for (int i = 0; i < 5; ++i) { std::cout << "Thread 4 executing\n"; ++n; } } int n = 0; }; int main() { int n = 0; foo f; baz b; std::thread t1; // t1 is not a thread std::thread t2(f1, n + 1); // pass by value std::thread t3(f2, std::ref(n)); // pass by reference. Must use std::ref() std::thread t4(std::move(t3)); // t4 is now running f2(). t3 is no longer a thread std::thread t5(&foo::bar, &f); // t5 runs foo::bar() on object f std::thread t6(b); // t6 runs baz::operator() on object b. Function Object t2.join(); t4.join(); t5.join(); t6.join(); std::cout << "Final value of n is " << n << '\n'; std::cout << "Final value of foo::n is " << f.n << '\n'; }
Result:
Thread 1 executing Thread 1 executing Thread 2 executing Thread 3 executing Thread 3 executing Thread 3 executing Thread 3 executing Thread 3 executing Thread 1 executing Thread 1 executing Thread 1 executing Thread 4 executing Thread 4 executing Thread 4 executing Thread 4 executing Thread 4 executing Thread 2 executing Thread 2 executing Thread 2 executing Thread 2 executing Final value of n is 5 Final value of foo::n is 5
Race Conditions and Mutual Exclusion
Race conditions occur when two or more threads perform operations on the same memory location and modify the data at said memory location, causing unexpected results. To fix this issue, lock mechanisms must be introduced.
Mutex’s are the lock mechanism mentioned above. The purpose of a mutex is to make an area of memory accessible only to the thread that has accessed it. Once a thread is done with its changes to the memory, the thread will unlock it, move on, and allow the next thread to make its changes.
Mutex’s in C++ are introduced with the <Mutex> header file from the std::mutex class. A mutex has two main functions, std::mutex::lock() and std::mutex::unlock().
Result:
Money in Wallet is: 5000
If the locks are removed the output will change every run, but will look similar to:
results
Atomicity
The Atomic operations library throws out the concept of using locks and allows for lockless concurrent programming. The atomic class handles everything for you when it comes to making sure the memory space does not encounter bugs when running between multiple threads.
Intel Threading Building Blocks
Overview of TBB
Intel TBB is a library which provides features for parallel programming.It is used for task-based parallelism and requires no special compiler support. TBB has features such as Parallel Algorithms, memory allocation, threads and synchronization. Along side these features are similar features that were talked about in C++11. Naturally, TBB supports features such as containers and iterators, but with additional features to support the parallel nature of the library. Finally, TBB allows for generic programming which removes type requirements.
Containers
Containers within TBB allow for multiple threads to alter the same container at once. These containers are known as concurrent containers. Examples of these containers are:
- concurrent_unordered_map
- concurrent_hash_map
- concurrent_queue or concurrent_vector
Iterators
Parallel algorithms within Intel TBB are generic, This allows for the use of iterators and STL algorithms within parallel algorithms.
Algorithms
Algorithms within the TBB library are created for a similar reason that the algorithms in the Standard Template Library are, to allow for easier use and improve on performance. Algorithms within the library include:
- parallel_for
- parallel_reduce
- parallel_scan
Case Studies
Comparison of TBB parallel sort, STL sort
#include <stddef.h> #include <stdio.h> #include <iostream> #include <algorithm> #include <chrono> #include <random> #include <ratio> #include <vector> #include <execution> #include <tbb/tbb.h> using std::chrono::duration; using std::chrono::duration_cast; using std::chrono::high_resolution_clock; using std::milli; using std::random_device; using std::sort; using std::transform; using std::vector; const size_t testSize = 10'000'000; const int iterationCount = 5; void print_results(const char* const tag, const vector<double>& sorted, high_resolution_clock::time_point startTime, high_resolution_clock::time_point endTime) { printf("%s: Lowest: %g Highest: %g Time: %fms\n", tag, sorted.front(), sorted.back(), duration_cast<duration<double, milli>>(endTime - startTime).count()); } int main() { random_device rd; // generate some random doubles: printf("Testing with %zu doubles...\n", testSize); vector<double> doubles(testSize); for (auto& d : doubles) { d = static_cast<double>(rd()); } //serial stl sort for (int i = 0; i < iterationCount; ++i) { vector<double> sorted(doubles); const auto startTime = high_resolution_clock::now(); sort(sorted.begin(), sorted.end()); const auto endTime = high_resolution_clock::now(); print_results("Serial STL", sorted, startTime, endTime); } //parallel stl sort for (int i = 0; i < iterationCount; ++i) { vector<double> sorted(doubles); const auto startTime = high_resolution_clock::now(); // same sort call as above, but with par_unseq: sort(std::execution::par_unseq, sorted.begin(), sorted.end()); const auto endTime = high_resolution_clock::now(); // in our output, note that these are the parallel results: print_results("Parallel STL", sorted, startTime, endTime); } //parallel TBB sort for (int i = 0; i < iterationCount; ++i) { vector<double> sorted(doubles); const auto startTime = high_resolution_clock::now(); // same sort call as above, but with par_unseq: sort(std::execution::par_unseq, sorted.begin(), sorted.end()); tbb::parallel_sort(sorted); const auto endTime = high_resolution_clock::now(); // in our output, note that these are the parallel results: print_results("Parallel TBB", sorted, startTime, endTime); } }
Results on an intel i7-3770k, average of 5 runs: