Once upon a time there was a team. In their programs, they never used floats. They only used double doubles.
Parallelization of the C++ Standard Library: Intel Parallel STL, MCSTL, and C++17
Introduction: The Standard Template Library
The Standard Template Library (STL) for C++ is a library that provides a set of classes and functions for C++, like iterators and vectors. The STL is divided in four parts: algorithms, containers, functors, and iterators. For this project, we will focus on the algorithms library.
The algorithms library provides many functions to be used for ranges of elements, and use iterators or pointers to navigate through them. Some of these algorithms are:
- all_of - Tests if a condition is true for all the elements in the range
- for_each - Applies a function to all elements in the range
- find - Finds a value in the range
- sort - Sorts the elements in the range
These algorithms provide a functional way to work on collections, and are present in several programming languages. These algorithms replace what usually would be a for/while loop, shifting the attention from "how to loop" to "what to do".
#include <iostream> #include <algorithm> #include <vector> using namespace std; int main() { vector<int> myVector = { 10, 6, 7, 8, 5, 4, 1, 2, 3, 9 }; // Sorting the elements sort(myVector.begin(), myVector.end(), [](int a, int b) { return b < a; }); cout << "== FOR EACH ==" << endl; for_each(myVector.begin(), myVector.end(), [](int i) { cout << "Element: " << i << endl; }); cout << endl; cout << "== COUNTING NUMBERS LARGER THAN 6 ==" << endl; long nLt6 = count_if(myVector.begin(), myVector.end(), [](int i) { return i > 6; }); cout << "There are " << nLt6 << " numbers larger than 6 in the vector" << endl << endl; cout << "== FINDING AN ELEMENT ==" << endl; vector<int>::iterator elFound = find_if(myVector.begin(), myVector.end(), [](int i) { return i > 4 && i % 6 == 0; }); cout << "The element " << *elFound << " is the first that satisfies i > 4 && i % 6 == 0" << endl << endl; return 0; }
Output:
== FOR EACH == Element: 10 Element: 9 Element: 8 Element: 7 Element: 6 Element: 5 Element: 4 Element: 3 Element: 2 Element: 1 == COUNTING NUMBERS LARGER THAN 6 == There are 4 numbers larger than 6 in the vector == FINDING AN ELEMENT == The element 6 is the first that satisfies i > 4 && i % 6 == 0
These algorithms not only make the code easier to write and read, but it also tends to be faster: these algorithms are heavily optimized, making them much faster than for/while loops.
The execution for these algorithms, however, are not parallel by default: they are sequential. However, it is possible to parallelize many of these algorithms, making their execution much faster. This project will discuss the following methods for parallelizing the STL: Policy-based execution for C++17 + Intel Parallel STL.
Policy-based execution for C++17 and Intel Parallel STL
C++17, introduces a feature called Execution Policies, which can be used to specify what kind of execution is is desired for the algorithm. There are three possible execution policies:
- std::execution::seq - Sequential execution (no parallelism)
- std::execution::par - Parallel execution (multiple threads)
- std::execution::par_unseq - Parallel execution + vectorization
The Intel C++ compiler also supports another policy, which was not specified in the C++ Documentation:
- std::execution::unseq - Vectorization
These executions are passed as the first parameter for the algorithm:
std::copy( std::execution::par, a.start(), a.end(), b.start() );
Most compilers nowadays do not yet support this feature. Probably the only compiler that can already make use of these policies is the Intel C++ Compiler, which was used to perform the tests below.
Here are the tests performed: we compared the speed of the four execution policies above for six algorithms: **sort**, **count\_if**, **for_each**, **reduce**, **transform**, and **copy**, using three different data structures: **array**, **vector**, and **list**. We used the Intel C++ Compiler on Windows, on an x86 64 machine with 8 cores. The programs were compiled with **O3** optimization, including the Intel TBB (Thread Building Blocks) library and OpenMP. The purpose of these tests was to see how the elapsed time would change with different execution policies and different data structures.
Some observations to point out:
- There is not much documentation available on how to properly use execution policies (at least I did not find anything in depth).
- The programmer who is using execution policies is still responsible for avoiding deadlocks and race conditions.
- The purpose of these tests is not to compare one compiler versus another or one algorithm versus another.
- The code for all the tests is available here.
- **std::vector** is a dynamic array: the data is kept sequentially in memory, while **std::list** is a linked list (the data is scattered in memory).
- The tests were ran three times, and the average of the elapsed time was used as the results, avoiding "abnormal" results.
- std::copy**
Code snippet
Copying values from array a to array b.
std::copy( std::execution::par, a, a + ARRSZ, b );
Results
First, for the array and the vector, these results look rather strange - we were expecting that both for the array and the vector, we would notice a downward trend for the time for the execution policies: since the memory is sequential, they could easily be vectorized.
There are some important guesses/conclusions here that I want to mention. They will be important to explain these results and the results for the following tests: - An explanation for why the vectorized method did not yield a good result could be that because I was compiling with **O3**, the serial code is already being vectorized in the background; the fact that I would be asking for it to be vectorized again would add some overhead - I am not entirely sure why the parallel policy did not always have a good result, but if I had to guess, I would say it was because the copies were creating a **race condition**, which was slowing the parallel execution down
Now, about the *list*: the parallel versions were better than the serial and vectorized ones. Why did the vectorize version did not yield a good result? This could be explained by the fact that vectorization did not happen: because of the nature of a linked list, the memory is too scattered to be put together in a vector register and processed with **SIMD**, and this would not improve the timing at all. One thing to point out: **since a list is not sequential in memory, it is costly to "slice" the collection in equal parts and parallelize them (we basically have to iterate through the whole list), so the gains for parallel execution are not great. If, however, the operation to be performed in every element is slow, this slow slicing can still be worth it**.
- std::count_if**
Snippet:
// Counting all multiples of 3 auto condition = [](mytype& i) { return i % 3 == 0; }; size_t sum = std::count_if( std::execution::par, a, a + ARRSZ, condition );
Results:
These results look a lot more like what we were expecting:
- The vectorized version is the slowest, because vectorizing an operation like this is not possible, and we have the extra overhead of making this decision
- The parallel version performed better than any other policy, because the parallel + vectorized version had the overhead for deciding if the vectorization should happen or not
- We had a similar result for the list, but the gains were dramatically smaller. This is because, as I cited before, a list is not sequential in memory, so it is costly to "slice" the collection in equal parts and parallelize them. We did have a very small gain, but the cost for iterating through the list was way too big.
std::for_each
Snippet:
size_t sum; auto action = [&](mytype& i) { return sum += i; }; std::for_each( std::execution::par, a, a + ARRSZ, action );
Notice how this **for_each** behaves like a **reduce**: I am modifying the variable "sum" outside of this function. This is very likely to cause a **race condition**.
Results:
Even though we had (what it looked like) a race condition,we still got good results with the parallel excution policies for array and vector. Again, this process could not be vectorized, and this is why the vectorization policy did not do well. For the **list**, we see the same pattern as before: since the slicing for the collection is costly, it seems like the either the compiler did not parallelize it, or the parallel version was just as slow as the serial version.
std::reduce
Snippet:
size_t sum = std::reduce( std::execution::par, a.begin(), a.end() );
Results:
The results here are very similar to the **for_each** algorithm - it seems like the race condition I made with the "*sum*" for the previous test was not really a problem for the algorithm.
std::sort
Snippet:
This algorithm is a bit different: we cannot use std::list with it. Since the algorithm requires random access, it is only possible to use arrays or vectors.
auto sorter = [](mytype a, mytype b) { return a < b; }; std::sort( std::execution::par, a, a + ARRSZ, sorter );
Results:
Probably the most dramatic results so far. Vectorization for sorting is probably not something that can be done, so it makes sense that the vectorized policies did not yield a good result. The parallel versions, however, had a very dramatic improvement. It seems that this **std::sort** implementation is really optimised for parallel execution!
std::transform
Snippet:
auto action = [](mytype i) { return i += 5; }; std::transform( std::execution::par, a, a + ARRSZ, a, action );
Results:
For the list, what happened here seems to be similar to what happened before: it is too costly to parallelize or vectorize the operations, so the execution policies did not have a good result. For the array, we also had a similar result as before, with the parallel version working better than the other ones.
For the **vector** data structure, however, we finally had a different result: for some reason, the vectorized version, this time, had a good effect. Apparently the serial version could not be vectorized by default, but when we used the **unseq** execution policy, it was now able to vectorize the operation. An explanation for the fact that the parallel + vectorization version did not have a better result than the parallel one, however, is not very obvious to me. If I had to guess, I would say it is because the parallel version had the vectorization done by the compiler as well, without us needing to specify it. One thing that is very obvious is that lists don't seem to go very well with parallelization since they do not have random access - we have to go through the list in order to slice it.
In conclusion, it is refreshing to see a tool like this. It makes parallelization so easy that we almost don't need to worry about anything else. It would be nice to have more clear documentation for these methods, so the results could be a little more predictable - Some of these results are not exactly intuitive. A suggestion for the next project is to analyse the assembly code that was generated to see exactly why we got some of these results.
???
???
Sources
- STL: https://en.wikipedia.org/wiki/Standard_Template_Library
- STL: http://www.cplusplus.com/reference/algorithm/
- Policy-Based execution for C++17: https://scs.senecac.on.ca/~oop345/pages/content/multi.html#alg
- Intel Parallel STL: https://software.intel.com/en-us/get-started-with-pstl*
Group Members
Progress
-5%