55
edits
Changes
no edit summary
These algorithms not only make the code easier to write and read, but it also tends to be faster: these algorithms are heavily optimized, making them much faster than for/while loops.
The execution for these algorithms, however, are not parallel by default: they are sequential. However, it is possible to parallelize many of these algorithms, making their execution much faster. This project will discuss three the following methods/implementations for parallelizing the STL: MCSTL, Intel Parallel STL, and Policy-based execution for C++17+ Intel Parallel STL.
== MCSTL Policy-based execution for C++17 and Intel Parallel STL ==C++17, introduces a feature called Execution Policies, which can be used to specify what kind of execution is is desired for the algorithm. There are three possible execution policies: - **std::execution::seq** - Sequential execution (no parallelism)- **std::execution::par** - Parallel execution (multiple threads)- **std::execution::par_unseq** - Parallel execution + vectorization The Intel C++ compiler also supports another policy, which was not specified in the [http://en.cppreference.com/w/cpp/algorithm/execution_policy_tag_t C++ Documentation]: - **std::execution::unseq** - Vectorization These executions are passed as the first parameter for the algorithm: <pre style=><code>std::copy( std::execution::par, a.start(), a.end(), b.start());</code></pre> Most compilers nowadays do not yet support this feature. Probably the only compiler that can already make use of these policies is the Intel C++ Compiler, which was used to perform the tests below. Here are the tests performed: we compared the speed of the four execution policies above for six algorithms: **sort**, **count\_if**, **for_each**, **reduce**, **transform**, and **copy**, using three different data structures: **array**, **vector**, and **list**. We used the Intel C++ Compiler on Windows, on an x86 64 machine with 8 cores. The programs were compiled with **O3** optimization, including the Intel TBB (Thread Building Blocks) library and OpenMP. The purpose of these tests was to see how the elapsed time would change with different execution policies and different data structures. Some observations to point out: * There is not much documentation available on how to properly use execution policies (at least I did not find anything in depth).* The programmer who is using execution policies is still responsible for avoiding deadlocks and race conditions.* The purpose of these tests is not to compare one compiler versus another or one algorithm versus another.* The code for all the tests is available [https://github.com/hscasn/testpstl here].* **std::vector** is a dynamic array: the data is kept sequentially in memory, while **std::list** is a linked list (the data is scattered in memory).* The tests were ran three times, and the average of the elapsed time was used as the results, avoiding "abnormal"colorresults. ----**std::copy** ''Code snippet'' Copying values from array ''a'' to array ''b''.<pre><code>std::copy( std::execution::par, a, a + ARRSZ, b);``` ''Results'' [[File:http://public.hcoelho.com/images/blog/pstl_copy.png]] First, for the array and the vector, these results look rather strange - we were expecting that both for the array and the vector, we would notice a downward trend for the time for the execution policies: since the memory is sequential, they could easily be vectorized. There are some important guesses/conclusions here that I want to mention. They will be important to explain these results and the results for the following tests:- An explanation for why the vectorized method did not yield a good result could be that because I was compiling with **O3**, the serial code is already being vectorized in the background; the fact that I would be asking for it to be vectorized again would add some overhead- I am not entirely sure why the parallel policy did not always have a good result, but if I had to guess, I would say it was because the copies were creating a **race condition**, which was slowing the parallel execution down Now, about the *list*: the parallel versions were better than the serial and vectorized ones. Why did the vectorize version did not yield a good result? This could be explained by the fact that vectorization did not happen: because of the nature of a linked list, the memory is too scattered to be put together in a vector register and processed with **SIMD**, and this would not improve the timing at all. One thing to point out: red**since a list is not sequential in memory, it is costly to "slice"the collection in equal parts and parallelize them (we basically have to iterate through the whole list), so the gains for parallel execution are not great. If, however, the operation to be performed in every element is slow, this slow slicing can still be worth it**. ----**std::count_if** ''Snippet:'' <pre><code>What // Counting all multiples of 3auto condition = [](mytype& i) { return i % 3 == 0; }; size_t sum = std::count_if( std::execution::par, a, a + ARRSZ, condition);</code></pre> ''Results:'' [[File:http://public.hcoelho.com/images/blog/pstl_count_if.png]] These results look a lot more like what we were expecting:* The vectorized version is the slowest, because vectorizing an operation like this is not possible, and we have the extra overhead of making this decision* The parallel version performed better than any other policy, because the parallel + vectorized version had the overhead for deciding if the vectorization should happen or not* We had a similar result for the list, but the gains were dramatically smaller. This is because, as I cited before, a list is not sequential in memory, so it?is costly to "slice" the collection in equal parts and parallelize them. We did have a very small gain, but the cost for iterating through the list was way too big. ----'''std::for_each''' ''Snippet:'' <pre><code>size_t sum;auto action = [&](mytype& i) { return sum += i; };std::for_each( std::execution::par, a, a + ARRSZ, action);</code></pre> Notice how this **for_each** behaves like a **reduce**: I am modifying the variable "sum" outside of this function. This is very likely to cause a **race condition**. ''Results:'' [[File:http://public.hcoelho.com/images/blog/pstl_for_each.png]] How Even though we had (what it looked like) a race condition,we still got good results with the parallel excution policies for array and vector. Again, this process could not be vectorized, and this is why the vectorization policy did not do well. For the **list**, we see the same pattern as before: since the slicing for the collection is costly, it seems like the either the compiler did not parallelize it, or the parallel version was just as slow as the serial version. ----'''std::reduce''' ''Snippet:'' <pre><code>size_t sum = std::reduce( std::execution::par, a.begin(), a.end());</code></pre> ''Results:'' [[File:http://public.hcoelho.com/images/blog/pstl_reduce.png]] The results here are very similar to the **for_each** algorithm - it seems like the race condition I made with the "*sum*" for the previous test was not really a problem for the algorithm. ---- ''' std::sort''' ''Snippet:'' This algorithm is a bit different: we install cannot use ''std::list'' with it. Since the algorithm requires random access, it?is only possible to use arrays or vectors. <pre><code>auto sorter = [](mytype a, mytype b) { return a < b; }; std::sort( std::execution::par, a, a + ARRSZ, sorter);</code></pre> ''Results:'' [[File:http://public.hcoelho.com/images/blog/pstl_sort.png]] Is Probably the most dramatic results so far. Vectorization for sorting is probably not something that can be done, so it free?makes sense that the vectorized policies did not yield a good result. The parallel versions, however, had a very dramatic improvement. It seems that this **std::sort** implementation is really optimised for parallel execution! ---- '''std::transform''' ''Snippet:'' <pre><code>auto action = [](mytype i) { return i += 5; }; std::transform( std::execution::par, a, a + ARRSZ, a, actionWhat algorithms are provided?);</code></pre> ''Results:'' [[File:http://public.hcoelho.com/images/blog/pstl_transform.png]] For the list, what happened here seems to be similar to what happened before: it is too costly to parallelize or vectorize the operations, so the execution policies did not have a good result. For the array, we also had a similar result as before, with the parallel version working better than the other ones.
For the **vector** data structure, however, we finally had a different result: for some reason, the vectorized version, this time, had a good effect. Apparently the serial version could not be vectorized by default, but when we used the **unseq** execution policy, it was now able to vectorize the operation. An explanation for the fact that the parallel + vectorization version did not have a better result than the parallel one, however, is not very obvious to me. If I had to guess, I would say it is because the parallel version had the vectorization done by the compiler as well, without us needing to specify it.
== Benchmark ??? ==
<pre style="color: red">
</pre>
* Policy-Based execution for C++17: https://scs.senecac.on.ca/~oop345/pages/content/multi.html#alg
* Intel Parallel STL: https://software.intel.com/en-us/get-started-with-pstl*