Changes

Jump to: navigation, search

Team Darth Vector

8,352 bytes added, 17:09, 17 December 2017
Business Point of View Comparison for STL and TBB
'''TEAM, use this for formatting. [https://en.wikipedia.org/wiki/Help:Cheatsheet Wiki Editing Cheat Sheet]
'''GPU621 Darth Vector: C++11 STL vs TBB Case Studies'''
''Join me, and together we can fork the problem as master and thread''
==Generic Programming==
Generic Programming is a an objective when writing code to make algorithms reusable and with the least amount of specific code. Intel describes generic programming as "''writing the best possible algorithms with the least constraints''". An example of generic code is STL's templating functions which provide generic code that can be used with many different types without requiring much specific coding for the type( an addition template could be used for int, double, float, short, etc without requiring re-coding). A non-generic library requires types to be specified, meaning more type-specific code has to be created.  [[File:gputemplates.PNG |thumb|center|600px| An example of generic coding]]
==TBB Background==
Threaded building blocks is an attempt by Intel to push the development of multi-threaded programs, the library implements containers and algorithms that improve the ability of the programmer to create multi-threaded applications. The library implements parallel versions of for, reduce and scan patterns.
Threaded building blocks was originally developed by Intel but the concepts that it uses are derived from a diverse range of sources. Threaded building blocks was created in 2004 and was open sourced in 2007, the latest release of Threaded building blocks was in 2017.
==STL Background==
The STL was created as a general purpose computation library that a focus on generic programming. The STL uses templates extensively to achieve compile time polymorphism. In general the library provide four components: algorithms, containers, functions and iterators.
 
The library was, mostly, created by Alexander Stepanov due to his ideas about generic programming and its potential to revolutionize software development. Because of the ability of C++ to provide access to storage using pointers, C++ was used by Stepanov, even though the language was still relatively young at the time.
===List After a long period of STL Functions:===engineering and development of the library, it obtained final approval in July 1994 to become part of the language standard.
==A Comparison from STL ==In general, most of STL is intended for use within a serial environment. This however changes with C++17'''Algorithms'''s introduction of parallel algorithms.
<u>'''Algorithms'''</u>Are supported by STL for various algorithms such as sorting, searching and accumulation. All can be found within the header "'''<algorithm>'''". Examples include sort() and reverse functions()functions.
<u>'''STL iterators'''</u>
Are supported for serial traversal. Should you use an iterator in parallel, you must be cautious to not change the data while a thread is going through the iterator.
They are defined within te the header "'''<iterator>'''" and is coded as
<pre>
#include<iterator>
</pre>
<u>'''Containers'''</u>STL supports a variety of containers for data storage. Generally these containers are supported in parallel for read actions, but does not safely support writing to the container with or without reading at the same time.There are several header files that are included such as "'''<vector>'''", "'''<queue>'''", and "'''<deque>'''".'''Most STL containers do not support concurrent operations upon them.'''They are coded as:<pre>#include<vector>#include<dequeue>#include<queue> int main(){ vector<type> myVector; dequeue<type> cards; queue<type> SenecaYorkTimHortons; }</pre> <u>'''Memory Allocater'''</u>The use of a memory allocator allows more precise control over how dynamic memory is created and used. It is the default memory allocation method(ie not the "new" keyword) for all STL containers. The allocaters allow memory to scale rather then using delete and 'new' to manage a container's memory. They are defined within the header file "'''<memory>'''" and are coded as: <pre>#include <memory>
===List of TBB containersvoid foo(){std:===:allocater<type> name;}</pre>
==A Comparison from TBB=====Containers==='''<u>concurrent_queue</u>''' : This is the concurrent version of the STL container Queue. This container supports first-in-first-out data storage like its STL counterpart. Multiple threads may simultaneously push and pop elements from the queue. Queue does NOT support and front() or back() in concurrent operations(the front could change while accessing). Also supports iterations, but are slow and are only intended for debugdebugging a program. This is defined within the header "'''tbb/concurrent_queue.h'''" and is coded as: <pre>
#include <tbb/concurrent_queue.h>
//....//
tbb:concurrent_queue<typename> name; </pre>
'''<u>concurrent_vector</u>''' : This is a container class for vectors with concurrent(parallel) support. These vectors do not support insertion or erase operations but do support operations done by multiple threadssuch as push_back(). Note that when elements are inserted, they cannot be removed without calling the clear() member function on it, which removes every element in the array. The container when storing elements does not guarantee that elements will be stored in consecutive addresses in memory. This is defined within the header "'''tbb/concurrent_vector.h'''" and is coded as: <pre>
#include <tbb/concurrent_vector.h>
//...//
'''<u>concurrent_hash_map</u>''' : A container class that supports hashing in parallel. The generated keys are not ordered and there will always be at least 1 element for a key. Defined within "'''tbb/concurrent_hash_map.h'''"
===List of TBB Algorithms:==='''Algorithms'''
<u>'''parallel_for:'''</u> Provides concurrent support for for loops. This allows data to be divided up into chunks that each thread can work on. The code is defined in "'''tbb/parallel_for.h'''" and takes the template of: <pre>foo parallel_for(firstPos, lastPos, increment { boo()}
parallel_scan</pre>
parallel_reduce<u>'''parallel_scan:'''</u> Provides concurrent support for a parallel scan. Intel promises it may invoke the function up to 2 times the amount when compared to the serial algorithm. The code is defined in "'''tbb/parallel_scan.h'''" and according to intel takes the template of: <pre>void parallel_scan( const Range& range, Body& body [, partitioner] );
</pre>
<u>'''Threadsparallel_invoke:'''</u> Provides support for parallel calling to functions provided in the arguments. It is defined within the header "'''tbb/parallel_invoke.h'''" and is coded as: <pre>tbb:parallel_invoke(myFuncA, myFuncB, myFuncC);</pre>
==Lock Convoying Problem=Allocaters====What Handles memory allocation for concurrent containers. In particular is a Lock?===used to help resolve issues that affect parallel programming. Called '''scalable_allocater<type>''' and '''cache_aligned_allocater<type>'''. Defined in "'''#include <tbb/scalable_allocator.h>'''"
A Lock(also called "mutex") ==TBB Memory Allocation & Fixing Issues from Parallel Programming==TBB provides memory allocation just like in STL via the '''std::allocater''' template class. Where TBB's allocater though improves, is a method through its expanded support for programmers to secure code that when executing common issues experienced in parallel can cause multiple threads to fight for a resource/container for some operationprogramming. When threads work in parallel to complete a task with containers, there is no indication when the thread reach the container These allocaters are called '''scalable_allocater<type>''' and '''cache_aligned_allocater<type>''' and ensure that issues like '''Scalability''' and need to perform an operation on it. This causes '''False Sharing''' performance problems when multiple threads are accessing the same place. When doing an insertion on a container with threads, we must ensure only 1 thread is capable of pushing to it or else threads may fight for control. By "Locking" the container, we ensure only 1 thread accesses at any given timereduced.
To use ===False Sharing===As you may have seen from the workshop "False Sharing" a lock, you program must be working major performance hit can occur in parallel(ex #include when data that sits on the same cache line in memory is used by two threads. When threads are attempting operations on the same cache line the threads will compete for access and will move the cache line around. The time taken to move the line is a significant amount of clock cycles which causes the performance problem. Through TBB, Intel created an allocated known as '''cache_aligned_allocater<threadtype>) and should be completing something in parallel'''. When used, any objects with memory allocation from it will never encounter false sharing. Note that if only 1 object is allocated by this allocater, false sharing may still occur. You For compatability's sake(so that programmers can simply use "find c++11 locks and replace"), the cache_aligned_allocater takes the same arguments as the STL allocater. If you wish to use the allocater with #include <mutex>STL containers, you only need to set the 2nd argument as the cache_allocater object.
Code The following is an example or Picture here ^_^provided by Intel to demonstrate this:
<pre>
#include std::vector<iostreamint,cache_aligned_allocator<int>#include <thread> ;#include <mutex/pre>
//Some ===Scaling Issue===When working in parallel, several threads may be required to access shared memory which causes a performance slow down from forcing a single thread to allocate memory while other threads are spawned which call required to wait. Intel describes this function//Declared issue in parallel programming as '''Scalability''' and answers the following within issue with '''scalable_allocater<type>''' which permits concurrent memory allocation and is considered ideal for "''programs the class std::mutex NightsWatch;void GameOfThronesClass::GuardTheWall(){rapidly allocate and free memory''".
//Protect until Unlock() ==Lock Convoying Problem=====What is called. Only 1 thread may do this below at a time. It is //"locked"NightsWatch.Lock();?===
A Lock(also called "mutex") is a method for programmers to secure code that when executing in parallel can cause multiple threads to fight for a resource//IncrementDaysWithoutWhiteWalkerAttack++;std::cout << "It has been container for some operation. When threads work in parallel to complete a task with containers, there is no indication when the thread reach the container and need to perform an operation on it. This causes problems when multiple threads are accessing the same place. When doing an insertion on a container with threads, we must ensure only 1 thread is capable of pushing to it or else threads may fight for control. By " << DaysWithoutWhiteWalkerAttack << Locking" since the last attack container, we ensure only 1 thread accesses at Castle Black!\n";any given time.
//Allow Next To use a lock, you program must be working in parallel(ex #include <thread to execute the above iterationNightsWatch>) and should be completing something in parallel.Unlock();You can find c++11 locks with #include <mutex>
 }</pre>[[File:Gpulockwhat.PNG |thumb|center|700px| Mutex Example]]
Note that there can be problems with locks. If a thread is locked but it is never unlocked, any other threads will be forced to wait which may cause performance issues. Another problem is called "Dead Locking" where each thread may be waiting for another to unlock (and vice versa) and the program is forced to wait and wait .
If we attempt to find the data in parallel with other operations ongoing, 1 thread could search for the data, but another could update the vector size during that time which causes problems with thread 1's search as the memory location may change should the vector need to grow(performs a deep copy to new memory).
Locks can solve the above issue but cause significant performance issues as the threads are forced to wait for each otherbefore continuing. This performance hit is known as '''Lock Convoying'''.
[[File:DarthVector ThreadLock.PNG |thumb|center|400px600px| Performance issues inside STL]]
===Lock Convoying in TBB===
TBB attempts to mitigate the performance issue from parallel code when accessing or completing an operation on a container through its own containers such as concurrent_vector.
Through '''concurrent_vector''', every time an element is accessed/changed, a return of the index location is given. TBB promises that any time an element is pushed, it will always be in the same location, no matter if the size of the vector changes in memory. With a standard vector, when the size of the vector changes, the data is copied over. If any threads are currently traversing this vector when the size changes, any iterators may no longer be valid.This support also goes further for containers so that multiple threads can iterate through the container while another thread may be growing the container. An interesting catch though is that anything iterating may iterate over objects that are being constructed, ensuring construction and access remain synchronized.  [[File:Gpuconcur.PNG |thumb|center|600px| concurrent_vector use with multiple threads]]
TBB also provides its own versions of the mutex such as ''spin_mutex'' for when mutual exclusion is still required.
 
You can find more information on convoying and containers here: https://software.intel.com/en-us/blogs/2008/10/20/tbb-containers-vs-stl-performance-in-the-multi-core-age
 
==A Comparison between Serial Vector and TBB concurrent_vector==
''If only you knew the power of the Building Blocks''
 
Using the code below, we will test the speed at completing some operations regarding vectors using the stl library with stl's <u>'''vector'''</u> and tbb's <u>'''concurrent_vector'''</u>.
The code below will perform a "push back" operation both in serial and concurrent. Then, it measure the time taken to complete an '''n''' of push back operations.
<nowiki>
#include <iostream>
#include <tbb/tbb.h>
#include <tbb/concurrent_vector.h>
#include <vector>
#include <fstream>
#include <cstring>
#include <chrono>
#include <string>
 
using namespace std::chrono;
 
// define a stl and tbb vector
tbb::concurrent_vector<std::string> con_vector_string;
std::vector<std::string> s_vector_string;
 
tbb::concurrent_vector<int> con_vector_int;
std::vector<int> s_vector_int;
 
 
 
void reportTime(const char* msg, steady_clock::duration span) {
auto ms = duration_cast<milliseconds>(span);
std::cout << msg << " - took - " <<
ms.count() << " milliseconds" << std::endl;
}
 
int main(int argc, char** argv){
if(argc != 2) { return 1; }
int size = std::atoi(argv[1]);
 
steady_clock::time_point ts, te;
/*
TEST WITH STRING OBJECT
*/
ts = steady_clock::now();
// serial for loop
for(int i = 0; i < size; ++i)
s_vector_string.push_back(std::string());
te = steady_clock::now();
reportTime("Serial vector speed - STRING: ", te-ts);
ts = steady_clock::now();
// concurrent for loop
tbb::parallel_for(0, size, 1, [&](int i){
con_vector_string.push_back(std::string());
});
te = steady_clock::now();
Study Ref reportTime("Concurrent vector speed - STRING: https://software.intel.com/en-us/blogs/2008/10/20/tbb-containers-vs-stl-performance-in-the-multi-core", te-ageThis leads into Concurrent_vector growing below..ts);
==Efficiency Comparison Parallel for and concurrent_vector== /* TEST WITH INT DATA TYPE */
''If only you knew the power of the Building Blocks'' std::cout<< "\n\n"; ts = steady_clock::now(); // serial for loop for(int i = 0; i < size; ++i) s_vector_int.push_back(i); te = steady_clock::now(); reportTime("Serial vector speed - INT: ", te-ts);
'''Concept ts = steady_clock:''' Fine-grained locking:now(); // concurrent for loop tbb::parallel_for(0, size, 1, [&](int i){ con_vector_int.push_back(i); }); te = steady_clock::now();
Multiple threads operate on the container by locking only those portions theyreally need to lock. reportTime("Concurrent vector speed - INT: ", te-ts);
'''Concept:''' Lock-free algorithms }
'''Bits of knowledge:''' </nowiki>
STL interfaces are inherently not thread-safe[[File:Gputable.PNG |thumb|center|1200px| A speed comparison between concurrent and serial vectors]]
Threading Building Blocks containers are not templated with
an allocator argument.
'''Links''' [[File:Gputablecmp.PNG |thumb|center|800px| The results in a table format, notice the int serial speed is faster then concurrent]]
http://www.cs.northwestern.edu/~riesbeck/programming/c++/stl-summary.html
http://www.cplusplus.com/reference/stl/
https://www===The Speed Improvement===As the table suggests, completing push back operations using tbb's concurrent vector allows for increased performance against a serial connection.infTBB additionally provides a benefit that it will never need to resize the vector as pushback operations are completed.edIn STL, the vector is dynamically allocated which requires it to reallocate and copy memory over which may further slow down the push back operation.ac.uk/teaching/courses/ppls/TBBtutorial.pdf
==Code Implementation =Why was TBB slower for Int?===When dealing with primitive types like int, the actual operation to push back is not very complex; meaning that a serial process can complete the push back quite quickly. The vector can also likely hold a lot more data before requiring to reallocate its memory. TBB's practicality comes when performing a more complex action on a primitive type or from a simple action on a more complex type. Parallel overhead(the resources required to support tbb in parallel) may also increase the time required which could be the cause of the slowdown in the above comparison. TBB also provides automatic chunking and describes the benifit for use of parallel_for is 1 million clock cycles. ==Business Point of View Comparison for STL and TBB==
{| class="wikitable collapsible collapsed" style="text-align: left;margin:0px;"
===Which library is better depending the on the use case?===
One major aspect The real question is when should you parallelize your code, or to just keep it serial? TBB is for multi-threading and STL is for single threading workloads. The fastest known serial algorithm maybe difficult or impossible to parallelize. Some aspects to look out for when parallelizing your code are; *Overhead whether it maybe in communication, idling, load imbalance, synchronization, and excess computation *Efficiency which is the measure of processor utilization in a piece parallel program *Scalability the efficiency can be kept constant as the number of code processing elements is increased, provided that the problem size is increased *Correct Problem Size, when testing for efficiency, it may show poor efficiency if the Cost- Benefitproblem size is too small. Is it worth So, you would want to use serial instead, if the time problem size is always small. If you have a large problem size and effort has great efficiency, then parallel is the way to parallelize go  Resource: http://ppomorsk.sharcnet.ca/Lecture_2_d_performance.pdf  === Identifying the worries and responsibilities === The increasing complexity of your code is a natural problem when working in parallel. Knowing the responsibilities as in what you must worry about as a part of developer is key. When trying quickly implement parallel regions in your code, or to just to keep your software only code serial. ====STL and the Threading Libraries==== If you are going to try to get parallelize your code using STL coupled with the threading libraries this is what you must worry: *Thread Creation, terminating, and synchronizing, partitioning, and management must be handled by you. This increases the work load and the complexity, the thread creation and overall resource for STL is managed by a small performance gain?combination of libraries.
If are just trying to parallelize a section of your code with a simple map, scan, or reduce pattern. Without much thought TBB has you covered. Also, when working with large collections *Dividing collection of data TBB with it use is more of block range coupled with it algorithms makes it simpler to come up with solutions for the collectionproblem when using the STL containers.
TBB helps to lower the cost of smaller performance benefit*C++11 does not have any parallel algorithms. Due to TBB requiring less effort to implementSo, any common parallel patterns such as; map, scan, reduce, must be implemented by yourself, or by another library. Though C++17 will have some parallel algorithms like scan, map, and reduce.
TBB enables you to specify logical parallelism instead of threads. It eliminates the time needed when developing a backbone for your code when working with threads. For quick and easy solutions for parallelize your code TBB is the way to go.
When trying to fine tune performance, have more control near hardware level, or '''What you don’t need to work near hardware levelworry about''' *Making sorting, the STL library is the way to go. If you are trying to create your own model / deeper threading solution STL gives you the foundations without the needless level of abstraction of other 3rd party’s softwaresearching algorithms.
===Implementation Safety for TBB and STL ===*Partitioning data
We are all human *Array algorithms; like copying, assigning, and we do make mistakes.checking data
Less mistakes done by developers will equal to less wasted time.*Types of data storage
TBB specifically makes it concurrent_vector container not to support insert and erase operations. Only new items can only be pushed back, and cannot be shrunk;*Value pairing
This prevents developers to write bad code. If for exampleNote all algorithms is done in serial, we would allow insert and erase operations on concurrent_vector. It could cause a big performance hit, burdening both iterating and growing operations. Which will may not only make the concurrent containers in TBB unless, but also your program inefficient. be thread safe
As already stated most of the STL containers are not thread safe. Though some operations in ====TBB containers are also not Worries and Responsibilities====*Thread Creation, terminating, synchronizing, partitioning, thread safecreation, like reserve() and clear() in concurrent_vectormanagement is managed by TBB.
Thread Creation*Own Parallel algorithms (makes you need not to worry about the heavy constructs of threads that are present in the lower levels of programming. simple map, terminatingscan, and synchronizingpipeline, partitioning is managed by or reduce TBB. This creates a layer of safety on the programmer’s end, has they do not have to deal with the threads themselves, making a developer less prone to make mistakes.you covered
=== Identifying *Dividing collection of data, the worries and responsibility when parallelizing code ===block range coupled with it algorithms makes it simpler to divide the data
The increasing complexity of your code is a natural problem when working in parallel. Knowing the responsibilities as in what you must worry about as a developer is key. When trying quickly implement parallel regions in your code.'''Benefit'''
====STL====Thread Creation, terminating, and synchronizing, partitioning, thread creation, and The benefit of TBB is that it is made in such a way that you as a programmer can only worry about one thing; how to parallelize your serial code. You do not need to worry about the resource management must be handled by the programmer. The TBB model allows you to make quick parallel implementation solutions with less amount of effort.
This increases the work load and the complexity. The thread creation and overall resource for STL is managed by a combination of libraries.'''Downside'''
Dividing collection The downside of data TBB is more since much of the problem when using close to hard hardware management is done be hide the scenes, it makes you has a developer have less control on finetuning your program. Unlike how STLwith the threading library allows you to do.
C++11 does not have any parallel algorithms. So, any common parallel patterns such as; map, scan, reduce, must be implemented by yourself, or by another library. But the latest STL C++17, will have some parallel algorithms like scan, map, and reduce.
The benefit ===Licensing===TBB is dual-licensed as of STL is due to the fact that you must manage the thread/ resources yourself it give you more control on the code, and fine tuning optimizations. Though this can be a double edge sword and with more control, it will take time implementing the code. Not also mentioning the increase in complexity. September 2016
====TBB====*COM license as part of suites products. Offers one year of technical support and products updates
Thread Creation, terminating, and synchronizing, partitioning, thread creation, and management is managed by TBB*Apache v2.0 license for Open source code. This make you need not to worry about Allows the heavy constructs user of threads which are close the software the freedom to use the hardware level. Having software for any purpose, to distribute it close , to hardware level makes modify it less flexible , and require writing more need less code. It also has to distribute modified versions of the software, under the potential terms of making your program inefficient if not done correctlythe license, without concern for royalties.
Like already stated TBB does have Parallel Algorithms support.
The great benefit of ===Companies and Products that uses TBB is that it is made in such a way that you as a programmer can only worry about one thing. How to parallelize your serial code, and need to not to worry about the resource management. ===*DreamWorks (DreamWorks Fur Shader)
The TBB model allows you to make quick parallel implementation solutions with less amount of effort. The downside of TBB is since much of the close to hard hardware management is done be hide the scenes, it makes you has a developer have less control on finetuning your program. Unlike how STL allows you to do.*Blue Sky Studios (animation and simulation software)
*Pacific Northwest National Laboratory (Ultrasound products)
*More: https://Leo Note: WIP, maybe will make this look bettersoftware.intel.com/en-us/intel-tbb/reviews
32
edits

Navigation menu