Open main menu

CDOT Wiki β

Changes

J&J

14,897 bytes added, 20:53, 11 December 2016
Supports C++11 Lambda
== Loop Parallelization ==
The simplest form of scalable parallelism is a loop of iterations that can each run
simultaneously without interfering with each other. Such as parallel_for, parallel_reduce, parallel_scan.
'''parallel_for and parallel_reduce'''
Load-balanced, parallel execution of a fixed number of independent loop iterations
'''parallel_scan'''
A template function that computes a parallel prefix (y[i] = y[i-1] op x[i])
'''parallel_reduce'''
Applying a function such as sum, max, min, or logical AND across all the members of a
group is called a reduction operation. Doing a reduction in parallel can yield a different
answer from a serial reduction because of rounding. For instance, A+B+C+D+E+F
may be evaluated in serial as (((((A+B)+C)+D)+E)+F), whereas the parallel version
may compute ((A+B)+((C+D)+(E+F))). Ideally, the results would be the same, but if
rounding can occur, the answers will differ. Traditional C++ programs perform
reductions in loops
'''parallel_scan'''
A parallel_scan computes a parallel prefix, also known as a parallel scan. This computation
is an advanced concept in parallel computing that is sometimes useful in
scenarios that appear to have inherently serial dependencies.
'''Partitioner Concept'''
Requirements for a type that decides whether a range should be operated on by a
task body or further split.The partitioner implements rules for deciding when a given range should no longer be
subdivided, but should be operated over as a whole by a task’s body.
The default behavior of the algorithms parallel_for, parallel_reduce, and parallel_scan
is to recursively split a range until no subrange remains that is divisible, as decided by the
function is_divisible of the Range Concept. The Partitioner Concept models rules for the
early termination of the recursive splitting of a range, providing the ability to change the
default behavior. A Partitioner object’s decision making is implemented using two functions:
a splitting constructor and the function should_execute_range.
== Advanced Algorithms ==
'''parallel_while'''
A parallel_while<Body> performs parallel iteration over items. The processing to be
performed on each item is defined by a function object of type Body. The items are specified
in two ways:
• An initial stream of items
• Additional items that are added while the stream is being processed
'''Working on the Assembly Line: Pipeline'''
Pipelining is a common parallel pattern that mimics a traditional manufacturing
assembly line. Data flows through a series of pipeline stages and each stage processes
the data in some way. Given an incoming stream of data, some of these stages
can operate in parallel, and others cannot.
Pipelined processing is common in multimedia and signal processing applications. A
classical thread-per-stage implementation suffers two problems:
• The speedup is limited to the number of stages.
• When a thread finishes a stage, it must pass its data to another thread.
'''parallel_sort'''
parallel_sort is a comparison sort with an average time complexity O(n log n).
When worker threads are available, parallel_sort creates subtasks that may be executed
concurrently. This sort provides an unstable sort of the sequence [begin1,
end1). Being an unstable sort means that it might not preserve the relative ordering of
elements with equal keys.The sort is deterministic; sorting the same sequence will produce the same result
each time. The requirements on the iterator and sequence are the same as for
std::sort.
== Containers ==
Intel Threading Building Blocks provides highly concurrent containers that permit
multiple threads to invoke a method simultaneously on the same container. At this
time, a concurrent queue, vector, and hash map are provided. All of these highly
concurrent containers can be used with this library, OpenMP, or raw threads.
 
'''concurrent_queue'''
 
The template class concurrent_queue<T> implements a concurrent queue with values
of type T. Multiple threads may simultaneously push and pop elements from the
queue. In a single-threaded program, a queue is a first-in first-out structure. But if multiple
threads are pushing and popping concurrently, the definition of first is uncertain.
The only guarantee of ordering offered by concurrent_queue is that if a thread pushes
multiple values, and another thread pops those same values, they will be popped in
the same order that they were pushed.
 
Pushing is provided by the push method. There are blocking and nonblocking flavors
of pop:
 
'''pop_if_present'''
 
This method is nonblocking: it attempts to pop a value, and if it cannot because
the queue is empty, it returns anyway.
 
'''pop'''
 
This method blocks until it pops a value. If a thread must wait for an item to
become available and it has nothing else to do, it should use pop(item) and not
while(!pop_if_present(item)) continue; because pop uses processor resources
more efficiently than the loop.
 
 
'''concurrent_vector'''
A concurrent_vector<T> is a dynamically growable array of items of type T for which
it is safe to simultaneously access elements in the vector while growing it. However,
be careful not to let another task access an element that is under construction or is
otherwise being modified. For safe concurrent growing, concurrent_vector has two
methods for resizing that support common uses of dynamic arrays: grow_by and
grow_to_at_least. The index of the first element is 0. The method grow_by(n) enables
you to safely append n consecutive elements to a vector, and returns the index of the
first appended element. Each element is initialized with T( ).
== Memory Allocators ==
Threading Building Blocks offers two choices, both similar to the STL template class,
std::allocator:
scalable_allocator
 
This template offers just scalability, but it does not completely protect against
false sharing. Memory is returned to each thread from a separate pool, which
helps protect against false sharing if the memory is not shared with other threads.
 
cache_aligned_allocator
 
This template offers both scalability and protection against false sharing. It
addresses false sharing by making sure each allocation is done on a cache line.
 
== Mutual Exclusion ==
 
One of the key advantages of threading is the capability of tasks to share data and
other resources, such as open files, network connections, and the user interface. But
concurrency creates some uncertainty over when each task reads or writes resources,
leading to the need for mutual exclusion.
 
Threading Building Blocks offers two kinds of mutual exclusion:
 
'''Mutexes'''
 
These will be familiar to anyone who has used locks in other environments, and
they include common variants such as reader-writer locks.
 
'''Atomic operations'''
 
These are based on atomic operations offered by hardware processors, and they
provide a solution that is simpler and faster than mutexes in a limited set of situations.
These should be preferred when the circumstances make their use possible.
 
'''Mutexes'''
 
A mutex is a global variable that multiple tasks can access. Before entering code that
you do not want to execute concurrently, a task should request a lock on the mutex.
If the mutex is already locked, the task is stalled waiting. Once the lock is granted,
the task can proceed. The task should proceed quickly and release the lock on the
mutex promptly to avoid unnecessary stalls in other tasks.
If a mutex creates a lot of lock requests, it is considered highly contested. Highly
contested locks will result in many tasks being stalled and a sharp reduction in program
scalability. Avoiding highly contested locks through careful program design is
important.
A mutex, used properly, ensures that no task reads or writes a variable or other
resource when another task is writing it. Intel Threading Building Blocks mutexes
work with generic programming in C++, even in the presence of exceptions. Meeting
all these requirements is no small feat and takes some consideration to ensure
their proper usage.
In Threading Building Blocks, mutual exclusion is implemented by classes known as
mutexes and locks. A mutex is an object on which a task can acquire a lock. Only one
task at a time can have a lock on a mutex; other tasks have to wait their turn.
Mutual exclusion controls how many tasks can simultaneously run a region of code.
In general, you protect a region of code from concurrency when that code reads or
writes a small amount of memory, which is generally interrelated by a particular
operation you want to effect.
 
 
The following is a summary of mutex behaviors:
 
• A spin_mutex is nonscalable, unfair, nonreentrant, and spins in user space. It
would seem to be the worst of all possible worlds, except that it is very fast in
lightly contended situations. If you can design your program so that contention is
somehow spread out among many spin mutexes, you can improve performance over other kinds of mutexes. If a mutex is heavily contended, your algorithm will not scale anyway. Consider redesigning the algorithm instead of looking for a
more efficient lock.
 
• A queuing_mutex is scalable, fair, nonreentrant, and spins in user space. Use it
when scalability and fairness are important.
 
• A spin_rw_mutex and a queuing_rw_mutex are similar to spin_mutex and queuing_
mutex , but they additionally support reader locks.
spin_mutex is very fast in lightly contended situations; use it if you
need to protect a small section of code.
'''Overview'''• A mutex is a wrapper around the system’s native mutual exclusion mechanism.On Windows systems, it is implemented on top of a CRITICAL_SECTION . OnLinux systems, it is implemented on top of a pthread mutex . The advantages ofusing the wrapper are that it adds an exception- Intel safe interface and provides aninterface identical to the other mutexes in Threading Building Blocks, whichmakes it easy to swap in a different kind of mutex later if warranted by performancemeasurements.Future versions of Threading Building Blocks (IntelTBB) may be able to put tasksto sleep more often, which can be very desirable. == Atomic Operations ==  Atomic operations are a fast and relatively easy alternative to mutexes. They do notsuffer from the deadlock and convoying problems described earlier, in the section“Lock Pathologies.” The main limitation of atomic operations is that they are limitedin current computer systems to fairly small data sizes: the largest is usually thesize of the largest scalar, often a double-precision floating-point number. Atomic operations are also limited to a small set of operations supported by theunderlying hardware processor. But sophisticated algorithms can accomplish a lotwith these operations; this section shows a couple of examples. You should not passup an opportunity to use an atomic operation in place of mutual exclusion.The class atomic<T> implements atomic operations with C++ style. A classic use of atomic operations is for thread-safe reference counting. Suppose x is a reference count of type int and the program needs to take some action when thereference count becomes zero. In single-threaded code, you could use a plain int forx , and write --x;if(x==0) action( ) . But this method might fail for multithreadedcode because two tasks might interleave their operations   == Timing == Intel Threading Building Blocks provides a thread-safe and portable method tocompute elapsed time. Many operating systems provide a fine-grained method to track the accounting ofcentral processing unit (CPU) time, and most provide a way to track elapsed/wallclocktime. The method of extracting an elapsed measurement with subsecondaccuracy varies a great deal.The class tick_count in Threading Building Blocks provides a simple interface formeasuring wall-clock time.  == Task Scheduler ==  The library provides a task scheduler, which is the engine that drives the algorithmtemplates. You may also call it directly. This is worth considering if your applicationmeets the criteria described earlier that simplifies threading for performance make the default task scheduler inefficient.Tasks are logical units of computation. The scheduler maps these onto physicalthreads. The mapping is non- Move preemptive. Each thread has an execute( ) method.Once a thread starts running execute( ) , the level task is bound to that thread untilexecute( ) returns. During that time, the thread services other tasks only when itwaits on child tasks, at which time it may run the child tasks or—if there are nopending child tasks—service tasks created by other threads.The task scheduler is intended for parallelizing computationally intensive work.Since task objects are not scheduled preemptively, they should not make calls thatmight block for long periods because, meanwhile, the blocked thread (and itsassociated processor) are precluded from servicing other tasks. Potential parallelism is typically generated by a split/join pattern. Two basic patterns ofsplit/join are supported. The most efficient is continuation passing , in which the programmerconstructs an explicit “continuation” task rather than leaving it to the defaultscheduler to determine the next task.  '''Making Best Use of the Scheduler''' This section explains useful programming techniques for scheduling tasks. '''Recursive Chain Reaction''' The scheduler works best with tree-structured task graphs, because that is where thestrategy of breadth-first theft and depth-first work applies very well. Also, tree structuredtask graphs allow fast creation of many tasks. For example, if a mastertask tries to create n children directly, it will take O(n) steps. But with tree-structuredforking, it takes only O(log n) steps because some of the tasks created can go on tocreate subtasks. Often, domains are not obviously tree-structured, but you program can easily map them totrees. For example, parallel_for works over an iteration space such as a sequence ofintegers. The template function parallel_for uses that definition to recursively mapthe iteration space onto a binary tree. '''Continuation Passing''' The spawn_and_wait_for_all method is a convenient way to wait for child tasks, butit incurs some inefficiency if a thread becomes idle. The idle thread attempts to keepbusy by stealing tasks from other threads . The scheduler limits possible victim tasksto those deeper than the waiting task. This limit modifies the policy that theshallowest task should be chosen. The limit restrains memory demands in worst-casescenarios.A way around the constraint is for the parent not to wait, but simply to tasksspawn bothchildren and return. The children are allocated not as children of the parent, but aschildren of the parent’s continuation task, which is a task that runs when both childrencomplete. '''task_scheduler_init'''  A task_scheduler_init is either active or inactive. Each thread that uses a task should haveone active task_scheduler_init object that stays active over the duration that the threaduses task objects. A thread may have more than one active task_scheduler_init at anygiven moment. '''Task Scheduler Summary''' The task scheduler works most efficiently for fork- Let join parallelism with lots of forksso that the runtask stealing can cause sufficient breadth-time library worry about how many first behavior to occupy threads ,which then conduct themselves in a depth-first manner until they need to steal morework.The task scheduler is not the simplest-possible scheduler because it is designed forspeed. If you need to useit directly, schedulingit may be best to hide it behind a higher-levelinterface, such as the templates parallel_for, parallel_reduce, cache etcand so on.Some of- Committed the details toremember are: compiler independence • Always use new(allocation_method) T to allocate a task, processor independencewhere allocation_method is one of the allocation methods of the class task. Do not create local orfile-scope instances of a task. • Allocate all siblings before any of them start to run, OS independenceunless you are usingallocate_additional_child_of.
'''Benefits of TBB'''- Intel Threading Building Blocks enables you • Exploit continuation passing, scheduler bypass, and task recycling to specify a task instead of threadssqueeze- Intel Threading Building Blocks targets threading out maximum performance - Intel Threading Building Blocks is compatible with other threading packages - Intel Threading Building Blocks emphasizes scalable data parallel programming- Intel Threading Building Blocks relies on generic programming.
'''TBB is • If a collection of components task completes and was not marked for parallel programming:'''- ''Basic algorithms'': parallel_forreexecution, parallel_reduce, parallel_scanit is automatically- ''Advanced algorithms'': parallel_whiledestroyed. Also, parallel_doits dependent’s reference count is decremented, parallel_pipelineand if it hits 0, parallel_sort- ''Containers'': concurrent_queue, concurrent_priority_queue, concurrent_vector, concurrent_hash_map- ''Memory allocation'': scalable_malloc, scalable_free, scalable_realloc, scalable_calloc, scalable_allocator, cache_aligned_allocator- ''Mutual exclusion'': mutex, spin_mutex, queuing_mutex, spin_rw_mutex, queuing_rw_mutex, recursive_mutex- ''Atomic operations'': fetch_and_add, fetch_and_increment, fetch_and_decrement, compare_and_swap, fetch_and_storethe dependent is automatically spawned.
10
edits