10
edits
Changes
J&J
,→Supports C++11 Lambda
== Loop Parallelization ==
The simplest form of scalable parallelism is a loop of iterations that can each run
simultaneously without interfering with each other. Such as parallel_for, parallel_reduce, parallel_scan.
'''parallel_for and parallel_reduce'''
Load-balanced, parallel execution of a fixed number of independent loop iterations
'''parallel_scan'''
A template function that computes a parallel prefix (y[i] = y[i-1] op x[i])
'''parallel_reduce'''
Applying a function such as sum, max, min, or logical AND across all the members of a
group is called a reduction operation. Doing a reduction in parallel can yield a different
answer from a serial reduction because of rounding. For instance, A+B+C+D+E+F
may be evaluated in serial as (((((A+B)+C)+D)+E)+F), whereas the parallel version
may compute ((A+B)+((C+D)+(E+F))). Ideally, the results would be the same, but if
rounding can occur, the answers will differ. Traditional C++ programs perform
reductions in loops
'''parallel_scan'''
A parallel_scan computes a parallel prefix, also known as a parallel scan. This computation
is an advanced concept in parallel computing that is sometimes useful in
scenarios that appear to have inherently serial dependencies.
'''Partitioner Concept'''
Requirements for a type that decides whether a range should be operated on by a
task body or further split.The partitioner implements rules for deciding when a given range should no longer be
subdivided, but should be operated over as a whole by a task’s body.
The default behavior of the algorithms parallel_for, parallel_reduce, and parallel_scan
is to recursively split a range until no subrange remains that is divisible, as decided by the
function is_divisible of the Range Concept. The Partitioner Concept models rules for the
early termination of the recursive splitting of a range, providing the ability to change the
default behavior. A Partitioner object’s decision making is implemented using two functions:
a splitting constructor and the function should_execute_range.
== Advanced Algorithms ==
'''parallel_while'''
A parallel_while<Body> performs parallel iteration over items. The processing to be
performed on each item is defined by a function object of type Body. The items are specified
in two ways:
• An initial stream of items
• Additional items that are added while the stream is being processed
'''Working on the Assembly Line: Pipeline'''
Pipelining is a common parallel pattern that mimics a traditional manufacturing
assembly line. Data flows through a series of pipeline stages and each stage processes
the data in some way. Given an incoming stream of data, some of these stages
can operate in parallel, and others cannot.
Pipelined processing is common in multimedia and signal processing applications. A
classical thread-per-stage implementation suffers two problems:
• The speedup is limited to the number of stages.
• When a thread finishes a stage, it must pass its data to another thread.
'''parallel_sort'''
parallel_sort is a comparison sort with an average time complexity O(n log n).
When worker threads are available, parallel_sort creates subtasks that may be executed
concurrently. This sort provides an unstable sort of the sequence [begin1,
end1). Being an unstable sort means that it might not preserve the relative ordering of
elements with equal keys.The sort is deterministic; sorting the same sequence will produce the same result
each time. The requirements on the iterator and sequence are the same as for
std::sort.
== Containers ==
Intel Threading Building Blocks provides highly concurrent containers that permit
multiple threads to invoke a method simultaneously on the same container. At this
time, a concurrent queue, vector, and hash map are provided. All of these highly
concurrent containers can be used with this library, OpenMP, or raw threads.
'''concurrent_queue'''
The template class concurrent_queue<T> implements a concurrent queue with values
of type T. Multiple threads may simultaneously push and pop elements from the
queue. In a single-threaded program, a queue is a first-in first-out structure. But if multiple
threads are pushing and popping concurrently, the definition of first is uncertain.
The only guarantee of ordering offered by concurrent_queue is that if a thread pushes
multiple values, and another thread pops those same values, they will be popped in
the same order that they were pushed.
Pushing is provided by the push method. There are blocking and nonblocking flavors
of pop:
'''pop_if_present'''
This method is nonblocking: it attempts to pop a value, and if it cannot because
the queue is empty, it returns anyway.
'''pop'''
This method blocks until it pops a value. If a thread must wait for an item to
become available and it has nothing else to do, it should use pop(item) and not
while(!pop_if_present(item)) continue; because pop uses processor resources
more efficiently than the loop.
'''concurrent_vector'''
A concurrent_vector<T> is a dynamically growable array of items of type T for which
it is safe to simultaneously access elements in the vector while growing it. However,
be careful not to let another task access an element that is under construction or is
otherwise being modified. For safe concurrent growing, concurrent_vector has two
methods for resizing that support common uses of dynamic arrays: grow_by and
grow_to_at_least. The index of the first element is 0. The method grow_by(n) enables
you to safely append n consecutive elements to a vector, and returns the index of the
first appended element. Each element is initialized with T( ).
== Memory Allocators ==
Threading Building Blocks offers two choices, both similar to the STL template class,
std::allocator:
scalable_allocator
This template offers just scalability, but it does not completely protect against
false sharing. Memory is returned to each thread from a separate pool, which
helps protect against false sharing if the memory is not shared with other threads.
cache_aligned_allocator
This template offers both scalability and protection against false sharing. It
addresses false sharing by making sure each allocation is done on a cache line.
== Mutual Exclusion ==
One of the key advantages of threading is the capability of tasks to share data and
other resources, such as open files, network connections, and the user interface. But
concurrency creates some uncertainty over when each task reads or writes resources,
leading to the need for mutual exclusion.
Threading Building Blocks offers two kinds of mutual exclusion:
'''Mutexes'''
These will be familiar to anyone who has used locks in other environments, and
they include common variants such as reader-writer locks.
'''Atomic operations'''
These are based on atomic operations offered by hardware processors, and they
provide a solution that is simpler and faster than mutexes in a limited set of situations.
These should be preferred when the circumstances make their use possible.
'''Mutexes'''
A mutex is a global variable that multiple tasks can access. Before entering code that
you do not want to execute concurrently, a task should request a lock on the mutex.
If the mutex is already locked, the task is stalled waiting. Once the lock is granted,
the task can proceed. The task should proceed quickly and release the lock on the
mutex promptly to avoid unnecessary stalls in other tasks.
If a mutex creates a lot of lock requests, it is considered highly contested. Highly
contested locks will result in many tasks being stalled and a sharp reduction in program
scalability. Avoiding highly contested locks through careful program design is
important.
A mutex, used properly, ensures that no task reads or writes a variable or other
resource when another task is writing it. Intel Threading Building Blocks mutexes
work with generic programming in C++, even in the presence of exceptions. Meeting
all these requirements is no small feat and takes some consideration to ensure
their proper usage.
In Threading Building Blocks, mutual exclusion is implemented by classes known as
mutexes and locks. A mutex is an object on which a task can acquire a lock. Only one
task at a time can have a lock on a mutex; other tasks have to wait their turn.
Mutual exclusion controls how many tasks can simultaneously run a region of code.
In general, you protect a region of code from concurrency when that code reads or
writes a small amount of memory, which is generally interrelated by a particular
operation you want to effect.
The following is a summary of mutex behaviors:
• A spin_mutex is nonscalable, unfair, nonreentrant, and spins in user space. It
would seem to be the worst of all possible worlds, except that it is very fast in
lightly contended situations. If you can design your program so that contention is
somehow spread out among many spin mutexes, you can improve performance over other kinds of mutexes. If a mutex is heavily contended, your algorithm will not scale anyway. Consider redesigning the algorithm instead of looking for a
more efficient lock.
• A queuing_mutex is scalable, fair, nonreentrant, and spins in user space. Use it
when scalability and fairness are important.
• A spin_rw_mutex and a queuing_rw_mutex are similar to spin_mutex and queuing_
mutex , but they additionally support reader locks.
spin_mutex is very fast in lightly contended situations; use it if you
need to protect a small section of code.