Changes

J&J

14,897 bytes added, 20:53, 11 December 2016

→‎Supports C++11 Lambda

== Loop Parallelization ==

The simplest form of scalable parallelism is a loop of iterations that can each run

simultaneously without interfering with each other. Such as parallel_for, parallel_reduce, parallel_scan.

'''parallel_for and parallel_reduce'''

Load-balanced, parallel execution of a fixed number of independent loop iterations

'''parallel_scan'''

A template function that computes a parallel prefix (y[i] = y[i-1] op x[i])

'''parallel_reduce'''

Applying a function such as sum, max, min, or logical AND across all the members of a

group is called a reduction operation. Doing a reduction in parallel can yield a different

answer from a serial reduction because of rounding. For instance, A+B+C+D+E+F

may be evaluated in serial as (((((A+B)+C)+D)+E)+F), whereas the parallel version

may compute ((A+B)+((C+D)+(E+F))). Ideally, the results would be the same, but if

rounding can occur, the answers will differ. Traditional C++ programs perform

reductions in loops

'''parallel_scan'''

A parallel_scan computes a parallel prefix, also known as a parallel scan. This computation

is an advanced concept in parallel computing that is sometimes useful in

scenarios that appear to have inherently serial dependencies.

'''Partitioner Concept'''

Requirements for a type that decides whether a range should be operated on by a

task body or further split.The partitioner implements rules for deciding when a given range should no longer be

subdivided, but should be operated over as a whole by a task’s body.

The default behavior of the algorithms parallel_for, parallel_reduce, and parallel_scan

is to recursively split a range until no subrange remains that is divisible, as decided by the

function is_divisible of the Range Concept. The Partitioner Concept models rules for the

early termination of the recursive splitting of a range, providing the ability to change the

default behavior. A Partitioner object’s decision making is implemented using two functions:

a splitting constructor and the function should_execute_range.

== Advanced Algorithms ==

'''parallel_while'''

A parallel_while<Body> performs parallel iteration over items. The processing to be

performed on each item is defined by a function object of type Body. The items are specified

in two ways:

• An initial stream of items

• Additional items that are added while the stream is being processed

'''Working on the Assembly Line: Pipeline'''

Pipelining is a common parallel pattern that mimics a traditional manufacturing

assembly line. Data flows through a series of pipeline stages and each stage processes

the data in some way. Given an incoming stream of data, some of these stages

can operate in parallel, and others cannot.

Pipelined processing is common in multimedia and signal processing applications. A

classical thread-per-stage implementation suffers two problems:

• The speedup is limited to the number of stages.

• When a thread finishes a stage, it must pass its data to another thread.

'''parallel_sort'''

parallel_sort is a comparison sort with an average time complexity O(n log n).

When worker threads are available, parallel_sort creates subtasks that may be executed

concurrently. This sort provides an unstable sort of the sequence [begin1,

end1). Being an unstable sort means that it might not preserve the relative ordering of

elements with equal keys.The sort is deterministic; sorting the same sequence will produce the same result

each time. The requirements on the iterator and sequence are the same as for

std::sort.

== Containers ==

Intel Threading Building Blocks provides highly concurrent containers that permit

multiple threads to invoke a method simultaneously on the same container. At this

time, a concurrent queue, vector, and hash map are provided. All of these highly

concurrent containers can be used with this library, OpenMP, or raw threads.

'''concurrent_queue'''

The template class concurrent_queue<T> implements a concurrent queue with values

of type T. Multiple threads may simultaneously push and pop elements from the

queue. In a single-threaded program, a queue is a first-in first-out structure. But if multiple

threads are pushing and popping concurrently, the definition of first is uncertain.

The only guarantee of ordering offered by concurrent_queue is that if a thread pushes

multiple values, and another thread pops those same values, they will be popped in

the same order that they were pushed.

Pushing is provided by the push method. There are blocking and nonblocking flavors

of pop:

'''pop_if_present'''

This method is nonblocking: it attempts to pop a value, and if it cannot because

the queue is empty, it returns anyway.

'''pop'''

This method blocks until it pops a value. If a thread must wait for an item to

become available and it has nothing else to do, it should use pop(item) and not

while(!pop_if_present(item)) continue; because pop uses processor resources

more efficiently than the loop.

'''concurrent_vector'''

A concurrent_vector<T> is a dynamically growable array of items of type T for which

it is safe to simultaneously access elements in the vector while growing it. However,

be careful not to let another task access an element that is under construction or is

otherwise being modified. For safe concurrent growing, concurrent_vector has two

methods for resizing that support common uses of dynamic arrays: grow_by and

grow_to_at_least. The index of the first element is 0. The method grow_by(n) enables

you to safely append n consecutive elements to a vector, and returns the index of the

first appended element. Each element is initialized with T( ).

== Memory Allocators ==

Threading Building Blocks offers two choices, both similar to the STL template class,

std::allocator:

scalable_allocator

This template offers just scalability, but it does not completely protect against

false sharing. Memory is returned to each thread from a separate pool, which

helps protect against false sharing if the memory is not shared with other threads.

cache_aligned_allocator

This template offers both scalability and protection against false sharing. It

addresses false sharing by making sure each allocation is done on a cache line.

== Mutual Exclusion ==

One of the key advantages of threading is the capability of tasks to share data and

other resources, such as open files, network connections, and the user interface. But

concurrency creates some uncertainty over when each task reads or writes resources,

leading to the need for mutual exclusion.

Threading Building Blocks offers two kinds of mutual exclusion:

'''Mutexes'''

These will be familiar to anyone who has used locks in other environments, and

they include common variants such as reader-writer locks.

'''Atomic operations'''

These are based on atomic operations offered by hardware processors, and they

provide a solution that is simpler and faster than mutexes in a limited set of situations.

These should be preferred when the circumstances make their use possible.

'''Mutexes'''

A mutex is a global variable that multiple tasks can access. Before entering code that

you do not want to execute concurrently, a task should request a lock on the mutex.

If the mutex is already locked, the task is stalled waiting. Once the lock is granted,

the task can proceed. The task should proceed quickly and release the lock on the

mutex promptly to avoid unnecessary stalls in other tasks.

If a mutex creates a lot of lock requests, it is considered highly contested. Highly

contested locks will result in many tasks being stalled and a sharp reduction in program

scalability. Avoiding highly contested locks through careful program design is

important.

A mutex, used properly, ensures that no task reads or writes a variable or other

resource when another task is writing it. Intel Threading Building Blocks mutexes

work with generic programming in C++, even in the presence of exceptions. Meeting

all these requirements is no small feat and takes some consideration to ensure

their proper usage.

In Threading Building Blocks, mutual exclusion is implemented by classes known as

mutexes and locks. A mutex is an object on which a task can acquire a lock. Only one

task at a time can have a lock on a mutex; other tasks have to wait their turn.

Mutual exclusion controls how many tasks can simultaneously run a region of code.

In general, you protect a region of code from concurrency when that code reads or

writes a small amount of memory, which is generally interrelated by a particular

operation you want to effect.

The following is a summary of mutex behaviors:

• A spin_mutex is nonscalable, unfair, nonreentrant, and spins in user space. It

would seem to be the worst of all possible worlds, except that it is very fast in

lightly contended situations. If you can design your program so that contention is

somehow spread out among many spin mutexes, you can improve performance over other kinds of mutexes. If a mutex is heavily contended, your algorithm will not scale anyway. Consider redesigning the algorithm instead of looking for a

more efficient lock.

• A queuing_mutex is scalable, fair, nonreentrant, and spins in user space. Use it

when scalability and fairness are important.

• A spin_rw_mutex and a queuing_rw_mutex are similar to spin_mutex and queuing_

mutex , but they additionally support reader locks.

spin_mutex is very fast in lightly contended situations; use it if you

need to protect a small section of code.

~~'''Overview'''~~• A mutex is a wrapper around the system’s native mutual exclusion mechanism.On Windows systems, it is implemented on top of a CRITICAL_SECTION . OnLinux systems, it is implemented on top of a pthread mutex . The advantages ofusing the wrapper are that it adds an exception- ~~Intel~~ safe interface and provides aninterface identical to the other mutexes in Threading Building Blocks, whichmakes it easy to swap in a different kind of mutex later if warranted by performancemeasurements.Future versions of Threading Building Blocks ~~(IntelTBB)~~ may be able to put tasksto sleep more often, which can be very desirable. == Atomic Operations == Atomic operations are a fast and relatively easy alternative to mutexes. They do notsuffer from the deadlock and convoying problems described earlier, in the section“Lock Pathologies.” The main limitation of atomic operations is that they are limitedin current computer systems to fairly small data sizes: the largest is usually thesize of the largest scalar, often a double-precision floating-point number. Atomic operations are also limited to a small set of operations supported by theunderlying hardware processor. But sophisticated algorithms can accomplish a lotwith these operations; this section shows a couple of examples. You should not passup an opportunity to use an atomic operation in place of mutual exclusion.The class atomic<T> implements atomic operations with C++ style. A classic use of atomic operations is for thread-safe reference counting. Suppose x is a reference count of type int and the program needs to take some action when thereference count becomes zero. In single-threaded code, you could use a plain int forx , and write --x;if(x==0) action( ) . But this method might fail for multithreadedcode because two tasks might interleave their operations == Timing == Intel Threading Building Blocks provides a thread-safe and portable method tocompute elapsed time. Many operating systems provide a fine-grained method to track the accounting ofcentral processing unit (CPU) time, and most provide a way to track elapsed/wallclocktime. The method of extracting an elapsed measurement with subsecondaccuracy varies a great deal.The class tick_count in Threading Building Blocks provides a simple interface formeasuring wall-clock time. == Task Scheduler == The library provides a task scheduler, which is the engine that drives the algorithmtemplates. You may also call it directly. This is worth considering if your applicationmeets the criteria described earlier that ~~simplifies threading for performance~~ make the default task scheduler inefficient.Tasks are logical units of computation. The scheduler maps these onto physicalthreads. The mapping is non- ~~Move~~ preemptive. Each thread has an execute( ) method.Once a thread starts running execute( ) , the ~~level~~ task is bound to that thread untilexecute( ) returns. During that time, the thread services other tasks only when itwaits on child tasks, at which time it may run the child tasks or—if there are nopending child tasks—service tasks created by other threads.The task scheduler is intended for parallelizing computationally intensive work.Since task objects are not scheduled preemptively, they should not make calls thatmight block for long periods because, meanwhile, the blocked thread (and itsassociated processor) are precluded from servicing other tasks. Potential parallelism is typically generated by a split/join pattern. Two basic patterns ofsplit/join are supported. The most efficient is continuation passing , in which the programmerconstructs an explicit “continuation” task rather than leaving it to the defaultscheduler to determine the next task. '''Making Best Use of the Scheduler''' This section explains useful programming techniques for scheduling tasks. '''Recursive Chain Reaction''' The scheduler works best with tree-structured task graphs, because that is where thestrategy of breadth-first theft and depth-first work applies very well. Also, tree structuredtask graphs allow fast creation of many tasks. For example, if a mastertask tries to create n children directly, it will take O(n) steps. But with tree-structuredforking, it takes only O(log n) steps because some of the tasks created can go on tocreate subtasks. Often, domains are not obviously tree-structured, but you ~~program~~ can easily map them totrees. For example, parallel_for works over an iteration space such as a sequence ofintegers. The template function parallel_for uses that definition to recursively mapthe iteration space onto a binary tree. '''Continuation Passing''' The spawn_and_wait_for_all method is a convenient way to wait for child tasks, butit incurs some inefficiency if a thread becomes idle. The idle thread attempts to keepbusy by stealing tasks from other threads . The scheduler limits possible victim tasksto those deeper than the waiting task. This limit modifies the policy that theshallowest task should be chosen. The limit restrains memory demands in worst-casescenarios.A way around the constraint is for the parent not to wait, but simply to ~~tasks~~spawn bothchildren and return. The children are allocated not as children of the parent, but aschildren of the parent’s continuation task, which is a task that runs when both childrencomplete. '''task_scheduler_init''' A task_scheduler_init is either active or inactive. Each thread that uses a task should haveone active task_scheduler_init object that stays active over the duration that the threaduses task objects. A thread may have more than one active task_scheduler_init at anygiven moment. '''Task Scheduler Summary''' The task scheduler works most efficiently for fork- ~~Let~~ join parallelism with lots of forksso that the ~~run~~task stealing can cause sufficient breadth-~~time library worry about how many~~ first behavior to occupy threads ,which then conduct themselves in a depth-first manner until they need to steal morework.The task scheduler is not the simplest-possible scheduler because it is designed forspeed. If you need to useit directly, ~~scheduling~~it may be best to hide it behind a higher-levelinterface, such as the templates parallel_for, parallel_reduce, ~~cache etc~~and so on.Some of~~- Committed~~ the details toremember are: ~~compiler independence~~ • Always use new(allocation_method) T to allocate a task, ~~processor independence~~where allocation_method is one of the allocation methods of the class task. Do not create local orfile-scope instances of a task. • Allocate all siblings before any of them start to run, ~~OS independence~~unless you are usingallocate_additional_child_of.

~~'''Benefits of TBB'''- Intel Threading Building Blocks enables you~~ • Exploit continuation passing, scheduler bypass, and task recycling to ~~specify a task instead of threads~~squeeze~~- Intel Threading Building Blocks targets threading~~ out maximum performance ~~- Intel Threading Building Blocks is compatible with other threading packages~~ ~~- Intel Threading Building Blocks emphasizes scalable data parallel programming- Intel Threading Building Blocks relies on generic programming~~.

~~'''TBB is~~ • If a ~~collection of components~~ task completes and was not marked for ~~parallel programming:'''- ''Basic algorithms'': parallel_for~~reexecution, ~~parallel_reduce, parallel_scan~~it is automatically~~- ''Advanced algorithms'': parallel_while~~destroyed. Also, ~~parallel_do~~its dependent’s reference count is decremented, ~~parallel_pipeline~~and if it hits 0, ~~parallel_sort- ''Containers'': concurrent_queue, concurrent_priority_queue, concurrent_vector, concurrent_hash_map- ''Memory allocation'': scalable_malloc, scalable_free, scalable_realloc, scalable_calloc, scalable_allocator, cache_aligned_allocator- ''Mutual exclusion'': mutex, spin_mutex, queuing_mutex, spin_rw_mutex, queuing_rw_mutex, recursive_mutex- ''Atomic operations'': fetch_and_add, fetch_and_increment, fetch_and_decrement, compare_and_swap, fetch_and_store~~the dependent is automatically spawned.

Jvaraujo

10

edits

CDOT Wiki β

Changes

J&J

CDOT Wiki ^β