J&J
Introduction to Intel Threading Building Blocks
Intel Threading Building Blocks offers a rich and complete approach to expressing parallelism in a C++ program. It is a library that helps you leverage multi-core processor performance without having to be a threading expert. Threading Building Blocks is not just a threads-replacement library; it represents a higher-level, taskbased parallelism that abstracts platform details and threading mechanisms for performance and scalability.
Why Use It:Intel® Threading Building Blocks (Intel® TBB) lets you easily write parallel C++ programs that take full advantage of multicore performance, that are portable and composable, and that have future-proof scalability.
What is it:Widely used C++ template library for task parallelism.
Primary Features: - Parallel algorithms and data structures and Scalable memory allocation and task scheduling.
Reason to Use: - Rich feature set for general purpose parallelism. C++; Windows*, Linux*, OS X* and other OSes
Key Benefits of Using Intel TBB
Intel TBB differs from typical threading packages in the following ways:
Enables you to specify logical parallelism instead of threads.
Intel TBB has a runtime library that automatically maps logical parallelism onto threads in a way that makes efficient use of processor resources―thereby making it less tedious and more efficient.
Targets threading for performance.
Intel TBB focuses on the particular goal of parallelizing computationally intensive work, delivering higher-level, simpler solutions
Compatible with other threading packages.
Intel TBB can coexist seamlessly with other threading packages, giving you the flexibility to not touch your legacy code but still use Intel TBB for new implementations.
Emphasizes scalable, data parallel programming.
Intel TBB emphasizes data-parallel programming, enabling multiple threads to work on different parts of a collection. Data-parallel programming scales well to larger numbers of processors by dividing the collection into smaller pieces. With data-parallel programming, program performance increases as you add processors.
Relies on generic programming.
Intel TBB uses generic programming. The essence of generic programming is writing the best possible algorithms with the fewest constraints. The C++ Standard Template Library (STL) is a good example of generic programming in which the interfaces are specified by requirements on types.
Threading Building Blocks enables you to specify tasks instead of threads
Most threading packages require you to create, join, and manage threads. Programming directly in terms of threads can be tedious and can lead to inefficient programs because threads are low-level, heavy constructs that are close to the hardware. Direct programming with threads forces you to do the work to efficiently map logical tasks onto threads. In contrast, the Threading Building Blocks runtime library automatically schedules tasks onto threads in a way that makes efficient use of processor resources. The runtime is very effective at loadbalancing the many tasks you will be specifying. By avoiding programming in a raw native thread model, you can expect better portability, easier programming, more understandable source code, and better performance and scalability in general. Indeed, the alternative of using raw threads directly would amount to programming in the assembly language of parallel programming. It may give you maximum flexibility, but with many costs.
Threading Building Blocks targets threading for performance
Most general-purpose threading packages support many different kinds of threading, such as threading for asynchronous events in graphical user interfaces. As a result, general-purpose packages tend to be low-level tools that provide a foundation, not a solution. Instead, Threading Building Blocks focuses on the particular goal of parallelizing computationally intensive work, delivering higher-level, simpler solutions.
Threading Building Blocks is compatible with other threading packages
Threading Building Blocks can coexist seamlessly with other threading packages. This is very important because it does not force you to pick among Threading Building Blocks, OpenMP, or raw threads for your entire program. You are free to add Threading Building Blocks to programs that have threading in them already. You can also add an OpenMP directive, for instance, somewhere else in your program that uses Threading Building Blocks. For a particular part of your program, you will use one method, but in a large program, it is reasonable to anticipate the convenience of mixing various techniques. It is fortunate that Threading Building Blocks supports this. Using or creating libraries is a key reason for this flexibility, particularly because libraries are often supplied by others. For instance, Intel’s Math Kernel Library (MKL) and Integrated Performance Primitives (IPP) library are implemented internally using OpenMP. You can freely link a program using Threading Building Blocks with the Intel MKL or Intel IPP library.
Threading Building Blocks emphasizes scalable, data-parallel programming
Breaking a program into separate functional blocks and assigning a separate thread to each block is a solution that usually does not scale well because, typically, the number of functional blocks is fixed. In contrast, Threading Building Blocks emphasizes data-parallel programming, enabling multiple threads to work most efficiently together. Data-parallel programming scales well to larger numbers of processors by dividing a data set into smaller pieces. With dataparallel programming, program performance increases (scales) as you add processors. Threading Building Blocks also avoids classic bottlenecks, such as a global task queue that each processor must wait for and lock in order to get a new task.
Threading Building Blocks relies on generic programming
Traditional libraries specify interfaces in terms of specific types or base classes. Instead, Threading Building Blocks uses generic programming, which is defined in Chapter 12. The essence of generic programming is to write the best possible algorithms with the fewest constraints. The C++ Standard Template Library (STL) is a good example of generic programming in which the interfaces are specified by requirements on types. For example, C++ STL has a template function that sorts a sequence abstractly, defined in terms of iterators on the sequence. Generic programming enables Threading Building Blocks to be flexible yet efficient. The generic interfaces enable you to customize components to your specific needs.
Intel® Threading Building Blocks (Intel® TBB) makes parallel performance and scalability easily accessible to software developers who are writing loop and task based applications. Developers can build robust applications that abstract platform details and threading mechanisms while achieving performance that scales with increasing core count.
Rich Feature Set for Parallelism
Intel TBB includes a rich set of components for threading performance and productivity.
Parallel algorithms and data structures
Generic Parallel Algorithms
An efficient, scalable way to exploit the power of multi-core without having to start from scratch.
Flow Graph
A set of classes to express parallelism as a graph of compute dependencies and/or data flow.
Concurrent Containers
Concurrent access, and a scalable alternative to containers. that are externally locked for thread-safety.
Memory allocation and task scheduling
Task Scheduler
Sophisticated work scheduling engine that empowers parallel algorithms and the flow graph.
Memory Allocation
Scalable memory manager and false-sharing free allocators
Threads and synchronization
Synchronization Primitives
Atomic operations, a variety of mutexes with different properties, condition variables
Timers and Exceptions
Thread-safe timers and exception classes
Threads
OS API wrappers
Thread Local Storage
Efficient implementation for an unlimited number of thread-local variables.
Conditional Numerical Reproducibility
Ensure deterministic associativity for floating-point arithmetic results with the new Intel TBB template function ‘parallel_deterministic_reduce’.
Supports C++11 Lambda
Intel TBB can be used with C++11 compilers and supports lambda expressions. For developers using parallel algorithms, lambda expressions reduce the time and code needed by removing the requirement for separate objects or classes. Flow Graph Designer
Computing systems are becoming increasingly heterogeneous. And developing for heterogeneous computing systems can often be challenging because of divergent programming models and tools. Intel TBB flow graph provides a single interface that enables intra-node distributed memory programming, thereby simplifying communication and load balancing across heterogeneous devices.
It does this in two ways:
1. As an analyzer, it provides capabilities to collect and visualize execution traces from Intel TBB flow graph applications. From Flow Graph Designer, users can explore the topology of their graphs, interact with a timeline of node executions, and project important statistics on to the nodes of their graphs. 2. As a designer, it provides the ability to visually create Intel TBB flow graph diagrams and then generate C++ stubs as a starting point for further development.
Overview
- Intel Threading Building Blocks (IntelTBB) is a C++ library that simplifies threading for performance
- Move the level at which you program from threads to tasks
- Let the run-time library worry about how many threads to use, scheduling, cache etc.
- Committed to: compiler independence, processor independence, OS independence
Benefits of TBB - Intel Threading Building Blocks enables you to specify a task instead of threads - Intel Threading Building Blocks targets threading performance - Intel Threading Building Blocks is compatible with other threading packages - Intel Threading Building Blocks emphasizes scalable data parallel programming - Intel Threading Building Blocks relies on generic programming
TBB is a collection of components for parallel programming: - Basic algorithms: parallel_for, parallel_reduce, parallel_scan - Advanced algorithms: parallel_while, parallel_do, parallel_pipeline, parallel_sort - Containers: concurrent_queue, concurrent_priority_queue, concurrent_vector, concurrent_hash_map - Memory allocation: scalable_malloc, scalable_free, scalable_realloc, scalable_calloc, scalable_allocator, cache_aligned_allocator - Mutual exclusion: mutex, spin_mutex, queuing_mutex, spin_rw_mutex, queuing_rw_mutex, recursive_mutex - Atomic operations: fetch_and_add, fetch_and_increment, fetch_and_decrement, compare_and_swap, fetch_and_store