Changes

← Older edit

GPU621/NoName

3,221 bytes removed, 19:17, 3 December 2016

→‎Asynchronous Multi-Threading

===Creating and executing Threads===

====OpenMp====

Inside a declared OpenMp parallel region, if not specified via an environment variable OMP_NUM_THREADS or the library routine omp_get_thread_num() , OpenMp will automatically decide how many threads are needed to execute parallel code.

An issue with this approach is that OpenMp is unaware how many threads a CPU can support. A result of this can be OpenMp creating 4 threads for a single core processor which may result in a degradation of performance.

}

====C++ 11====

C++ 11 Threads on the contrary always required to specify the number of threads required for a parallel region. If not specified by user input or hard-coding, the number of threads supported by a CPU can also be accurately via the std::thread::hardware_concurrency(); function.

OpenMp automatically decides what order threads will execute. C++ 11 Threads require the developer to specify in what order threads will execute. This is typically done within a for loop block. Threads are created by initializing the std::thread class and specifying a function or any other callable object within the constructor.

~~After the initial creation and execution of a thread, the main thread must either detach or join the thread.~~

~~The C++ 11 standard library offers these two member functions for attaching or detaching threads.~~

* std::thread::join - allows the thread to execute in the background independently from the main thread. The thread will continue execution without blocking nor synchronizing in any way and terminate without relying on the main thread.

* std::thread::detach - waits for the thread to finish execution. Once a thread is created another thread can wait for the thread to finish.

Example of native thread creating and synchronization using C++ 11

int numThreads = std::thread::hardware_concurrency();

std::vector<std::thread> threads(numThreads);

threads[ID] = std::thread(function);

}

After the initial creation and execution of a thread, the main thread must either detach or join the thread.

The C++ 11 standard library offers these two member functions for attaching or detaching threads.

* std::thread::join - allows the thread to execute in the background independently from the main thread. The thread will continue execution without blocking nor synchronizing in any way and terminate without relying on the main thread.

* std::thread::detach - waits for the thread to finish execution. Once a thread is created another thread can wait for the thread to finish.

Each created thread can then be synchronized with the main thread

C++ 11 threads and language native threads unfortunately lack this luxury. In order to parallelize a loop using std Threads, it is the programmers responsibility to calculate the range of each iteration within the loop the be parallelized. This is usually done using SPMD techniques.

====Synchronization====

C++ 11 and Openmp are designed to avoid race conditions and share data between threads in various ways.

====Shared Memory====

=====OpenMp=====

std::atomic<type> var_name;

====Mutual Exclusion====

=====OpenMp=====

Openmp ~~Handles Mutual Exclusion~~ offers multiple solutions for handling mutual exclusion. Scoped Locking may be implemented using the omp_set_lock and omp_unset_lock template functions to allow thread blocking. Example of Scoped Locking omp_lock_t lock; omp_init_lock(&lock); int i = 0; #pragma omp parallel num_threads(8) { omp_set_lock(&lock); i++; omp_unset_lock(&lock); } omp_destroy_lock(&lock); A lock is somewhat similar to a critical section as it guarantees that some instructions can only be performed by one process at a time. With a lock you make sure that some data elements can only be touched by one process at a time. Openmp offers a easier solution for mutual exclusion and preventing race conditions within its section constructs as the programmer does not have to worry about initializing and destroying locks. * critical - region to be executed by only one thread at a time* atomic - the memory location to be updated by one thread at a time A critical section works by acquiring a lock, which carries a substantial overhead. Furthermore. If a thread is in one critical section, the other ones are all blocked. A critical region can by implemented as follows #pragma omp critical { i++; } A atomic region is implemented just as critical region, only the critical construct is replaced by an atomic construct. An atomic section has much lower overhead then a critical section as it does not require locking and unlocking operations as it takes advantage of the hardware providing atomic increment operations.

=====C++ 11=====

The C++ 11 thread libraries provide the mutex ~~and the atomic classes which~~ class to support mutual exclusionand synchronization. <br>

The mutex class is a synchronization primitive that can be used to protect shared data from being accessed by multiple threads.

std::mutex is usually not accessed directly, instead std::unique_lock and std::lock_guard are used to manage locking.

<br>

Mutex offers these member functions for controlling locking

• * lock - locks the mutex, blocks if the mutex is not availabl• * unlock - unlocks the mutex• * try_lock - tries to lock the mutex, returns if the mutex is not available

Example of thread locking/blocking

for (int i = 1000; i > 0; i--)

shared_output("main thread", i);

t.join();

return 0;

}

====Implementations====

Serial Implementation

A future is an object that can retrieve a value from some provider object (also known as a promise) or function. Simply put in the case of multithreading, a future object will wait until its associated thread has completed and then store its return value.

To retrieve or construct a future object, these functions may be used.

• * Async • * promise::get_future • * packaged_task::get_future

However, a future object can only be used if it is in a valid state. Default future objects constructed from the std::async template function are not valid and must be assigned a valid state during execution.

A std::future references a shared state that cannot be shared to other asynchronous return objects. If multiple threads need to wait for the same shared state, std::shared_future class template should be used.

Openmp unfortunately does not support asynchronous multi-threading as is designed for designed for parallelism, not concurrency.

===~~Programming Models=======SPMD====~~ ~~An example of the SPMD programming model in STD Threads using an atomic barrier~~ ~~#include <iostream>~~ ~~#include <iomanip>~~ ~~#include <cstdlib>~~ ~~#include <chrono>~~ ~~#include <vector>~~ ~~#include <thread>~~ ~~#include <atomic>~~ ~~using namespace std::chrono;~~ ~~std::atomic<double> pi;~~ ~~void reportTime(const char* msg, steady_clock::duration span) {~~ ~~auto ms = duration_cast<milliseconds>(span);~~ ~~std::cout << msg << " - took - " <<~~ ~~ms.count() << " milliseconds" << std::endl;~~ } ~~void run(int ID, double stepSize, int nthrds, int n)~~ { ~~double x;~~ ~~double sum = 0.0;~~ ~~for (int i = ID; i < n; i = i~~ C+ ~~nthrds){~~ ~~x = (i~~ + ~~0.5)*stepSize;~~ ~~sum += 4.0 / (1.0 + x*x);~~ } ~~sum = sum * stepSize;~~ ~~pi = pi + sum;~~ } ~~int main(int argc, char** argv) {~~ ~~if (argc != 3) {~~ ~~std::cerr << argv[0] << ": invalid number of arguments\n";~~ ~~return 1;~~ } ~~int n = atoi(argv[1]);~~ ~~int numThreads = atoi(argv[2]);~~ ~~steady_clock::time_point ts, te;~~ ~~// calculate pi by integrating the area under 1/(1 + x^2) in n steps~~ ~~ts = steady_clock::now();~~ ~~std::vector<std::thread> threads(numThreads);~~ ~~double stepSize = 1.0 / (double)n;~~ ~~for (int ID = 0; ID < numThreads; ID++) {~~ ~~int nthrds = std::thread::hardware_concurrency();~~ ~~if (ID == 0) numThreads = nthrds;~~ ~~threads[ID] = std::thread(run, ID, stepSize, 8, n);~~ } ~~te = steady_clock::now();~~ ~~for (int i = 0; i < numThreads; i++){~~ ~~threads[i].join();~~ } ~~std::cout << "n = " << n << std::fixed << std::setprecision(15) << "\n pi(exact) = " << 3.141592653589793 << "\n pi(calcd) = " << pi << std::endl;~~ ~~reportTime("Integration", te - ts);~~ ~~// terminate~~ ~~char c;~~ ~~std::cout << "Press Enter key to exit ... ";~~ ~~std::cin.get(c);~~ } ~~====Question & Awnser=~~11 Threads and OpenMp compatibility===

Can one safely use C++11 multi-threading as well as OpenMP in one and the same program but without

interleaving them (i.e. no OpenMP statement in any code passed to C++11 concurrent features and no

C++11 concurrency in threads spawned by OpenMP)?

On some platforms efficient implementation could only be achieved if the OpenMP run-time is the

and x86 is usually considered an "experimental" platform (other vendors are usually much more conservative).

===Conclusion===

~~====OpenMP code====//Workshop 3 using the scan~~ In conclusion while OpenMp is and ~~reduce with openMp~~ ~~template <typename T, typename R, typename C, typename S>~~ ~~int scan(~~ const T* still continues to be a viable option inmulti-threading, ~~// source data~~ ~~T* out, // output data~~ ~~int size, // size~~ it lacks the some of ~~source, output data sets~~ ~~R reduce, // reduction expression~~ outlined features and lacks low-level control. While C ~~combine, // combine expression~~ ~~S scan_fn, // scan function (exclusive or inclusive)~~ ~~T initial // initial value~~ ) { ~~/* int tile size = (n - 1)/ntiles~~ + 1; ~~reduced[tid] = reduce(in~~ + ~~tid * tilesize,itile == last_tile ? last_tile_size : tile_size, combine, T(0));~~ ~~#pragma omp barrier~~ ~~#pragma omp single */~~ ~~int nthreads = 1;~~ ~~if (size > 0) {~~ ~~// requested number of tiles~~ ~~int max_threads = omp_get_max_threads();~~ ~~T* reduced = new T[max_threads];~~ ~~T* scanRes = new T[max_threads];~~ ~~#pragma omp parallel~~ { ~~int ntiles = omp_get_num_threads(); // Number of tiles~~ ~~int itile = omp_get_thread_num();~~ ~~int tile_size = (size - 1) / ntiles + 1;~~ ~~int last_tile = ntiles - 1;~~ ~~int last_tile_size = size - last_tile * tile_size;~~ ~~if (itile == 0)~~ ~~nthreads = ntiles;~~ ~~// step 1~~ 11 standard libarary multi- ~~reduce each tile separately~~ ~~for (int itile = 0; itile < ntiles; itile++)~~ ~~reduced[itile] = reduce(in + itile * tile_size,~~ ~~itile == last_tile ? last_tile_size : tile_size, combine~~threading can be more difficult to learn, ~~T(0));~~ ~~// step 2 - perform exclusive scan on~~ is supported by virtually all ~~tiles using reduction outputs~~ ~~// store results in scanRes[]~~ ~~excl_scan(reduced, scanRes, ntiles, combine, T(0));~~ ~~// step 3 - scan each tile separately using scanRes[]~~ ~~for (int itile = 0; itile < ntiles; itile++)~~ ~~scan_fn(in + itile * tile_size, out + itile * tile_size,~~ ~~itile == last_tile ? last_tile_size : tile_size, combine,~~ ~~scanRes[itile]);~~ } ~~delete[] reduced;~~ ~~delete[] scanRes;~~ } ~~return nthreads;~~ } ~~====~~C++11 ~~code====~~ ~~#include <iostream>~~ ~~#include <omp.h>~~ ~~#include <chrono>~~ ~~#include <vector>~~ ~~#include <thread>~~ ~~using namespace std;~~ ~~void doNothing() {}~~ ~~int run(int algorithmToRun)~~ { ~~auto startTime = std::chrono::system_clock::now();~~ ~~for(int j=1; j<100000; ++j)~~ { ~~if(algorithmToRun == 1)~~ { ~~vector<thread> threads;~~ ~~for(int i=0; i<16; i++)~~ { ~~threads.push_back(thread(doNothing));~~ } ~~for(auto& thread :~~ compilers and offers a low-level interaction between hardware threads~~) thread.join();~~ } ~~else if(algorithmToRun == 2)~~ { ~~#pragma omp parallel for num_threads(16)~~ ~~for(unsigned i=0; i<16; i++)~~ { ~~doNothing();~~ } } } ~~auto endTime = std::chrono::system_clock::now();~~ ~~std::chrono::duration<double> elapsed_seconds = endTime - startTime;~~ ~~return elapsed_seconds~~.~~count();~~ } ~~int main()~~ { ~~int cppt = run(1);~~ ~~int ompt = run(2);~~ ~~cout<<cppt<<endl;~~ ~~cout<<ompt<<endl;~~ ~~return 0;~~ }

Danylo Daniel Medinski

45

edits

Changes

GPU621/NoName

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools