Changes

← Older edit

GPU621/NoName

4,716 bytes removed, 19:17, 3 December 2016

→‎Asynchronous Multi-Threading

===Creating and executing Threads===

====OpenMp====

Inside a declared OpenMp parallel region, if not specified via an environment variable OMP_NUM_THREADS or the library routine omp_get_thread_num() , OpenMp will automatically decide how many threads are needed to execute parallel code.

An issue with this approach is that OpenMp is unaware how many threads a CPU can support. A result of this can be OpenMp creating 4 threads for a single core processor which may result in a degradation of performance.

}

====C++ 11====

C++ 11 Threads on the contrary always required to specify the number of threads required for a parallel region. If not specified by user input or hard-coding, the number of threads supported by a CPU can also be accurately via the std::thread::hardware_concurrency(); function.

OpenMp automatically decides what order threads will execute. C++ 11 Threads require the developer to specify in what order threads will execute. This is typically done within a for loop block. Threads are created by initializing the std::thread class and specifying a function or any other callable object within the constructor.

Openmp offers a easier solution for mutual exclusion and preventing race conditions within its section constructs as the programmer does not have to worry about initializing and destroying locks.

* critical - region to be executed by only one thread at a time * atomic - the memory location to be updated by one thread at a time

A critical section works by acquiring a lock, which carries a substantial overhead. Furthermore. If a thread is in one critical section, the other ones are all blocked.

=====C++ 11=====

The C++ 11 thread libraries provide the mutex class to support mutual exclusionand synchronization. <br>

The mutex class is a synchronization primitive that can be used to protect shared data from being accessed by multiple threads.

std::mutex is usually not accessed directly, instead std::unique_lock and std::lock_guard are used to manage locking.

}

====Implementations====

Serial Implementation

A future is an object that can retrieve a value from some provider object (also known as a promise) or function. Simply put in the case of multithreading, a future object will wait until its associated thread has completed and then store its return value.

To retrieve or construct a future object, these functions may be used.

• * Async • * promise::get_future • * packaged_task::get_future

However, a future object can only be used if it is in a valid state. Default future objects constructed from the std::async template function are not valid and must be assigned a valid state during execution.

A std::future references a shared state that cannot be shared to other asynchronous return objects. If multiple threads need to wait for the same shared state, std::shared_future class template should be used.

Openmp unfortunately does not support asynchronous multi-threading as is designed for designed for parallelism, not concurrency.

===~~Programming Models=======SPMD====~~ ~~An example of the SPMD programming model in STD Threads using an atomic barrier~~ ~~#include <iostream>~~ ~~#include <iomanip>~~ ~~#include <cstdlib>~~ ~~#include <chrono>~~ ~~#include <vector>~~ ~~#include <thread>~~ ~~#include <atomic>~~ ~~using namespace std::chrono;~~ ~~std::atomic<double> pi;~~ ~~void reportTime(const char* msg, steady_clock::duration span) {~~ ~~auto ms = duration_cast<milliseconds>(span);~~ ~~std::cout << msg << " - took - " <<~~ ~~ms.count() << " milliseconds" << std::endl;~~ } ~~void run(int ID, double stepSize, int nthrds, int n)~~ { ~~double x;~~ ~~double sum = 0.0;~~ ~~for (int i = ID; i < n; i = i~~ C+ ~~nthrds){~~ ~~x = (i~~ + ~~0.5)*stepSize;~~ ~~sum += 4.0 / (1.0 + x*x);~~ } ~~sum = sum * stepSize;~~ ~~pi = pi + sum;~~ } ~~int main(int argc, char** argv) {~~ ~~if (argc != 3) {~~ ~~std::cerr << argv[0] << ": invalid number of arguments\n";~~ ~~return 1;~~ } ~~int n = atoi(argv[1]);~~ ~~int numThreads = atoi(argv[2]);~~ ~~steady_clock::time_point ts, te;~~ ~~// calculate pi by integrating the area under 1/(1 + x^2) in n steps~~ ~~ts = steady_clock::now();~~ ~~std::vector<std::thread> threads(numThreads);~~ ~~double stepSize = 1.0 / (double)n;~~ ~~for (int ID = 0; ID < numThreads; ID++) {~~ ~~int nthrds = std::thread::hardware_concurrency();~~ ~~if (ID == 0) numThreads = nthrds;~~ ~~threads[ID] = std::thread(run, ID, stepSize, 8, n);~~ } ~~te = steady_clock::now();~~ ~~for (int i = 0; i < numThreads; i++){~~ ~~threads[i].join();~~ } ~~std::cout << "n = " << n << std::fixed << std::setprecision(15) << "\n pi(exact) = " << 3.141592653589793 << "\n pi(calcd) = " << pi << std::endl;~~ ~~reportTime("Integration", te - ts);~~ ~~// terminate~~ ~~char c;~~ ~~std::cout << "Press Enter key to exit ... ";~~ ~~std::cin.get(c);~~ } ~~====Question & Awnser=~~11 Threads and OpenMp compatibility===

Can one safely use C++11 multi-threading as well as OpenMP in one and the same program but without

interleaving them (i.e. no OpenMP statement in any code passed to C++11 concurrent features and no

C++11 concurrency in threads spawned by OpenMP)?

On some platforms efficient implementation could only be achieved if the OpenMP run-time is the

and x86 is usually considered an "experimental" platform (other vendors are usually much more conservative).

===Conclusion===

~~====OpenMP code====//Workshop 3 using the scan~~ In conclusion while OpenMp is and ~~reduce with openMp~~ ~~template <typename T, typename R, typename C, typename S>~~ ~~int scan(~~ const T* still continues to be a viable option inmulti-threading, ~~// source data~~ ~~T* out, // output data~~ ~~int size, // size~~ it lacks the some of ~~source, output data sets~~ ~~R reduce, // reduction expression~~ outlined features and lacks low-level control. While C ~~combine, // combine expression~~ ~~S scan_fn, // scan function (exclusive or inclusive)~~ ~~T initial // initial value~~ ) { ~~/* int tile size = (n - 1)/ntiles~~ + 1; ~~reduced[tid] = reduce(in~~ + ~~tid * tilesize,itile == last_tile ? last_tile_size : tile_size, combine, T(0));~~ ~~#pragma omp barrier~~ ~~#pragma omp single */~~ ~~int nthreads = 1;~~ ~~if (size > 0) {~~ ~~// requested number of tiles~~ ~~int max_threads = omp_get_max_threads();~~ ~~T* reduced = new T[max_threads];~~ ~~T* scanRes = new T[max_threads];~~ ~~#pragma omp parallel~~ { ~~int ntiles = omp_get_num_threads(); // Number of tiles~~ ~~int itile = omp_get_thread_num();~~ ~~int tile_size = (size - 1) / ntiles + 1;~~ ~~int last_tile = ntiles - 1;~~ ~~int last_tile_size = size - last_tile * tile_size;~~ ~~if (itile == 0)~~ ~~nthreads = ntiles;~~ ~~// step 1~~ 11 standard libarary multi- ~~reduce each tile separately~~ ~~for (int itile = 0; itile < ntiles; itile++)~~ ~~reduced[itile] = reduce(in + itile * tile_size,~~ ~~itile == last_tile ? last_tile_size : tile_size, combine~~threading can be more difficult to learn, ~~T(0));~~ ~~// step 2 - perform exclusive scan on~~ is supported by virtually all ~~tiles using reduction outputs~~ ~~// store results in scanRes[]~~ ~~excl_scan(reduced, scanRes, ntiles, combine, T(0));~~ ~~// step 3 - scan each tile separately using scanRes[]~~ ~~for (int itile = 0; itile < ntiles; itile++)~~ ~~scan_fn(in + itile * tile_size, out + itile * tile_size,~~ ~~itile == last_tile ? last_tile_size : tile_size, combine,~~ ~~scanRes[itile]);~~ } ~~delete[] reduced;~~ ~~delete[] scanRes;~~ } ~~return nthreads;~~ } ~~====~~C++11 ~~code====~~ ~~#include <iostream>~~ ~~#include <omp.h>~~ ~~#include <chrono>~~ ~~#include <vector>~~ ~~#include <thread>~~ ~~using namespace std;~~ ~~void doNothing() {}~~ ~~int run(int algorithmToRun)~~ { ~~auto startTime = std::chrono::system_clock::now();~~ ~~for(int j=1; j<100000; ++j)~~ { ~~if(algorithmToRun == 1)~~ { ~~vector<thread> threads;~~ ~~for(int i=0; i<16; i++)~~ { ~~threads.push_back(thread(doNothing));~~ } ~~for(auto& thread :~~ compilers and offers a low-level interaction between hardware threads~~) thread.join();~~ } ~~else if(algorithmToRun == 2)~~ { ~~#pragma omp parallel for num_threads(16)~~ ~~for(unsigned i=0; i<16; i++)~~ { ~~doNothing();~~ } } } ~~auto endTime = std::chrono::system_clock::now();~~ ~~std::chrono::duration<double> elapsed_seconds = endTime - startTime;~~ ~~return elapsed_seconds~~.~~count();~~ } ~~int main()~~ { ~~int cppt = run(1);~~ ~~int ompt = run(2);~~ ~~cout<<cppt<<endl;~~ ~~cout<<ompt<<endl;~~ ~~return 0;~~ }

Danylo Daniel Medinski

45

edits

Changes

GPU621/NoName

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools