Changes

Jump to: navigation, search

GPU621/NoName

4,716 bytes removed, 19:17, 3 December 2016
Asynchronous Multi-Threading
===Creating and executing Threads===
====OpenMp====
Inside a declared OpenMp parallel region, if not specified via an environment variable OMP_NUM_THREADS or the library routine omp_get_thread_num() , OpenMp will automatically decide how many threads are needed to execute parallel code.
An issue with this approach is that OpenMp is unaware how many threads a CPU can support. A result of this can be OpenMp creating 4 threads for a single core processor which may result in a degradation of performance.
}
====C++ 11====
C++ 11 Threads on the contrary always required to specify the number of threads required for a parallel region. If not specified by user input or hard-coding, the number of threads supported by a CPU can also be accurately via the std::thread::hardware_concurrency(); function.
OpenMp automatically decides what order threads will execute. C++ 11 Threads require the developer to specify in what order threads will execute. This is typically done within a for loop block. Threads are created by initializing the std::thread class and specifying a function or any other callable object within the constructor.
Openmp offers a easier solution for mutual exclusion and preventing race conditions within its section constructs as the programmer does not have to worry about initializing and destroying locks.
* critical - region to be executed by only one thread at a time * atomic - the memory location to be updated by one thread at a time
A critical section works by acquiring a lock, which carries a substantial overhead. Furthermore. If a thread is in one critical section, the other ones are all blocked.
=====C++ 11=====
The C++ 11 thread libraries provide the mutex class to support mutual exclusionand synchronization. <br>
The mutex class is a synchronization primitive that can be used to protect shared data from being accessed by multiple threads.
std::mutex is usually not accessed directly, instead std::unique_lock and std::lock_guard are used to manage locking.
}
====Implementations====
Serial Implementation
A future is an object that can retrieve a value from some provider object (also known as a promise) or function. Simply put in the case of multithreading, a future object will wait until its associated thread has completed and then store its return value.
To retrieve or construct a future object, these functions may be used.
* Async * promise::get_future * packaged_task::get_future
However, a future object can only be used if it is in a valid state. Default future objects constructed from the std::async template function are not valid and must be assigned a valid state during execution.
A std::future references a shared state that cannot be shared to other asynchronous return objects. If multiple threads need to wait for the same shared state, std::shared_future class template should be used.
Openmp unfortunately does not support asynchronous multi-threading as is designed for designed for parallelism, not concurrency.
===Programming Models=======SPMD==== An example of the SPMD programming model in STD Threads using an atomic barrier  #include <iostream> #include <iomanip> #include <cstdlib> #include <chrono> #include <vector> #include <thread> #include <atomic> using namespace std::chrono; std::atomic<double> pi; void reportTime(const char* msg, steady_clock::duration span) { auto ms = duration_cast<milliseconds>(span); std::cout << msg << " - took - " << ms.count() << " milliseconds" << std::endl; } void run(int ID, double stepSize, int nthrds, int n) { double x; double sum = 0.0; for (int i = ID; i < n; i = i C+ nthrds){ x = (i + 0.5)*stepSize; sum += 4.0 / (1.0 + x*x); } sum = sum * stepSize; pi = pi + sum; } int main(int argc, char** argv) { if (argc != 3) { std::cerr << argv[0] << ": invalid number of arguments\n"; return 1; } int n = atoi(argv[1]); int numThreads = atoi(argv[2]); steady_clock::time_point ts, te; // calculate pi by integrating the area under 1/(1 + x^2) in n steps ts = steady_clock::now(); std::vector<std::thread> threads(numThreads); double stepSize = 1.0 / (double)n; for (int ID = 0; ID < numThreads; ID++) { int nthrds = std::thread::hardware_concurrency(); if (ID == 0) numThreads = nthrds; threads[ID] = std::thread(run, ID, stepSize, 8, n); } te = steady_clock::now(); for (int i = 0; i < numThreads; i++){ threads[i].join(); } std::cout << "n = " << n << std::fixed << std::setprecision(15) << "\n pi(exact) = " << 3.141592653589793 << "\n pi(calcd) = " << pi << std::endl; reportTime("Integration", te - ts); // terminate char c; std::cout << "Press Enter key to exit ... "; std::cin.get(c); } ====Question & Awnser=11 Threads and OpenMp compatibility===
Can one safely use C++11 multi-threading as well as OpenMP in one and the same program but without
interleaving them (i.e. no OpenMP statement in any code passed to C++11 concurrent features and no
C++11 concurrency in threads spawned by OpenMP)?
 
On some platforms efficient implementation could only be achieved if the OpenMP run-time is the
and x86 is usually considered an "experimental" platform (other vendors are usually much more conservative).
===Conclusion===
====OpenMP code====//Workshop 3 using the scan In conclusion while OpenMp is and reduce with openMp  template <typename T, typename R, typename C, typename S> int scan( const T* still continues to be a viable option inmulti-threading, // source data T* out, // output data int size, // size it lacks the some of source, output data sets R reduce, // reduction expression outlined features and lacks low-level control. While C combine, // combine expression S scan_fn, // scan function (exclusive or inclusive) T initial // initial value ) { /* int tile size = (n - 1)/ntiles + 1; reduced[tid] = reduce(in + tid * tilesize,itile == last_tile ? last_tile_size : tile_size, combine, T(0)); #pragma omp barrier #pragma omp single */ int nthreads = 1; if (size > 0) { // requested number of tiles int max_threads = omp_get_max_threads(); T* reduced = new T[max_threads]; T* scanRes = new T[max_threads]; #pragma omp parallel { int ntiles = omp_get_num_threads(); // Number of tiles int itile = omp_get_thread_num(); int tile_size = (size - 1) / ntiles + 1; int last_tile = ntiles - 1; int last_tile_size = size - last_tile * tile_size; if (itile == 0) nthreads = ntiles; // step 1 11 standard libarary multi- reduce each tile separately for (int itile = 0; itile < ntiles; itile++) reduced[itile] = reduce(in + itile * tile_size, itile == last_tile ? last_tile_size : tile_size, combinethreading can be more difficult to learn, T(0)); // step 2 - perform exclusive scan on is supported by virtually all tiles using reduction outputs // store results in scanRes[] excl_scan(reduced, scanRes, ntiles, combine, T(0)); // step 3 - scan each tile separately using scanRes[] for (int itile = 0; itile < ntiles; itile++) scan_fn(in + itile * tile_size, out + itile * tile_size, itile == last_tile ? last_tile_size : tile_size, combine, scanRes[itile]); } delete[] reduced; delete[] scanRes; } return nthreads; } ====C++11 code====  #include <iostream> #include <omp.h> #include <chrono> #include <vector> #include <thread> using namespace std; void doNothing() {} int run(int algorithmToRun) { auto startTime = std::chrono::system_clock::now(); for(int j=1; j<100000; ++j) { if(algorithmToRun == 1) { vector<thread> threads; for(int i=0; i<16; i++) { threads.push_back(thread(doNothing)); } for(auto& thread : compilers and offers a low-level interaction between hardware threads) thread.join(); } else if(algorithmToRun == 2) { #pragma omp parallel for num_threads(16) for(unsigned i=0; i<16; i++) { doNothing(); } } } auto endTime = std::chrono::system_clock::now(); std::chrono::duration<double> elapsed_seconds = endTime - startTime; return elapsed_seconds.count(); } int main() { int cppt = run(1); int ompt = run(2); cout<<cppt<<endl; cout<<ompt<<endl; return 0; }

Navigation menu