Open main menu

CDOT Wiki β

Changes

GPU621/NoName

4,419 bytes added, 01:07, 28 November 2016
no edit summary
C++ 11 Threads on the contrary always required to specify the number of threads required for a parallel region. If not specified by user input or hardcoding, the number of threads supported by a CPU can also be accurately via the std::thread::hardware_concurrency(); function.
OpenMp automatically decides what order threads will execute. C++ 11 Threads require the developer to specify in what order threads will execute. This is typically done within a for loop block.
 
===Parallelizing for Loops===
 
In OpenMp, paralleling for loops can be accomplished using SPMD or Work-Sharing. When using work-sharing, the omp for construct makes parallelizing for loops a straight-forward and simple process.
By placing the appropriate #pragma omp construct over the loop to be parallelized, the range for distributing work across multiple threads is automatically calculated by OpenMp. All that is required to use the omp for construct is to remove any possible data-dependencies within the parallel region. <br>
C++ 11 threads and language native threads unfortunately lack this luxury. In order to parallelize a loop using std Threads, it is the programmers responsibility to calculate the range of each iteration within the loop the be parallelized. This is usually done using SPMD techniques.
 
====Mutual Exclusion====
 
=====OpenMp=====
 
=====C++ 11=====
 
The C++ 11 thread libraries provide the mutex and the atomic classes which support mutual exclusion. <br>
The mutex class is a synchronization primitive that can be used to protect shared data from being accessed by multiple threads.
std::mutex is usually not accessed directly, instead std::unique_lock and std::lock_guard are used to manage locking.
<br>
Mutex offers these member functions for controlling locking
• lock - locks the mutex, blocks if the mutex is not availabl
• unlock - unlocks the mutex
• try_lock - tries to lock the mutex, returns if the mutex is not available
 
The atomic class provides an atomic object type which can eliminate the possibility of data races by providing synchronization between threads. Accesses to atomic objects may establish inter-thread synchronization and order non-atomic memory accesses.
<br>
Atomic types are defined as
std::atomic<type> var_name;
 
====Implementations====
 
Serial Implementation
#include <iostream>
#include <chrono>
using namespace std::chrono;
int main(int argc, char *argv[])
{
steady_clock::time_point ts, te;
const size_t n = 100000000;
int j = 0;
ts = steady_clock::now();
for (int i = 0; i<n; i++)
{
j += i;
}
te = steady_clock::now();
std::cout << j << std::endl;
auto ms = duration_cast<milliseconds>(te - ts);
std::cout << std::endl << "Took - " <<
ms.count() << " milliseconds" << std::endl;
}
 
Finished at 180 milliseconds
 
OpenMp with work-sharing implementation
#include <iostream>
#include <chrono>
#include <omp.h>
using namespace std::chrono;
int main(int argc, char *argv[])
{
const size_t n = 100000000;
steady_clock::time_point ts, te;
int j = 0;
int i;
ts = steady_clock::now();
#pragma omp parallel num_threads(8)
{
#pragma omp for reduction(+:j)
for (i = 0; i < n; i++){
j += i;
}
}
te = steady_clock::now();
std::cout << j << std::endl;
auto ms = duration_cast<milliseconds>(te - ts);
std::cout << std::endl << "Took - " <<
ms.count() << " milliseconds" << std::endl;
}
 
Finished at 63 milliseconds
 
Native Implementation using mutex locking barrier
#include <iostream>
#include <chrono>
#include <vector>
#include <thread>
#include <mutex>
#include <algorithm>
using namespace std::chrono;
int main(int argc, char *argv[]){
const size_t n = 100000000;
steady_clock::time_point ts, te;
const size_t nthreads = std::thread::hardware_concurrency();
std::vector<std::thread> threads(nthreads);
std::mutex critical;
int j = 0;
ts = steady_clock::now();
for (int t = 0; t < nthreads; t++)
{
threads[t] = std::thread(std::bind([&](const int bi, const int ei, const int t)
{
std::lock_guard<std::mutex> lock(critical);
for (int i = bi; i < ei; i++)
{
j += i;
}
},t*n / nthreads, (t + 1) == nthreads ? n : (t + 1)*n / nthreads, t));
}
te = steady_clock::now();
std::for_each(threads.begin(), threads.end(), [](std::thread& x){x.join(); });
std::cout << j << std::endl;
auto ms = duration_cast<milliseconds>(te - ts);
std::cout << std::endl << "Took - " <<
ms.count() << " milliseconds" << std::endl;
}
 
 
Finished at 6 milliseconds
====OpenMp====
====STD Native Threads====
int numThreads = std::thread::hardware_concurrency();
threads[ID] = std::thread(function);
}
 
===Programming Models===