Difference between revisions of "GPU621/NoName"

From CDOT Wiki
Jump to: navigation, search
(SPMD)
(Asynchronous Multi-Threading)
 
(37 intermediate revisions by 2 users not shown)
Line 20: Line 20:
  
 
===What are C++ 11 Threads===
 
===What are C++ 11 Threads===
With the introduction of C++ 11, there were major changes and additions made to the C++ Standard libraries. One of the most significant changes was the inclusion of multi-threading libraries.  Before C++ 11 in order to implement multi-threading, external libraries or language extensions such as OpenMp was required.
+
With the introduction of C++ 11, there were major changes and additions made to the C++ Standard libraries. One of the most significant changes was the inclusion of multi-threading libraries.   
 +
Before C++ 11 in order to implement multi-threading, external libraries or language extensions such as OpenMp was required. Not only the standard library now include support for multi-threading,
 +
it also offered synchronization and thread safety.
 +
 
 
The C++ 11 thread support library includes these 4 files to enable multi-threading
 
The C++ 11 thread support library includes these 4 files to enable multi-threading
  
Line 27: Line 30:
 
* <contition_variable> - a synchronization primitive that can be used to block a thread, or multiple threads at the same time, until another thread both modifies a shared variable (the condition), and        notifies the condition_variable.
 
* <contition_variable> - a synchronization primitive that can be used to block a thread, or multiple threads at the same time, until another thread both modifies a shared variable (the condition), and        notifies the condition_variable.
 
* <future> - Describes components that a C++ program can use to retrieve in one thread the result (value or exception) from a function that has run in the same thread or another thread.
 
* <future> - Describes components that a C++ program can use to retrieve in one thread the result (value or exception) from a function that has run in the same thread or another thread.
 +
 +
Two options are available for multi-threading. Synchronous threading via std::thread and Asynchronous threading via std::async and std::future.
  
 
===Creating and executing Threads===
 
===Creating and executing Threads===
 
+
====OpenMp====
 
Inside a declared OpenMp parallel region, if not specified via an environment variable OMP_NUM_THREADS or the library routine omp_get_thread_num()  , OpenMp will automatically decide how many threads are needed to execute parallel code.  
 
Inside a declared OpenMp parallel region, if not specified via an environment variable OMP_NUM_THREADS or the library routine omp_get_thread_num()  , OpenMp will automatically decide how many threads are needed to execute parallel code.  
 
An issue with this approach is that OpenMp is unaware how many threads a CPU can support. A result of this can be OpenMp creating 4 threads for a single core processor which may result in a degradation of performance.  
 
An issue with this approach is that OpenMp is unaware how many threads a CPU can support. A result of this can be OpenMp creating 4 threads for a single core processor which may result in a degradation of performance.  
C++ 11 Threads on the contrary always required to specify the number of threads required for a parallel region.  If not specified by user input or hardcoding, the number of threads supported by a CPU can also be accurately via the std::thread::hardware_concurrency(); function.
 
OpenMp automatically decides what order threads will execute. C++ 11 Threads require the developer to specify in what order threads will execute. This is typically done within a for loop block.
 
 
====OpenMp====
 
  
 
Automatic thread creation
 
Automatic thread creation
Line 55: Line 56:
 
       }
 
       }
  
 +
====C++ 11====
 +
C++ 11 Threads on the contrary always required to specify the number of threads required for a parallel region.  If not specified by user input or hard-coding, the number of threads supported by a CPU can also be accurately via the std::thread::hardware_concurrency(); function.
 +
OpenMp automatically decides what order threads will execute. C++ 11 Threads require the developer to specify in what order threads will execute. This is typically done within a for loop block. Threads are created by initializing the std::thread class and specifying a function or any other callable object within the constructor.
  
====STD Threads====
+
Example of native thread creating and synchronization using C++ 11
 
 
 
  int numThreads = std::thread::hardware_concurrency();
 
  int numThreads = std::thread::hardware_concurrency();
 
  std::vector<std::thread> threads(numThreads);
 
  std::vector<std::thread> threads(numThreads);
 
  for (int ID = 0; ID < numThreads; ID++) {
 
  for (int ID = 0; ID < numThreads; ID++) {
 
       threads[ID] = std::thread(function);
 
       threads[ID] = std::thread(function);
  }  
+
}
 +
 
 +
After the initial creation and execution of a thread, the main thread must either detach or join the thread.
 +
The C++ 11 standard library offers these two member functions for attaching or detaching threads.
 +
 
 +
* std::thread::join - allows the thread to execute in the background independently from the main thread. The thread will continue execution without blocking nor synchronizing in any way and terminate without relying on the main thread.
 +
* std::thread::detach - waits for the thread to finish execution. Once a thread is created another thread can wait for the thread to finish.
 +
 
 +
Each created thread can then be synchronized with the main thread
 +
for (int i = 0; i < threads.size(); i++){
 +
      threads.at(i).join();
 +
}
 +
 
 +
===Parallelizing for Loops===
 +
 
 +
In OpenMp, paralleling for loops can be accomplished using SPMD or Work-Sharing.  When using work-sharing, the omp for construct makes parallelizing for loops a straight-forward and simple process.
 +
By placing the appropriate #pragma omp construct over the loop to be parallelized, the range for distributing work across multiple threads is automatically calculated by OpenMp. All that is required to use the omp for construct is to remove any possible data-dependencies within the parallel region.  <br>
 +
C++ 11 threads and language native threads unfortunately lack this luxury. In order to parallelize a loop using std Threads, it is the programmers responsibility to calculate the range of each iteration within the loop the be parallelized. This is usually done using SPMD techniques.
 +
 
 +
===Synchronization===
 +
 
 +
C++ 11 and Openmp are designed to avoid race conditions and share data between threads in various ways. 
 +
 
 +
====Shared Memory====
 +
 
 +
=====OpenMp=====
 +
 
 +
Openmp uses the shared clause to define what variables are shared among all threads. All threads within a team access the same storage area for shared variables.
 +
 
 +
The he shared clause would be located within a pragma statement. The clause is defined as follows.
 +
shared(var) 
 +
 
 +
=====C++ 11=====
 +
The atomic class provides an atomic object type which can eliminate the possibility of data races by providing synchronization between threads.
 +
Accesses to atomic objects may establish inter-thread synchronization and order non-atomic memory accesses.
 +
<br>
 +
Atomic types are defined as
 +
std::atomic<type> var_name;
 +
 
 +
====Mutual Exclusion====
 +
=====OpenMp=====
 +
Openmp offers multiple solutions for handling mutual exclusion.
 +
Scoped Locking may be implemented using the omp_set_lock and omp_unset_lock template functions to allow thread blocking.
 +
 
 +
Example of Scoped Locking
 +
 
 +
omp_lock_t lock;
 +
omp_init_lock(&lock);
 +
int i = 0;
 +
#pragma omp parallel num_threads(8)
 +
{         
 +
      omp_set_lock(&lock);
 +
      i++;
 +
      omp_unset_lock(&lock);
 +
}
 +
omp_destroy_lock(&lock);
 +
 
 +
A lock is somewhat similar to a critical section as it guarantees that some instructions can only be performed by one process at a time. With a lock you make sure that some data elements can only be touched by one process at a time.
 +
 
 +
Openmp offers a easier solution for mutual exclusion and preventing race conditions within its section constructs as the programmer does not have to worry about initializing and destroying locks.
 +
 
 +
* critical - region to be executed by only one thread at a time
 +
* atomic - the memory location to be updated by one thread at a time
 +
 
 +
A critical section works by acquiring a lock, which carries a substantial overhead. Furthermore. If a thread is in one critical section, the other ones are all blocked.
 +
 
 +
A critical region can by implemented as follows
 +
#pragma omp critical
 +
{
 +
      i++;
 +
  }
  
 +
A atomic region is implemented just as critical region, only the critical construct is replaced by an atomic construct. An atomic section has much lower overhead then a critical section as it does not require locking and unlocking operations as it takes advantage of the hardware providing atomic increment operations.
  
===Programming Models===
+
=====C++ 11=====
====SPMD====
 
  
An example of the SPMD programming model in STD Threads using an atomic barrier
+
The C++ 11 thread libraries provide the mutex class to support mutual exclusion and synchronization. <br>
 +
The mutex class is a synchronization primitive that can be used to protect shared data from being accessed by multiple threads.
 +
std::mutex is usually not accessed directly, instead std::unique_lock and std::lock_guard are used to manage locking.
 +
<br>
 +
Mutex offers these member functions for controlling locking
 +
* lock - locks the mutex, blocks if the mutex is not availabl
 +
* unlock - unlocks the mutex
 +
* try_lock - tries to lock the mutex, returns if the mutex is not available
  
  #include <iostream>
+
Example of thread locking/blocking
  #include <iomanip>
+
#include <iostream>
  #include <cstdlib>
+
#include <thread>
  #include <chrono>
+
#include <string>
  #include <vector>
+
#include <mutex>
  #include <thread>
+
  #include <atomic>
+
std::mutex mu;
  using namespace std::chrono;
 
 
   
 
   
  std::atomic<double> pi;
+
void shared_output(std::string msg, int id)
 +
{
 +
      mu.lock();
 +
      std::cout << msg << ":" << id << std::endl;
 +
      mu.unlock();
 +
}
 +
void thread_function()
 +
{
 +
      for (int i = -1000; i < 0; i++)
 +
          shared_output("thread ", i);
 +
}
 +
int main()
 +
{
 +
      std::thread t(&thread_function);
 +
      for (int i = 1000; i > 0; i--)
 +
          shared_output("main thread", i);
 +
      t.join();
 +
  return 0;
 +
}
 +
 
 +
===Implementations===
 +
 
 +
Serial Implementation
 +
#include <iostream>
 +
#include <chrono>
 +
using namespace std::chrono;
 
   
 
   
  void reportTime(const char* msg, steady_clock::duration span) {
+
int main(int argc, char *argv[])
       auto ms = duration_cast<milliseconds>(span);
+
{
       std::cout << msg << " - took - " <<
+
      steady_clock::time_point ts, te;
 +
      const size_t n = 100000000;
 +
      int j = 0;
 +
      ts = steady_clock::now();
 +
      for (int i = 0; i<n; i++)
 +
      {
 +
          j += i;
 +
      }
 +
      te = steady_clock::now();
 +
      std::cout << j << std::endl;
 +
       auto ms = duration_cast<milliseconds>(te - ts);
 +
       std::cout << std::endl << "Took - " <<
 
       ms.count() << " milliseconds" << std::endl;
 
       ms.count() << " milliseconds" << std::endl;
  }
+
}
  void run(int ID, double stepSize, int nthrds, int n)
+
 
  {
+
The example finished execution at 180 milliseconds
      double x;
+
 
      double sum = 0.0;
+
OpenMp with work-sharing implementation. Since the program is adding data, a reduction pattern can be used with Openmp's work-sharing constructs.
      for (int i = ID; i < n; i = i + nthrds){
+
#include <iostream>
        x = (i + 0.5)*stepSize;
+
#include <chrono>
          sum += 4.0 / (1.0 + x*x);
+
#include <omp.h>
      }
+
using namespace std::chrono;
       sum = sum * stepSize;
+
int main(int argc, char *argv[])
      pi = pi + sum;
+
{
  }
+
       const size_t n = 100000000;
 
   
 
   
  int main(int argc, char** argv) {
+
      steady_clock::time_point ts, te;
    if (argc != 3) {
 
          std::cerr << argv[0] << ": invalid number of arguments\n";
 
          return 1;
 
    }
 
 
   
 
   
    int n = atoi(argv[1]);
+
      int j = 0;
    int numThreads = atoi(argv[2]);
+
      int i;
 +
      ts = steady_clock::now();
 +
      #pragma omp parallel num_threads(8)
 +
      {
 +
          #pragma omp for reduction(+:j)
 +
          for (i = 0; i < n; i++){
 +
                j += i;
 +
          }
 +
      }
 +
      te = steady_clock::now();
 +
      std::cout << j << std::endl;
 
   
 
   
    steady_clock::time_point ts, te;
+
      auto ms = duration_cast<milliseconds>(te - ts);
 
   
 
   
    // calculate pi by integrating the area under 1/(1 + x^2) in n steps
+
      std::cout << std::endl << "Took - " <<
    ts = steady_clock::now();
+
      ms.count() << " milliseconds" << std::endl;
 +
}
 +
 
 +
The example finished execution at 63 milliseconds
 +
 
 +
Native SPMD Implementation using mutex locking barrier.
 +
std::bind() allows the user to specify the range for each thread.
 
   
 
   
    std::vector<std::thread> threads(numThreads);
+
#include <iostream>
 +
#include <chrono>
 +
#include <vector>
 +
#include <thread>
 +
#include <mutex>
 +
#include <algorithm>
 +
using namespace std::chrono;
 +
int main(int argc, char *argv[]){
 +
      const size_t n = 100000000;
 +
      steady_clock::time_point ts, te;
 +
      const size_t nthreads = std::thread::hardware_concurrency();
 +
      std::vector<std::thread> threads(nthreads);
 +
      std::mutex critical;
 +
      int j = 0;
 
   
 
   
    double stepSize = 1.0 / (double)n;
+
      ts = steady_clock::now();
 +
      for (int t = 0; t < nthreads; t++)
 +
          {
 +
                threads[t] = std::thread(std::bind([&](const int bi, const int ei, const int t)
 +
                {
 +
                    std::lock_guard<std::mutex> lock(critical);
 +
                    for (int i = bi; i < ei; i++)
 +
                    {
 +
                          j += i;
 +
                    }
 +
                },t*n / nthreads, (t + 1) == nthreads ? n : (t + 1)*n / nthreads, t));
 +
      }
 +
      te = steady_clock::now();
 +
      std::for_each(threads.begin(), threads.end(), [](std::thread& x){x.join(); });
 +
      std::cout << j << std::endl;
 +
      auto ms = duration_cast<milliseconds>(te - ts);
 +
      std::cout << std::endl << "Took - " <<
 +
      ms.count() << " milliseconds" << std::endl;
 +
}
 +
 
 +
 
 +
The example finished execution at 6 milliseconds
 +
 
 +
===Asynchronous Multi-Threading===
 +
 
 +
C++ 11 allows the creation of asynchronous threads using the std:async template function part of the <future> header.  The function returns a std::future type that will store the expected return value of std::async’s parameter function.
 +
A future is an object that can retrieve a value from some provider object (also known as a promise) or function. Simply put in the case of multithreading, a future object will wait until its associated thread has completed and then store its return value.
 +
To retrieve or construct a future object, these functions may be used.
 +
* Async
 +
* promise::get_future
 +
* packaged_task::get_future
 +
However, a future object can only be used if it is in a valid state. Default future objects constructed from the std::async template function are not valid and must be assigned a valid state during execution.
 +
A std::future references a shared state that cannot be shared to other asynchronous return objects. If multiple threads need to wait for the same shared state, std::shared_future class template should be used.
 
   
 
   
    for (int ID = 0; ID < numThreads; ID++) {
+
Basic example of asynchronous multi-threading using std::async to create the thread and std::future to store the return result of their associated threads.
          int nthrds = std::thread::hardware_concurrency();
+
 
          if (ID == 0) numThreads = nthrds;
+
#include <vector>
          threads[ID] = std::thread(run, ID, stepSize, 8, n);
+
#include <iostream>
    }
+
#include <chrono>
 +
#include <future>
 
   
 
   
    te = steady_clock::now();
+
int twice(int m){
 +
      return 2 * m;
 +
}
 
   
 
   
    for (int i = 0; i < numThreads; i++){
+
int main(int argc, char *argv[])
          threads[i].join();
+
{
    }
+
      std::vector<std::future<int>> futures;
+
      for (int i = 0; i < 10; ++i) {
    std::cout << "n = " << n << std::fixed << std::setprecision(15) << "\n pi(exact) = " << 3.141592653589793 << "\n pi(calcd) = " << pi << std::endl;
+
          futures.push_back(std::async(twice, i));
 +
      }
 
   
 
   
    reportTime("Integration", te - ts);
+
      for (int i = 0; i < futures.size(); i++){
 +
          std::cout << futures.at(i).get() << std::endl;
 +
      }
 
   
 
   
    // terminate
+
      return 0;
    char c;
+
}
    std::cout << "Press Enter key to exit ... ";
+
 
    std::cin.get(c);
+
Openmp unfortunately does not support asynchronous multi-threading as is designed for designed for parallelism, not concurrency.
  }
 
  
====Question & Awnser====
+
===C++ 11 Threads and OpenMp compatibility===
 
Can one safely use C++11 multi-threading as well as OpenMP in one and the same program but without  
 
Can one safely use C++11 multi-threading as well as OpenMP in one and the same program but without  
 
interleaving them (i.e. no OpenMP statement in any code passed to C++11 concurrent features and no  
 
interleaving them (i.e. no OpenMP statement in any code passed to C++11 concurrent features and no  
 
C++11 concurrency in threads spawned by OpenMP)?
 
C++11 concurrency in threads spawned by OpenMP)?
 
  
 
On some platforms efficient implementation could only be achieved if the OpenMP run-time is the  
 
On some platforms efficient implementation could only be achieved if the OpenMP run-time is the  
Line 155: Line 331:
 
and x86 is usually considered an "experimental" platform (other vendors are usually much more conservative).
 
and x86 is usually considered an "experimental" platform (other vendors are usually much more conservative).
  
 +
===Conclusion===
  
====OpenMP code====
+
In conclusion while OpenMp is and still continues to be a viable option in multi-threading, it lacks the some of outlined features and lacks low-level control. While C++ 11 standard libarary multi-threading can be more difficult to learn, is supported by virtually all C++ 11 compilers and offers a low-level interaction between hardware threads.
//Workshop 3 using the scan and reduce with openMp
 
 
 
template <typename T, typename R, typename C, typename S>
 
int scan(
 
const T* in,   // source data
 
T* out,        // output data
 
int size,      // size of source, output data sets
 
R reduce,      // reduction expression
 
C combine,    // combine expression
 
S scan_fn,    // scan function (exclusive or inclusive)
 
T initial      // initial value
 
)
 
{
 
/* int tile size = (n - 1)/ntiles + 1;
 
  reduced[tid] = reduce(in + tid * tilesize,itile == last_tile ? last_tile_size : tile_size, combine, T(0));
 
  #pragma omp barrier
 
  #pragma omp single
 
*/
 
int nthreads = 1;
 
if (size > 0) {
 
// requested number of tiles
 
int max_threads = omp_get_max_threads();
 
T* reduced = new T[max_threads];
 
T* scanRes = new T[max_threads];
 
 
 
#pragma omp parallel
 
{
 
int ntiles = omp_get_num_threads(); // Number of tiles
 
int itile = omp_get_thread_num();
 
int tile_size = (size - 1) / ntiles + 1;
 
int last_tile = ntiles - 1;
 
int last_tile_size = size - last_tile * tile_size;
 
if (itile == 0)
 
nthreads = ntiles;
 
// step 1 - reduce each tile separately
 
for (int itile = 0; itile < ntiles; itile++)
 
reduced[itile] = reduce(in + itile * tile_size,
 
itile == last_tile ? last_tile_size : tile_size, combine, T(0));
 
 
 
// step 2 - perform exclusive scan on all tiles using reduction outputs
 
// store results in scanRes[]
 
excl_scan(reduced, scanRes, ntiles, combine, T(0));
 
 
 
// step 3 - scan each tile separately using scanRes[]
 
for (int itile = 0; itile < ntiles; itile++)
 
scan_fn(in + itile * tile_size, out + itile * tile_size,
 
itile == last_tile ? last_tile_size : tile_size, combine,
 
scanRes[itile]);
 
}
 
delete[] reduced;
 
delete[] scanRes;
 
}
 
return nthreads;
 
}
 
 
 
====C++11 code====
 
 
 
#include <iostream>
 
#include <omp.h>
 
#include <chrono>
 
#include <vector>
 
#include <thread>
 
 
 
using namespace std;
 
 
 
void doNothing() {}
 
 
 
int run(int algorithmToRun)
 
{
 
    auto startTime = std::chrono::system_clock::now();
 
 
 
    for(int j=1; j<100000; ++j)
 
    {
 
        if(algorithmToRun == 1)
 
        {
 
            vector<thread> threads;
 
            for(int i=0; i<16; i++)
 
            {
 
                threads.push_back(thread(doNothing));
 
            }
 
            for(auto& thread : threads) thread.join();
 
        }
 
        else if(algorithmToRun == 2)
 
        {
 
            #pragma omp parallel for num_threads(16)
 
            for(unsigned i=0; i<16; i++)
 
            {
 
                doNothing();
 
            }
 
        }
 
    }
 
 
 
    auto endTime = std::chrono::system_clock::now();
 
    std::chrono::duration<double> elapsed_seconds = endTime - startTime;
 
 
 
    return elapsed_seconds.count();
 
}
 
 
 
int main()
 
{
 
    int cppt = run(1);
 
    int ompt = run(2);
 
 
 
    cout<<cppt<<endl;
 
    cout<<ompt<<endl;
 
 
 
    return 0;
 
}
 

Latest revision as of 18:17, 3 December 2016

NoName

Our project: C++11 Threads Library Comparison to OpenMP

Group Members

  1. Saad Toor [1] Research etc.
  2. Danylo Medinski [2] Research etc.
  3. Ahmed Khan [3] Research etc.

Progress

Oct 17th:

  1. Picked topic
  2. Picked presentation date.
  3. Gathering information

Oct 20th:

  1. Created Wiki page

OpenMp vs C++ 11 Threads

What are C++ 11 Threads

With the introduction of C++ 11, there were major changes and additions made to the C++ Standard libraries. One of the most significant changes was the inclusion of multi-threading libraries. Before C++ 11 in order to implement multi-threading, external libraries or language extensions such as OpenMp was required. Not only the standard library now include support for multi-threading, it also offered synchronization and thread safety.

The C++ 11 thread support library includes these 4 files to enable multi-threading

  • <thread> - class and namespace for working with threads
  • <mutex> - provides support for mutual exclusion
  • <contition_variable> - a synchronization primitive that can be used to block a thread, or multiple threads at the same time, until another thread both modifies a shared variable (the condition), and notifies the condition_variable.
  • <future> - Describes components that a C++ program can use to retrieve in one thread the result (value or exception) from a function that has run in the same thread or another thread.

Two options are available for multi-threading. Synchronous threading via std::thread and Asynchronous threading via std::async and std::future.

Creating and executing Threads

OpenMp

Inside a declared OpenMp parallel region, if not specified via an environment variable OMP_NUM_THREADS or the library routine omp_get_thread_num() , OpenMp will automatically decide how many threads are needed to execute parallel code. An issue with this approach is that OpenMp is unaware how many threads a CPU can support. A result of this can be OpenMp creating 4 threads for a single core processor which may result in a degradation of performance.

Automatic thread creation

#pragma omp parallel
     {
          int tid = omp_get_thread_num(); 
          std::cout << "Hi from thread "
          << tid << '\n';
     }

Programmer Specified thread creation

int numThreads = 4;
omp_set_num_threads(numThreads);
#pragma omp parallel
     {
          int tid = omp_get_thread_num(); 
          std::cout << "Hi from thread "
          << tid << '\n';
     }

C++ 11

C++ 11 Threads on the contrary always required to specify the number of threads required for a parallel region. If not specified by user input or hard-coding, the number of threads supported by a CPU can also be accurately via the std::thread::hardware_concurrency(); function. OpenMp automatically decides what order threads will execute. C++ 11 Threads require the developer to specify in what order threads will execute. This is typically done within a for loop block. Threads are created by initializing the std::thread class and specifying a function or any other callable object within the constructor.

Example of native thread creating and synchronization using C++ 11

int numThreads = std::thread::hardware_concurrency();
std::vector<std::thread> threads(numThreads);
for (int ID = 0; ID < numThreads; ID++) {
     threads[ID] = std::thread(function);
}

After the initial creation and execution of a thread, the main thread must either detach or join the thread. The C++ 11 standard library offers these two member functions for attaching or detaching threads.

  • std::thread::join - allows the thread to execute in the background independently from the main thread. The thread will continue execution without blocking nor synchronizing in any way and terminate without relying on the main thread.
  • std::thread::detach - waits for the thread to finish execution. Once a thread is created another thread can wait for the thread to finish.

Each created thread can then be synchronized with the main thread

for (int i = 0; i < threads.size(); i++){
     threads.at(i).join();
}

Parallelizing for Loops

In OpenMp, paralleling for loops can be accomplished using SPMD or Work-Sharing. When using work-sharing, the omp for construct makes parallelizing for loops a straight-forward and simple process. By placing the appropriate #pragma omp construct over the loop to be parallelized, the range for distributing work across multiple threads is automatically calculated by OpenMp. All that is required to use the omp for construct is to remove any possible data-dependencies within the parallel region.
C++ 11 threads and language native threads unfortunately lack this luxury. In order to parallelize a loop using std Threads, it is the programmers responsibility to calculate the range of each iteration within the loop the be parallelized. This is usually done using SPMD techniques.

Synchronization

C++ 11 and Openmp are designed to avoid race conditions and share data between threads in various ways.

Shared Memory

OpenMp

Openmp uses the shared clause to define what variables are shared among all threads. All threads within a team access the same storage area for shared variables.

The he shared clause would be located within a pragma statement. The clause is defined as follows.

shared(var)  
C++ 11

The atomic class provides an atomic object type which can eliminate the possibility of data races by providing synchronization between threads. Accesses to atomic objects may establish inter-thread synchronization and order non-atomic memory accesses.
Atomic types are defined as

std::atomic<type> var_name;

Mutual Exclusion

OpenMp

Openmp offers multiple solutions for handling mutual exclusion. Scoped Locking may be implemented using the omp_set_lock and omp_unset_lock template functions to allow thread blocking.

Example of Scoped Locking

omp_lock_t lock;
omp_init_lock(&lock);
int i = 0;	
#pragma omp parallel num_threads(8)
{          
     omp_set_lock(&lock);
     i++;
     omp_unset_lock(&lock);
}
omp_destroy_lock(&lock);

A lock is somewhat similar to a critical section as it guarantees that some instructions can only be performed by one process at a time. With a lock you make sure that some data elements can only be touched by one process at a time.

Openmp offers a easier solution for mutual exclusion and preventing race conditions within its section constructs as the programmer does not have to worry about initializing and destroying locks.

  • critical - region to be executed by only one thread at a time
  • atomic - the memory location to be updated by one thread at a time

A critical section works by acquiring a lock, which carries a substantial overhead. Furthermore. If a thread is in one critical section, the other ones are all blocked.

A critical region can by implemented as follows

#pragma omp critical
{
     i++;
}

A atomic region is implemented just as critical region, only the critical construct is replaced by an atomic construct. An atomic section has much lower overhead then a critical section as it does not require locking and unlocking operations as it takes advantage of the hardware providing atomic increment operations.

C++ 11

The C++ 11 thread libraries provide the mutex class to support mutual exclusion and synchronization.
The mutex class is a synchronization primitive that can be used to protect shared data from being accessed by multiple threads. std::mutex is usually not accessed directly, instead std::unique_lock and std::lock_guard are used to manage locking.
Mutex offers these member functions for controlling locking

  • lock - locks the mutex, blocks if the mutex is not availabl
  • unlock - unlocks the mutex
  • try_lock - tries to lock the mutex, returns if the mutex is not available

Example of thread locking/blocking

#include <iostream>
#include <thread>
#include <string>
#include <mutex>

std::mutex mu;

void shared_output(std::string msg, int id)
{
     mu.lock();
     std::cout << msg << ":" << id << std::endl;
     mu.unlock();
}
void thread_function()
{
     for (int i = -1000; i < 0; i++)
          shared_output("thread ", i);
}
int main()
{
     std::thread t(&thread_function);
     for (int i = 1000; i > 0; i--)
          shared_output("main thread", i);
     t.join();
 return 0;
}

Implementations

Serial Implementation

#include <iostream>
#include <chrono>
using namespace std::chrono;

int main(int argc, char *argv[])
{
     steady_clock::time_point ts, te;
     const size_t n = 100000000;
     int j = 0;
     ts = steady_clock::now();
     for (int i = 0; i<n; i++)
     {
          j += i;
     }
     te = steady_clock::now();
     std::cout << j << std::endl;
     auto ms = duration_cast<milliseconds>(te - ts);
     std::cout << std::endl << "Took - " <<
     ms.count() << " milliseconds" << std::endl;
}

The example finished execution at 180 milliseconds

OpenMp with work-sharing implementation. Since the program is adding data, a reduction pattern can be used with Openmp's work-sharing constructs.

#include <iostream>
#include <chrono>
#include <omp.h>
using namespace std::chrono;
int main(int argc, char *argv[])
{
     const size_t n = 100000000;

     steady_clock::time_point ts, te;

     int j = 0;
     int i;
     ts = steady_clock::now();
     #pragma omp parallel num_threads(8)
     {
          #pragma omp for reduction(+:j)
          for (i = 0; i < n; i++){
               j += i;
          }
     }
     te = steady_clock::now();
     std::cout << j << std::endl;

     auto ms = duration_cast<milliseconds>(te - ts);

     std::cout << std::endl << "Took - " <<
     ms.count() << " milliseconds" << std::endl;
}

The example finished execution at 63 milliseconds

Native SPMD Implementation using mutex locking barrier. std::bind() allows the user to specify the range for each thread.

#include <iostream>
#include <chrono>
#include <vector>
#include <thread>
#include <mutex>
#include <algorithm>
using namespace std::chrono;
int main(int argc, char *argv[]){
     const size_t n = 100000000;
     steady_clock::time_point ts, te;
     const size_t nthreads = std::thread::hardware_concurrency();
     std::vector<std::thread> threads(nthreads);
     std::mutex critical;
     int j = 0;

     ts = steady_clock::now();
     for (int t = 0; t < nthreads; t++)
          {
               threads[t] = std::thread(std::bind([&](const int bi, const int ei, const int t)
               {
                    std::lock_guard<std::mutex> lock(critical);
                    for (int i = bi; i < ei; i++)
                    {
                         j += i;
                    }
               },t*n / nthreads, (t + 1) == nthreads ? n : (t + 1)*n / nthreads, t));
      }
      te = steady_clock::now();
      std::for_each(threads.begin(), threads.end(), [](std::thread& x){x.join(); });
      std::cout << j << std::endl;
      auto ms = duration_cast<milliseconds>(te - ts);
      std::cout << std::endl << "Took - " <<
      ms.count() << " milliseconds" << std::endl;
}


The example finished execution at 6 milliseconds

Asynchronous Multi-Threading

C++ 11 allows the creation of asynchronous threads using the std:async template function part of the <future> header. The function returns a std::future type that will store the expected return value of std::async’s parameter function. A future is an object that can retrieve a value from some provider object (also known as a promise) or function. Simply put in the case of multithreading, a future object will wait until its associated thread has completed and then store its return value. To retrieve or construct a future object, these functions may be used.

  • Async
  • promise::get_future
  • packaged_task::get_future

However, a future object can only be used if it is in a valid state. Default future objects constructed from the std::async template function are not valid and must be assigned a valid state during execution. A std::future references a shared state that cannot be shared to other asynchronous return objects. If multiple threads need to wait for the same shared state, std::shared_future class template should be used.

Basic example of asynchronous multi-threading using std::async to create the thread and std::future to store the return result of their associated threads.

#include <vector>
#include <iostream>
#include <chrono>
#include <future>

int twice(int m){
     return 2 * m;
}

int main(int argc, char *argv[])
{
     std::vector<std::future<int>> futures;
     for (int i = 0; i < 10; ++i) {
          futures.push_back(std::async(twice, i));
     }

     for (int i = 0; i < futures.size(); i++){
          std::cout << futures.at(i).get() << std::endl;
     }

     return 0;
}

Openmp unfortunately does not support asynchronous multi-threading as is designed for designed for parallelism, not concurrency.

C++ 11 Threads and OpenMp compatibility

Can one safely use C++11 multi-threading as well as OpenMP in one and the same program but without interleaving them (i.e. no OpenMP statement in any code passed to C++11 concurrent features and no C++11 concurrency in threads spawned by OpenMP)?

On some platforms efficient implementation could only be achieved if the OpenMP run-time is the only one in control of the process threads. Also there are certain aspects of OpenMP that might not play well with other threading constructs, for example the limit on the number of threads set by OMP_THREAD_LIMIT when forking two or more concurrent parallel regions.Since the OpenMP standard itself does not strictly forbid using other threading paradigms, but neither standardises the interoperability with such, supporting such functionality is up to the implementers. This means that some implementations might provide safe concurrent execution of top-level OpenMP regions, some might not. The x86 implementers pledge to supporting it, may be because most of them are also proponents of other execution models (e.g. Intel with Cilk and TBB, GCC with C++11, etc.) and x86 is usually considered an "experimental" platform (other vendors are usually much more conservative).

Conclusion

In conclusion while OpenMp is and still continues to be a viable option in multi-threading, it lacks the some of outlined features and lacks low-level control. While C++ 11 standard libarary multi-threading can be more difficult to learn, is supported by virtually all C++ 11 compilers and offers a low-level interaction between hardware threads.