Difference between revisions of "DPS921/ND-R&D"

From CDOT Wiki
Jump to: navigation, search
(Using Barrier)
 
(8 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
{{GPU621/DPS921 Index | 20187}}
 
{{GPU621/DPS921 Index | 20187}}
= C++11 Threads Library Comparison to OpenMP =
+
<h1>C++11 Threads Library Comparison to OpenMP</h1>
 +
<h3>Group Members</h3>
 +
Daniel Bogomazov<br>
 +
Nick Krillis<br><br>
  
 +
<!-- How Threads Works -->
  
== Group Members ==
+
<h2>How Threads Works</h2>
Daniel Bogomazov
 
  
Nick Krilis
+
<!-- OpenMP Threads -->
  
= Threading in OpenMP =
+
<h3>OpenMP Threads</h3>
OpenMP (Open Multi-Processing) is an API specification for compilers that implement an explicit SPMD programming model on shared memory architectures.OpenMP implements threading through the main thread which will fork a specific number of child threads and divide the task amongst them. The runtime environment will then allocate the threads onto multiple processors.
 
  
The standard OpenMP consists of three main components.
+
<p>Threading in OpenMP works through the use of compiler directives with constructs in order to create a parallel region in which threading can be performed.</p>
 +
<p>For example:</p>
 +
<pre class="code">#pragma omp construct [clause, ...] newline (\n)
 +
structured block</pre>
 +
<p>Through the use of different constructs we can define the parallel programming command to be used. Using constructs is mandatory in order for OpenMP to execute the command</p>
 +
<p>For example:</p>
 +
<pre class="code">#pragma omp parallel</pre>
  
* '''Compiler directives'''
+
<p>The above code shows the use of the parallel construct parllel this construct identifies a block of code to be executed by multiple threads (a parallel region) </p>
** Compiler directives are used in order to control the parallelism of code regions.
 
** The directive keyword placed after #pragma omp is telling the compiler what action needs to happen on that specific region of code. In addition to this OpenMP allows the use of clauses after the directive in order to provoke additional behaviour on that parallel region.
 
** Example of directives and constructs include:
 
*** Parallel ( #pragma omp parallel)
 
**** This defines a parallel region in which the compiler knows to form threads for parallel execution.
 
*** Task (#pragma omp task)
 
**** Defines an explicit task. The data environment of the task is created according to data-sharing attribute clauses on task construct and any defaults that apply
 
*** Simd ( #pragma omp simd)
 
**** Applied to a loop to indicate that the loop can be transformed into a SIMD loop.
 
*** Atomic (#pragma omp atomic )
 
**** This directive allows the use of a specific memory location atomically. It helps ensure that race conditions are avoided through the direct control of concurrent threads. Used for writing more efficient algorithms.
 
* '''The runtime library routines'''
 
** This include routines that deal with setting and getting the number of total threads, the current thread, etc. For example:
 
*** omp_set_num_threads(int) sets the number of threads in the next parallel region while omp_get_num_threads() returns how many threads OpenMP actually created.
 
* '''Environment Variables''' used to guide OpenMP. A widely used example  includes OMP_NUM_THREADS which defines the maximum number of threads for OpenMP to attempt to use.
 
  
 +
<h4>Implicit Barrier</h4>
  
=== Creating a Thread ===
+
<p>With OpenMP after defining a parallel region, by default at the end of the region there is what we call an implicit barrier. An implicit barrier is where all individual threads are contained back into one thread; the Master thread which then continues.</p>
  
 +
<pre class="code">// OpenMP - Parallel Construct
 +
// omp_parallel.cpp
  
Using the components above, the programmer can setup a parallel region to run tasks in parallel. The following is an example thread creation controlled by OpenMP.
+
#include &lt;iostream&gt;
 +
#include &lt;omp.h&gt;
  
[[File:Ompthread.png | 300px]]
+
int main() {
 +
  #pragma omp parallel
 +
  {
 +
    std::cout << "Hello\n";
 +
  }
 +
  std::cout << "Fin\n";
 +
  return 0;
 +
}
 +
</pre>
  
 +
<p>Output:</p>
 +
<pre class="code">Hello
 +
Hello
 +
Hello
 +
Hello
 +
Hello
 +
Hello
 +
Fin
 +
</pre>
  
=== Multithreading ===
 
  
 +
<!-- C++11 Threads -->
  
'''Control Structures'''
+
<h3>C++11 Threads</h3>
* OpenMP is made to have a very simplistic set of control structures. Most parallel applications require the use of a few control structures.
 
* The very basic execution of these control structures is through the use of the fork-join method. Whereas the start of each new thread would be defined by the control structure.
 
* OpenMP includes control structures only in instances where a compiler can provide both functionality and performance over what a user could reasonably program.
 
  
'''Data Environment'''
+
<p>C++11 introduced threading through the thread library.</p>
* Each process in OpenMP has associated clauses that define the data environment.
 
* Each new data environment is constructed only for new processes at the time of execution
 
* Using the following clauses you are able to change storage attributes for constructs that apply to the construct and not the entire parallel region
 
** SHARED
 
** PRIVATE
 
** FIRSTPRIVATE
 
** LASTPRIVATE
 
** DEFAULT
 
* By default almost all variables are shared, global variables are also shared amongst threads. However not everything is shared, stack variables that are apart of subprograms or functions in parallel regions are PRIVATE.
 
  
 +
<p>Unlike OpenMP, C++11 does <i>not</i> use parallel regions as barriers for its threading. When a thread is run using the C++11 thread library, we must consider the scope of the parent thread. If the parent thread would exit before the child thread can return, it can crash the program if not handled correctly.</p>
  
=== Synchronization ===
+
<h4>Join and Detach</h4>
  
 +
<p>When using the join function on the child thread, the parent thread will be blocked until the child thread returns.</p>
 +
<pre class="code">                    t2
 +
            ____________________
 +
          /                    \
 +
__________/\___________________|/\__________
 +
    t1          t1  t2.join() |      t1
 +
</pre>
  
* Synchronization is a way of telling a parallel region(threads) to be completed in a specific order to the sequence in which they do things.
+
<p>When using the detach function on the child thread, the two threads will split and run independently. Even if the parent thread exits before the child thread is able to finish, the child thread will still be able to continue. The child thread is responsible for deallocation of memory upon completion.</p>
* The most common form of synchronization is the use of barriers. Essentially the threads will wait at a barrier until every thread in the scope of the parallel region has reached the same point.
+
<p>OpenMP does not have this functionality. OpenMP cannot execute instructions outside of its parallel region like the C++11 thread library can.</p>
* There are some constructs that help implement synchronization such as master. The master construct defines a block that is only executed by the master thread, which makes the other threads skip it. Another example is the ordered region. This allows the parallel region to be executed in sequential order.
+
<pre class="code">                          t2
 +
            ________________________________
 +
          /                   
 +
__________/\_______________________
 +
    t1          t1  t2.detach()     
 +
</pre>
  
==== Implicit Barrier ====
+
<h4>Creating a Thread</h4>
 +
<p>The following is the template used for the overloaded thread constructor. The thread begins to run on initialization.<br>
 +
f is the function, functor, or lambda expression to be executed in the thread. args are the arguements to pass to f.</p>
 +
<pre class="code">template&lt;class Function, class... Args&gt;
 +
explicit thread(Function&& f, Args&&... args);
 +
</pre>
  
 +
<!-- How Multithreading Works -->
  
[[File:Openmpbarrier.png | 500px]]
+
<h2>How Multithreading Works</h2>
  
==== Ordered Example ====
+
<!-- Multithreading With OpenMP -->
  
[[File:Openmporderedexample.png | 500px]]
+
<h3>Multithreading With OpenMP</h3>
  
 +
<pre class="code">#include &lt;iostream&gt;
 +
#include &lt;omp.h&gt;
  
= Threading in C++11 =
+
int main() {
 +
  #pragma omp parallel
 +
  {
 +
    int tid = omp_get_thread_num();
 +
    std::cout &lt;&lt; "Hi from thread "&lt;&lt; tid &lt;&lt; '\n';
 +
  }
 +
  return 0;
 +
}
 +
</pre>
  
 +
<p>Output:</p>
 +
<pre class="code">Hi from thread Hi from Thread 2
 +
0
 +
Hi from thread 1
 +
Hi from thread 3
 +
</pre>
  
Threading in C++11 is available through the <thread> library.
+
<p>Essentially what is happening in the code above is that the threads are intermingling creating a jumbled output. All threads are trying to access the cout stream at the same time. As one thread is in the stream another may interfere with it because they are all trying to access the stream at the same time. </p>
C++11 relies mostly on joining or detaching forked subthreads.
 
  
 +
<!-- Threading with C++11 -->
  
=== Join vs Detach ===
+
<h3>Threading with C++11</h3>
 +
<p>Unlike OpenMP, C++11 threads are created by the programmer instead of the compiler.</p>
 +
<p>std::this_thread::get_id() is similar to OpenMP's omp_get_thread_num() but instead of an int, it returns a </p>
  
 +
<pre class="code">// cpp11.multithreading.cpp
  
A thread will begin running on initialization. While running in parallel, the child thread’s scope could exit before the child thread is finished. This will result in an error. The two main ways of dealing with this problem is through joining or detaching the child thread to/from the parent thread.
+
#include &lt;iostream&gt;
 +
#include &lt;vector&gt;
 +
#include &lt;thread&gt;
  
The following example shows how join works with the C++11 thread library. The thread (t1) forks of when creating a new child thread (t2). Both of these threads run in parallel. To prevent t2 from going out of scope in case t1 finishes first, t1 will call t2.join(). This will block t1 from executing code until t2 returns. Once t2 joins back, t1 can continue to execute.
+
void func1(int index) {
 +
  std::cout &lt;&lt; "Index: " &lt;&lt; index &lt;&lt; " - ID: " &lt;&lt; std::this_thread::get_id() &lt;&lt; std::endl;
 +
}
  
[[File:Cppjoin.png | 500px]]
+
int main() {
 +
  int numThreads = 10;
  
Detach, on the other hand, separates the two threads entirely. When t1 creates the new t2 thread, they both run in parallel. This time, t1 will call the detach function on t2. This will cause the two threads to continue running in parallel without t1’s scope affecting t2. Therefore, if t1 exits before t2 finishes, t2 can continue to run without any errors occurring - deallocating any memory after it itself finishes.
+
  std::vector&lt;std::thread&gt; threads;
  
[[File:Cppdetach.png | 500px]]
+
  std::cout &lt;&lt; "Creating threads...\n";
  
 +
  for (int i = 0; i < numThreads; i++)
 +
    threads.push_back(std::thread(func1, i));
  
=== Creating a Thread ===
+
  std::cout &lt;&lt; "All threads have launched!\n";
 +
  std::cout &lt;&lt; "Syncronizing...\n";
  
 +
  for (auto& thread : threads)
 +
    thread.join();
  
The basic constructor for a thread follows the following template:
+
  std::cout &lt;&lt; "All threads have syncronized!\n";
  
[[File:Cppthread.png | 500px]]
+
  return 0;
 +
}
 +
</pre>
  
 +
<p>Since all threads are using the std::cout stream, the output can appear jumbled and out of order. The solution to this problem will be presented in the next section.</p>
  
The thread can take in a function, functor, or lambda expression as its first argument, followed by 0 or more arguments to be passed into the function.
+
<pre class="code">Creating threads...
 +
Index: 0 - ID: Index: 1 - ID: Index: 2 - ID: 0x70000b57e000
 +
0x70000b4fb000
 +
0x70000b601000Index: 3 - ID: 0x70000b684000
 +
Index:
 +
4 - ID: 0x70000b707000
 +
Index: 5 - ID: 0x70000b78a000
 +
Index: 6 - ID: 0x70000b80d000
 +
Index: 7 - ID: 0x70000b890000
 +
Index: All threads have launched!
 +
8 - ID: 0x70000b913000
 +
Index: Syncronizing...
 +
9 - ID: 0x70000b996000
 +
All threads have syncronized!
 +
</pre>
  
The thread constructor, by default, will treat all arguments as if you are passing them in by value, even if the function requires a variable by reference. To make sure no errors occur, the programmer needs to specify that the argument(s) passed to be treated as references by wrapping them in std::ref().
+
<!-- How Syncronization Works -->
  
The following is an example of a thread passing in variables by value and by reference:
+
<h2>How Syncronization Works</h2>
  
 +
<!-- Syncronization With OpenMP -->
  
[[File:Cppthreadpassinvariables.png | 500px]]
+
<h3>Syncronization With OpenMP</h3>
  
'''Output:'''
+
  <!-- critical -->
 +
<h4>critical</h4>
  
[[File:Cppthreadpassinvariablesoutput.png | 300px]]
+
<pre class="code">#include &lt;iostream&gt;
 +
#include &lt;omp.h&gt;
  
 +
int main()
 +
{
 +
  #pragma omp parallel
 +
  {
 +
    int tid = omp_get_thread_num();
 +
    #pragma omp critical
 +
    std::cout << "Hi from thread "<< tid << '\n';
 +
  }
 +
  return 0;
 +
}
 +
</pre>
  
 +
<p>Using the parallel construct: critical we are able to limit one thread accessing the stream at a time. critical defines the region in which only one thread is allowed to execute at a time. In this case its the cout stream that we are limiting to one thread. The revised code now has an output like this:</p>
  
=== Multithreading ===
+
<pre class="code">Hi from thread 0
 +
Hi from Thread 1
 +
Hi from thread 2
 +
Hi from thread 3
 +
</pre>
  
 +
  <!-- parallel for -->
  
Multithreading with the C++11 thread library requires manual creation of every new thread. To define the number of threads to be created, the programmer has the option of manually setting the number of threads or using the hardware_concurrency function that will return the maximum number of threads that are available for the program to use. This works in a similar way as OpenMP’s omp_get_max_threads().
+
<h4>parallel for</h4>
  
 +
<p>In OpenMp there is a way of parallelizing a for loop by using the parallel construct for. This statement will automatically distribute iterations between threads.</p>
  
[[File:Cppmultithreading.png | 500px]]
+
<p>Example:</p>
 +
<pre class="code">void simple(int n, float *a, float *b) {
 +
  int i;
 +
  #pragma omp parallel for
 +
  for (i = 1; i &lt; n; i++)
 +
    b[i] = (a[i] + a[i-1]) / 2.0;
 +
}
 +
</pre>
  
'''Output:'''
+
<!-- Syncronization with C++11 -->
  
[[File:Cppmultithreadingoutput.png | 300px]]
+
<h3>Syncronization with C++11</h3>
  
 +
<h4>mutex</h4>
 +
<p>To allow for thread syncronization, we can use the mutex library to lock specific sections of code from being used by multiple threads at once.</p>
  
As the threads execute, they create a race condition. Because they all share the std::cout stream object, multithreading like this can result in unwanted behaviour - as seen in the above output.  
+
<pre class="code">// cpp11.mutex.cpp
  
''Note how you can delay a thread by calling the std::this_thread::sleep_for() function.''
+
#include &lt;iostream&gt;
 +
#include &lt;vector&gt;
 +
#include &lt;thread&gt;
 +
#include &lt;mutex&gt;
  
 +
std::mutex mu;
  
=== Synchronization ===
+
void func1(int index) {
 +
  std::lock_guard&lt;std::mutex&gt; lock(mu);
 +
  // mu.lock();
 +
  std::cout &lt;&lt; "Index: " &lt;&lt; index &lt;&lt; " - ID: " &lt;&lt; std::this_thread::get_id() &lt;&lt; std::endl;
 +
  // mu.unlock();
 +
}
  
==== Using Mutex ====
+
int main() {
 +
  int numThreads = 10;
  
 +
  std::vector&lt;std::thread&gt; threads;
  
To prevent unwanted race conditions, we can use the mutex functionality available through the <mutex> library.  
+
  std::cout &lt;&lt; "Creating threads...\n";
  
Mutex creates an exclusivity region within a thread through a lock system. Once locked, it protects shared data from being accessed by multiple threads at the same time. To prevent from a mutex lock from never unlocking - if, for example, an exception is thrown before the unlock function runs - it is advised to use std::lock_guard<std::mutex> instead to manage locking in a more exception-safe manner.
+
  for (int i = 0; i < numThreads; i++)
 +
    threads.push_back(std::thread(func1, i));
  
 +
  std::cout &lt;&lt; "All threads have launched!\n";
 +
  std::cout &lt;&lt; "Syncronizing...\n";
  
[[File:Cppmutex.png | 500px]]
+
  for (auto& thread : threads)
 +
    thread.join();
  
 +
  std::cout &lt;&lt; "All threads have syncronized!\n";
  
'''Output:'''
+
  return 0;
 +
}
 +
</pre>
  
[[File:Cppmutexoutput.png | 300px]]
+
<p>Using mutex, we're able to place a lock on the data used by the threads to allow for mutual exclusion. This is similar to OpenMP's critical in that it only allows one thread to execute a block of code at a time.</p>
  
 +
<pre class="code">Creating threads...
 +
Index: 0 - ID: 0x70000aa29000
 +
Index: 4 - ID: 0x70000ac35000
 +
Index: 5 - ID: 0x70000acb8000
 +
Index: 1 - ID: 0x70000aaac000
 +
Index: 6 - ID: 0x70000ad3b000
 +
Index: 7 - ID: 0x70000adbe000
 +
Index: 8 - ID: 0x70000ae41000
 +
Index: 3 - ID: 0x70000abb2000
 +
All threads have launched!
 +
Syncronizing...
 +
Index: 9 - ID: 0x70000aec4000
 +
Index: 2 - ID: 0x70000ab2f000
 +
All threads have syncronized!
 +
</pre>
  
==== Using Atomic ====
+
<!-- How Data Sharing Works -->
  
Another way to manage shared data access between multiple threads is through the use of the atomic structure defined in the <atomic> library.
+
<h2>How Data Sharing Works</h2>
  
Atomic in the C++11 library works very similarly to how Atomic works in OpenMP. In C++, atomic is used as a wrapper for a variable type in order to give the variable atomic properties - that is, it can only be written by one thread at a time.
+
<!-- Data Sharing With OpenMP -->
 +
 
 +
<h3>Data Sharing With OpenMP</h3>
 +
 
 +
<p></p>
 +
<p>In OpenMP by default all data is shared and passed by reference. Therefore, we must be careful how the data is handled within the parallel region if accessed by multiple threads at once.</p>
 +
 
 +
<p>For Example:</p>
 +
<pre class="code">#include &lt;iostream&gt;
 +
#include &lt;omp.h&gt;
 +
 
 +
int main() {
 +
  int i = 12;
 +
  #pragma omp parallel
 +
  {
 +
    #pragma omp critical
 +
    std::cout << "\ni = " << ++i;
 +
  }
 +
  std::cout << "\ni = " << i << std::endl;
 +
  return 0;
 +
}
 +
</pre>
 +
 
 +
<p>Output:</p>
 +
<pre class="code">i = 13
 +
i = 14
 +
i = 15
 +
i = 16
 +
i = 16
 +
</pre>
 +
 
 +
<p>What we can see using the output from the code above is that even after the parallel region is closed we can see that our variable i holds a different value than it did originally. This is due to the fact that the variable is shared inside and outside the parallel region. In order to pass this variable by value to each thread we must make this variable non-shared. This is done by using firstprivate() This is considered a clause, which comes after a construct. firstprivate(i) will take i and make it private to each thread.</p>
 +
 
 +
<p>For example:</p>
 +
<pre class="code">
 +
#include &lt;iostream&gt;
 +
#include &lt;omp.h&gt;
 +
 
 +
int main() {
 +
  int i = 12;
 +
  #pragma omp parallel firstprivate(i)
 +
  {
 +
  #pragma omp critical
 +
    std::cout << "\ni = " << ++i;
 +
  }
 +
  std::cout << "\ni = " << i << std::endl;
 +
}
 +
</pre>
 +
 
 +
<p>New Output:</p>
 +
<pre class="code">i = 13
 +
i = 13
 +
i = 13
 +
i = 13
 +
i = 12
 +
</pre>
 +
 
 +
<p>What we can see here is that through each indiviual thread the value of i stays at 12 then gets incremented by the thread to 13. On the last line of the output we can see that i = 12 showing that the parallel region did not change the value of i outside the parallel region.</p>
 +
 
 +
<!-- Data Sharing with C++11 -->
 +
 
 +
<h3>Data Sharing with C++11</h3>
 +
<p>The C++11 thread library requires the programmer to pass in the address of the data that should be shared by the threads.</p>
 +
 
 +
<pre class="code">// cpp11.datasharing.cpp
 +
 
 +
#include &lt;iostream&gt;
 +
#include &lt;vector&gt;
 +
#include &lt;thread&gt;
 +
#include &lt;mutex&gt;
 +
 
 +
std::mutex mu;
 +
 
 +
void func1(int value) {
 +
  std::lock_guard&lt;std::mutex&gt; lock(mu);
 +
  std::cout &lt;&lt; "func1 start - value = " &lt;&lt; value << std::endl;
 +
  value = 0;
 +
  std::cout &lt;&lt; "func1 end - value = " &lt;&lt; value << std::endl;
 +
}
 +
 
 +
void func2(int& value) {
 +
  std::lock_guard&lt;std::mutex&gt; lock(mu);
 +
  std::cout &lt;&lt; "func2 start - value = " &lt;&lt; value << std::endl;
 +
  value *= 2;
 +
  std::cout &lt;&lt; "func2 end - value = " &lt;&lt; value << std::endl;
 +
}
 +
 
 +
int main() {
 +
  int numThreads = 5;
 +
  int value = 1;
 +
 
 +
  std::vector&lt;std::thread&gt; threads;
 +
 
 +
  for (int i = 0; i < numThreads; i++) {
 +
  if (i == 2) threads.push_back(std::thread(func1, value));
 +
  else threads.push_back(std::thread(func2, std::ref(value)));
 +
  }
 +
 
 +
  for (auto& thread : threads)
 +
    thread.join();
 +
 
 +
  return 0;
 +
}
 +
</pre>
 +
 
 +
<pre class="code">func2 start - value = 1
 +
func2 end - value = 2
 +
func2 start - value = 2
 +
func2 end - value = 4
 +
func1 start - value = 1
 +
func1 end - value = 0
 +
func2 start - value = 4
 +
func2 end - value = 8
 +
func2 start - value = 8
 +
func2 end - value = 16
 +
</pre>
 +
 
 +
<!-- How Syncronization Works Continued -->
 +
 
 +
<h2>How Syncronization Works Continued</h2>
 +
 
 +
<!-- Syncronization Continued With OpenMP -->
 +
 
 +
<h3>Syncronization Continued With OpenMP</h3>
 +
 
 +
<h4>atomic</h4>
 +
 
 +
<p>The atomic construct is a way of OpenMP's implementation to serialize a specific operation. The advantage of using the atomic construct in this example below is that it allows the increment operation with less overhead than critical. Atomic ensures that only the operation is being performed one thread at a time.</p>
 +
 
 +
<pre class="code">int main() {
 +
  int i = 0;
 +
  #pragma omp parallel num_threads(10)
 +
  {
 +
  #pragma omp atomic
 +
  i++;
 +
  }
 +
  std::cout << i << std::endl;
 +
  return 0;
 +
}
 +
</pre>
 +
 
 +
<pre class="code">10</pre>
 +
 
 +
<!-- Syncronization Continued with C++11 -->
 +
 
 +
<h3>Syncronization Continued with C++11</h3>
 +
 
 +
<h4>atomic</h4>
 +
<p>Another way to ensure syncronization of data between threads is to use the atomic library.</p>
 +
 
 +
<pre class="code">// cpp11.atomic.cpp
 +
 
 +
#include &lt;iostream&gt;
 +
#include &lt;vector&gt;
 +
#include &lt;thread&gt;
 +
#include &lt;atomic&gt;
 +
 
 +
std::atomic&lt;int&gt; value(1);
 +
 
 +
void add() {
 +
  ++value;
 +
}
 +
 
 +
void sub() {
 +
  --value;
 +
}
 +
 
 +
int main() {
 +
  int numThreads = 5;
 +
 
 +
  std::vector&lt;std::thread&gt; threads;
 +
 
 +
  for (int i = 0; i < numThreads; i++) {
 +
  if (i == 2) threads.push_back(std::thread(sub));
 +
  else threads.push_back(std::thread(add));
 +
  }
 +
 
 +
  for (auto& thread : threads)
 +
    thread.join();
 +
 
 +
  std::cout << value << std::endl;
 +
 
 +
  return 0;
 +
}
 +
</pre>
 +
 
 +
<p>The atomic value can only be accessed by one thread at a time. This is a similar lock procedure as mutex except the lock is defined by the atomic wrapper instead of the programmer.</p>
 +
 
 +
<pre class="code">4</pre>
 +
 
 +
<!-- Thread Creation Test -->
 +
 
 +
<h2>Thread Creation Test</h2>
 +
 
 +
<pre class="code">#include <iostream>
 +
#include <string>
 +
#include <chrono>
 +
#include <vector>
 +
#include <thread>
 +
#include <omp.h>
 +
 
 +
using namespace std::chrono;
 +
 
 +
void reportTime(const char* msg, int size, steady_clock::duration span) {
 +
  auto ms = duration_cast<milliseconds>(span);
 +
  std::cout << msg << "- size : " << std::to_string(size) << " - took - " << ms.count() << " milliseconds" << std::endl;
 +
}
 +
 
 +
void empty() {}
 +
 
 +
void cpp(int size) {
 +
  steady_clock::time_point ts, te;
 +
  ts = steady_clock::now();
 +
  for (int i = 0; i < size; i++) {
 +
    std::vector<std::thread> threads;
 +
    for (int j = 0; j < 10; j++) threads.push_back(std::thread(empty));
 +
    for (auto& thread : threads) thread.join();
 +
  }
 +
  te = steady_clock::now();
 +
  reportTime("C++11 Threads", size, te - ts);
 +
}
 +
 
 +
void omp(int size) {
 +
  steady_clock::time_point ts, te;
 +
  ts = steady_clock::now();
 +
  for (int i = 0; i < size; i++) {
 +
    #pragma omp parallel for num_threads(10)
 +
      for (int i = 0; i < 10; i++) empty();
 +
  }
 +
  te = steady_clock::now();
 +
  reportTime("OpenMP", size, te - ts);
 +
}
 +
 
 +
int main() {
 +
 
 +
  // Test C++11 Threads
 +
  cpp(1);
 +
  cpp(10);
 +
  cpp(100);
 +
  cpp(1000);
 +
  cpp(10000);
 +
  cpp(100000);
 +
 
 +
  std::cout << std::endl;
 +
 
 +
  // Test OpenMP
 +
  omp(1);
 +
  omp(10);
 +
  omp(100);
 +
  omp(1000);
 +
  omp(10000);
 +
  omp(100000);
 +
 
 +
  return 0;
 +
}
 +
</pre>
 +
 
 +
<pre class="code">C++11 Threads- size : 1 - took - 1 milliseconds
 +
C++11 Threads- size : 10 - took - 10 milliseconds
 +
C++11 Threads- size : 100 - took - 125 milliseconds
 +
C++11 Threads- size : 1000 - took - 1703 milliseconds
 +
C++11 Threads- size : 10000 - took - 20760 milliseconds
 +
C++11 Threads- size : 100000 - took - 168628 milliseconds
 +
 
 +
OpenMP- size : 1 - took - 0 milliseconds
 +
OpenMP- size : 10 - took - 0 milliseconds
 +
OpenMP- size : 100 - took - 0 milliseconds
 +
OpenMP- size : 1000 - took - 6 milliseconds
 +
OpenMP- size : 10000 - took - 62 milliseconds
 +
OpenMP- size : 100000 - took - 616 milliseconds
 +
</pre>
 +
 
 +
[[File:Cpp11threadgraph.png | 700px]]
 +
[[File:Openmpthreadgraph.png | 700px]]

Latest revision as of 19:36, 4 December 2018


GPU621/DPS921 | Participants | Groups and Projects | Resources | Glossary

C++11 Threads Library Comparison to OpenMP

Group Members

Daniel Bogomazov
Nick Krillis


How Threads Works


OpenMP Threads

Threading in OpenMP works through the use of compiler directives with constructs in order to create a parallel region in which threading can be performed.

For example:

#pragma omp construct [clause, ...] newline (\n)
 structured block

Through the use of different constructs we can define the parallel programming command to be used. Using constructs is mandatory in order for OpenMP to execute the command

For example:

#pragma omp parallel

The above code shows the use of the parallel construct parllel this construct identifies a block of code to be executed by multiple threads (a parallel region)

Implicit Barrier

With OpenMP after defining a parallel region, by default at the end of the region there is what we call an implicit barrier. An implicit barrier is where all individual threads are contained back into one thread; the Master thread which then continues.

// OpenMP - Parallel Construct
// omp_parallel.cpp

#include <iostream>
#include <omp.h>

int main() {
  #pragma omp parallel
  {
    std::cout << "Hello\n";
  }
  std::cout << "Fin\n";
  return 0;
}

Output:

Hello
Hello
Hello
Hello
Hello
Hello
Fin


C++11 Threads

C++11 introduced threading through the thread library.

Unlike OpenMP, C++11 does not use parallel regions as barriers for its threading. When a thread is run using the C++11 thread library, we must consider the scope of the parent thread. If the parent thread would exit before the child thread can return, it can crash the program if not handled correctly.

Join and Detach

When using the join function on the child thread, the parent thread will be blocked until the child thread returns.

                     t2
            ____________________
           /                    \
__________/\___________________|/\__________
    t1          t1   t2.join() |      t1

When using the detach function on the child thread, the two threads will split and run independently. Even if the parent thread exits before the child thread is able to finish, the child thread will still be able to continue. The child thread is responsible for deallocation of memory upon completion.

OpenMP does not have this functionality. OpenMP cannot execute instructions outside of its parallel region like the C++11 thread library can.

                           t2
            ________________________________
           /                    
__________/\_______________________
    t1          t1   t2.detach()      

Creating a Thread

The following is the template used for the overloaded thread constructor. The thread begins to run on initialization.
f is the function, functor, or lambda expression to be executed in the thread. args are the arguements to pass to f.

template<class Function, class... Args>
explicit thread(Function&& f, Args&&... args);


How Multithreading Works


Multithreading With OpenMP

#include <iostream>
#include <omp.h>

int main() {
  #pragma omp parallel
  {
    int tid = omp_get_thread_num();
    std::cout << "Hi from thread "<< tid << '\n';
  }
  return 0;
}

Output:

Hi from thread Hi from Thread 2
0
Hi from thread 1
Hi from thread 3

Essentially what is happening in the code above is that the threads are intermingling creating a jumbled output. All threads are trying to access the cout stream at the same time. As one thread is in the stream another may interfere with it because they are all trying to access the stream at the same time.


Threading with C++11

Unlike OpenMP, C++11 threads are created by the programmer instead of the compiler.

std::this_thread::get_id() is similar to OpenMP's omp_get_thread_num() but instead of an int, it returns a

// cpp11.multithreading.cpp

#include <iostream>
#include <vector>
#include <thread>

void func1(int index) {
  std::cout << "Index: " << index << " - ID: " << std::this_thread::get_id() << std::endl;
}

int main() {
  int numThreads = 10;

  std::vector<std::thread> threads;

  std::cout << "Creating threads...\n";

  for (int i = 0; i < numThreads; i++)
    threads.push_back(std::thread(func1, i));

  std::cout << "All threads have launched!\n";
  std::cout << "Syncronizing...\n";

  for (auto& thread : threads)
    thread.join();

  std::cout << "All threads have syncronized!\n";

  return 0;
}

Since all threads are using the std::cout stream, the output can appear jumbled and out of order. The solution to this problem will be presented in the next section.

Creating threads...
Index: 0 - ID: Index: 1 - ID: Index: 2 - ID: 0x70000b57e000
0x70000b4fb000
0x70000b601000Index: 3 - ID: 0x70000b684000
Index: 
4 - ID: 0x70000b707000
Index: 5 - ID: 0x70000b78a000
Index: 6 - ID: 0x70000b80d000
Index: 7 - ID: 0x70000b890000
Index: All threads have launched!
8 - ID: 0x70000b913000
Index: Syncronizing...
9 - ID: 0x70000b996000
All threads have syncronized!


How Syncronization Works


Syncronization With OpenMP

critical

#include <iostream>
#include <omp.h>

int main() 
{
  #pragma omp parallel
  {
    int tid = omp_get_thread_num();
    #pragma omp critical
    std::cout << "Hi from thread "<< tid << '\n';
  }
  return 0;
}

Using the parallel construct: critical we are able to limit one thread accessing the stream at a time. critical defines the region in which only one thread is allowed to execute at a time. In this case its the cout stream that we are limiting to one thread. The revised code now has an output like this:

Hi from thread 0
Hi from Thread 1
Hi from thread 2
Hi from thread 3


parallel for

In OpenMp there is a way of parallelizing a for loop by using the parallel construct for. This statement will automatically distribute iterations between threads.

Example:

void simple(int n, float *a, float *b) {
  int i;
  #pragma omp parallel for
  for (i = 1; i < n; i++)
    b[i] = (a[i] + a[i-1]) / 2.0;
}


Syncronization with C++11

mutex

To allow for thread syncronization, we can use the mutex library to lock specific sections of code from being used by multiple threads at once.

// cpp11.mutex.cpp

#include <iostream>
#include <vector>
#include <thread>
#include <mutex>

std::mutex mu;

void func1(int index) {
  std::lock_guard<std::mutex> lock(mu);
  // mu.lock();
  std::cout << "Index: " << index << " - ID: " << std::this_thread::get_id() << std::endl;
  // mu.unlock();
}

int main() {
  int numThreads = 10;

  std::vector<std::thread> threads;

  std::cout << "Creating threads...\n";

  for (int i = 0; i < numThreads; i++)
    threads.push_back(std::thread(func1, i));

  std::cout << "All threads have launched!\n";
  std::cout << "Syncronizing...\n";

  for (auto& thread : threads)
    thread.join();

  std::cout << "All threads have syncronized!\n";

  return 0;
}

Using mutex, we're able to place a lock on the data used by the threads to allow for mutual exclusion. This is similar to OpenMP's critical in that it only allows one thread to execute a block of code at a time.

Creating threads...
Index: 0 - ID: 0x70000aa29000
Index: 4 - ID: 0x70000ac35000
Index: 5 - ID: 0x70000acb8000
Index: 1 - ID: 0x70000aaac000
Index: 6 - ID: 0x70000ad3b000
Index: 7 - ID: 0x70000adbe000
Index: 8 - ID: 0x70000ae41000
Index: 3 - ID: 0x70000abb2000
All threads have launched!
Syncronizing...
Index: 9 - ID: 0x70000aec4000
Index: 2 - ID: 0x70000ab2f000
All threads have syncronized!


How Data Sharing Works


Data Sharing With OpenMP

In OpenMP by default all data is shared and passed by reference. Therefore, we must be careful how the data is handled within the parallel region if accessed by multiple threads at once.

For Example:

#include <iostream>
#include <omp.h>

int main() {
  int i = 12;
  #pragma omp parallel
  {
    #pragma omp critical
    std::cout << "\ni = " << ++i;
  }
  std::cout << "\ni = " << i << std::endl;
  return 0;
}

Output:

i = 13
i = 14
i = 15
i = 16
i = 16

What we can see using the output from the code above is that even after the parallel region is closed we can see that our variable i holds a different value than it did originally. This is due to the fact that the variable is shared inside and outside the parallel region. In order to pass this variable by value to each thread we must make this variable non-shared. This is done by using firstprivate() This is considered a clause, which comes after a construct. firstprivate(i) will take i and make it private to each thread.

For example:

#include <iostream>
#include <omp.h>

int main() {
  int i = 12;
  #pragma omp parallel firstprivate(i)
  {
  #pragma omp critical
    std::cout << "\ni = " << ++i;
  }
  std::cout << "\ni = " << i << std::endl;
}

New Output:

i = 13
i = 13
i = 13
i = 13
i = 12

What we can see here is that through each indiviual thread the value of i stays at 12 then gets incremented by the thread to 13. On the last line of the output we can see that i = 12 showing that the parallel region did not change the value of i outside the parallel region.


Data Sharing with C++11

The C++11 thread library requires the programmer to pass in the address of the data that should be shared by the threads.

// cpp11.datasharing.cpp

#include <iostream>
#include <vector>
#include <thread>
#include <mutex>

std::mutex mu;

void func1(int value) {
  std::lock_guard<std::mutex> lock(mu);
  std::cout << "func1 start - value = " << value << std::endl;
  value = 0;
  std::cout << "func1 end - value = " << value << std::endl;
}

void func2(int& value) {
  std::lock_guard<std::mutex> lock(mu);
  std::cout << "func2 start - value = " << value << std::endl;
  value *= 2;
  std::cout << "func2 end - value = " << value << std::endl;
}

int main() {
  int numThreads = 5;
  int value = 1;

  std::vector<std::thread> threads;

  for (int i = 0; i < numThreads; i++) {
  	if (i == 2) threads.push_back(std::thread(func1, value));
  	else threads.push_back(std::thread(func2, std::ref(value)));
  }

  for (auto& thread : threads)
    thread.join();

  return 0;
}
func2 start - value = 1
func2 end - value = 2
func2 start - value = 2
func2 end - value = 4
func1 start - value = 1
func1 end - value = 0
func2 start - value = 4
func2 end - value = 8
func2 start - value = 8
func2 end - value = 16


How Syncronization Works Continued


Syncronization Continued With OpenMP

atomic

The atomic construct is a way of OpenMP's implementation to serialize a specific operation. The advantage of using the atomic construct in this example below is that it allows the increment operation with less overhead than critical. Atomic ensures that only the operation is being performed one thread at a time.

int main() {
  int i = 0;
  #pragma omp parallel num_threads(10)
  {
  	#pragma omp atomic
  	i++;
  }
  std::cout << i << std::endl;
  return 0;
}
10


Syncronization Continued with C++11

atomic

Another way to ensure syncronization of data between threads is to use the atomic library.

// cpp11.atomic.cpp

#include <iostream>
#include <vector>
#include <thread>
#include <atomic>

std::atomic<int> value(1);

void add() {
  ++value;
}

void sub() {
  --value;
}

int main() {
  int numThreads = 5;

  std::vector<std::thread> threads;

  for (int i = 0; i < numThreads; i++) {
  	if (i == 2) threads.push_back(std::thread(sub));
  	else threads.push_back(std::thread(add));
  }

  for (auto& thread : threads)
    thread.join();

  std::cout << value << std::endl;

  return 0;
}

The atomic value can only be accessed by one thread at a time. This is a similar lock procedure as mutex except the lock is defined by the atomic wrapper instead of the programmer.

4


Thread Creation Test

#include <iostream>
#include <string>
#include <chrono>
#include <vector>
#include <thread>
#include <omp.h>

using namespace std::chrono;

void reportTime(const char* msg, int size, steady_clock::duration span) {
  auto ms = duration_cast<milliseconds>(span);
  std::cout << msg << "- size : " << std::to_string(size) << " - took - " << ms.count() << " milliseconds" << std::endl;
}

void empty() {}

void cpp(int size) {
  steady_clock::time_point ts, te;
  ts = steady_clock::now();
  for (int i = 0; i < size; i++) {
    std::vector<std::thread> threads;
    for (int j = 0; j < 10; j++) threads.push_back(std::thread(empty));
    for (auto& thread : threads) thread.join();
  }
  te = steady_clock::now();
  reportTime("C++11 Threads", size, te - ts);
}

void omp(int size) {
  steady_clock::time_point ts, te;
  ts = steady_clock::now();
  for (int i = 0; i < size; i++) {
    #pragma omp parallel for num_threads(10)
      for (int i = 0; i < 10; i++) empty();
  }
  te = steady_clock::now();
  reportTime("OpenMP", size, te - ts);
}

int main() {

  // Test C++11 Threads
  cpp(1);
  cpp(10);
  cpp(100);
  cpp(1000);
  cpp(10000);
  cpp(100000);

  std::cout << std::endl;

  // Test OpenMP
  omp(1);
  omp(10);
  omp(100);
  omp(1000);
  omp(10000);
  omp(100000);

  return 0;
}
C++11 Threads- size : 1 - took - 1 milliseconds
C++11 Threads- size : 10 - took - 10 milliseconds
C++11 Threads- size : 100 - took - 125 milliseconds
C++11 Threads- size : 1000 - took - 1703 milliseconds
C++11 Threads- size : 10000 - took - 20760 milliseconds
C++11 Threads- size : 100000 - took - 168628 milliseconds

OpenMP- size : 1 - took - 0 milliseconds
OpenMP- size : 10 - took - 0 milliseconds
OpenMP- size : 100 - took - 0 milliseconds
OpenMP- size : 1000 - took - 6 milliseconds
OpenMP- size : 10000 - took - 62 milliseconds
OpenMP- size : 100000 - took - 616 milliseconds

Cpp11threadgraph.png Openmpthreadgraph.png