Changes

Jump to: navigation, search

DPS921/ND-R&D

7,316 bytes added, 20:36, 4 December 2018
no edit summary
{{GPU621/DPS921 Index | 20187}}
= <h1>C++11 Threads Library Comparison to OpenMP =</h1><h3>Group Members</h3>Daniel Bogomazov<br>Nick Krillis<br><br>
<!-- How Threads Works -->
== Group Members == Daniel Bogomazov<h2>How Threads Works</h2>
Nick Krilis <!-- OpenMP Threads -->
= Threading in <h3>OpenMP =OpenMP (Open Multi-Processing) is an API specification for compilers that implement an explicit SPMD programming model on shared memory architectures.OpenMP implements threading through the main thread which will fork a specific number of child threads and divide the task amongst them. The runtime environment will then allocate the threads onto multiple processors.Threads</h3>
The standard <p>Threading in OpenMP consists works through the use of three main componentscompiler directives with constructs in order to create a parallel region in which threading can be performed.</p><p>For example:</p><pre class="code">#pragma omp construct [clause, ...] newline (\n) structured block</pre> <p>Through the use of different constructs we can define the parallel programming command to be used. Using constructs is mandatory in order for OpenMP to execute the command</p> <p>For example:</p> <pre class="code">#pragma omp parallel</pre>
* '''Compiler directives'''** Compiler directives are used in order to control the parallelism of code regions. ** <p>The directive keyword placed after #pragma omp is telling the compiler what action needs to happen on that specific region of above code. In addition to this OpenMP allows shows the use of clauses after the directive in order to provoke additional behaviour on that parallel region. ** Example of directives and constructs include:*** Parallel ( #pragma omp parallel)**** This defines construct parllel this construct identifies a parallel region in which the compiler knows to form threads for parallel execution.*** Task (#pragma omp task)**** Defines an explicit task. The data environment block of the task is created according to data-sharing attribute clauses on task construct and any defaults that apply*** Simd ( #pragma omp simd)**** Applied code to a loop to indicate that the loop can be transformed into a SIMD loop.*** Atomic executed by multiple threads (#pragma omp atomic )**** This directive allows the use of a specific memory location atomically. It helps ensure that race conditions are avoided through the direct control of concurrent threads. Used for writing more efficient algorithms.* '''The runtime library routines'''** This include routines that deal with setting and getting the number of total threads, the current thread, etc. For example:*** omp_set_num_threads(int) sets the number of threads in the next parallel region while omp_get_num_threads() returns how many threads OpenMP actually created.* '''Environment Variables''' used to guide OpenMP. A widely used example includes OMP_NUM_THREADS which defines the maximum number of threads for OpenMP to attempt to use.</p>
<h4>Implicit Barrier</h4>
=== Creating <p>With OpenMP after defining a Thread ===parallel region, by default at the end of the region there is what we call an implicit barrier. An implicit barrier is where all individual threads are contained back into one thread; the Master thread which then continues.</p>
<pre class="code">// OpenMP - Parallel Construct
// omp_parallel.cpp
Using the components above, the programmer can setup a parallel region to run tasks in parallel. The following is an example thread creation controlled by OpenMP#include &lt;iostream&gt;#include &lt;omp.h&gt;
[[Fileint main() { #pragma omp parallel { std:Ompthread.png | 300px]]:cout << "Hello\n"; } std::cout << "Fin\n"; return 0;}</pre>
<p>Output:</p>
<pre class="code">Hello
Hello
Hello
Hello
Hello
Hello
Fin
</pre>
=== Multithreading ===
<!-- C++11 Threads -->
'''Control Structures'''* OpenMP is made to have a very simplistic set of control structures. Most parallel applications require the use of a few control structures.* The very basic execution of these control structures is through the use of the fork-join method. Whereas the start of each new thread would be defined by the control structure.* OpenMP includes control structures only in instances where a compiler can provide both functionality and performance over what a user could reasonably program.<h3>C++11 Threads</h3>
'''Data Environment'''* Each process in OpenMP has associated clauses that define <p>C++11 introduced threading through the data environment.* Each new data environment is constructed only for new processes at the time of execution* Using the following clauses you are able to change storage attributes for constructs that apply to the construct and not the entire parallel region** SHARED** PRIVATE** FIRSTPRIVATE** LASTPRIVATE** DEFAULT* By default almost all variables are shared, global variables are also shared amongst threads. However not everything is shared, stack variables that are apart of subprograms or functions in parallel regions are PRIVATEthread library.</p>
<p>Unlike OpenMP, C++11 does <i>not</i> use parallel regions as barriers for its threading. When a thread is run using the C++11 thread library, we must consider the scope of the parent thread. If the parent thread would exit before the child thread can return, it can crash the program if not handled correctly.</p>
=== Synchronization ===<h4>Join and Detach</h4>
<p>When using the join function on the child thread, the parent thread will be blocked until the child thread returns.</p>
<pre class="code"> t2
____________________
/ \
__________/\___________________|/\__________
t1 t1 t2.join() | t1
</pre>
* Synchronization <p>When using the detach function on the child thread, the two threads will split and run independently. Even if the parent thread exits before the child thread is a way of telling a parallel region(threads) able to finish, the child thread will still be completed in a specific order able to the sequence in which they do thingscontinue.* The most common form of synchronization child thread is the use responsible for deallocation of barriersmemory upon completion. Essentially the threads will wait at a barrier until every thread in the scope </p><p>OpenMP does not have this functionality. OpenMP cannot execute instructions outside of the its parallel region has reached like the same pointC++11 thread library can.</p><pre class="code"> t2 ________________________________* There are some constructs that help implement synchronization such as master. The master construct defines a block that is only executed by the master thread, which makes the other threads skip it. Another example is the ordered region. This allows the parallel region to be executed in sequential order / __________/\_______________________ t1 t1 t2.detach() </pre>
<h4>Creating a Thread</h4><p>The following is the template used for the overloaded thread constructor. The thread begins to run on initialization.<br>f is the function, functor, or lambda expression to be executed in the thread. args are the arguements to pass to f.</p><pre class==== Using Barrier ===="code">template&lt;class Function, class... Args&gt;explicit thread(Function&& f, Args&&... args);</pre>
<!-- How Multithreading Works -->
[[File:Openmpbarrier.png | 500px]]<h2>How Multithreading Works</h2>
<!-- Multithreading With OpenMP -->
==== Ordered Example ====<h3>Multithreading With OpenMP</h3>
[[File:Openmporderedexample<pre class="code">#include &lt;iostream&gt;#include &lt;omp.png | 500px]]h&gt;
int main() {
#pragma omp parallel
{
int tid = omp_get_thread_num();
std::cout &lt;&lt; "Hi from thread "&lt;&lt; tid &lt;&lt; '\n';
}
return 0;
}
</pre>
<p>Output:</p><pre class= Threading in C++11 ="code">Hi from thread Hi from Thread 20Hi from thread 1Hi from thread 3</pre>
<p>Essentially what is happening in the code above is that the threads are intermingling creating a jumbled output. All threads are trying to access the cout stream at the same time. As one thread is in the stream another may interfere with it because they are all trying to access the stream at the same time. </p>
<!-- Threading in with C++11 is available through the <thread--> library. C++11 relies mostly on joining or detaching forked subthreads.
<h3>Threading with C++11</h3>
<p>Unlike OpenMP, C++11 threads are created by the programmer instead of the compiler.</p>
<p>std::this_thread::get_id() is similar to OpenMP's omp_get_thread_num() but instead of an int, it returns a </p>
<pre class=== Join vs Detach ==="code">// cpp11.multithreading.cpp
#include &lt;iostream&gt;
#include &lt;vector&gt;
#include &lt;thread&gt;
A thread will begin running on initialization. While running in parallel, the child thread’s scope could exit before the child thread is finished. This will result in an error. The two main ways of dealing with this problem is through joining or detaching the child thread to/from the parent thread. void func1(int index) { std::cout &lt;&lt; "Index: " &lt;&lt; index &lt;&lt; " - ID: " &lt;&lt; std::this_thread::get_id() &lt;&lt; std::endl;}
The following example shows how join works with the C++11 thread library. The thread int main(t1) forks of when creating a new child thread (t2). Both of these threads run in parallel. To prevent t2 from going out of scope in case t1 finishes first, t1 will call t2.join(). This will block t1 from executing code until t2 returns. Once t2 joins back, t1 can continue to execute.{ int numThreads = 10;
[[File std:Cppjoin.png | 500px]]:vector&lt;std::thread&gt; threads;
Detach, on the other hand, separates the two std::cout &lt;&lt; "Creating threads entirely. When t1 creates the new t2 thread, they both run in parallel. This time, t1 will call the detach function on t2. This will cause the two threads to continue running in parallel without t1’s scope affecting t2. Therefore, if t1 exits before t2 finishes, t2 can continue to run without any errors occurring - deallocating any memory after it itself finishes. \n";
[[File for (int i = 0; i < numThreads; i++) threads.push_back(std:Cppdetach.png | 500px]]:thread(func1, i));
std::cout &lt;&lt; "All threads have launched!\n";
std::cout &lt;&lt; "Syncronizing...\n";
=== Creating a Thread === for (auto& thread : threads) thread.join();
std::cout &lt;&lt; "All threads have syncronized!\n";
The basic constructor for a thread follows the following template: return 0;}</pre>
[[File<p>Since all threads are using the std:Cppthread:cout stream, the output can appear jumbled and out of order. The solution to this problem will be presented in the next section.png | 500px]]</p>
<pre class="code">Creating threads...
Index: 0 - ID: Index: 1 - ID: Index: 2 - ID: 0x70000b57e000
0x70000b4fb000
0x70000b601000Index: 3 - ID: 0x70000b684000
Index:
4 - ID: 0x70000b707000
Index: 5 - ID: 0x70000b78a000
Index: 6 - ID: 0x70000b80d000
Index: 7 - ID: 0x70000b890000
Index: All threads have launched!
8 - ID: 0x70000b913000
Index: Syncronizing...
9 - ID: 0x70000b996000
All threads have syncronized!
</pre>
The thread can take in a function, functor, or lambda expression as its first argument, followed by 0 or more arguments to be passed into the function.<!-- How Syncronization Works -->
The thread constructor, by default, will treat all arguments as if you are passing them in by value, even if the function requires a variable by reference. To make sure no errors occur, the programmer needs to specify that the argument(s) passed to be treated as references by wrapping them in std::ref(). <h2>How Syncronization Works</h2>
The following is an example of a thread passing in variables by value and by reference: <!-- Syncronization With OpenMP -->
<h3>Syncronization With OpenMP</h3>
[[File:Cppthreadpassinvariables.png | 500px]] <!-- critical --><h4>critical</h4>
'''Output:'''<pre class="code">#include &lt;iostream&gt;#include &lt;omp.h&gt;
[[Fileint main() { #pragma omp parallel { int tid = omp_get_thread_num(); #pragma omp critical std:Cppthreadpassinvariablesoutput.png | 300px]]:cout << "Hi from thread "<< tid << '\n'; } return 0;}</pre>
<p>Using the parallel construct: critical we are able to limit one thread accessing the stream at a time. critical defines the region in which only one thread is allowed to execute at a time. In this case its the cout stream that we are limiting to one thread. The revised code now has an output like this:</p>
<pre class="code">Hi from thread 0
Hi from Thread 1
Hi from thread 2
Hi from thread 3
</pre>
=== Multithreading === <!-- parallel for -->
<h4>parallel for</h4>
Multithreading with the C++11 thread library requires manual creation <p>In OpenMp there is a way of every new thread. To define the number of threads to be created, the programmer has the option of manually setting the number of threads or parallelizing a for loop by using the hardware_concurrency function that will return the maximum number of threads that are available parallel construct for the program to use. This works in a similar way as OpenMP’s omp_get_max_threads()statement will automatically distribute iterations between threads.</p>
<p>Example:</p>
<pre class="code">void simple(int n, float *a, float *b) {
int i;
#pragma omp parallel for
for (i = 1; i &lt; n; i++)
b[i] = (a[i] + a[i-1]) / 2.0;
}
</pre>
[[File:Cppmultithreading.png | 500px]] <!-- Syncronization with C++11 -->
'''Output:'''<h3>Syncronization with C++11</h3>
[[File:Cppmultithreadingoutput<h4>mutex</h4><p>To allow for thread syncronization, we can use the mutex library to lock specific sections of code from being used by multiple threads at once.png | 300px]]</p>
<pre class="code">// cpp11.mutex.cpp
As the threads execute, they create a race condition. Because they all share the std::cout stream object, multithreading like this can result in unwanted behaviour - as seen in the above output. #include &lt;iostream&gt;#include &lt;vector&gt;#include &lt;thread&gt;#include &lt;mutex&gt;
''Note how you can delay a thread by calling the std::this_thread::sleep_for() function.''mutex mu;
void func1(int index) {
std::lock_guard&lt;std::mutex&gt; lock(mu);
// mu.lock();
std::cout &lt;&lt; "Index: " &lt;&lt; index &lt;&lt; " - ID: " &lt;&lt; std::this_thread::get_id() &lt;&lt; std::endl;
// mu.unlock();
}
int main() { int numThreads === Synchronization === 10;
==== Using Mutex ==== std::vector&lt;std::thread&gt; threads;
std::cout &lt;&lt; "Creating threads...\n";
To prevent unwanted race conditions, we can use the mutex functionality available through the for (int i = 0; i <mutex> librarynumThreads; i++) threads. push_back(std::thread(func1, i));
Mutex creates an exclusivity region within a thread through a lock system. Once locked, it protects shared data from being accessed by multiple threads at the same time. To prevent from a mutex lock from never unlocking - if, for example, an exception is thrown before the unlock function runs - it is advised to use std::lock_guard<cout &lt;&lt; "All threads have launched!\n"; std::mutex> instead to manage locking in a more exception-safe mannercout &lt;&lt; "Syncronizing...\n";
for (auto& thread : threads)
thread.join();
[[File std:Cppmutex.png | 500px]]:cout &lt;&lt; "All threads have syncronized!\n";
return 0;
}
</pre>
<p>Using mutex, we're able to place a lock on the data used by the threads to allow for mutual exclusion. This is similar to OpenMP''Output:'''s critical in that it only allows one thread to execute a block of code at a time.</p>
[[File<pre class="code">Creating threads...Index: 0 - ID: 0x70000aa29000Index: 4 - ID: 0x70000ac35000Index: 5 - ID: 0x70000acb8000Index: 1 - ID: 0x70000aaac000Index: 6 - ID: 0x70000ad3b000Index: 7 - ID: 0x70000adbe000Index: 8 - ID: 0x70000ae41000Index: 3 - ID:Cppmutexoutput0x70000abb2000All threads have launched!Syncronizing..png | 300px]].Index: 9 - ID: 0x70000aec4000Index: 2 - ID: 0x70000ab2f000All threads have syncronized!</pre>
<!-- How Data Sharing Works -->
==== Using Atomic ====<h2>How Data Sharing Works</h2>
Another way to manage shared data access between multiple threads is through the use of the atomic structure defined in the <atomic!-- Data Sharing With OpenMP --> library.
Atomic in <h3>Data Sharing With OpenMP</h3> <p></p><p>In OpenMP by default all data is shared and passed by reference. Therefore, we must be careful how the data is handled within the parallel region if accessed by multiple threads at once.</p> <p>For Example:</p><pre class="code">#include &lt;iostream&gt;#include &lt;omp.h&gt; int main() { int i = 12; #pragma omp parallel { #pragma omp critical std::cout << "\ni = " << ++i; } std::cout << "\ni = " << i << std::endl; return 0;}</pre> <p>Output:</p><pre class="code">i = 13i = 14i = 15i = 16i = 16</pre> <p>What we can see using the output from the code above is that even after the parallel region is closed we can see that our variable i holds a different value than it did originally. This is due to the fact that the variable is shared inside and outside the parallel region. In order to pass this variable by value to each thread we must make this variable non-shared. This is done by using firstprivate() This is considered a clause, which comes after a construct. firstprivate(i) will take i and make it private to each thread.</p> <p>For example:</p><pre class="code">#include &lt;iostream&gt;#include &lt;omp.h&gt; int main() { int i = 12; #pragma omp parallel firstprivate(i) { #pragma omp critical std::cout << "\ni = " << ++i; } std::cout << "\ni = " << i << std::endl;}</pre> <p>New Output:</p><pre class="code">i = 13i = 13i = 13i = 13i = 12</pre> <p>What we can see here is that through each indiviual thread the value of i stays at 12 then gets incremented by the thread to 13. On the last line of the output we can see that i = 12 showing that the parallel region did not change the value of i outside the parallel region.</p>  <!-- Data Sharing with C++11 --> <h3>Data Sharing with C++11 </h3><p>The C++11 thread library works very similarly requires the programmer to how Atomic works pass in OpenMPthe address of the data that should be shared by the threads.</p> <pre class="code">// cpp11.datasharing. In Ccpp #include &lt;iostream&gt;#include &lt;vector&gt;#include &lt;thread&gt;#include &lt;mutex&gt; std::mutex mu; void func1(int value) { std::lock_guard&lt;std::mutex&gt; lock(mu); std::cout &lt;&lt; "func1 start - value = " &lt;&lt; value << std::endl; value = 0; std::cout &lt;&lt; "func1 end - value = " &lt;&lt; value << std::endl;} void func2(int& value) { std::lock_guard&lt;std::mutex&gt; lock(mu); std::cout &lt;&lt; "func2 start - value = " &lt;&lt; value << std::endl; value *= 2; std::cout &lt;&lt; "func2 end - value = " &lt;&lt; value << std::endl;} int main() { int numThreads = 5; int value = 1;  std::vector&lt;std::thread&gt; threads;  for (int i = 0; i < numThreads; i++) { if (i == 2) threads.push_back(std::thread(func1, value)); else threads.push_back(std::thread(func2, std::ref(value))); }  for (auto& thread : threads) thread.join();  return 0;}</pre> <pre class="code">func2 start - value = 1func2 end - value = 2func2 start - value = 2func2 end - value = 4func1 start - value = 1func1 end - value = 0func2 start - value = 4func2 end - value = 8func2 start - value = 8func2 end - value = 16</pre> <!-- How Syncronization Works Continued --> <h2>How Syncronization Works Continued</h2>  <!-- Syncronization Continued With OpenMP --> <h3>Syncronization Continued With OpenMP</h3> <h4>atomic </h4> <p>The atomic construct is used as a wrapper for way of OpenMP's implementation to serialize a variable type specific operation. The advantage of using the atomic construct in order this example below is that it allows the increment operation with less overhead than critical. Atomic ensures that only the operation is being performed one thread at a time.</p> <pre class="code">int main() { int i = 0; #pragma omp parallel num_threads(10) { #pragma omp atomic i++; } std::cout << i << std::endl; return 0;}</pre> <pre class="code">10</pre>  <!-- Syncronization Continued with C++11 --> <h3>Syncronization Continued with C++11</h3> <h4>atomic</h4><p>Another way to ensure syncronization of data between threads is to give use the variable atomic properties library.</p> <pre class="code">// cpp11.atomic.cpp #include &lt;iostream&gt;#include &lt;vector&gt;#include &lt;thread&gt;#include &lt;atomic&gt; std::atomic&lt;int&gt; value(1); void add() { ++value;} void sub() { - that is, it -value;} int main() { int numThreads = 5;  std::vector&lt;std::thread&gt; threads;  for (int i = 0; i < numThreads; i++) { if (i == 2) threads.push_back(std::thread(sub)); else threads.push_back(std::thread(add)); }  for (auto& thread : threads) thread.join();  std::cout << value << std::endl;  return 0;}</pre> <p>The atomic value can only be written accessed by one thread at a time.This is a similar lock procedure as mutex except the lock is defined by the atomic wrapper instead of the programmer.</p> <pre class="code">4</pre> <!-- Thread Creation Test --> <h2>Thread Creation Test</h2> <pre class="code">#include <iostream>#include <string>#include <chrono>#include <vector>#include <thread>#include <omp.h> using namespace std::chrono; void reportTime(const char* msg, int size, steady_clock::duration span) { auto ms = duration_cast<milliseconds>(span); std::cout << msg << "- size : " << std::to_string(size) << " - took - " << ms.count() << " milliseconds" << std::endl;} void empty() {} void cpp(int size) { steady_clock::time_point ts, te; ts = steady_clock::now(); for (int i = 0; i < size; i++) { std::vector<std::thread> threads; for (int j = 0; j < 10; j++) threads.push_back(std::thread(empty)); for (auto& thread : threads) thread.join(); } te = steady_clock::now(); reportTime("C++11 Threads", size, te - ts);} void omp(int size) { steady_clock::time_point ts, te; ts = steady_clock::now(); for (int i = 0; i < size; i++) { #pragma omp parallel for num_threads(10) for (int i = 0; i < 10; i++) empty(); } te = steady_clock::now(); reportTime("OpenMP", size, te - ts);} int main() {  // Test C++11 Threads cpp(1); cpp(10); cpp(100); cpp(1000); cpp(10000); cpp(100000);  std::cout << std::endl;  // Test OpenMP omp(1); omp(10); omp(100); omp(1000); omp(10000); omp(100000);  return 0;}</pre> <pre class="code">C++11 Threads- size : 1 - took - 1 millisecondsC++11 Threads- size : 10 - took - 10 millisecondsC++11 Threads- size : 100 - took - 125 millisecondsC++11 Threads- size : 1000 - took - 1703 millisecondsC++11 Threads- size : 10000 - took - 20760 millisecondsC++11 Threads- size : 100000 - took - 168628 milliseconds OpenMP- size : 1 - took - 0 millisecondsOpenMP- size : 10 - took - 0 millisecondsOpenMP- size : 100 - took - 0 millisecondsOpenMP- size : 1000 - took - 6 millisecondsOpenMP- size : 10000 - took - 62 millisecondsOpenMP- size : 100000 - took - 616 milliseconds</pre> [[File:Cpp11threadgraph.png | 700px]][[File:Openmpthreadgraph.png | 700px]]
44
edits

Navigation menu