Changes

Jump to: navigation, search

DPS921/ND-R&D

7,326 bytes added, 20:36, 4 December 2018
no edit summary
{{GPU621/DPS921 Index | 20187}}
= <h1>C++11 Threads Library Comparison to OpenMP =</h1><h3>Group Members</h3>Daniel Bogomazov<br>Nick Krillis<br><br>
<!-- How Threads Works -->
== Group Members == Daniel Bogomazov<h2>How Threads Works</h2>
Nick Krilis <!-- OpenMP Threads -->
= Threading in <h3>OpenMP =OpenMP (Open Multi-Processing) is an API specification for compilers that implement an explicit SPMD programming model on shared memory architectures.OpenMP implements threading through the main thread which will fork a specific number of child threads and divide the task amongst them. The runtime environment will then allocate the threads onto multiple processors.Threads</h3>
The standard <p>Threading in OpenMP consists works through the use of three main componentscompiler directives with constructs in order to create a parallel region in which threading can be performed.</p><p>For example:</p><pre class="code">#pragma omp construct [clause, ...] newline (\n) structured block</pre> <p>Through the use of different constructs we can define the parallel programming command to be used. Using constructs is mandatory in order for OpenMP to execute the command</p> <p>For example:</p> <pre class="code">#pragma omp parallel</pre>
* '''Compiler directives'''** Compiler directives are used in order to control the parallelism of code regions. ** <p>The directive keyword placed after #pragma omp is telling the compiler what action needs to happen on that specific region of above code. In addition to this OpenMP allows shows the use of clauses after the directive in order to provoke additional behaviour on that parallel region. ** Example of directives and constructs include:*** Parallel ( #pragma omp parallel)**** This defines construct parllel this construct identifies a parallel region in which the compiler knows to form threads for parallel execution.*** Task (#pragma omp task)**** Defines an explicit task. The data environment block of the task is created according to data-sharing attribute clauses on task construct and any defaults that apply*** Simd ( #pragma omp simd)**** Applied code to a loop to indicate that the loop can be transformed into a SIMD loop.*** Atomic executed by multiple threads (#pragma omp atomic )**** This directive allows the use of a specific memory location atomically. It helps ensure that race conditions are avoided through the direct control of concurrent threads. Used for writing more efficient algorithms.* '''The runtime library routines'''** This include routines that deal with setting and getting the number of total threads, the current thread, etc. For example:*** omp_set_num_threads(int) sets the number of threads in the next parallel region while omp_get_num_threads() returns how many threads OpenMP actually created.* '''Environment Variables''' used to guide OpenMP. A widely used example includes OMP_NUM_THREADS which defines the maximum number of threads for OpenMP to attempt to use.</p>
<h4>Implicit Barrier</h4>
=== Creating <p>With OpenMP after defining a Thread ===parallel region, by default at the end of the region there is what we call an implicit barrier. An implicit barrier is where all individual threads are contained back into one thread; the Master thread which then continues.</p>
<pre class="code">// OpenMP - Parallel Construct
// omp_parallel.cpp
Using the components above, the programmer can setup a parallel region to run tasks in parallel. The following is an example thread creation controlled by OpenMP#include &lt;iostream&gt;#include &lt;omp.h&gt;
[[Fileint main() { #pragma omp parallel { std:Ompthread.png | 300px]]:cout << "Hello\n"; } std::cout << "Fin\n"; return 0;}</pre>
<p>Output:</p>
<pre class="code">Hello
Hello
Hello
Hello
Hello
Hello
Fin
</pre>
=== Multithreading ===
<!-- C++11 Threads -->
'''Control Structures'''* OpenMP is made to have a very simplistic set of control structures. Most parallel applications require the use of a few control structures.* The very basic execution of these control structures is through the use of the fork-join method. Whereas the start of each new thread would be defined by the control structure.* OpenMP includes control structures only in instances where a compiler can provide both functionality and performance over what a user could reasonably program.<h3>C++11 Threads</h3>
'''Data Environment'''* Each process in OpenMP has associated clauses that define <p>C++11 introduced threading through the data environment.* Each new data environment is constructed only for new processes at the time of execution* Using the following clauses you are able to change storage attributes for constructs that apply to the construct and not the entire parallel region** SHARED** PRIVATE** FIRSTPRIVATE** LASTPRIVATE** DEFAULT* By default almost all variables are shared, global variables are also shared amongst threads. However not everything is shared, stack variables that are apart of subprograms or functions in parallel regions are PRIVATEthread library.</p>
<p>Unlike OpenMP, C++11 does <i>not</i> use parallel regions as barriers for its threading. When a thread is run using the C++11 thread library, we must consider the scope of the parent thread. If the parent thread would exit before the child thread can return, it can crash the program if not handled correctly.</p>
=== Synchronization ===<h4>Join and Detach</h4>
<p>When using the join function on the child thread, the parent thread will be blocked until the child thread returns.</p>
<pre class="code"> t2
____________________
/ \
__________/\___________________|/\__________
t1 t1 t2.join() | t1
</pre>
* Synchronization <p>When using the detach function on the child thread, the two threads will split and run independently. Even if the parent thread exits before the child thread is a way of telling a parallel region(threads) able to finish, the child thread will still be completed in a specific order able to the sequence in which they do thingscontinue.* The most common form of synchronization child thread is the use responsible for deallocation of barriersmemory upon completion. Essentially the threads will wait at a barrier until every thread in the scope </p><p>OpenMP does not have this functionality. OpenMP cannot execute instructions outside of the its parallel region has reached like the same pointC++11 thread library can.</p><pre class="code"> t2 ________________________________* There are some constructs that help implement synchronization such as master. The master construct defines a block that is only executed by the master thread, which makes the other threads skip it. Another example is the ordered region. This allows the parallel region to be executed in sequential order / __________/\_______________________ t1 t1 t2.detach() </pre>
<h4>Creating a Thread</h4><p>The following is the template used for the overloaded thread constructor. The thread begins to run on initialization.<br>f is the function, functor, or lambda expression to be executed in the thread. args are the arguements to pass to f.</p><pre class==== Implicit Barrier ===="code">template&lt;class Function, class... Args&gt;explicit thread(Function&& f, Args&&... args);</pre>
<!-- How Multithreading Works -->
[[File:Openmpbarrier.png | 500px]]<h2>How Multithreading Works</h2>
==== Barrier Example ==== <!-- Multithreading With OpenMP -->
[[File:Example2.png | 500px]]<h3>Multithreading With OpenMP</h3>
<pre class= Threading in C++11 ="code">#include &lt;iostream&gt;#include &lt;omp.h&gt;
int main() {
#pragma omp parallel
{
int tid = omp_get_thread_num();
std::cout &lt;&lt; "Hi from thread "&lt;&lt; tid &lt;&lt; '\n';
}
return 0;
}
</pre>
Threading in C++11 is available through the <p>Output:</p><pre class="code">Hi from thread Hi from Thread 20Hi from thread 1Hi from thread3</pre> library. C++11 relies mostly on joining or detaching forked subthreads.
<p>Essentially what is happening in the code above is that the threads are intermingling creating a jumbled output. All threads are trying to access the cout stream at the same time. As one thread is in the stream another may interfere with it because they are all trying to access the stream at the same time. </p>
=== Join vs Detach === <!-- Threading with C++11 -->
<h3>Threading with C++11</h3>
<p>Unlike OpenMP, C++11 threads are created by the programmer instead of the compiler.</p>
<p>std::this_thread::get_id() is similar to OpenMP's omp_get_thread_num() but instead of an int, it returns a </p>
A thread will begin running on initialization<pre class="code">// cpp11. While running in parallel, the child thread’s scope could exit before the child thread is finished. This will result in an error. The two main ways of dealing with this problem is through joining or detaching the child thread to/from the parent threadmultithreading. cpp
The following example shows how join works with the C++11 #include &lt;iostream&gt;#include &lt;vector&gt;#include &lt;thread library. The thread (t1) forks of when creating a new child thread (t2). Both of these threads run in parallel. To prevent t2 from going out of scope in case t1 finishes first, t1 will call t2.join(). This will block t1 from executing code until t2 returns. Once t2 joins back, t1 can continue to execute.&gt;
[[Filevoid func1(int index) { std:Cppjoin.png | 500px]]:cout &lt;&lt; "Index: " &lt;&lt; index &lt;&lt; " - ID: " &lt;&lt; std::this_thread::get_id() &lt;&lt; std::endl;}
Detach, on the other hand, separates the two threads entirely. When t1 creates the new t2 thread, they both run in parallel. This time, t1 will call the detach function on t2. This will cause the two threads to continue running in parallel without t1’s scope affecting t2. Therefore, if t1 exits before t2 finishes, t2 can continue to run without any errors occurring - deallocating any memory after it itself finishes. int main() { int numThreads = 10;
[[File std:Cppdetach.png | 500px]]:vector&lt;std::thread&gt; threads;
std::cout &lt;&lt; "Creating threads...\n";
for (int i === Creating a Thread ===0; i < numThreads; i++) threads.push_back(std::thread(func1, i));
std::cout &lt;&lt; "All threads have launched!\n";
std::cout &lt;&lt; "Syncronizing...\n";
The basic constructor for a (auto& thread follows the following template:threads) thread.join();
[[File std:Cppthread.png | 500px]]:cout &lt;&lt; "All threads have syncronized!\n";
return 0;
}
</pre>
<p>Since all threads are using the std::cout stream, the output can appear jumbled and out of order. The thread can take in a function, functor, or lambda expression as its first argument, followed by 0 or more arguments solution to this problem will be passed into presented in the functionnext section.</p>
The thread constructor, by default, will treat all arguments as if you are passing them in by value, even if the function requires a variable by reference<pre class="code">Creating threads. To make sure no errors occur, the programmer needs to specify that the argument(s) passed to be treated as references by wrapping them in std..Index: 0 - ID: Index: 1 - ID: Index: 2 - ID: 0x70000b57e0000x70000b4fb0000x70000b601000Index: 3 - ID: 0x70000b684000Index: 4 - ID: 0x70000b707000Index: 5 - ID: 0x70000b78a000Index: 6 - ID:0x70000b80d000Index:ref()7 - ID: 0x70000b890000Index: All threads have launched!8 - ID: 0x70000b913000Index: Syncronizing. ..9 - ID: 0x70000b996000All threads have syncronized!</pre>
The following is an example of a thread passing in variables by value and by reference:<!-- How Syncronization Works -->
<h2>How Syncronization Works</h2>
[[File:Cppthreadpassinvariables.png | 500px]] <!-- Syncronization With OpenMP -->
'''Output:'''<h3>Syncronization With OpenMP</h3>
[[File:Cppthreadpassinvariablesoutput.png | 300px]] <!-- critical --><h4>critical</h4>
<pre class="code">#include &lt;iostream&gt;
#include &lt;omp.h&gt;
int main()
{
#pragma omp parallel
{
int tid = omp_get_thread_num();
#pragma omp critical
std::cout << "Hi from thread "<< tid << '\n';
}
return 0;
}
</pre>
=== Multithreading ===<p>Using the parallel construct: critical we are able to limit one thread accessing the stream at a time. critical defines the region in which only one thread is allowed to execute at a time. In this case its the cout stream that we are limiting to one thread. The revised code now has an output like this:</p>
<pre class="code">Hi from thread 0
Hi from Thread 1
Hi from thread 2
Hi from thread 3
</pre>
Multithreading with the C++11 thread library requires manual creation of every new thread. To define the number of threads to be created, the programmer has the option of manually setting the number of threads or using the hardware_concurrency function that will return the maximum number of threads that are available <!-- parallel for the program to use. This works in a similar way as OpenMP’s omp_get_max_threads().-->
<h4>parallel for</h4>
[[File:Cppmultithreading<p>In OpenMp there is a way of parallelizing a for loop by using the parallel construct for.png | 500px]]This statement will automatically distribute iterations between threads.</p>
'''Output<p>Example:'''</p><pre class="code">void simple(int n, float *a, float *b) { int i; #pragma omp parallel for for (i = 1; i &lt; n; i++) b[i] = (a[i] + a[i-1]) / 2.0;}</pre>
[[File:Cppmultithreadingoutput.png | 300px]] <!-- Syncronization with C++11 -->
<h3>Syncronization with C++11</h3>
As the threads execute, they create a race condition. Because they all share the std::cout stream object<h4>mutex</h4><p>To allow for thread syncronization, multithreading like this we can result in unwanted behaviour - as seen in use the above outputmutex library to lock specific sections of code from being used by multiple threads at once. </p>
''Note how you can delay a thread by calling the std::this_thread::sleep_for() function<pre class="code">// cpp11.''mutex.cpp
#include &lt;iostream&gt;
#include &lt;vector&gt;
#include &lt;thread&gt;
#include &lt;mutex&gt;
=== Synchronization === std::mutex mu;
void func1(int index) {
std::lock_guard&lt;std::mutex&gt; lock(mu);
// mu.lock();
std::cout &lt;&lt; "Index: " &lt;&lt; index &lt;&lt; " - ID: " &lt;&lt; std::this_thread::get_id() &lt;&lt; std::endl;
// mu.unlock();
}
int main() { int numThreads ==== Using Atomic ====10;
Another way to manage shared data access between multiple std::vector&lt;std::thread&gt; threads is through the use of the atomic structure defined in the <atomic> library.;
Atomic in the C++11 library works very similarly to how Atomic works in OpenMP std::cout &lt;&lt; "Creating threads. In C++, atomic is used as a wrapper for a variable type in order to give the variable atomic properties - that is, it can only be written by one thread at a time..\n";
for (int i = 0; i < numThreads; i++)
threads.push_back(std::thread(func1, i));
==== Using Mutex ==== std::cout &lt;&lt; "All threads have launched!\n"; std::cout &lt;&lt; "Syncronizing...\n";
for (auto& thread : threads)
thread.join();
To prevent unwanted race conditions, we can use the mutex functionality available through the <mutex> library. std::cout &lt;&lt; "All threads have syncronized!\n";
Mutex creates an exclusivity region within a thread through a lock system. Once locked, it protects shared data from being accessed by multiple threads at the same time. To prevent from a mutex lock from never unlocking - if, for example, an exception is thrown before the unlock function runs - it is advised to use std::lock_guard return 0;}<std::mutex/pre> instead to manage locking in a more exception-safe manner.
<p>Using mutex, we're able to place a lock on the data used by the threads to allow for mutual exclusion. This is similar to OpenMP's critical in that it only allows one thread to execute a block of code at a time.</p>
[[File<pre class="code">Creating threads...Index: 0 - ID: 0x70000aa29000Index: 4 - ID: 0x70000ac35000Index: 5 - ID: 0x70000acb8000Index: 1 - ID: 0x70000aaac000Index: 6 - ID: 0x70000ad3b000Index: 7 - ID: 0x70000adbe000Index: 8 - ID: 0x70000ae41000Index: 3 - ID:Cppmutex0x70000abb2000All threads have launched!Syncronizing..png | 500px]].Index: 9 - ID: 0x70000aec4000Index: 2 - ID: 0x70000ab2f000All threads have syncronized!</pre>
<!-- How Data Sharing Works -->
'''Output:'''<h2>How Data Sharing Works</h2>
<!-- Data Sharing With OpenMP --> <h3>Data Sharing With OpenMP</h3> <p></p><p>In OpenMP by default all data is shared and passed by reference. Therefore, we must be careful how the data is handled within the parallel region if accessed by multiple threads at once.</p> <p>For Example:</p><pre class="code">#include &lt;iostream&gt;#include &lt;omp.h&gt; int main() { int i = 12; #pragma omp parallel { #pragma omp critical std::cout << "\ni = " << ++i; } std::cout << "\ni = " << i << std::endl; return 0;}</pre> <p>Output:</p><pre class="code">i = 13i = 14i = 15i = 16i = 16</pre> <p>What we can see using the output from the code above is that even after the parallel region is closed we can see that our variable i holds a different value than it did originally. This is due to the fact that the variable is shared inside and outside the parallel region. In order to pass this variable by value to each thread we must make this variable non-shared. This is done by using firstprivate() This is considered a clause, which comes after a construct. firstprivate(i) will take i and make it private to each thread.</p> <p>For example:</p><pre class="code">#include &lt;iostream&gt;#include &lt;omp.h&gt; int main() { int i = 12; #pragma omp parallel firstprivate(i) { #pragma omp critical std::cout << "\ni = " << ++i; } std::cout << "\ni = " << i << std::endl;}</pre> <p>New Output:</p><pre class="code">i = 13i = 13i = 13i = 13i = 12</pre> <p>What we can see here is that through each indiviual thread the value of i stays at 12 then gets incremented by the thread to 13. On the last line of the output we can see that i = 12 showing that the parallel region did not change the value of i outside the parallel region.</p>  <!-- Data Sharing with C++11 --> <h3>Data Sharing with C++11</h3><p>The C++11 thread library requires the programmer to pass in the address of the data that should be shared by the threads.</p> <pre class="code">// cpp11.datasharing.cpp #include &lt;iostream&gt;#include &lt;vector&gt;#include &lt;thread&gt;#include &lt;mutex&gt; std::mutex mu; void func1(int value) { std::lock_guard&lt;std::mutex&gt; lock(mu); std::cout &lt;&lt; "func1 start - value = " &lt;&lt; value << std::endl; value = 0; std::cout &lt;&lt; "func1 end - value = " &lt;&lt; value << std::endl;} void func2(int& value) { std::lock_guard&lt;std::mutex&gt; lock(mu); std::cout &lt;&lt; "func2 start - value = " &lt;&lt; value << std::endl; value *= 2; std::cout &lt;&lt; "func2 end - value = " &lt;&lt; value << std::endl;} int main() { int numThreads = 5; int value = 1;  std::vector&lt;std::thread&gt; threads;  for (int i = 0; i < numThreads; i++) { if (i == 2) threads.push_back(std::thread(func1, value)); else threads.push_back(std::thread(func2, std::ref(value))); }  for (auto& thread : threads) thread.join();  return 0;}</pre> <pre class="code">func2 start - value = 1func2 end - value = 2func2 start - value = 2func2 end - value = 4func1 start - value = 1func1 end - value = 0func2 start - value = 4func2 end - value = 8func2 start - value = 8func2 end - value = 16</pre> <!-- How Syncronization Works Continued --> <h2>How Syncronization Works Continued</h2>  <!-- Syncronization Continued With OpenMP --> <h3>Syncronization Continued With OpenMP</h3> <h4>atomic</h4> <p>The atomic construct is a way of OpenMP's implementation to serialize a specific operation. The advantage of using the atomic construct in this example below is that it allows the increment operation with less overhead than critical. Atomic ensures that only the operation is being performed one thread at a time.</p> <pre class="code">int main() { int i = 0; #pragma omp parallel num_threads(10) { #pragma omp atomic i++; } std::cout << i << std::endl; return 0;}</pre> <pre class="code">10</pre>  <!-- Syncronization Continued with C++11 --> <h3>Syncronization Continued with C++11</h3> <h4>atomic</h4><p>Another way to ensure syncronization of data between threads is to use the atomic library.</p> <pre class="code">// cpp11.atomic.cpp #include &lt;iostream&gt;#include &lt;vector&gt;#include &lt;thread&gt;#include &lt;atomic&gt; std::atomic&lt;int&gt; value(1); void add() { ++value;} void sub() { --value;} int main() { int numThreads = 5;  std::vector&lt;std::thread&gt; threads;  for (int i = 0; i < numThreads; i++) { if (i == 2) threads.push_back(std::thread(sub)); else threads.push_back(std::thread(add)); }  for (auto& thread : threads) thread.join();  std::cout << value << std::endl;  return 0;}</pre> <p>The atomic value can only be accessed by one thread at a time. This is a similar lock procedure as mutex except the lock is defined by the atomic wrapper instead of the programmer.</p> <pre class="code">4</pre> <!-- Thread Creation Test --> <h2>Thread Creation Test</h2> <pre class="code">#include <iostream>#include <string>#include <chrono>#include <vector>#include <thread>#include <omp.h> using namespace std::chrono; void reportTime(const char* msg, int size, steady_clock::duration span) { auto ms = duration_cast<milliseconds>(span); std::cout << msg << "- size : " << std::to_string(size) << " - took - " << ms.count() << " milliseconds" << std::endl;} void empty() {} void cpp(int size) { steady_clock::time_point ts, te; ts = steady_clock::now(); for (int i = 0; i < size; i++) { std::vector<std::thread> threads; for (int j = 0; j < 10; j++) threads.push_back(std::thread(empty)); for (auto& thread : threads) thread.join(); } te = steady_clock::now(); reportTime("C++11 Threads", size, te - ts);} void omp(int size) { steady_clock::time_point ts, te; ts = steady_clock::now(); for (int i = 0; i < size; i++) { #pragma omp parallel for num_threads(10) for (int i = 0; i < 10; i++) empty(); } te = steady_clock::now(); reportTime("OpenMP", size, te - ts);} int main() {  // Test C++11 Threads cpp(1); cpp(10); cpp(100); cpp(1000); cpp(10000); cpp(100000);  std::cout << std::endl;  // Test OpenMP omp(1); omp(10); omp(100); omp(1000); omp(10000); omp(100000);  return 0;}</pre> <pre class="code">C++11 Threads- size : 1 - took - 1 millisecondsC++11 Threads- size : 10 - took - 10 millisecondsC++11 Threads- size : 100 - took - 125 millisecondsC++11 Threads- size : 1000 - took - 1703 millisecondsC++11 Threads- size : 10000 - took - 20760 millisecondsC++11 Threads- size : 100000 - took - 168628 milliseconds OpenMP- size : 1 - took - 0 millisecondsOpenMP- size : 10 - took - 0 millisecondsOpenMP- size : 100 - took - 0 millisecondsOpenMP- size : 1000 - took - 6 millisecondsOpenMP- size : 10000 - took - 62 millisecondsOpenMP- size : 100000 - took - 616 milliseconds</pre> [[File:Cpp11threadgraph.png | 700px]][[File:CppmutexoutputOpenmpthreadgraph.png | 300px700px]]
44
edits

Navigation menu