Changes

← Older edit

DPS921/ND-R&D

7,768 bytes added, 19:36, 4 December 2018

no edit summary

= <h1>C++11 Threads Library Comparison to OpenMP =</h1><h3>Group Members</h3>Daniel Bogomazov<br>Nick Krillis<br><br>

~~== Group Members ==~~ ~~Daniel Bogomazov~~<h2>How Threads Works</h2>

~~Nick Krilis~~

= <h3>OpenMP =OpenMP (Open Multi-Processing) is an API specification for compilers that implement an explicit SPMD programming model on shared memory architectures.OpenMP implements threading through the main thread which will fork a specific number of child threads and divide the task amongst them. The runtime environment will then allocate the threads onto multiple processors.Threads</h3>

~~The standard~~ <p>Threading in OpenMP ~~consists~~ works through the use of ~~three main components~~compiler directives with constructs in order to create a parallel region in which threading can be performed.</p><p>For example:</p><pre class="code">#pragma omp construct [clause, ...] newline (\n) structured block</pre> <p>Through the use of different constructs we can define the parallel programming command to be used. Using constructs is mandatory in order for OpenMP to execute the command</p> <p>For example:</p> <pre class="code">#pragma omp parallel</pre>

* '''Compiler directives'''** Compiler directives are used in order to control the parallelism of code regions. ** <p>The ~~directive keyword placed after #pragma omp is telling the compiler what action needs to happen on that specific region of~~ above code~~. In addition to this OpenMP allows~~ shows the use of ~~clauses after~~ the ~~directive in order to provoke additional behaviour on that~~ parallel ~~region.~~ ** Example of directives and constructs include:*** Parallel ( #pragma omp parallel)**** This defines construct parllel this construct identifies a ~~parallel region in which the compiler knows to form threads for parallel execution.~~*** Task (#pragma omp task)**** Defines an explicit task. The data environment block of ~~the task is created according to data-sharing attribute clauses on task construct and any defaults that apply~~*** Simd ( #pragma omp simd)**** Applied code to ~~a loop to indicate that the loop can~~ be ~~transformed into a SIMD loop.~~*** Atomic executed by multiple threads (~~#pragma omp atomic )~~**** This directive allows the use of a ~~specific memory location atomically. It helps ensure that race conditions are avoided through the direct control of concurrent threads. Used for writing more efficient algorithms.~~* '''The runtime library routines'''** This include routines that deal with setting and getting the number of total threads, the current thread, etc. For example:*** omp_set_num_threads(int) sets the number of threads in the next parallel region ~~while omp_get_num_threads(~~) ~~returns how many threads OpenMP actually created.~~* '''Environment Variables''' used to guide OpenMP. A widely used example includes OMP_NUM_THREADS which defines the maximum number of threads for OpenMP to attempt to use.</p>

<h4>Implicit Barrier</h4>

~~= C++11 =~~<p>With OpenMP after defining a parallel region, by default at the end of the region there is what we call an implicit barrier. An implicit barrier is where all individual threads are contained back into one thread; the Master thread which then continues.</p>

<pre class="code">// OpenMP - Parallel Construct

// omp_parallel.cpp

~~Threading in C++11 is available through the <thread> library.~~ #include <iostream>~~C++11 relies mostly on joining or detaching forked subthreads~~#include <omp. h>

int main() {

#pragma omp parallel

{

std::cout << "Hello\n";

}

std::cout << "Fin\n";

return 0;

}

</pre>

<p>Output:</p><pre class=~~== Join vs Detach ===~~"code">HelloHelloHelloHelloHelloHelloFin</pre>

A thread will begin running on initialization. While running in parallel, the child thread’s scope could exit before the child thread is finished. This will result in an error. The two main ways of dealing with this problem is through joining or detaching the child thread to/from the parent thread.

~~The following example shows how join works with the~~ <h3>C++11 thread library. The thread (t1) forks of when creating a new child thread (t2). Both of these threads run in parallel. To prevent t2 from going out of scope in case t1 finishes first, t1 will call t2.join(). This will block t1 from executing code until t2 returns. Once t2 joins back, t1 can continue to execute.Threads</h3>

~~[[File:Cppjoin~~<p>C++11 introduced threading through the thread library.~~png | 500px]]~~</p>

~~Detach~~<p>Unlike OpenMP, ~~on the other hand, separates the two threads entirely~~C++11 does <i>not</i> use parallel regions as barriers for its threading. When ~~t1 creates~~ a thread is run using the ~~new t2~~ C++11 threadlibrary, ~~they both run in parallel. This time, t1 will call~~ we must consider the scope of the ~~detach function on t2~~parent thread. ~~This will cause~~ If the ~~two threads to continue running in parallel without t1’s scope affecting t2. Therefore, if t1 exits~~ parent thread would exit before ~~t2 finishes~~the child thread can return, t2 it can ~~continue to run without any errors occurring - deallocating any memory after it itself finishes~~crash the program if not handled correctly. </p>

~~[[File:Cppdetach.png | 500px]]~~<h4>Join and Detach</h4>

<p>When using the join function on the child thread, the parent thread will be blocked until the child thread returns.</p>

<pre class="code"> t2

____________________

/ \

__________/\___________________|/\__________

t1 t1 t2.join() | t1

</pre>

<p>When using the detach function on the child thread, the two threads will split and run independently. Even if the parent thread exits before the child thread is able to finish, the child thread will still be able to continue. The child thread is responsible for deallocation of memory upon completion.</p><p>OpenMP does not have this functionality. OpenMP cannot execute instructions outside of its parallel region like the C++11 thread library can.</p><pre class= ~~Creating a Thread =~~"code"> t2 ________________________________ / __________/\_______________________ t1 t1 t2.detach() </pre>

<h4>Creating a Thread</h4><p>The following is the template used for the overloaded thread constructor. The thread begins to run on initialization.<br>f is the function, functor, or lambda expression to be executed in the thread. args are the arguements to pass to f.</p><pre class=~~= OpenMP ==~~"code">template<class Function, class... Args>explicit thread(Function&& f, Args&&... args);</pre>

~~Using the components above, the programmer can setup a parallel region to run tasks in parallel. The following is an example thread creation controlled by OpenMP.~~<h2>How Multithreading Works</h2>

~~[[File:Ompthread.png | 300px]]~~

<h3>Multithreading With OpenMP</h3>

<pre class=~~== Multithreading ===~~"code">#include <iostream>#include <omp.h>

int main() {

#pragma omp parallel

{

int tid = omp_get_thread_num();

std::cout << "Hi from thread "<< tid << '\n';

}

return 0;

}

</pre>

~~'''Control Structures'''~~<p>Output:</p>* OpenMP is made to have a very simplistic set of control structures. Most parallel applications require the use of a few control structures.<pre class="code">Hi from thread Hi from Thread 20Hi from thread 1* The very basic execution of these control structures is through the use of the fork-join method. Whereas the start of each new Hi from thread ~~would be defined by the control structure.~~3* OpenMP includes control structures only in instances where a compiler can provide both functionality and performance over what a user could reasonably program.</pre>

~~'''Data Environment'''~~* Each process <p>Essentially what is happening in ~~OpenMP has associated clauses~~ the code above is that ~~define~~ the ~~data environment~~threads are intermingling creating a jumbled output.* Each new data environment is constructed only for new processes All threads are trying to access the cout stream at the same time ~~of execution~~* Using . As one thread is in the ~~following clauses you~~ stream another may interfere with it because they are ~~able to change storage attributes for constructs that apply~~ all trying to access the ~~construct and not~~ stream at the ~~entire parallel region~~** SHARED** PRIVATE** FIRSTPRIVATE** LASTPRIVATE** DEFAULT* By default almost all variables are shared, global variables are also shared amongst threads. However not everything is shared, stack variables that are apart of subprograms or functions in parallel regions are PRIVATEsame time.</p>

== <h3>Threading with C++11 ==</h3><p>Unlike OpenMP, C++11 threads are created by the programmer instead of the compiler.</p><p>std::this_thread::get_id() is similar to OpenMP's omp_get_thread_num() but instead of an int, it returns a </p>

<pre class="code">// cpp11.multithreading.cpp

~~The basic constructor for a~~ #include <iostream>#include <vector>#include <thread ~~follows the following template:~~>

~~[[File~~void func1(int index) { std:~~Cppthread.png | 500px]]~~:cout << "Index: " << index << " - ID: " << std::this_thread::get_id() << std::endl;}

int main() {

int numThreads = 10;

~~The~~ std::vector<std::thread ~~can take in a function, functor, or lambda expression as its first argument, followed by 0 or more arguments to be passed into the function.~~> threads;

The thread constructor, by default, will treat all arguments as if you are passing them in by value, even if the function requires a variable by reference. To make sure no errors occur, the programmer needs to specify that the argument(s) passed to be treated as references by wrapping them in std::~~ref()~~cout << "Creating threads... \n";

~~The following is an example of a~~ for (int i = 0; i < numThreads; i++) threads.push_back(std::thread ~~passing in variables by value and by reference:~~(func1, i));

std::cout << "All threads have launched!\n";

std::cout << "Syncronizing...\n";

~~[[File~~ for (auto& thread :~~Cppthreadpassinvariables~~threads) thread.~~png | 500px]]~~join();

~~'''Output~~ std:~~'''~~:cout << "All threads have syncronized!\n";

~~[[File:Cppthreadpassinvariablesoutput.png | 300px]]~~ return 0;}</pre>

<p>Since all threads are using the std::cout stream, the output can appear jumbled and out of order. The solution to this problem will be presented in the next section.</p>

<pre class="code">Creating threads...

Index: 0 - ID: Index: 1 - ID: Index: 2 - ID: 0x70000b57e000

0x70000b4fb000

0x70000b601000Index: 3 - ID: 0x70000b684000

Index:

4 - ID: 0x70000b707000

Index: 5 - ID: 0x70000b78a000

Index: 6 - ID: 0x70000b80d000

Index: 7 - ID: 0x70000b890000

Index: All threads have launched!

8 - ID: 0x70000b913000

Index: Syncronizing...

9 - ID: 0x70000b996000

All threads have syncronized!

</pre>

~~=== Multithreading ===~~

<h2>How Syncronization Works</h2>

Multithreading with the C++11 thread library requires manual creation of every new thread. To define the number of threads to be created, the programmer has the option of manually setting the number of threads or using the hardware_concurrency function that will return the maximum number of threads that are available for the program to use. This works in a similar way as OpenMP’s omp_get_max_threads().

<h3>Syncronization With OpenMP</h3>

~~[[File:Cppmultithreading.png | 500px]]~~ <h4>critical</h4>

~~'''Output:'''~~<pre class="code">#include <iostream>#include <omp.h>

~~[[File~~int main() { #pragma omp parallel { int tid = omp_get_thread_num(); #pragma omp critical std:~~Cppmultithreadingoutput.png | 300px]]~~:cout << "Hi from thread "<< tid << '\n'; } return 0;}</pre>

<p>Using the parallel construct: critical we are able to limit one thread accessing the stream at a time. critical defines the region in which only one thread is allowed to execute at a time. In this case its the cout stream that we are limiting to one thread. The revised code now has an output like this:</p>

~~As the threads execute, they create a race condition. Because they all share the std::cout stream object, multithreading like this can result in unwanted behaviour - as seen in the above output.~~ <pre class="code">Hi from thread 0Hi from Thread 1Hi from thread 2Hi from thread 3</pre>

~~''Note how you can delay a thread by calling the std::this_thread::sleep_for() function.''~~

<h4>parallel for</h4>

<p>In OpenMp there is a way of parallelizing a for loop by using the parallel construct for. This statement will automatically distribute iterations between threads.</p>

<p>Example:</p><pre class= ~~Synchronization~~ "code">void simple(int n, float *a, float *b) { int i; #pragma omp parallel for for (i =1; i < n; i++) b[i] = (a[i] + a[i-1]) / 2.0;}</pre>

~~== OpenMP ==~~<h3>Syncronization with C++11</h3>

<h4>mutex</h4>

<p>To allow for thread syncronization, we can use the mutex library to lock specific sections of code from being used by multiple threads at once.</p>

* Synchronization is a way of telling a parallel region(threads) to be completed in a specific order to the sequence in which they do things<pre class="code">// cpp11.* The most common form of synchronization is the use of barriers. Essentially the threads will wait at a barrier until every thread in the scope of the parallel region has reached the same point.* There are some constructs that help implement synchronization such as master. The master construct defines a block that is only executed by the master thread, which makes the other threads skip it. Another example is the ordered region. This allows the parallel region to be executed in sequential ordermutex.cpp

~~==== Implicit Barrier ====~~#include <iostream>#include <vector>#include <thread>#include <mutex>

std::mutex mu;

~~[[File~~void func1(int index) { std:~~Openmpbarrier~~:lock_guard<std::mutex> lock(mu); // mu.lock(); std::cout << "Index: " << index << " - ID: " << std::this_thread::get_id() << std::endl; // mu.~~png | 500px]]~~unlock();}

int main() { int numThreads =~~=== Barrier Example ====~~10;

~~[[File~~ std:~~Example2.png | 500px]]~~:vector<std::thread> threads;

std::cout << "Creating threads...\n";

for (int i =~~= C~~0; i < numThreads; i++~~11 ==~~) threads.push_back(std::thread(func1, i));

std::cout << "All threads have launched!\n";

std::cout << "Syncronizing...\n";

~~==== Using Mutex ====~~ for (auto& thread : threads) thread.join();

std::cout << "All threads have syncronized!\n";

~~To prevent unwanted race conditions, we can use the mutex functionality available through the~~ return 0;}<~~mutex~~/pre> ~~library.~~

~~Mutex creates an exclusivity region within a thread through~~ <p>Using mutex, we're able to place a lock ~~system. Once locked, it protects shared~~ on the data ~~from being accessed~~ used by ~~multiple~~ the threads ~~at the same time~~to allow for mutual exclusion. ~~To prevent from a mutex lock from never unlocking - if, for example, an exception~~ This is ~~thrown before the unlock function runs -~~ similar to OpenMP's critical in that it ~~is advised~~ only allows one thread to ~~use std::lock_guard~~execute a block of code at a time.<~~std::mutex~~/p> ~~instead to manage locking in a more exception-safe manner.~~

<pre class="code">Creating threads...

Index: 0 - ID: 0x70000aa29000

Index: 4 - ID: 0x70000ac35000

Index: 5 - ID: 0x70000acb8000

Index: 1 - ID: 0x70000aaac000

Index: 6 - ID: 0x70000ad3b000

Index: 7 - ID: 0x70000adbe000

Index: 8 - ID: 0x70000ae41000

Index: 3 - ID: 0x70000abb2000

All threads have launched!

Syncronizing...

Index: 9 - ID: 0x70000aec4000

Index: 2 - ID: 0x70000ab2f000

All threads have syncronized!

</pre>

~~[[File:Cppmutex.png | 500px]]~~

<h2>How Data Sharing Works</h2>

~~'''Output:'''~~

<h3>Data Sharing With OpenMP</h3> <p></p><p>In OpenMP by default all data is shared and passed by reference. Therefore, we must be careful how the data is handled within the parallel region if accessed by multiple threads at once.</p> <p>For Example:</p><pre class="code">#include <iostream>#include <omp.h> int main() { int i = 12; #pragma omp parallel { #pragma omp critical std::cout << "\ni = " << ++i; } std::cout << "\ni = " << i << std::endl; return 0;}</pre> <p>Output:</p><pre class="code">i = 13i = 14i = 15i = 16i = 16</pre> <p>What we can see using the output from the code above is that even after the parallel region is closed we can see that our variable i holds a different value than it did originally. This is due to the fact that the variable is shared inside and outside the parallel region. In order to pass this variable by value to each thread we must make this variable non-shared. This is done by using firstprivate() This is considered a clause, which comes after a construct. firstprivate(i) will take i and make it private to each thread.</p> <p>For example:</p><pre class="code">#include <iostream>#include <omp.h> int main() { int i = 12; #pragma omp parallel firstprivate(i) { #pragma omp critical std::cout << "\ni = " << ++i; } std::cout << "\ni = " << i << std::endl;}</pre> <p>New Output:</p><pre class="code">i = 13i = 13i = 13i = 13i = 12</pre> <p>What we can see here is that through each indiviual thread the value of i stays at 12 then gets incremented by the thread to 13. On the last line of the output we can see that i = 12 showing that the parallel region did not change the value of i outside the parallel region.</p>  <h3>Data Sharing with C++11</h3><p>The C++11 thread library requires the programmer to pass in the address of the data that should be shared by the threads.</p> <pre class="code">// cpp11.datasharing.cpp #include <iostream>#include <vector>#include <thread>#include <mutex> std::mutex mu; void func1(int value) { std::lock_guard<std::mutex> lock(mu); std::cout << "func1 start - value = " << value << std::endl; value = 0; std::cout << "func1 end - value = " << value << std::endl;} void func2(int& value) { std::lock_guard<std::mutex> lock(mu); std::cout << "func2 start - value = " << value << std::endl; value *= 2; std::cout << "func2 end - value = " << value << std::endl;} int main() { int numThreads = 5; int value = 1; std::vector<std::thread> threads; for (int i = 0; i < numThreads; i++) { if (i == 2) threads.push_back(std::thread(func1, value)); else threads.push_back(std::thread(func2, std::ref(value))); } for (auto& thread : threads) thread.join(); return 0;}</pre> <pre class="code">func2 start - value = 1func2 end - value = 2func2 start - value = 2func2 end - value = 4func1 start - value = 1func1 end - value = 0func2 start - value = 4func2 end - value = 8func2 start - value = 8func2 end - value = 16</pre>  <h2>How Syncronization Works Continued</h2>  <h3>Syncronization Continued With OpenMP</h3> <h4>atomic</h4> <p>The atomic construct is a way of OpenMP's implementation to serialize a specific operation. The advantage of using the atomic construct in this example below is that it allows the increment operation with less overhead than critical. Atomic ensures that only the operation is being performed one thread at a time.</p> <pre class="code">int main() { int i = 0; #pragma omp parallel num_threads(10) { #pragma omp atomic i++; } std::cout << i << std::endl; return 0;}</pre> <pre class="code">10</pre>  <h3>Syncronization Continued with C++11</h3> <h4>atomic</h4><p>Another way to ensure syncronization of data between threads is to use the atomic library.</p> <pre class="code">// cpp11.atomic.cpp #include <iostream>#include <vector>#include <thread>#include <atomic> std::atomic<int> value(1); void add() { ++value;} void sub() { --value;} int main() { int numThreads = 5; std::vector<std::thread> threads; for (int i = 0; i < numThreads; i++) { if (i == 2) threads.push_back(std::thread(sub)); else threads.push_back(std::thread(add)); } for (auto& thread : threads) thread.join(); std::cout << value << std::endl; return 0;}</pre> <p>The atomic value can only be accessed by one thread at a time. This is a similar lock procedure as mutex except the lock is defined by the atomic wrapper instead of the programmer.</p> <pre class="code">4</pre>  <h2>Thread Creation Test</h2> <pre class="code">#include <iostream>#include <string>#include <chrono>#include <vector>#include <thread>#include <omp.h> using namespace std::chrono; void reportTime(const char* msg, int size, steady_clock::duration span) { auto ms = duration_cast<milliseconds>(span); std::cout << msg << "- size : " << std::to_string(size) << " - took - " << ms.count() << " milliseconds" << std::endl;} void empty() {} void cpp(int size) { steady_clock::time_point ts, te; ts = steady_clock::now(); for (int i = 0; i < size; i++) { std::vector<std::thread> threads; for (int j = 0; j < 10; j++) threads.push_back(std::thread(empty)); for (auto& thread : threads) thread.join(); } te = steady_clock::now(); reportTime("C++11 Threads", size, te - ts);} void omp(int size) { steady_clock::time_point ts, te; ts = steady_clock::now(); for (int i = 0; i < size; i++) { #pragma omp parallel for num_threads(10) for (int i = 0; i < 10; i++) empty(); } te = steady_clock::now(); reportTime("OpenMP", size, te - ts);} int main() { // Test C++11 Threads cpp(1); cpp(10); cpp(100); cpp(1000); cpp(10000); cpp(100000); std::cout << std::endl; // Test OpenMP omp(1); omp(10); omp(100); omp(1000); omp(10000); omp(100000); return 0;}</pre> <pre class="code">C++11 Threads- size : 1 - took - 1 millisecondsC++11 Threads- size : 10 - took - 10 millisecondsC++11 Threads- size : 100 - took - 125 millisecondsC++11 Threads- size : 1000 - took - 1703 millisecondsC++11 Threads- size : 10000 - took - 20760 millisecondsC++11 Threads- size : 100000 - took - 168628 milliseconds OpenMP- size : 1 - took - 0 millisecondsOpenMP- size : 10 - took - 0 millisecondsOpenMP- size : 100 - took - 0 millisecondsOpenMP- size : 1000 - took - 6 millisecondsOpenMP- size : 10000 - took - 62 millisecondsOpenMP- size : 100000 - took - 616 milliseconds</pre> [[File:Cpp11threadgraph.png | 700px]][[File:~~Cppmutexoutput~~Openmpthreadgraph.png | ~~300px~~700px]]

Dbogomazov

44

edits

Changes

DPS921/ND-R&D

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools