Changes

← Older edit

DPS921/ND-R&D

7,314 bytes added, 20:36, 4 December 2018

no edit summary

= <h1>C++11 Threads Library Comparison to OpenMP =</h1><h3>Group Members</h3>Daniel Bogomazov<br>Nick Krillis<br><br>

~~== Group Members ==~~ ~~Daniel Bogomazov~~<h2>How Threads Works</h2>

~~Nick Krilis~~

~~= Threading in~~ <h3>OpenMP =OpenMP (Open Multi-Processing) is an API specification for compilers that implement an explicit SPMD programming model on shared memory architectures.OpenMP implements threading through the main thread which will fork a specific number of child threads and divide the task amongst them. The runtime environment will then allocate the threads onto multiple processors.Threads</h3>

~~The standard~~ <p>Threading in OpenMP ~~consists~~ works through the use of ~~three main components~~compiler directives with constructs in order to create a parallel region in which threading can be performed.</p><p>For example:</p><pre class="code">#pragma omp construct [clause, ...] newline (\n) structured block</pre> <p>Through the use of different constructs we can define the parallel programming command to be used. Using constructs is mandatory in order for OpenMP to execute the command</p> <p>For example:</p> <pre class="code">#pragma omp parallel</pre>

* '''Compiler directives'''** Compiler directives are used in order to control the parallelism of code regions. ** <p>The ~~directive keyword placed after #pragma omp is telling the compiler what action needs to happen on that specific region of~~ above code~~. In addition to this OpenMP allows~~ shows the use of ~~clauses after~~ the ~~directive in order to provoke additional behaviour on that~~ parallel ~~region.~~ ** Example of directives and constructs include:*** Parallel ( #pragma omp parallel)**** This defines construct parllel this construct identifies a ~~parallel region in which the compiler knows to form threads for parallel execution.~~*** Task (#pragma omp task)**** Defines an explicit task. The data environment block of ~~the task is created according to data-sharing attribute clauses on task construct and any defaults that apply~~*** Simd ( #pragma omp simd)**** Applied code to ~~a loop to indicate that the loop can~~ be ~~transformed into a SIMD loop.~~*** Atomic executed by multiple threads (~~#pragma omp atomic )~~**** This directive allows the use of a ~~specific memory location atomically. It helps ensure that race conditions are avoided through the direct control of concurrent threads. Used for writing more efficient algorithms.~~* '''The runtime library routines'''** This include routines that deal with setting and getting the number of total threads, the current thread, etc. For example:*** omp_set_num_threads(int) sets the number of threads in the next parallel region ~~while omp_get_num_threads(~~) ~~returns how many threads OpenMP actually created.~~* '''Environment Variables''' used to guide OpenMP. A widely used example includes OMP_NUM_THREADS which defines the maximum number of threads for OpenMP to attempt to use.</p>

<h4>Implicit Barrier</h4>

~~=== Creating~~ <p>With OpenMP after defining a ~~Thread ===~~parallel region, by default at the end of the region there is what we call an implicit barrier. An implicit barrier is where all individual threads are contained back into one thread; the Master thread which then continues.</p>

<pre class="code">// OpenMP - Parallel Construct

// omp_parallel.cpp

~~Using the components above, the programmer can setup a parallel region to run tasks in parallel. The following is an example thread creation controlled by OpenMP~~#include <iostream>#include <omp.h>

~~[[File~~int main() { #pragma omp parallel { std:~~Ompthread.png | 300px]]~~:cout << "Hello\n"; } std::cout << "Fin\n"; return 0;}</pre>

<p>Output:</p>

<pre class="code">Hello

Hello

Fin

</pre>

~~=== Multithreading ===~~

~~'''Control Structures'''~~* OpenMP is made to have a very simplistic set of control structures. Most parallel applications require the use of a few control structures.* The very basic execution of these control structures is through the use of the fork-join method. Whereas the start of each new thread would be defined by the control structure.* OpenMP includes control structures only in instances where a compiler can provide both functionality and performance over what a user could reasonably program.<h3>C++11 Threads</h3>

~~'''Data Environment'''~~* Each process in OpenMP has associated clauses that define <p>C++11 introduced threading through the ~~data environment.~~* Each new data environment is constructed only for new processes at the time of execution* Using the following clauses you are able to change storage attributes for constructs that apply to the construct and not the entire parallel region** SHARED** PRIVATE** FIRSTPRIVATE** LASTPRIVATE** DEFAULT* By default almost all variables are shared, global variables are also shared amongst threads. However not everything is shared, stack variables that are apart of subprograms or functions in parallel regions are PRIVATEthread library.</p>

<p>Unlike OpenMP, C++11 does <i>not</i> use parallel regions as barriers for its threading. When a thread is run using the C++11 thread library, we must consider the scope of the parent thread. If the parent thread would exit before the child thread can return, it can crash the program if not handled correctly.</p>

~~=== Synchronization ===~~<h4>Join and Detach</h4>

<p>When using the join function on the child thread, the parent thread will be blocked until the child thread returns.</p>

<pre class="code"> t2

____________________

/ \

__________/\___________________|/\__________

t1 t1 t2.join() | t1

</pre>

* Synchronization <p>When using the detach function on the child thread, the two threads will split and run independently. Even if the parent thread exits before the child thread is ~~a way of telling a parallel region(threads)~~ able to finish, the child thread will still be ~~completed in a specific order~~ able to ~~the sequence in which they do things~~continue.* The ~~most common form of synchronization~~ child thread is ~~the use~~ responsible for deallocation of ~~barriers~~memory upon completion. ~~Essentially the threads will wait at a barrier until every thread in the scope~~ </p><p>OpenMP does not have this functionality. OpenMP cannot execute instructions outside of ~~the~~ its parallel region ~~has reached~~ like the ~~same point~~C++11 thread library can.</p><pre class="code"> t2 ________________________________* There are some constructs that help implement synchronization such as master. The master construct defines a block that is only executed by the master thread, which makes the other threads skip it. Another example is the ordered region. This allows the parallel region to be executed in sequential order / __________/\_______________________ t1 t1 t2.detach() </pre>

<h4>Creating a Thread</h4><p>The following is the template used for the overloaded thread constructor. The thread begins to run on initialization.<br>f is the function, functor, or lambda expression to be executed in the thread. args are the arguements to pass to f.</p><pre class=~~=== Implicit Barrier ====~~"code">template<class Function, class... Args>explicit thread(Function&& f, Args&&... args);</pre>

~~[[File:Openmpbarrier.png | 500px]]~~<h2>How Multithreading Works</h2>

~~==== Ordered Example ====~~

~~[[File:Openmporderedexample.png | 500px]]~~<h3>Multithreading With OpenMP</h3>

<pre class="code">#include <iostream>

#include <omp.h>

int main() { #pragma omp parallel { int tid = ~~Threading in C++11 =~~omp_get_thread_num(); std::cout << "Hi from thread "<< tid << '\n'; } return 0;}</pre>

<p>Output:</p>

<pre class="code">Hi from thread Hi from Thread 2

0

Hi from thread 1

Hi from thread 3

</pre>

~~Threading~~ <p>Essentially what is happening in ~~C++11~~ the code above is that the threads are intermingling creating a jumbled output. All threads are trying to access the cout stream at the same time. As one thread is ~~available through~~ in the stream another may interfere with it because they are all trying to access the stream at the same time. <~~thread~~/p> ~~library.~~ ~~C++11 relies mostly on joining or detaching forked subthreads.~~

~~=== Join vs Detach ===~~<h3>Threading with C++11</h3><p>Unlike OpenMP, C++11 threads are created by the programmer instead of the compiler.</p><p>std::this_thread::get_id() is similar to OpenMP's omp_get_thread_num() but instead of an int, it returns a </p>

<pre class="code">// cpp11.multithreading.cpp

A #include <iostream>#include <vector>#include <thread will begin running on initialization. While running in parallel, the child thread’s scope could exit before the child thread is finished. This will result in an error. The two main ways of dealing with this problem is through joining or detaching the child thread to/from the parent thread. >

~~The following example shows how join works with the C++11 thread library. The thread~~ void func1(t1int index) ~~forks of when creating a new child thread~~ { std::cout << "Index: " << index << " - ID: " << std::this_thread::get_id(t2). Both of these threads run in parallel. To prevent t2 from going out of scope in case t1 finishes first, t1 will call t2.join(). This will block t1 from executing code until t2 returns. Once t2 joins back, t1 can continue to execute.<< std::endl;}

~~[[File:Cppjoin.png | 500px]]~~int main() { int numThreads = 10;

~~Detach, on the other hand, separates the two threads entirely. When t1 creates the new t2~~ std::vector<std::thread~~, they both run in parallel. This time, t1 will call the detach function on t2. This will cause the two~~ > threads to continue running in parallel without t1’s scope affecting t2. Therefore, if t1 exits before t2 finishes, t2 can continue to run without any errors occurring - deallocating any memory after it itself finishes. ;

~~[[File~~ std:~~Cppdetach~~:cout << "Creating threads...~~png | 500px]]~~\n";

for (int i = 0; i < numThreads; i++)

threads.push_back(std::thread(func1, i));

~~=== Creating a Thread ===~~ std::cout << "All threads have launched!\n"; std::cout << "Syncronizing...\n";

for (auto& thread : threads)

thread.join();

~~The basic constructor for a thread follows the following template~~ std::cout << "All threads have syncronized!\n";

~~[[File:Cppthread.png | 500px]]~~ return 0;}</pre>

<p>Since all threads are using the std::cout stream, the output can appear jumbled and out of order. The solution to this problem will be presented in the next section.</p>

~~The thread can take in a function, functor, or lambda expression as its first argument, followed by~~ <pre class="code">Creating threads...Index: 0 ~~or more arguments to be passed into the function~~- ID: Index: 1 - ID: Index: 2 - ID: 0x70000b57e0000x70000b4fb0000x70000b601000Index: 3 - ID: 0x70000b684000Index: 4 - ID: 0x70000b707000Index: 5 - ID: 0x70000b78a000Index: 6 - ID: 0x70000b80d000Index: 7 - ID: 0x70000b890000Index: All threads have launched!8 - ID: 0x70000b913000Index: Syncronizing...9 - ID: 0x70000b996000All threads have syncronized!</pre>

The thread constructor, by default, will treat all arguments as if you are passing them in by value, even if the function requires a variable by reference. To make sure no errors occur, the programmer needs to specify that the argument(s) passed to be treated as references by wrapping them in std::ref().

~~The following is an example of a thread passing in variables by value and by reference:~~<h2>How Syncronization Works</h2>

~~[[File:Cppthreadpassinvariables.png | 500px]]~~<h3>Syncronization With OpenMP</h3>

~~'''Output:'''~~ <h4>critical</h4>

~~[[File:Cppthreadpassinvariablesoutput~~<pre class="code">#include <iostream>#include <omp.~~png | 300px]]~~h>

int main()

{

#pragma omp parallel

{

int tid = omp_get_thread_num();

#pragma omp critical

std::cout << "Hi from thread "<< tid << '\n';

}

return 0;

}

</pre>

<p>Using the parallel construct: critical we are able to limit one thread accessing the stream at a time. critical defines the region in which only one thread is allowed to execute at a time. In this case its the cout stream that we are limiting to one thread. The revised code now has an output like this:</p>

<pre class=~~== Multithreading ===~~"code">Hi from thread 0Hi from Thread 1Hi from thread 2Hi from thread 3</pre>

Multithreading with the C++11 thread library requires manual creation of every new thread. To define the number of threads to be created, the programmer has the option of manually setting the number of threads or using the hardware_concurrency function that will return the maximum number of threads that are available <h4>parallel for ~~the program to use. This works in a similar way as OpenMP’s omp_get_max_threads().~~</h4>

<p>In OpenMp there is a way of parallelizing a for loop by using the parallel construct for. This statement will automatically distribute iterations between threads.</p>

<p>Example:</p><pre class="code">void simple(int n, float *a, float *b) { int i; #pragma omp parallel for for (i = 1; i < n; i++) b[i] = (a[~~File:Cppmultithreading.png | 500px~~i]+ a[i-1]) / 2.0;}</pre>

~~'''Output:'''~~

~~[[File:Cppmultithreadingoutput.png | 300px]]~~<h3>Syncronization with C++11</h3>

<h4>mutex</h4>

<p>To allow for thread syncronization, we can use the mutex library to lock specific sections of code from being used by multiple threads at once.</p>

~~As the threads execute, they create a race condition~~<pre class="code">// cpp11. ~~Because they all share the std::cout stream object, multithreading like this can result in unwanted behaviour - as seen in the above output~~mutex. cpp

~~''Note how you can delay a~~ #include <iostream>#include <vector>#include <thread ~~by calling the std::this_thread::sleep_for() function.''~~>#include <mutex>

std::mutex mu;

~~=== Synchronization ===~~ void func1(int index) { std::lock_guard<std::mutex> lock(mu); // mu.lock(); std::cout << "Index: " << index << " - ID: " << std::this_thread::get_id() << std::endl; // mu.unlock();}

int main() { int numThreads =~~=== Using Mutex ====~~10;

std::vector<std::thread> threads;

~~To prevent unwanted race conditions, we can use the mutex functionality available through the <mutex> library~~ std::cout << "Creating threads. ..\n";

~~Mutex creates an exclusivity region within a thread through a lock system. Once locked, it protects shared data from being accessed by multiple~~ for (int i = 0; i < numThreads; i++) threads ~~at the same time~~. ~~To prevent from a mutex lock from never unlocking - if, for example, an exception is thrown before the unlock function runs - it is advised to use~~ push_back(std::~~lock_guard<std::mutex> instead to manage locking in a more exception-safe manner.~~thread(func1, i));

std::cout << "All threads have launched!\n";

std::cout << "Syncronizing...\n";

~~[[File~~ for (auto& thread :~~Cppmutex~~threads) thread.~~png | 500px]]~~join();

std::cout << "All threads have syncronized!\n";

~~'''Output:'''~~ return 0;}</pre>

~~[[File:Cppmutexoutput~~<p>Using mutex, we're able to place a lock on the data used by the threads to allow for mutual exclusion.~~png | 300px]]~~This is similar to OpenMP's critical in that it only allows one thread to execute a block of code at a time.</p>

<pre class="code">Creating threads...

Index: 0 - ID: 0x70000aa29000

Index: 4 - ID: 0x70000ac35000

Index: 5 - ID: 0x70000acb8000

Index: 1 - ID: 0x70000aaac000

Index: 6 - ID: 0x70000ad3b000

Index: 7 - ID: 0x70000adbe000

Index: 8 - ID: 0x70000ae41000

Index: 3 - ID: 0x70000abb2000

All threads have launched!

Syncronizing...

Index: 9 - ID: 0x70000aec4000

Index: 2 - ID: 0x70000ab2f000

All threads have syncronized!

</pre>

~~==== Using Atomic ====~~

~~Another way to manage shared data access between multiple threads is through the use of the atomic structure defined in the~~ <~~atomic~~h2>How Data Sharing Works</h2> ~~library.~~

~~Atomic in~~  <h3>Data Sharing With OpenMP</h3> <p></p><p>In OpenMP by default all data is shared and passed by reference. Therefore, we must be careful how the data is handled within the parallel region if accessed by multiple threads at once.</p> <p>For Example:</p><pre class="code">#include <iostream>#include <omp.h> int main() { int i = 12; #pragma omp parallel { #pragma omp critical std::cout << "\ni = " << ++i; } std::cout << "\ni = " << i << std::endl; return 0;}</pre> <p>Output:</p><pre class="code">i = 13i = 14i = 15i = 16i = 16</pre> <p>What we can see using the output from the code above is that even after the parallel region is closed we can see that our variable i holds a different value than it did originally. This is due to the fact that the variable is shared inside and outside the parallel region. In order to pass this variable by value to each thread we must make this variable non-shared. This is done by using firstprivate() This is considered a clause, which comes after a construct. firstprivate(i) will take i and make it private to each thread.</p> <p>For example:</p><pre class="code">#include <iostream>#include <omp.h> int main() { int i = 12; #pragma omp parallel firstprivate(i) { #pragma omp critical std::cout << "\ni = " << ++i; } std::cout << "\ni = " << i << std::endl;}</pre> <p>New Output:</p><pre class="code">i = 13i = 13i = 13i = 13i = 12</pre> <p>What we can see here is that through each indiviual thread the value of i stays at 12 then gets incremented by the thread to 13. On the last line of the output we can see that i = 12 showing that the parallel region did not change the value of i outside the parallel region.</p>  <h3>Data Sharing with C++11 </h3><p>The C++11 thread library ~~works very similarly~~ requires the programmer to ~~how Atomic works~~ pass in ~~OpenMP~~the address of the data that should be shared by the threads.</p> <pre class="code">// cpp11.datasharing. ~~In C~~cpp #include <iostream>#include <vector>#include <thread>#include <mutex> std::mutex mu; void func1(int value) { std::lock_guard<std::mutex> lock(mu); std::cout << "func1 start - value = " << value << std::endl; value = 0; std::cout << "func1 end - value = " << value << std::endl;} void func2(int& value) { std::lock_guard<std::mutex> lock(mu); std::cout << "func2 start - value = " << value << std::endl; value *= 2; std::cout << "func2 end - value = " << value << std::endl;} int main() { int numThreads = 5; int value = 1; std::vector<std::thread> threads; for (int i = 0; i < numThreads; i++) { if (i == 2) threads.push_back(std::thread(func1, value)); else threads.push_back(std::thread(func2, std::ref(value))); } for (auto& thread : threads) thread.join(); return 0;}</pre> <pre class="code">func2 start - value = 1func2 end - value = 2func2 start - value = 2func2 end - value = 4func1 start - value = 1func1 end - value = 0func2 start - value = 4func2 end - value = 8func2 start - value = 8func2 end - value = 16</pre>  <h2>How Syncronization Works Continued</h2>  <h3>Syncronization Continued With OpenMP</h3> <h4>atomic </h4> <p>The atomic construct is ~~used as~~ a ~~wrapper for~~ way of OpenMP's implementation to serialize a ~~variable type~~ specific operation. The advantage of using the atomic construct in ~~order~~ this example below is that it allows the increment operation with less overhead than critical. Atomic ensures that only the operation is being performed one thread at a time.</p> <pre class="code">int main() { int i = 0; #pragma omp parallel num_threads(10) { #pragma omp atomic i++; } std::cout << i << std::endl; return 0;}</pre> <pre class="code">10</pre>  <h3>Syncronization Continued with C++11</h3> <h4>atomic</h4><p>Another way to ensure syncronization of data between threads is to ~~give~~ use the ~~variable~~ atomic ~~properties~~ library.</p> <pre class="code">// cpp11.atomic.cpp #include <iostream>#include <vector>#include <thread>#include <atomic> std::atomic<int> value(1); void add() { ++value;} void sub() { - ~~that is, it~~ -value;} int main() { int numThreads = 5; std::vector<std::thread> threads; for (int i = 0; i < numThreads; i++) { if (i == 2) threads.push_back(std::thread(sub)); else threads.push_back(std::thread(add)); } for (auto& thread : threads) thread.join(); std::cout << value << std::endl; return 0;}</pre> <p>The atomic value can only be ~~written~~ accessed by one thread at a time.This is a similar lock procedure as mutex except the lock is defined by the atomic wrapper instead of the programmer.</p> <pre class="code">4</pre>  <h2>Thread Creation Test</h2> <pre class="code">#include <iostream>#include <string>#include <chrono>#include <vector>#include <thread>#include <omp.h> using namespace std::chrono; void reportTime(const char* msg, int size, steady_clock::duration span) { auto ms = duration_cast<milliseconds>(span); std::cout << msg << "- size : " << std::to_string(size) << " - took - " << ms.count() << " milliseconds" << std::endl;} void empty() {} void cpp(int size) { steady_clock::time_point ts, te; ts = steady_clock::now(); for (int i = 0; i < size; i++) { std::vector<std::thread> threads; for (int j = 0; j < 10; j++) threads.push_back(std::thread(empty)); for (auto& thread : threads) thread.join(); } te = steady_clock::now(); reportTime("C++11 Threads", size, te - ts);} void omp(int size) { steady_clock::time_point ts, te; ts = steady_clock::now(); for (int i = 0; i < size; i++) { #pragma omp parallel for num_threads(10) for (int i = 0; i < 10; i++) empty(); } te = steady_clock::now(); reportTime("OpenMP", size, te - ts);} int main() { // Test C++11 Threads cpp(1); cpp(10); cpp(100); cpp(1000); cpp(10000); cpp(100000); std::cout << std::endl; // Test OpenMP omp(1); omp(10); omp(100); omp(1000); omp(10000); omp(100000); return 0;}</pre> <pre class="code">C++11 Threads- size : 1 - took - 1 millisecondsC++11 Threads- size : 10 - took - 10 millisecondsC++11 Threads- size : 100 - took - 125 millisecondsC++11 Threads- size : 1000 - took - 1703 millisecondsC++11 Threads- size : 10000 - took - 20760 millisecondsC++11 Threads- size : 100000 - took - 168628 milliseconds OpenMP- size : 1 - took - 0 millisecondsOpenMP- size : 10 - took - 0 millisecondsOpenMP- size : 100 - took - 0 millisecondsOpenMP- size : 1000 - took - 6 millisecondsOpenMP- size : 10000 - took - 62 millisecondsOpenMP- size : 100000 - took - 616 milliseconds</pre> [[File:Cpp11threadgraph.png | 700px]][[File:Openmpthreadgraph.png | 700px]]

Dbogomazov

44

edits

CDOT Wiki β

Changes

DPS921/ND-R&D

CDOT Wiki ^β