Changes

Jump to: navigation, search

GPU621/CamelCaseTeam

13,112 bytes added, 11:17, 4 August 2021
no edit summary
Our team will do a research on comparison between c++ 11 threads library and openmp. We will code problems with both libraries to demonstrate the coding differences and do the performance analysis on those two approaches. We will highlight the techniques and best practices to achieve the best performance and avoid the pitfalls.
 
== Get start with c++ threading ==
 
C++ Threading is supported after c++11 standard, previously, it used pthread and windows api or boost library to do multithreading.
 
In order to start with threading in c++, we need to include <thread>, this includes thread functions and classes. For protecting shared data, we will need to include other headers like <mutex> <condition_variable>, etc.
 
std::thread work with all callable type, lambda, functor(function object), function pointer. Here are some examples:
 
1. function pointer
<syntaxhighlight lang="cpp">
void hello() {
std::cout << "hello wolrd" << std::endl;
}
std::thread a(hello);
</syntaxhighlight>
 
2. lambda
<syntaxhighlight lang="cpp">
int num1 = 0;
int num2 = 1;
int num3 = 0;
 
std::thread b([]{
num3 = num1 + num2;
});
 
</syntaxhighlight>
 
3. functor
<syntaxhighlight lang="cpp">
class Func {
void operator()() {
std::cout << "print in class f operator" << std::endl;
}
}
Func f();
std::thread c(f);
</syntaxhighlight>
We need to decide If we want to wait for it to finish (join) or If we want it to run in the background (detach) before the thread object being deconstructed, because it will call std::terminate in the deconstructor and throw a error in runtime.
 
t.join() -------- synchronized function called in parent thread to wait for the child thread to finish.
 
t.detach() ------- let threads to run in the background.
== C++11 Threading ==
The code above gives a simple example on how threads are created and joined back to the parent thread in the end.
 
=== Usage of C++11 Threading ===
 
==== Serial Version ====
<syntaxhighlight lang="cpp">
#include <iostream>
#include <chrono>
 
int main() {
const int iteration = 100000000;
int count = 0;
std::chrono::steady_clock::time_point ts = std::chrono::steady_clock::now();
for (count = 0; count < iteration; count++) {}
std::chrono::steady_clock::time_point te = std::chrono::steady_clock::now();
auto t = std::chrono::duration_cast<std::chrono::milliseconds>(te - ts);
std::cout << "Serial - Counting to: " << count << " - elapsed time: " << t.count() << " ms" << std::endl;
}
</syntaxhighlight>
 
<syntaxhighlight lang="cpp">
Serial - Counting to: 100000000 - elapsed time: 122 ms
</syntaxhighlight>
 
==== Parallel Version ====
 
<syntaxhighlight lang="cpp">
#include <thread>
#include <iostream>
#include <chrono>
#include <vector>
#include <mutex>
 
#define MAX_THREAD 16
std::mutex mtx; //creates critical section
 
int main() {
const int iteration = 100000000;
const int runs = 5;
for (int x = 0; x < MAX_THREAD; x++) { //iteration over the max amount of cores.
auto t = 0;
int count = 0;
for (int y = 0; y < runs; y++) {
count = 0;
std::vector<std::thread> threads;
 
 
//setting the remainder count outside of count, neglectible at larger iteration, max value of number of thread - 1
for (int i = 0; i < iteration % (x + 1); i++) {
count++;
}
 
std::chrono::steady_clock::time_point ts = std::chrono::steady_clock::now();
for (int j = 0; j < x + 1; j++) {
threads.push_back(
std::thread([x, iteration, &count]() {
mtx.lock();
for (int k = 0; k < iteration / (x + 1); k++) {
count++;
}
mtx.unlock();
}));
}
 
for (auto& thrd : threads) {
thrd.join();
}
 
std::chrono::steady_clock::time_point te = std::chrono::steady_clock::now();
 
auto temp = std::chrono::duration_cast<std::chrono::milliseconds>(te - ts);
t += temp.count();
}
std::cout <<"C11 Thread - Counting to: " << count << " - Number of threads created: " << x+1 << " | elapsed time: " << t/runs << " ms" << std::endl;
}
}
</syntaxhighlight>
 
<syntaxhighlight lang="cpp">
C11 Thread - Counting to: 100000000 - Number of threads created: 1 | elapsed time: 243 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 2 | elapsed time: 238 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 3 | elapsed time: 241 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 4 | elapsed time: 224 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 5 | elapsed time: 225 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 6 | elapsed time: 224 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 7 | elapsed time: 228 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 8 | elapsed time: 225 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 9 | elapsed time: 223 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 10 | elapsed time: 231 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 11 | elapsed time: 226 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 12 | elapsed time: 225 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 13 | elapsed time: 222 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 14 | elapsed time: 225 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 15 | elapsed time: 226 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 16 | elapsed time: 223 ms
</syntaxhighlight>
 
The above code shows a simple demonstration on the performance of using C11 Thread library where it counts the number of iteration in a large amount. The thread library requires another library to produce a critical section in the form of mutex which allows for signals the section of the code that is under critical so that the thread must run and finish its task before rejoining with the other threads. This is similar to the OpenMP's #pragma omp critical directive but OpenMP does not require an additional library to perform this function. Taking the average of 10 runs for each number of threads, there is a slight increase in performance as the number of threads increases as well. When comparing to the serial version, the threading suffers in performance which is due to calling many headers and creation of threads.
== OpenMP Threading ==
The code above shows the creation of 3 threads in a parallel region which are performed in sequence using critical. This highlights the difference between the c++11 thread library as no joining is required after
the threads concludes their function.= ===Equivalent Counting Program=== <syntaxhighlight lang="cpp">#include <iostream>#include <omp.h>#include <chrono> int main() { const int iteration = 100000000; const int runs = 5; const int mthreads = omp_get_max_threads(); for (int x = 0; x < mthreads; x++) { omp_set_num_threads(x + 1); double t = 0; int count = 0; for (int y = 0; y < runs; y++) { count = 0; double ts = omp_get_wtime();#pragma omp parallel { int tid = omp_get_thread_num(); int nt = omp_get_num_threads(); int temp = 0; for (int j = tid; j < iteration; j += nt) { temp++; }#pragma omp critical count += temp; } double te = omp_get_wtime(); double temp = (te - ts)*1000; t += temp; } std::cout << "OpenMP Thread - Counting to: " << count << " - Number of threads created: " << omp_get_max_threads() << " | elapsed time: " << t / runs << " ms" << std::endl; }}</syntaxhighlight> <syntaxhighlight lang="cpp">OpenMP Thread - Counting to: 100000000 - Number of threads created: 1 | elapsed time: 109.845 msOpenMP Thread - Counting to: 100000000 - Number of threads created: 2 | elapsed time: 56.1196 msOpenMP Thread - Counting to: 100000000 - Number of threads created: 3 | elapsed time: 39.7748 msOpenMP Thread - Counting to: 100000000 - Number of threads created: 4 | elapsed time: 33.0161 msOpenMP Thread - Counting to: 100000000 - Number of threads created: 5 | elapsed time: 29.7252 msOpenMP Thread - Counting to: 100000000 - Number of threads created: 6 | elapsed time: 23.1936 msOpenMP Thread - Counting to: 100000000 - Number of threads created: 7 | elapsed time: 20.8697 msOpenMP Thread - Counting to: 100000000 - Number of threads created: 8 | elapsed time: 18.5374 msOpenMP Thread - Counting to: 100000000 - Number of threads created: 9 | elapsed time: 17.1628 msOpenMP Thread - Counting to: 100000000 - Number of threads created: 10 | elapsed time: 16.4023 msOpenMP Thread - Counting to: 100000000 - Number of threads created: 11 | elapsed time: 14.3493 msOpenMP Thread - Counting to: 100000000 - Number of threads created: 12 | elapsed time: 14.0788 msOpenMP Thread - Counting to: 100000000 - Number of threads created: 13 | elapsed time: 12.9446 msOpenMP Thread - Counting to: 100000000 - Number of threads created: 14 | elapsed time: 13.284 msOpenMP Thread - Counting to: 100000000 - Number of threads created: 15 | elapsed time: 13.0647 msOpenMP Thread - Counting to: 100000000 - Number of threads created: 16 | elapsed time: 17.9958 ms</syntaxhighlight> Comparison to the serial and C11 thread library equivalent of the code above, OpenMP shows a greater improvement over the other techniques. In addition, the performance increases as the number of threads increase by a significant amount in comparison to the C11 thread library method. == Data Sharing ==  === C++ Threading === ==== mutex and std::lock_guard ==== There are different ways how programmers can make sure that shared data will be used only by one thread at a time. One of which was already presented as an example above. Mutex library gives the programmer a tool to lock a part of the region to be sure that it is not accessed by any other thread at this moment. .lock() and .unlock() mutex methods can be used as in [[#Parallel_Version|this example]], however there is a safer, exceptionless method with the use of std::lock_guard mutex wrapper. Here is an example of a simple reduce code program with the use of c++ threads: <syntaxhighlight lang="cpp">#include <iostream>#include <thread>#include <vector>#include <mutex>#include "timer.h"std::mutex guard; void threadFunc(int ithread, double& accum) { double buffer = 1; for (int i = ithread * 10000000; i < ithread * 10000000 + 10000000; i++) { buffer += 1.0 / (i + 1); }  std::lock_guard<std::mutex> lock(guard); accum = accum + buffer;} int main(int argc, char* argv) { Timer t;  std::vector<std::thread> threads;  double accum = 0;  t.reset();  t.start(); for (int i = 0; i < 8; i++) { threads.push_back(std::thread(threadFunc, i, std::ref(accum))); } for (auto& thread : threads) thread.join(); t.stop();  std::cout << "std::lock_guard version - " << accum << " - " << t.currtime() << std::endl; }</syntaxhighlight> std::lock_guard wrapper automatically unlocks range at the end of the function and returns access to the shared data to access by other threads. ==== std::atomic ==== Another important library for c++ threading library is atomic[https://en.cppreference.com/w/cpp/atomic/atomic]. Atomic is a wrapper object that is free from data races. That allows different threads to simultaneously access it without creating undefined behaviour. An example of the reduction code above with std::atomic wrapper: <syntaxhighlight lang="cpp">#include <iostream>#include <thread>#include <vector>#include <atomic>#include "timer.h"void threadFuncAtomic(int ithread, std::atomic<double>& accum) { double buffer = 1; for (int i = ithread * 10000000; i < ithread * 10000000 + 10000000; i++) { buffer += 1.0 / (i + 1); }  accum = accum + buffer;} int main(int argc, char* argv) { Timer t;  t.reset();  std::vector<std::thread> threadsAtomic;  std::atomic<double> accumAtomic = 0;  t.start(); for (int i = 0; i < 8; i++) { threadsAtomic.push_back(std::thread(threadFuncAtomic, i, std::ref(accumAtomic))); } for (auto& thread : threadsAtomic) thread.join(); t.stop();  std::cout << "std::atomic version - " << accumAtomic << " - " << t.currtime() << std::endl;}</syntaxhighlight> On average both solutions do not show significant differences in performance. And because locks are OS-dependent and atomic is dependent on processor support of this feature, therefore the performance of these two different approaches depends mostly on the hardware. === OpenMP === OpenMP provides easy to use solution to share data among the threads. Both #pragma omp critical and #pragma omp atomic work the same way by allowing only one thread at a time to access critical region, however, atomic has much lower overhead and where available uses hardware advantage of atomic operations if there is any provided. Here is an OpenMP atomic alternative to the reduction code example: <syntaxhighlight lang="cpp">#include <iostream>#include <omp.h>#include "timer.h" int main(int argc, char* argv) { Timer t;  double accum = 0;  t.reset();  omp_set_num_threads(8); t.start();#pragma omp parallel { int ithread = omp_get_thread_num();  double buffer = 1; for (int i = ithread * 10000000; i < ithread * 10000000 + 10000000; i++) { buffer += 1.0 / (i + 1); } #pragma omp atomic accum += buffer; } t.stop();  std::cout << accum << " - " << t.currtime() << std::endl;}</syntaxhighlight> === Results === Performance tests on my machine showed the following results: c++ threads: std::lock_guard and atomic solutions - 65 ms on average.OpenMP: critical solution - 31 ms on average. OpenMP provides a much better performance-based solution and much more comfortable and straightforward tools to use.
== References ==
Thread Support Library https://en.cppreference.com/w/cpp/thread
 
Mutex Library https://www.cplusplus.com/reference/mutex/mutex/
 
C++ Concurrency in Action, Second Edition, February 2019, ISBN 9781617294693
10
edits

Navigation menu