Open main menu

CDOT Wiki β

GPU621/CamelCaseTeam

Revision as of 20:43, 28 July 2021 by Htruong (talk | contribs) (OpenMP Threading - adding explanation to OpenMP example)

Project Name

C++11 Threads Library Comparison to OpenMP

Group Members

Andrei Fedchenko Hung Truong Ren Ren

Overview

Our team will do a research on comparison between c++ 11 threads library and openmp. We will code problems with both libraries to demonstrate the coding differences and do the performance analysis on those two approaches. We will highlight the techniques and best practices to achieve the best performance and avoid the pitfalls.

Get start with c++ threading

C++ Threading is supported after c++11 standard, previously, it used pthread and windows api or boost library to do multithreading.

In order to start with threading in c++, we need to include <thread>, this includes thread functions and classes. For protecting shared data, we will need to include other headers like <mutex> <condition_variable>, etc.

std::thread work with all callable type, lambda, functor(function object), function pointer. Here are some examples:

1. function pointer

void hello() {
	std::cout << "hello wolrd" << std::endl;
}
std::thread a(hello);

2. lambda

int num1 = 0;
int num2 = 1;
int num3 = 0;

std::thread b([]{
	num3 = num1 + num2;
});

3. functor

class Func {
	void operator()() {
		std::cout << "print in class f operator" << std::endl;
	}
}
Func f();
std::thread c(f);

We need to decide If we want to wait for it to finish (join) or If we want it to run in the background (detach) before the thread object being deconstructed, because it will call std::terminate in the deconstructor and throw a error in runtime.

t.join() -------- synchronized function called in parent thread to wait for the child thread to finish.

t.detach() ------- let threads to run in the background.

C++11 Threading

C++ threading consists of the creation of a thread that works on a function. By having multiple threads, multiple functions are able to be performed at the same time which enables parallel processing.

template<class Function, class... args> 
explicit thread( Function&& f, Args&&... args);

As seen above, a single thread takes a function through overloading and performs the task by passing the arguments to the function parameter which would execute separate to other threads. Once these threads are finished performing their function, the child thread must be joined using the .join() function in the thread library to the parent thread.

#include <thread>
#include <iostream>
#include <chrono>

void thread1() {
	std::this_thread::sleep_for(std::chrono::seconds(2));
	std::cout << "Hello from thread 1" << std::endl;
}

void thread2() {
	std::this_thread::sleep_for(std::chrono::seconds(3));
	std::cout << "Hello from thread 2" << std::endl;
}

void thread3() {
	std::this_thread::sleep_for(std::chrono::seconds(1));
	std::cout << "Hello from thread 3" << std::endl;
}

int main() {
	std::cout << "Creating thread 1: " << std::endl;
	std::thread t1(thread1); //appears 2nd

	std::cout << "Creating thread 2: " << std::endl;
	std::thread t2(thread2); //appears 3rd

	std::cout << "Creating thread 3: " << std::endl;
	std::thread t3(thread3); //appears 1st

	t1.join();
	t2.join();
	t3.join();
}

//Creating thread 1:
//Creating thread 2:
//Creating thread 3:
//Hello from thread 3
//Hello from thread 1
//Hello from thread 2

The code above gives a simple example on how threads are created and joined back to the parent thread in the end.

Usage of C++11 Threading

Serial Version

#include <iostream>
#include <chrono>

int main() {
	const int iteration = 100000000;
	int count = 0;
	std::chrono::steady_clock::time_point ts = std::chrono::steady_clock::now();
	for (count = 0; count < iteration; count++) {}
	std::chrono::steady_clock::time_point te = std::chrono::steady_clock::now();
	auto t = std::chrono::duration_cast<std::chrono::milliseconds>(te - ts);
	std::cout << "Serial - Counting to: " << count << " - elapsed time: " << t.count() << " ms" << std::endl;
}
Serial - Counting to: 100000000 - elapsed time: 122 ms

Parallel Version

#include <thread>
#include <iostream>
#include <chrono>
#include <vector>
#include <mutex>

#define MAX_THREAD 16
std::mutex mtx; //creates critical section

int main() {
	const int iteration = 100000000;
	const int runs = 5;
	for (int x = 0; x < MAX_THREAD; x++) { //iteration over the max amount of cores.
		auto t = 0;
		int count = 0;
		for (int y = 0; y < runs; y++) {
			count = 0;
			std::vector<std::thread> threads;


			//setting the remainder count outside of count, neglectible at larger iteration, max value of number of thread - 1
			for (int i = 0; i < iteration % (x + 1); i++) {
				count++;
			}

			std::chrono::steady_clock::time_point ts = std::chrono::steady_clock::now();
			for (int j = 0; j < x + 1; j++) {
				threads.push_back(
					std::thread([x, iteration, &count]() {
						mtx.lock();
						for (int k = 0; k < iteration / (x + 1); k++) {
							count++;
						}
						mtx.unlock();
						}));
			}

			for (auto& thrd : threads) {
				thrd.join();
			}

			std::chrono::steady_clock::time_point te = std::chrono::steady_clock::now();

			auto temp = std::chrono::duration_cast<std::chrono::milliseconds>(te - ts);
			t += temp.count();
		}
		std::cout <<"C11 Thread - Counting to: " << count << " - Number of threads created: " << x+1 << " | elapsed time: " << t/runs << " ms" << std::endl;
	}
}
C11 Thread - Counting to: 100000000 - Number of threads created: 1 | elapsed time: 243 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 2 | elapsed time: 238 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 3 | elapsed time: 241 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 4 | elapsed time: 224 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 5 | elapsed time: 225 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 6 | elapsed time: 224 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 7 | elapsed time: 228 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 8 | elapsed time: 225 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 9 | elapsed time: 223 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 10 | elapsed time: 231 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 11 | elapsed time: 226 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 12 | elapsed time: 225 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 13 | elapsed time: 222 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 14 | elapsed time: 225 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 15 | elapsed time: 226 ms
C11 Thread - Counting to: 100000000 - Number of threads created: 16 | elapsed time: 223 ms

The above code shows a simple demonstration on the performance of using C11 Thread library where it counts the number of iteration in a large amount. The thread library requires another library to produce a critical section in the form of mutex which allows for signals the section of the code that is under critical so that the thread must run and finish its task before rejoining with the other threads. This is similar to the OpenMP's #pragma omp critical directive but OpenMP does not require an additional library to perform this function. Taking the average of 10 runs for each number of threads, there is a slight increase in performance as the number of threads increases as well. When comparing to the serial version, the threading suffers in performance which is due to calling many headers and creation of threads.

OpenMP Threading

OpenMP threads consists of a parallel region that forks a master thread into multiple child threads that each performs a function in parallel amongst each other and joins back to its parent thread after reaching the end of the parallel region.

#pragma omp construct [clause, ...]
structured block

The construct identifies what block of code is being executed in parallel and the clause qualifies the parallel construct. In comparison to the C++ 11 thread library, the threads are performed in a region while the C++ 11 thread library have to be specifically created and joined back to the parent thread.

#include <iostream>
#include <omp.h>

void main() {
	omp_set_num_threads(3);
	#pragma omp parallel
	{
		int tid = omp_get_thread_num();
		#pragma omp critical
		std::cout << "Thread " << tid + 1 << std::endl;
	}
}

//Thread 1
//Thread 2
//Thread 3

The code above shows the creation of 3 threads in a parallel region which are performed in sequence using critical. This highlights the difference between the c++11 thread library as no joining is required after the threads concludes their function.=

Equivalent Counting Program

#include <iostream>
#include <omp.h>
#include <chrono>

int main() {
	const int iteration = 100000000;
	const int runs = 5;
	const int mthreads = omp_get_max_threads();
	for (int x = 0; x < mthreads; x++) {
		omp_set_num_threads(x + 1);
		double t = 0;
		int count = 0;
		for (int y = 0; y < runs; y++) {
			count = 0;
			double ts = omp_get_wtime();
#pragma omp parallel
			{
				int tid = omp_get_thread_num();
				int nt = omp_get_num_threads();
				int temp = 0;
				for (int j = tid; j < iteration; j += nt) {
					temp++;
				}
#pragma omp critical
				count += temp;
			}
			double te = omp_get_wtime();
			double temp = (te - ts)*1000;
			t += temp;
		}
		std::cout << "OpenMP Thread - Counting to: " << count << " - Number of threads created: " << omp_get_max_threads() << " | elapsed time: " << t / runs << " ms" << std::endl;
	}
}
OpenMP Thread - Counting to: 100000000 - Number of threads created: 1 | elapsed time: 109.845 ms
OpenMP Thread - Counting to: 100000000 - Number of threads created: 2 | elapsed time: 56.1196 ms
OpenMP Thread - Counting to: 100000000 - Number of threads created: 3 | elapsed time: 39.7748 ms
OpenMP Thread - Counting to: 100000000 - Number of threads created: 4 | elapsed time: 33.0161 ms
OpenMP Thread - Counting to: 100000000 - Number of threads created: 5 | elapsed time: 29.7252 ms
OpenMP Thread - Counting to: 100000000 - Number of threads created: 6 | elapsed time: 23.1936 ms
OpenMP Thread - Counting to: 100000000 - Number of threads created: 7 | elapsed time: 20.8697 ms
OpenMP Thread - Counting to: 100000000 - Number of threads created: 8 | elapsed time: 18.5374 ms
OpenMP Thread - Counting to: 100000000 - Number of threads created: 9 | elapsed time: 17.1628 ms
OpenMP Thread - Counting to: 100000000 - Number of threads created: 10 | elapsed time: 16.4023 ms
OpenMP Thread - Counting to: 100000000 - Number of threads created: 11 | elapsed time: 14.3493 ms
OpenMP Thread - Counting to: 100000000 - Number of threads created: 12 | elapsed time: 14.0788 ms
OpenMP Thread - Counting to: 100000000 - Number of threads created: 13 | elapsed time: 12.9446 ms
OpenMP Thread - Counting to: 100000000 - Number of threads created: 14 | elapsed time: 13.284 ms
OpenMP Thread - Counting to: 100000000 - Number of threads created: 15 | elapsed time: 13.0647 ms
OpenMP Thread - Counting to: 100000000 - Number of threads created: 16 | elapsed time: 17.9958 ms

Comparison to the serial and C11 thread library equivalent of the code above, OpenMP shows a greater improvement over the other techniques. In addition, the performance increases as the number of thread increases by a significant amount in comparison to the C11 thread library method.

References

Thread Support Library https://en.cppreference.com/w/cpp/thread

Mutex Library https://www.cplusplus.com/reference/mutex/mutex/