Difference between revisions of "Team Hortons"

From CDOT Wiki
Jump to: navigation, search
(Sources)
 
(28 intermediate revisions by 2 users not shown)
Line 2: Line 2:
  
  
= Parallelization of the C++ Standard Library: Intel Parallel STL, MCSTL, and C++17 =
+
= Parallelism with Rust =
  
== Introduction: The Standard Template Library ==
+
== Introduction: The Rust Programming language ==
The Standard Template Library (STL) for C++ is a library that provides a set of classes and functions for C++, like iterators and vectors. The STL is divided in four parts: algorithms, containers, functors, and iterators. For this project, we will focus on the ''algorithms'' library.
 
  
The algorithms library provides many functions to be used for ranges of elements, and use iterators or pointers to navigate through them. Some of these algorithms are:
+
Rust is a system programming language (like C and C++, offering direct access to physical memory) sponsored by Mozilla, created in 2010. It is compiled, imperative, functional and strongly typed; it aims to provide better memory safety and still maintain a performance similar to C++, while providing garbage collection through reference counting. Rust won the first place as the "most loved programming language" in 2016 and 2017, in a survey from Stack Overflow [https://insights.stackoverflow.com/survey/2016#technology-most-loved-dreaded-and-wanted 1]
 +
[https://insights.stackoverflow.com/survey/2017#most-loved-dreaded-and-wanted 2].
  
* '''all_of''' - Tests if a condition is true for all the elements in the range
+
Rust has a very strong emphasis in security, more specifically, memory safety. Because of this, the same rules that the compiler requires from a program end up solving many problems that are common in concurrent executions, such as race conditions. This project will explain what these security rules are, how they can prevent problems in parallel programming, as well as demonstrating the correct way to tackle concurrency with Rust.
* '''for_each''' - Applies a function to all elements in the range
 
* '''find''' - Finds a value in the range
 
* '''sort''' - Sorts the elements in the range
 
  
These algorithms provide a ''functional'' way to work on collections, and are present in several programming languages. These algorithms replace what usually would be a for/while loop, shifting the attention from "how to loop" to "what to do".
+
== Memory Ownership and Borrowing ==
  
<code><pre>
+
Rust has a very peculiar way to deal with memory deallocation, security, and ownership. Understanding how memory management works is essential in order to understand how to use parallel programming in Rust. Consider the following example, where we create a vector, push some elements into it, and then print the elements:
#include <iostream>
 
#include <algorithm>
 
#include <vector>
 
  
using namespace std;
+
<source lang="rust">
 +
fn main() {
 +
  let mut vec = Vec::new();
 +
  vec.push(0);
 +
  vec.push(1);
 +
  vec.push(2);
  
int main()
+
  for i in vec.iter() {
{
+
    println!("{}", i);
   vector<int> myVector = { 10, 6, 7, 8, 5, 4, 1, 2, 3, 9 };
+
   }
 +
}
 +
</source>
 +
 
 +
Output:
 +
 
 +
<source>
 +
0
 +
1
 +
2
 +
</source>
 +
 
 +
Notice how variables in Rust are immutable (constant) by default - we need to manually specify that the vector can be changed. Failing to do so will result in a compilation error.
 +
 
 +
In the next example, we abstracted the "push" method to a function called "add_to_vec", which takes a vector and a number, and pushes the number into the vector:
 +
 
 +
<source lang="rust">
 +
fn add_to_vec(mut v: Vec<i32>, n: i32) {
 +
  v.push(n);
 +
}
 +
 
 +
fn main() {
 +
  let mut vec = Vec::new();
 +
  vec.push(0);
 +
  vec.push(1);
 +
  vec.push(2);
 +
 
 +
  add_to_vec(vec, 3);
 +
 
 +
  for i in vec.iter() {
 +
    println!("{}", i);
 +
  }
 +
}
 +
</source>
 +
 
 +
In a language like C++, a vector passed like this would be copied to the function and modified, leaving the original vector untouched; in a language like Java, a similar logic would work, since objects are passed by reference by default. Rust, on the other hand, has a different approach: sending the vector to a function like this means transferring the ownership of the object completely to that function. This means that after the vector is sent to a function, it is not available to the main function anymore - once the '''add_to_vec''' function ends, the vector will then be deallocated, and the next *for* loop will fail.
 +
 
 +
If we try compiling this code, the compiler will produce an error:
 +
 
 +
<source>
 +
error[E0382]: use of moved value: `vec`
 +
  --> src/main.rs:13:12
 +
  |
 +
11 |  add_to_vec(vec, 3);
 +
  |              --- value moved here
 +
12 |
 +
13 |  for i in vec.iter() {
 +
  |            ^^^ value used here after move
 +
  |
 +
  = note: move occurs because `vec` has type `std::vec::Vec<i32>`, which does not implement the `Copy` trait
 +
</source>
 +
 
 +
To fix this, we will have to transfer the ownership of the vector back to the main function. We do this by returning the vector back and reassigning it to the variable '''vec''':
 +
 
 +
<source lang="rust">
 +
fn add_to_vec(mut v: Vec<i32>, n: i32) -> Vec<i32> {
 +
  v.push(n);
 +
  v // shorthand for returning. notice the lack of ;
 +
}
 +
 
 +
fn main() {
 +
  let mut vec = Vec::new();
 +
  vec.push(0);
 +
  vec.push(1);
 +
  vec.push(2);
  
   // Sorting the elements
+
   vec = add_to_vec(vec, 3); // receiving the ownership back
  sort(myVector.begin(), myVector.end(), [](int a, int b) { return b < a; });
 
  
   cout << "== FOR EACH ==" << endl;
+
   for i in vec.iter() {
  for_each(myVector.begin(), myVector.end(), [](int i) { cout << "Element: " << i << endl; });
+
    println!("{}", i);
   cout << endl;
+
   }
 +
}
 +
</source>
  
 +
Now the output is the expected:
  
  cout << "== COUNTING NUMBERS LARGER THAN 6 ==" << endl;
+
<source>
  long nLt6 = count_if(myVector.begin(), myVector.end(), [](int i) { return i > 6; });
+
0
  cout << "There are " << nLt6 << " numbers larger than 6 in the vector" << endl << endl;
+
1
 +
2
 +
3
 +
</source>
  
 +
An easier way to deal with memory ownership is by lending/borrowing memory instead of completely transferring control over it. This is called memory '''borrowing'''. To borrow a value, we use the '''&''' operator:
  
  cout << "== FINDING AN ELEMENT ==" << endl;
+
<source lang="rust">
  vector&lt;int>::iterator elFound = find_if(myVector.begin(), myVector.end(), [](int i) { return i > 4 && i % 6 == 0; });
+
fn add_to_vec(v: &mut Vec<i32>, n: i32) {
  cout << "The element " << *elFound << " is the first that satisfies i > 4 && i % 6 == 0" << endl << endl;
+
  v.push(n);
 +
}
  
   return 0;
+
fn main() {
 +
   let mut vec = Vec::new();
 +
  vec.push(0);
 +
  vec.push(1);
 +
  vec.push(2);
 +
 
 +
  add_to_vec(&mut vec, 3);
 +
 
 +
  for i in vec.iter() {
 +
    println!("{}", i);
 +
  }
 
}
 
}
</pre></code>
+
</source>
  
Output:
+
This will also produce the correct output. In this case, we don't need to send the ownership of the vector back, because it was never transferred, only borrowed. There is one important detail: '''mutable references (&mut T) are only allowed to exist if no other mutable references to that value are active. In other words: only one mutable reference to a value may exist at a time'''.
  
<pre>
+
From what was presented above, some problems can be identified when applying parallelism:
== FOR EACH ==
+
* If memory ownership is transferred in function calls, we can not transfer the ownership of one something to several threads at the same time - only one thread can have the ownership
Element: 10
+
* Mutable references to something can only exist one at a time, this means that we can not spawn several threads with mutable references to a value
Element: 9
 
Element: 8
 
Element: 7
 
Element: 6
 
Element: 5
 
Element: 4
 
Element: 3
 
Element: 2
 
Element: 1
 
  
== COUNTING NUMBERS LARGER THAN 6 ==
+
These two factors limit the capability of conventional shared memory. For example, the following code would not compile:
There are 4 numbers larger than 6 in the vector
 
  
== FINDING AN ELEMENT ==
+
<source lang="rust">
The element 6 is the first that satisfies i > 4 && i % 6 == 0
+
use std::thread;
</pre>
 
  
These algorithms not only make the code easier to write and read, but it also tends to be faster: these algorithms are heavily optimized, making them much faster than for/while loops.
+
fn add_to_vec(v: &mut Vec<i32>, n: i32) {
 +
  v.push(n);
 +
}
  
The execution for these algorithms, however, are not parallel by default: they are sequential. However, it is possible to parallelize many of these algorithms, making their execution much faster. This project will discuss the following methods for parallelizing the STL: Policy-based execution for C++17 + Intel Parallel STL.
+
fn main() {
 +
  let mut vec = Vec::new();
 +
  vec.push(0);
 +
  vec.push(1);
 +
  vec.push(2);
  
== Policy-based execution for C++17 and Intel Parallel STL ==
+
  let t1 = thread::spawn(move || {
C++17, introduces a feature called Execution Policies, which can be used to specify what kind of execution is is desired for the algorithm. There are three possible execution policies:
+
    add_to_vec(&mut vec, 4);
 +
  });
  
* '''std::execution::seq''' - Sequential execution (no parallelism)
+
  add_to_vec(&mut vec, 3);
* '''std::execution::par''' - Parallel execution (multiple threads)
 
* '''std::execution::par_unseq''' - Parallel execution + vectorization
 
  
The Intel C++ compiler also supports another policy, which was not specified in the [http://en.cppreference.com/w/cpp/algorithm/execution_policy_tag_t C++ Documentation]:
+
  for i in vec.iter() {
 +
    println!("{}", i);
 +
  }
  
* '''std::execution::unseq''' - Vectorization
+
  t1.join();
 +
}
 +
</source>
  
These executions are passed as the first parameter for the algorithm:
+
<source>
 +
error[E0382]: use of moved value: `vec`
 +
  --> src/main.rs:20:19
 +
  |
 +
13 |  let t1 = thread::spawn(move || {
 +
  |                          ------- value moved (into closure) here
 +
...
 +
20 |  add_to_vec(&mut vec, 3);
 +
  |                  ^^^ value used here after move
 +
  |
 +
</source>
  
<code><pre>
+
Although this seems unnecessarily strict, it has a great advantage: race conditions are impossible - they are checked at compile time, which means that the systems will be safer and less likely to present concurrency problems. To correctly use parallel programming in Rust, we can make use of tools such as Channels, Locks (Mutex), and Atomic pointers.
std::copy(
 
std::execution::par,
 
a.start(),
 
a.end(),
 
b.start()
 
);
 
</pre></code>
 
  
Most compilers nowadays do not yet support this feature. Probably the only compiler that can already make use of these policies is the Intel C++ Compiler, which was used to perform the tests below.
+
== Channels ==
  
Here are the tests performed: we compared the speed of the four execution policies above for six algorithms: **sort**, **count\_if**, **for_each**, **reduce**, **transform**, and **copy**, using three different data structures: **array**, **vector**, and **list**. We used the Intel C++ Compiler on Windows, on an x86 64 machine with 8 cores. The programs were compiled with **O3** optimization, including the Intel TBB (Thread Building Blocks) library and OpenMP. The purpose of these tests was to see how the elapsed time would change with different execution policies and different data structures.
+
Instead of sharing memory, Rust encourages communication between thread in order to combine information. Channels, which are available in Rust's standard library, can be pictured as rivers: an object put into this river will be carried downstream. Likewise, a Channel has two parts: a sender and a receiver. The sender is responsible for putting information in the channel, which can be retrieved by the receiver. When we ask for a receiver to get a message from the channel, the execution may wait until the message is received or not: the method '''recv''' will block the execution, while '''try_recv''' will not.
  
 +
The following example will spawn a thread, which will send a few strings down a channel which will be received by the main thread.
  
Some observations to point out:
+
<source lang="rust">
 +
use std::thread;
 +
use std::sync::mpsc;
 +
use std::sync::mpsc::{Sender, Receiver};
  
* There is not much documentation available on how to properly use execution policies (at least I did not find anything in depth).
+
fn main() {
* The programmer who is using execution policies is still responsible for avoiding deadlocks and race conditions.
+
  let (sender, receiver): (Sender<String>, Receiver<String>) =
* The purpose of these tests is not to compare one compiler versus another or one algorithm versus another.
+
    mpsc::channel();
* The code for all the tests is available [https://github.com/hscasn/testpstl here].
 
* **std::vector** is a dynamic array: the data is kept sequentially in memory, while **std::list** is a linked list (the data is scattered in memory).
 
* The tests were ran three times, and the average of the elapsed time was used as the results, avoiding "abnormal" results.
 
  
----
+
  thread::spawn(move || {
**std::copy**
+
    let messages = vec![
 +
      String::from("blue"),
 +
      String::from("yellow"),
 +
      String::from("green"),
 +
      String::from("red"),
 +
    ];
  
''Code snippet''
+
    for colour in messages {
 +
      sender.send(colour).unwrap();
 +
    }
 +
  });
  
Copying values from array ''a'' to array ''b''.
+
  for message in receiver {
 +
    println!("Got {}", message);
 +
  }
  
<code><pre>
+
}
std::copy(
+
</source>
std::execution::par,
 
a,
 
a + ARRSZ,
 
b
 
);
 
</pre></code>
 
  
''Results''
+
This produces the following output:
  
[[File:http://public.hcoelho.com/images/blog/pstl_copy.png]]
+
<source>
 +
Got blue
 +
Got yellow
 +
Got green
 +
Got red
 +
</source>
  
First, for the array and the vector, these results look rather strange - we were expecting that both for the array and the vector, we would notice a downward trend for the time for the execution policies: since the memory is sequential, they could easily be vectorized.
+
Similarly to the previous examples, the '''sender''' will be moved to the closure executed by the thread, which means that it will go out of scope when the thread finishes. To distribute the sender through several threads, they can be cloned: objects sent through this sender will still be received by the same receiver, but the two senders are different objects, which can have their ownership moved:
  
There are some important guesses/conclusions here that I want to mention. They will be important to explain these results and the results for the following tests:
+
<source lang="rust">
- An explanation for why the vectorized method did not yield a good result could be that because I was compiling with **O3**, the serial code is already being vectorized in the background; the fact that I would be asking for it to be vectorized again would add some overhead
+
use std::thread;
- I am not entirely sure why the parallel policy did not always have a good result, but if I had to guess, I would say it was because the copies were creating a **race condition**, which was slowing the parallel execution down
+
use std::sync::mpsc;
 +
use std::sync::mpsc::{Sender, Receiver};
  
Now, about the *list*: the parallel versions were better than the serial and vectorized ones. Why did the vectorize version did not yield a good result? This could be explained by the fact that vectorization did not happen: because of the nature of a linked list, the memory is too scattered to be put together in a vector register and processed with **SIMD**, and this would not improve the timing at all. One thing to point out: **since a list is not sequential in memory, it is costly to "slice" the collection in equal parts and parallelize them (we basically have to iterate through the whole list), so the gains for parallel execution are not great. If, however, the operation to be performed in every element is slow, this slow slicing can still be worth it**.
+
static NUM_THREADS: i32 = 8;
  
----
+
fn main() {
**std::count_if**
+
  let mut vec = Vec::new();
  
''Snippet:''
+
  let (sender, receiver): (Sender<i32>, Receiver<i32>) =
 +
    mpsc::channel();
  
<code><pre>
+
  for i in 0..NUM_THREADS {
// Counting all multiples of 3
+
    // cloning the sender so it can be exclusive to
auto condition = [](mytype& i) { return i % 3 == 0; };
+
    // the new thread
 +
    let thread_sender = sender.clone();
  
size_t sum = std::count_if(
+
    thread::spawn(move || {
std::execution::par,
+
      thread_sender.send(i).unwrap();
a,
+
    });
a + ARRSZ,
+
  }
condition
 
);
 
</pre></code>
 
  
''Results:''
+
  for _ in 0..NUM_THREADS {
 +
    vec.push(receiver.recv().unwrap());
 +
  }
  
[[File:http://public.hcoelho.com/images/blog/pstl_count_if.png]]
+
  for i in vec.iter() {
 +
    println!("{}", i);
 +
  }
  
These results look a lot more like what we were expecting:
+
}
* The vectorized version is the slowest, because vectorizing an operation like this is not possible, and we have the extra overhead of making this decision
+
</source>
* The parallel version performed better than any other policy, because the parallel + vectorized version had the overhead for deciding if the vectorization should happen or not
 
* We had a similar result for the list, but the gains were dramatically smaller. This is because, as I cited before, a list is not sequential in memory, so it is costly to "slice" the collection in equal parts and parallelize them. We did have a very small gain, but the cost for iterating through the list was way too big.
 
  
----
+
Output:
'''std::for_each'''
 
  
''Snippet:''
+
<source>
 +
1
 +
3
 +
0
 +
2
 +
5
 +
6
 +
7
 +
4
 +
</source>
  
<code><pre>
+
== Locks (Mutex) ==
size_t sum;
 
auto action = [&](mytype& i) { return sum += i; };
 
std::for_each(
 
std::execution::par,
 
a,
 
a + ARRSZ,
 
action
 
);
 
</pre></code>
 
  
Notice how this **for_each** behaves like a **reduce**: I am modifying the variable "sum" outside of this function. This is very likely to cause a **race condition**.
+
Rust language also supports shared-state concurrency which allows two or more processes have some shared state (data) between them they can write to and read from. Sharing data between multiple threads can get pretty complicated since it introduces the strong hazard of race conditions. Imagine a situation when one thread grabs some data and attempts to change it, while another thread is starting to read the same value, there’s no way to predict if latter one retrieves updated data or if it gets hold of the old value. Thus shared-state concurrency has to deliver some implementation of guard and synchronized access mechanism.
 +
[[File:RustMutex.png|thumb|left|alt=Mutex|Acquire/Release Lock Process ]]
  
''Results:''
+
The access to shared state (critical section) and synchronization methods are typically implemented through ‘mutex’. Mutual exclusion principles (mutex) provide ways to prevent race conditions by only allowing one thread to reach out some data at any given time. If one thread wants to read/write some piece of data, it must first give a signal and then, once permitted, acquire the mutex’s lock. The lock is a special data structure that can keep track of who currently has exclusive access to the data. You can think of locks as dedicated mechanism that guards critical section.
  
[[File:http://public.hcoelho.com/images/blog/pstl_for_each.png]]
+
Since Rust Lang supports ownership principles, the threads are inaccessible to each other automatically and only one thread can access data at any given time time.
  
Even though we had (what it looked like) a race condition,we still got good results with the parallel excution policies for array and vector. Again, this process could not be vectorized, and this is why the vectorization policy did not do well. For the **list**, we see the same pattern as before: since the slicing for the collection is costly, it seems like the either the compiler did not parallelize it, or the parallel version was just as slow as the serial version.
+
The snippet below demonstrates how you can create and access mutex in rustlang:  
  
----
+
<source lang="rust">
'''std::reduce'''
+
fn use_lock(mutex: &Mutex<Vec<i32>>) {
 +
    let mut guard = lock(mutex); // take ownership
 +
    let numbers = access(&mut guard);  // borrow access
 +
    numbers.push(42);  // change the data
 +
}
 +
</source>
  
''Snippet:''
+
Mutex construct is generic and it accepts the piece of shared data protected by the lock. It is important to remember that the ownership of that data is transferred into the mutex structure at its creation.
  
<code><pre>
+
The lock function will attempt to acquire the lock by blocking the local thread until it is available to obtain the mutex. The data is automatically unlocked once the return value of lock() function gets released (at the end of the function scope) so there is no need to release the mutex lock manually.
size_t sum = std::reduce(
 
std::execution::par,
 
a.begin(),
 
a.end()
 
);
 
</pre></code>
 
  
''Results:''
+
== RC and Atomic ==
  
[[File:http://public.hcoelho.com/images/blog/pstl_reduce.png]]
+
In Rust, some data types are defined as “thread safe” while others are not. For example, Rc<T> type, which is Rust’s own implementation of “smart pointer”, is considered to be unsafe to share across threads. This type keeps track of number of references and increments/decrements count each time a new reference is created or old one gets destroyed. Rc<T> does not enforce any mechanisms that makes sure that changes to the reference counter value can’t be interrupted by another thread. The snippet below demonstrates the issue an results in compiler error:
  
The results here are very similar to the **for_each** algorithm - it seems like the race condition I made with the "*sum*" for the previous test was not really a problem for the algorithm.
+
<source lang="rust">
 +
let mutex = Rc::new(Mutex::new(2));
 +
let mut handles = vec![];
  
 +
for _ in 0..4 {
 +
    let mutex = mutex.clone();
 +
    let handle = thread::spawn(move || {
 +
        let mut num = mutex.lock().unwrap();
  
----
+
        *num *= 2;
 +
        println!("Intermediate Result : {}", *num);
 +
    });
 +
    handles.push(handle);
 +
}
  
''' std::sort'''
+
for handle in handles {
 +
    handle.join().unwrap();
 +
}
  
''Snippet:''
+
println!("Final Result: {}", *mutex.lock().unwrap());
 +
</source>
  
This algorithm is a bit different: we cannot use ''std::list'' with it. Since the algorithm requires random access, it is only possible to use arrays or vectors.
+
This will produce an error :
  
<code><pre>
+
<source lang="rust">
auto sorter = [](mytype a, mytype b) { return a < b; };
+
error[E0277]: the trait bound `std::rc::Rc<std::sync::Mutex<i32>>: std::marker::Send` is not satisfied in `[closure@src/main.rs:11:32: 16:6 mutex:std::rc::Rc<std::sync::Mutex<i32>>]`
 +
  --> src/main.rs:11:18
 +
  |
 +
11 |    let handle = thread::spawn(move || {
 +
  |                  ^^^^^^^^^^^^^ `std::rc::Rc<std::sync::Mutex<i32>>` cannot be sent between threads safely
 +
  |
 +
  = help: within `[closure@src/main.rs:11:32: 16:6 mutex:std::rc::Rc<std::sync::Mutex<i32>>]`, the trait `std::marker::Send` is not implemented for `std::rc::Rc<std::sync::Mutex<i32>>`
 +
  = note: required because it appears within the type `[closure@src/main.rs:11:32: 16:6 mutex:std::rc::Rc<std::sync::Mutex<i32>>]`
 +
  = note: required by `std::thread::spawn`
  
std::sort(
+
</source>
std::execution::par,
 
a,
 
a + ARRSZ,
 
sorter
 
);
 
</pre></code>
 
  
''Results:''
+
Luckily for us, Rust ships another thread-safe implementation of ‘smart pointer’ called Arc<T> which implements reference counting via atomic operations. It is important to note that in serial code, it makes much more sense to use standard Rc<T> over Arc<T> type since the latter structure is more expensive performance-wise.
  
[[File:http://public.hcoelho.com/images/blog/pstl_sort.png]]
+
<source lang="rust">
 +
use std::sync::{Arc, Mutex};
 +
use std::thread;
  
Probably the most dramatic results so far. Vectorization for sorting is probably not something that can be done, so it makes sense that the vectorized policies did not yield a good result. The parallel versions, however, had a very dramatic improvement. It seems that this **std::sort** implementation is really optimised for parallel execution!
+
fn main() {
 +
let mutex = Arc::new(Mutex::new(2));
 +
let mut handles = vec![];
  
 +
for _ in 0..4 {
 +
    let mutex = mutex.clone();
 +
    let handle = thread::spawn(move || {
 +
        let mut num = mutex.lock().unwrap();
  
----
+
        *num *= 2;
 +
        println!("Intermediate Result : {}", *num);
 +
    });
 +
    handles.push(handle);
 +
}
  
'''std::transform'''
+
for handle in handles {
 +
    handle.join().unwrap();
 +
}
  
''Snippet:''
+
println!("Final Result: {}", *mutex.lock().unwrap());
  
<code><pre>
+
}
auto action = [](mytype i) { return i += 5; };
 
  
std::transform(
+
</source>
std::execution::par,
 
a,
 
a + ARRSZ,
 
a,
 
action
 
);
 
</pre></code>
 
  
''Results:''
+
This will produce the expected result :
  
[[File:http://public.hcoelho.com/images/blog/pstl_transform.png]]
+
<source>
 +
Intermediate Result : 4
 +
Intermediate Result : 8
 +
Intermediate Result : 16
 +
Intermediate Result : 32
 +
Final Result: 32
 +
</source>
  
For the list, what happened here seems to be similar to what happened before: it is too costly to parallelize or vectorize the operations, so the execution policies did not have a good result. For the array, we also had a similar result as before, with the parallel version working better than the other ones.
+
This time the code worked. This example was very simple and not too impressive. Much more complicated algorithms can be implemented with the Mutex<T>.
  
For the **vector** data structure, however, we finally had a different result: for some reason, the vectorized version, this time, had a good effect. Apparently the serial version could not be vectorized by default, but when we used the **unseq** execution policy, it was now able to vectorize the operation. An explanation for the fact that the parallel + vectorization version did not have a better result than the parallel one, however, is not very obvious to me. If I had to guess, I would say it is because the parallel version had the vectorization done by the compiler as well, without us needing to specify it.
+
== Rayon Library ==
 +
=== Introduction ===
  
In conclusion, it is refreshing to see a tool like this. It makes parallelization so easy that we almost don't need to worry about anything else. It would be nice to have more clear documentation for these methods, so the results could be a little more predictable - Some of these results are not exactly intuitive. A suggestion for the next project is to analyse the assembly code that was generated to see exactly why we got some of these results.
+
Rust lang users and community are heavily involved in development of language extensions. Although core features offer quite a bit of programming tools on their own, there is still a demand for more complex functionality with a higher-level API. Rust has its own package registry that stores various libraries that can satisfy programmers needs. These packages vary from networking tools, web services to image processing and multi-threading libraries.
 +
One package can be of particular interest to parallel programming enthusiasts – it is a data-parallelism library called Rayon. The package is praised to be extremely lightweight and it makes it easy to convert a sequential computation into a parallel one. It also abstracts some Rust-lang threading boilerplate code and makes coding parallel tasks a bit more fun.
 +
Just like TBB, it can be used with iterators, where iterator chains are executed in parallel. It also supports join model and provides ways to convert recursive, divide-and-conquer style problems into parallel implementations.  
  
 +
=== Parallel Iterators ===
 +
Rayon offers experimental API called “parallel iterators”. Example below demonstrates parallel sum of squares function:
 +
 +
<source lang="rust">
 +
use rayon::prelude::*;
 +
fn sum_of_squares(input: &[i32]) -> i32 {
 +
    input.par_iter()
 +
        .map(|&i| i * i)
 +
        .sum()
 +
}
 +
</source>
 +
 +
=== Using Join for Divide-and-Conquer Problems ===
 +
Parallel iterators are implemented with a more primitive method called join which takes 2 closures and runs them in parallel. The code snippet below  demonstrates increment algorithm:
 +
 +
<source lang="rust">
 +
 +
fn increment_all(slice: &mut [i32]) {
 +
    if slice.len() < 1000 {
 +
        for p in slice { *p += 1; }
 +
    } else {
 +
        let mid_point = slice.len() / 2;
 +
        let (left, right) = slice.split_at_mut(mid_point);
 +
        rayon::join(|| increment_all(left), || increment_all(right));
 +
    }
 +
}
 +
</source>
  
 +
=== How it works ===
 +
Rayon borrowed work stealing concepts from Cilk project. The library attempts to dynamically ascertain how much multi-threading resources are available. The idea is very simple: there is  always a pool of worker threads available, waiting for some work to do. When you call join the first time, we shift over into that pool of threads.
  
 +
[[File:Rust_Join1.png]]
  
== ??? ==
+
But if you call join(a, b) from a worker thread W, then W will place b into its work queue, advertising that this is work that other worker threads might help out with. W will then start executing a. While W is busy with a, other threads might come along and take b from its queue. That is called stealing b.
<pre style="color: red">
 
???
 
</pre>
 
  
 +
[[File:Rust_Join_2.png|600px]]
  
== Sources ==
+
Once a is done, W checks whether b was stolen by another thread and, if not, executes b itself. If W runs out of jobs in its own queue, it will look through the other threads' queues and try to steal work from them.
* STL: https://en.wikipedia.org/wiki/Standard_Template_Library
 
* STL: http://www.cplusplus.com/reference/algorithm/
 
* Policy-Based execution for C++17: https://scs.senecac.on.ca/~oop345/pages/content/multi.html#alg
 
* Intel Parallel STL: https://software.intel.com/en-us/get-started-with-pstl*
 
  
 +
[[File:Rust_Join_3.png|600px]]
  
 
== Group Members ==  
 
== Group Members ==  
Line 282: Line 429:
 
# [mailto:obelavina@myseneca.ca Olga Belavina]
 
# [mailto:obelavina@myseneca.ca Olga Belavina]
  
 +
== Sources ==
 +
* [https://doc.rust-lang.org/book/second-edition/ch03-01-variables-and-mutability.html Variables and Mutability]
 +
* [https://doc.rust-lang.org/book/second-edition/ch04-01-what-is-ownership.html Ownership]
 +
* [https://doc.rust-lang.org/book/second-edition/ch04-02-references-and-borrowing.html References and Borrowing]
 +
* [https://doc.rust-lang.org/std/sync/atomic/struct.AtomicPtr.html Atomic Pointers]
 +
* [https://doc.rust-lang.org/std/sync/struct.Mutex.html Mutex]
 +
* [https://docs.rs/rayon/0.9.0/rayon/ Rayon Library]
 +
* [https://static.rust-lang.org/doc/master/std/sync/mpsc/index.html Channels]
 +
* [https://blog.rust-lang.org/2015/04/10/Fearless-Concurrency.html Fearless Concurrency with Rust]
  
 
== Progress ==
 
== Progress ==
  
 
-5%
 
-5%

Latest revision as of 13:10, 15 December 2017

Once upon a time there was a team. In their programs, they never used floats. They only used double doubles.


Parallelism with Rust

Introduction: The Rust Programming language

Rust is a system programming language (like C and C++, offering direct access to physical memory) sponsored by Mozilla, created in 2010. It is compiled, imperative, functional and strongly typed; it aims to provide better memory safety and still maintain a performance similar to C++, while providing garbage collection through reference counting. Rust won the first place as the "most loved programming language" in 2016 and 2017, in a survey from Stack Overflow 1 2.

Rust has a very strong emphasis in security, more specifically, memory safety. Because of this, the same rules that the compiler requires from a program end up solving many problems that are common in concurrent executions, such as race conditions. This project will explain what these security rules are, how they can prevent problems in parallel programming, as well as demonstrating the correct way to tackle concurrency with Rust.

Memory Ownership and Borrowing

Rust has a very peculiar way to deal with memory deallocation, security, and ownership. Understanding how memory management works is essential in order to understand how to use parallel programming in Rust. Consider the following example, where we create a vector, push some elements into it, and then print the elements:

fn main() {
  let mut vec = Vec::new();
  vec.push(0);
  vec.push(1);
  vec.push(2);

  for i in vec.iter() {
    println!("{}", i);
  }
}

Output:

0
1
2

Notice how variables in Rust are immutable (constant) by default - we need to manually specify that the vector can be changed. Failing to do so will result in a compilation error.

In the next example, we abstracted the "push" method to a function called "add_to_vec", which takes a vector and a number, and pushes the number into the vector:

fn add_to_vec(mut v: Vec<i32>, n: i32) {
  v.push(n);
}

fn main() {
  let mut vec = Vec::new();
  vec.push(0);
  vec.push(1);
  vec.push(2);

  add_to_vec(vec, 3);

  for i in vec.iter() {
    println!("{}", i);
  }
}

In a language like C++, a vector passed like this would be copied to the function and modified, leaving the original vector untouched; in a language like Java, a similar logic would work, since objects are passed by reference by default. Rust, on the other hand, has a different approach: sending the vector to a function like this means transferring the ownership of the object completely to that function. This means that after the vector is sent to a function, it is not available to the main function anymore - once the add_to_vec function ends, the vector will then be deallocated, and the next *for* loop will fail.

If we try compiling this code, the compiler will produce an error:

error[E0382]: use of moved value: `vec`
  --> src/main.rs:13:12
   |
11 |   add_to_vec(vec, 3);
   |              --- value moved here
12 | 
13 |   for i in vec.iter() {
   |            ^^^ value used here after move
   |
   = note: move occurs because `vec` has type `std::vec::Vec<i32>`, which does not implement the `Copy` trait

To fix this, we will have to transfer the ownership of the vector back to the main function. We do this by returning the vector back and reassigning it to the variable vec:

fn add_to_vec(mut v: Vec<i32>, n: i32) -> Vec<i32> {
  v.push(n);
  v // shorthand for returning. notice the lack of ;
}

fn main() {
  let mut vec = Vec::new();
  vec.push(0);
  vec.push(1);
  vec.push(2);

  vec = add_to_vec(vec, 3); // receiving the ownership back

  for i in vec.iter() {
    println!("{}", i);
  }
}

Now the output is the expected:

0
1
2
3

An easier way to deal with memory ownership is by lending/borrowing memory instead of completely transferring control over it. This is called memory borrowing. To borrow a value, we use the & operator:

fn add_to_vec(v: &mut Vec<i32>, n: i32) {
  v.push(n);
}

fn main() {
  let mut vec = Vec::new();
  vec.push(0);
  vec.push(1);
  vec.push(2);

  add_to_vec(&mut vec, 3);

  for i in vec.iter() {
    println!("{}", i);
  }
}

This will also produce the correct output. In this case, we don't need to send the ownership of the vector back, because it was never transferred, only borrowed. There is one important detail: mutable references (&mut T) are only allowed to exist if no other mutable references to that value are active. In other words: only one mutable reference to a value may exist at a time.

From what was presented above, some problems can be identified when applying parallelism:

  • If memory ownership is transferred in function calls, we can not transfer the ownership of one something to several threads at the same time - only one thread can have the ownership
  • Mutable references to something can only exist one at a time, this means that we can not spawn several threads with mutable references to a value

These two factors limit the capability of conventional shared memory. For example, the following code would not compile:

use std::thread;

fn add_to_vec(v: &mut Vec<i32>, n: i32) {
  v.push(n);
}

fn main() {
  let mut vec = Vec::new();
  vec.push(0);
  vec.push(1);
  vec.push(2);

  let t1 = thread::spawn(move || {
    add_to_vec(&mut vec, 4);
  });

  add_to_vec(&mut vec, 3);

  for i in vec.iter() {
    println!("{}", i);
  }

  t1.join();
}
error[E0382]: use of moved value: `vec`
  --> src/main.rs:20:19
   |
13 |   let t1 = thread::spawn(move || {
   |                          ------- value moved (into closure) here
...
20 |   add_to_vec(&mut vec, 3);
   |                   ^^^ value used here after move
   |

Although this seems unnecessarily strict, it has a great advantage: race conditions are impossible - they are checked at compile time, which means that the systems will be safer and less likely to present concurrency problems. To correctly use parallel programming in Rust, we can make use of tools such as Channels, Locks (Mutex), and Atomic pointers.

Channels

Instead of sharing memory, Rust encourages communication between thread in order to combine information. Channels, which are available in Rust's standard library, can be pictured as rivers: an object put into this river will be carried downstream. Likewise, a Channel has two parts: a sender and a receiver. The sender is responsible for putting information in the channel, which can be retrieved by the receiver. When we ask for a receiver to get a message from the channel, the execution may wait until the message is received or not: the method recv will block the execution, while try_recv will not.

The following example will spawn a thread, which will send a few strings down a channel which will be received by the main thread.

use std::thread;
use std::sync::mpsc;
use std::sync::mpsc::{Sender, Receiver};

fn main() {
  let (sender, receiver): (Sender<String>, Receiver<String>) = 
    mpsc::channel();

  thread::spawn(move || {
    let messages = vec![
      String::from("blue"),
      String::from("yellow"),
      String::from("green"),
      String::from("red"),
    ];

    for colour in messages {
      sender.send(colour).unwrap();
    }
  });

  for message in receiver {
    println!("Got {}", message);
  }

}

This produces the following output:

Got blue
Got yellow
Got green
Got red

Similarly to the previous examples, the sender will be moved to the closure executed by the thread, which means that it will go out of scope when the thread finishes. To distribute the sender through several threads, they can be cloned: objects sent through this sender will still be received by the same receiver, but the two senders are different objects, which can have their ownership moved:

use std::thread;
use std::sync::mpsc;
use std::sync::mpsc::{Sender, Receiver};

static NUM_THREADS: i32 = 8;

fn main() {
  let mut vec = Vec::new();

  let (sender, receiver): (Sender<i32>, Receiver<i32>) = 
    mpsc::channel();

  for i in 0..NUM_THREADS {
    // cloning the sender so it can be exclusive to
    // the new thread
    let thread_sender = sender.clone();

    thread::spawn(move || {
      thread_sender.send(i).unwrap();
    });
  }

  for _ in 0..NUM_THREADS {
    vec.push(receiver.recv().unwrap());
  }

  for i in vec.iter() {
    println!("{}", i);
  }

}

Output:

1
3
0
2
5
6
7
4

Locks (Mutex)

Rust language also supports shared-state concurrency which allows two or more processes have some shared state (data) between them they can write to and read from. Sharing data between multiple threads can get pretty complicated since it introduces the strong hazard of race conditions. Imagine a situation when one thread grabs some data and attempts to change it, while another thread is starting to read the same value, there’s no way to predict if latter one retrieves updated data or if it gets hold of the old value. Thus shared-state concurrency has to deliver some implementation of guard and synchronized access mechanism.

Mutex
Acquire/Release Lock Process

The access to shared state (critical section) and synchronization methods are typically implemented through ‘mutex’. Mutual exclusion principles (mutex) provide ways to prevent race conditions by only allowing one thread to reach out some data at any given time. If one thread wants to read/write some piece of data, it must first give a signal and then, once permitted, acquire the mutex’s lock. The lock is a special data structure that can keep track of who currently has exclusive access to the data. You can think of locks as dedicated mechanism that guards critical section.

Since Rust Lang supports ownership principles, the threads are inaccessible to each other automatically and only one thread can access data at any given time time.

The snippet below demonstrates how you can create and access mutex in rustlang:

fn use_lock(mutex: &Mutex<Vec<i32>>) {
    let mut guard = lock(mutex); // take ownership
    let numbers = access(&mut guard);   // borrow access 
    numbers.push(42);  // change the data
}

Mutex construct is generic and it accepts the piece of shared data protected by the lock. It is important to remember that the ownership of that data is transferred into the mutex structure at its creation.

The lock function will attempt to acquire the lock by blocking the local thread until it is available to obtain the mutex. The data is automatically unlocked once the return value of lock() function gets released (at the end of the function scope) so there is no need to release the mutex lock manually.

RC and Atomic

In Rust, some data types are defined as “thread safe” while others are not. For example, Rc<T> type, which is Rust’s own implementation of “smart pointer”, is considered to be unsafe to share across threads. This type keeps track of number of references and increments/decrements count each time a new reference is created or old one gets destroyed. Rc<T> does not enforce any mechanisms that makes sure that changes to the reference counter value can’t be interrupted by another thread. The snippet below demonstrates the issue an results in compiler error:

let mutex = Rc::new(Mutex::new(2));
let mut handles = vec![];

for _ in 0..4 {
    let mutex = mutex.clone();
    let handle = thread::spawn(move || {
        let mut num = mutex.lock().unwrap();

        *num *= 2;
        println!("Intermediate Result : {}", *num); 
    });
    handles.push(handle);
}

for handle in handles {
    handle.join().unwrap();
}

println!("Final Result: {}", *mutex.lock().unwrap());

This will produce an error :

error[E0277]: the trait bound `std::rc::Rc<std::sync::Mutex<i32>>: std::marker::Send` is not satisfied in `[closure@src/main.rs:11:32: 16:6 mutex:std::rc::Rc<std::sync::Mutex<i32>>]`
  --> src/main.rs:11:18
   |
11 |     let handle = thread::spawn(move || {
   |                  ^^^^^^^^^^^^^ `std::rc::Rc<std::sync::Mutex<i32>>` cannot be sent between threads safely
   |
   = help: within `[closure@src/main.rs:11:32: 16:6 mutex:std::rc::Rc<std::sync::Mutex<i32>>]`, the trait `std::marker::Send` is not implemented for `std::rc::Rc<std::sync::Mutex<i32>>`
   = note: required because it appears within the type `[closure@src/main.rs:11:32: 16:6 mutex:std::rc::Rc<std::sync::Mutex<i32>>]`
   = note: required by `std::thread::spawn`

Luckily for us, Rust ships another thread-safe implementation of ‘smart pointer’ called Arc<T> which implements reference counting via atomic operations. It is important to note that in serial code, it makes much more sense to use standard Rc<T> over Arc<T> type since the latter structure is more expensive performance-wise.

use std::sync::{Arc, Mutex};
use std::thread;

fn main() {
let mutex = Arc::new(Mutex::new(2));
let mut handles = vec![];

for _ in 0..4 {
    let mutex = mutex.clone();
    let handle = thread::spawn(move || {
        let mut num = mutex.lock().unwrap();

        *num *= 2;
        println!("Intermediate Result : {}", *num); 
    });
    handles.push(handle);
}

for handle in handles {
    handle.join().unwrap();
 }

println!("Final Result: {}", *mutex.lock().unwrap());

}

This will produce the expected result :

Intermediate Result : 4
Intermediate Result : 8
Intermediate Result : 16
Intermediate Result : 32
Final Result: 32

This time the code worked. This example was very simple and not too impressive. Much more complicated algorithms can be implemented with the Mutex<T>.

Rayon Library

Introduction

Rust lang users and community are heavily involved in development of language extensions. Although core features offer quite a bit of programming tools on their own, there is still a demand for more complex functionality with a higher-level API. Rust has its own package registry that stores various libraries that can satisfy programmers needs. These packages vary from networking tools, web services to image processing and multi-threading libraries. One package can be of particular interest to parallel programming enthusiasts – it is a data-parallelism library called Rayon. The package is praised to be extremely lightweight and it makes it easy to convert a sequential computation into a parallel one. It also abstracts some Rust-lang threading boilerplate code and makes coding parallel tasks a bit more fun. Just like TBB, it can be used with iterators, where iterator chains are executed in parallel. It also supports join model and provides ways to convert recursive, divide-and-conquer style problems into parallel implementations.

Parallel Iterators

Rayon offers experimental API called “parallel iterators”. Example below demonstrates parallel sum of squares function:

use rayon::prelude::*;
fn sum_of_squares(input: &[i32]) -> i32 {
    input.par_iter()
         .map(|&i| i * i)
         .sum()
}

Using Join for Divide-and-Conquer Problems

Parallel iterators are implemented with a more primitive method called join which takes 2 closures and runs them in parallel. The code snippet below demonstrates increment algorithm:

fn increment_all(slice: &mut [i32]) {
    if slice.len() < 1000 {
        for p in slice { *p += 1; }
    } else {
        let mid_point = slice.len() / 2;
        let (left, right) = slice.split_at_mut(mid_point);
        rayon::join(|| increment_all(left), || increment_all(right));
    }
}

How it works

Rayon borrowed work stealing concepts from Cilk project. The library attempts to dynamically ascertain how much multi-threading resources are available. The idea is very simple: there is always a pool of worker threads available, waiting for some work to do. When you call join the first time, we shift over into that pool of threads.

Rust Join1.png

But if you call join(a, b) from a worker thread W, then W will place b into its work queue, advertising that this is work that other worker threads might help out with. W will then start executing a. While W is busy with a, other threads might come along and take b from its queue. That is called stealing b.

Rust Join 2.png

Once a is done, W checks whether b was stolen by another thread and, if not, executes b itself. If W runs out of jobs in its own queue, it will look through the other threads' queues and try to steal work from them.

Rust Join 3.png

Group Members

  1. Henrique Salvadori Coelho
  2. Olga Belavina

Sources

Progress

-5%