Changes

← Older edit

GPU621/Analyzing False Sharing

171 bytes removed, 15:14, 4 December 2022

→‎Use local variables for each thread

for (int i = 0; i < numbers.size(); i++) {

if (i % sizeOfSum == id) {

sum[id] += Ii;

}

In this program code, we can see that at the beginning of the main function we declare a vector numbers and initialize it from 0 to ~~100~~10000. We can also see in the main function that the code is divided into two main blocks. The first block is the Thread block, which is executed concurrently using multiple threads. While the second block is the Serial block, it is executed concurrently using normal serial logic.

The main purpose of the sumUp function is to calculate the sum of odd elements or even elements based on the data in the first vector argument in the argument list. Also, the sum will be recorded in the corresponding position of the second vector argument using the int argument as the index.

As we said above, the smallest unit of CPU operation on the cache is the size of a cache line, which is 64 bytes. As you can see in our program code, the sum is a vector that stores two data with long type. The two long data are in fact located on the same cache line. When two threads read and write sum[0] and sum[1] separately, it looks like they are not using the same variable, but in fact, they are affecting each other.For example, if a thread updates sum[0] in CPU0, this causes the cache line of sum[1] in CPU1 to be invalidated. This causes the program to take longer to run, as it needs to re-read the data.

We try to change the sizeOfSum to 8, which means that changing the size of the vector sum to 8 and the if condition in the sumUp function to i%8==id will give the program more data to process. [[File:sizeOfSum8.jpg|400px]] Nevertheless, the result still does not bring the time between the Serial block and Thread block closer and the Serial block still takes less time.

What would it take to show the advantages of multicore? As you can see in our program, multiple threads are not operating on the same variable, but only a single thread is reading and writing a lot of variables in the same cache line, so you can introduce a local variable.

When we ran the program again, we came to this conclusion:

We can see that the Local variable block is already much faster than the Thread block, and even comparable to the Serial block is almost the same. But overall the Serial block is still the least time-consuming because the Local variable block still needs to incur extra overhead for thread creation and scheduling.

We could try increasing sizeOfNumbers to 1000000 as well, which would allow the program to process more data, thus compensating for the extra overhead of thread creation and scheduling.

~~[[File:newBlockOutput1000000.jpg|400px]] ~~ [[File:newBlockOutput1000000(1new).jpg|400px]]

Now we can already see the advantage of multi-threading. Even when the vector numbers reach the size of 1000000, the Thread block even runs faster than the Serial block.

Ryan Leong

118

edits

Changes

GPU621/Analyzing False Sharing

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

get involved with CDOT

courses

course projects

links

Tools