Analysis of atomic weight and memory order of C++11

1. The problem of shared variables under multi-threading

In multi-threaded programming, it is often necessary to share some variables between different threads. However, some inexplicable errors are often caused by the operation of shared variables. Unless the access protection is honestly locked, some (seemingly) bizarre often appear. condition. For example, the following are two more "pleasant" situations.

(a) i ++ problem

In multithreaded programming, the most frequently cited problem is the well-known i++ problem, where multiple threads perform i++ operations on the same shared variable i. The reason for this problem is that the i++ operation can be divided into three steps:
|step|operation|
|---|---|
|1|i->reg Read the value of i to the register |
| 2|inc reg increments the value of i in the register |
|3|reg->i writes back to the i in memory| The
above three steps can be spaced apart, not atomic operations, that is to say, multiple threads execute at the same time Interleaving of steps may occur, such as the following situation:
|step| thread A | thread B |
|:----:|:--------:|:-------- :|
|1|i->reg||
|2|inc reg||
|3||i->reg|
|4||inc reg|
|5|reg->i||
|6||reg-> i|
Assuming that i is 0 at the beginning, after executing step 4, both threads think that the value in the register is 1, and then write back in steps 5 and 6 respectively. Finally, the value of i is 1 after the execution of the two threads is completed. But in fact we executed i++ in two threads, and we expected the value of i to be 2. i++ can actually represent problems in multithreaded programming where operations are not atomic, but we only focus on operations on a single variable for now.

(b) Instruction rearrangement problem

Sometimes, we use a variable as a flag, and do something when the variable is equal to a certain value. But there may still be some unexpected pits, for example, two threads are executed in the following order:

step thread A thread B
1 a = 1
2 flag= true
3 if flag== true
4 assert(a == 1)

When B judges that the flag is true, it asserts that a is 1, which seems to be the case. So must it be so? Probably not, because both the compiler and the CPU may rearrange the instructions (different levels of optimization by the compiler and out-of-order execution by the CPU). The actual execution order may become this:
|step| thread A | thread B |
|:------:|:--------:|:--------: |
|1|flag = true | |
|2| | if flag== true |
|3| | assert(a == 1) |
|4|a = 1
|| Instructions that do not have dependencies on each other exchange the execution order for higher execution efficiency. For example, above: flag and a do not seem to have any dependencies on the A thread, and it seems that the execution order does not matter. But the problem is that B uses flag as the basis for whether to read a. A's instruction rearrangement may cause
the assertion to fail at step3.

solution

A relatively safe way is to lock the access to the shared variable. The lock can guarantee the mutually exclusive access to the critical section. For example, in the first scenario, if i++ is executed after locking and then unlocked, there will only be one thread at the same time. Performing i++ operations. In addition, the memory semantics of locking can ensure that the write operation of a thread before releasing the lock can be seen by the thread that locks later (that is, it has happens before semantics), which can avoid reading the wrong value in the second scenario. .

So what if you feel that the locking operation is too cumbersome and you don't want to lock it? C++11 provides some support for atomic variables and atomic operations.

Second, the atomic weight of C++11

The C++11 standard provides the template atomic<> in the standard library atomic header file to define atomic weights:

template< class T >
struct atomic;

It provides a series of member functions to implement atomic operations on variables, such as read operations load, write operations store, and CAS operations compare_exchange_weak/compare_exchange_strong and so on. For most built-in types, C++11 provides some specializations:

std::atomic_bool    std::atomic<bool>
std::atomic_char    std::atomic<char>
std::atomic_schar   std::atomic<signed char>
std::atomic_uchar   std::atomic<unsigned char>
std::atomic_short   std::atomic<short>
std::atomic_ushort  std::atomic<unsigned short>
std::atomic_int std::atomic<int>
std::atomic_uint    std::atomic<unsigned int>
std::atomic_long    std::atomic<long>
······
//更多类型见:http://en.cppreference.com/w/cpp/atomic/atomic

In fact, these specializations are equivalent to taking an alias, which is essentially the same definition. For the specialization of integer, there will be some special member functions, such as atomic plus fetch_add
, atomic minus fetch_sub, atomic and fetch_and, atomic or fetch_or,
etc. Common operators ++, --, +=, &=, etc. also have corresponding overloaded versions.

Next, take the int type as an example to solve the problem in our i++ scenario mentioned earlier. First define an atomic weight of type int:

std::atomic<int> i;

Since the int-type atomic weight overloads the ++ operator, i++ is an inseparable atomic operation. We use multiple threads to perform the i++ operation for verification. The test code is as follows:

#include <iostream>
#include <atomic>
#include <vector>
#include <functional>
#include <thread>

std::atomic<int> i;
const int count = 100000;
const int n = 10;

void add()
{
    for (int j = 0; j < count; ++j)
        i++;
}

int main()
{
    i.store(0);
    std::vector<std::thread> workers;
    std::cout << "start " << n << " workers, "
              << "every woker inc " << count  << " times" << std::endl;

    for (int j = 0; j < n; ++j)
        workers.push_back(std::move(std::thread(add)));

    for (auto & w : workers)
        w.join();

    std::cout << "workers end "
              << "finally i is " << i << std::endl;

    if (i == n * count)
        std::cout << "i++ test passed!" << std::endl;
    else
        std::cout << "i++ test failed!" << std::endl;

    return 0;
}

In the test, we define an atomic quantity i, which is initialized to 0 at the beginning of the main function, and then starts 10 threads, each thread performs the i++ operation 100,000 times, and finally checks whether the value of i is correct. The final result of execution is as follows:

start 10 workers, every woker inc 100000 times
workers end finally i is 1000000
i++ test passed!

As we can see above, 10 threads perform a large number of auto-increment operations at the same time, and the value of i is still normal. If we change i to an ordinary int variable and execute the program again, we can get the result as follows:

start 10 workers, every woker inc 100000 times
workers end finally i is 445227
i++ test failed!

Obviously, due to the interleaved execution of the various steps of the auto-increment operation, we finally get an incorrect result.

Atomic weight can solve the i++ problem, so can it solve the problem of instruction rearrangement? It is also possible, and it is related to the memory order of atomic weight selection. We will focus on this issue in the next section.

. We have seen above that atomic is a template, which means that we can turn custom types into atomic variables. But can any type be defined as an atomic type? Of course not, the description in cppreference must be of type TriviallyCopyable. This connection is a detailed definition of TriviallyCopyable:

http://en.cppreference.com/w/cpp/concept/TriviallyCopyable

A relatively simple criterion is that the type can be copied bitwise using std::memcpy, such as the following class:

class {
    int x;
    int y;
}

This class is a TriviallyCopyable type, however if you add a virtual function to it:

class {
    int x;
    int y;
    virtual int add ()
    {
        return x + y;
    }
}

This class cannot be copied bitwise, and it cannot be atomized if the conditions are not met.

If a type can meet the requirements of the atomic template and can be atomic, it does not need to be locked, so it is faster? Still not, atomic has a member function is_lock_free, this member function can tell us whether the atomic quantity of this type uses atomic CPU instructions to achieve lock-free, or still uses the lock method to achieve atomic operations. However, whether or not it is implemented with a lock, atomic is used and the semantics it presents are the same. There is no restriction on which way to implement the C++ standard (except that the std::atomic_flag specialization must be lock free), which is related to the platform.
For example, execute the following code in my Cygwin64, GCC7.3 environment:

#include <iostream>
#include <atomic>

#define N 8

struct A {
    char a[N];
};

int main()
{
    std::atomic<A> a;
    std::cout << sizeof(A) << std::endl;
    std::cout << a.is_lock_free() << std::endl;
    return 0;
}

The result is:

8
1

Prove that the atomic weight of type A defined above is lock-free. I conducted an experiment on this platform, modifying the size of N, and the results are as follows:
| N | sizeof(A) | is_lock_free() |
|:--:| :--:|:--:|
|1|1| 1|
|2|2|1|
|3|3|0|
|4|4|1|
|5|5|0|
|6|6|0|
|7|7|0|
|8|8|1 |
| > 8| /|0|
Change A to a built-in type. The experimental results for built-in types are as follows:
| type | sizeof() | is_lock_free() |
|:--:| :--:|:- -:|
|char|1|1|
|short|2|1|
|int|4|1|
|long long|8|1|
|float|4|1|
|double|8|1|

It can be seen that the commonly used built-in types are lock free on my platform, and the custom types are related to size.

It can also be seen from the above statistics that is_lock_free() is true when the length of the custom type is equal to a custom type. I speculate that the lock-free implementation of the atomic implementation here is achieved by the compiler's built-in atomic operation, and lock-free can only be achieved when the data length is just enough to call the compiler's built-in atomic operation. Check out the prototype of built-in atomic operations in the GCC reference manual ( https://gcc.gnu.org/onlinedocs/gcc-7.3.0/gcc/_005f_005fatomic-Builtins.html#g_t_005f_005fatomic-Builtins ), taking CAS operations as an example:

bool __atomic_compare_exchange_n (type *ptr, type *expected, type desired, bool weak, int success_memorder, int failure_memorder)

bool __atomic_compare_exchange (type *ptr, type *expected, type *desired, bool weak, int success_memorder, int failure_memorder)

A pointer whose parameter type is type *, a description of GCC's type can be found on the same page:

The ‘__atomic’ builtins can be used with any integral scalar or pointer type that is 1, 2, 4, or 8 bytes in length. 16-byte integral types are also allowed if ‘__int128’ (see __int128) is supported by the architecture.

The length of the type type should be one of 1, 2, 4, and 8 bytes. A few platforms that support __int128 can reach 16 bytes, so only data with a length of 1, 2, 4, and 8 bytes can achieve lock-free . This is just my speculation, I don't know if this is the case.

3. Six memory sequences of C++11

When we solved the i++ problem earlier, we have used the atomic write operation load to assign the atomic value. In fact, the member function has another parameter:

void store( T desired, std::memory_order order = std::memory_order_seq_cst )

This parameter represents the memory order used by the operation, which is used to control the order visibility of variables in different threads. Not only load, but other member functions also have this parameter. C++11 provides six memory sequences to choose from, which are:

typedef enum memory_order {
    memory_order_relaxed,
    memory_order_consume,
    memory_order_acquire,
    memory_order_release,
    memory_order_acq_rel,
    memory_order_seq_cst
} memory_order;

Earlier in Scenario 2, an unexpected error was caused by the rearrangement of instructions. By using atomic variables and choosing the appropriate memory ordering, this problem can be solved. Let's take a look at these memory sequences

memory_order_release/memory_order_acquire

The memory order option is used as the parameter of the atomic member function, memory_order_release is used for the store operation, and memory_order_acquire is used for the load operation. Here we call the call using memory_order_release the release operation. Logically, it can be understood as follows: the release operation can prevent the read and write operations before the call from being rearranged to the back, and the acquire operation can ensure that the read and write operations after the call will not be rearranged to the front. It sounds a bit confusing, so let's explain it with an example: suppose flag is an atomic

  • For the same atomic quantity, the write before the release operation must be visible to the subsequent read after the acquire operation.

These two memory sequences need to be used in pairs, which is why they are introduced together. Another point to note is that the above guarantee can only be achieved if the same atomic weight is operated. For example, if step3 reads another atomic weight flag2, it cannot guarantee that the value of a is 1.

memory_order_release/memory_order_consume

memory_order_release can also be used in conjunction with memory_order_consume. The role of the memory_order_release operation has not changed, while memory_order_consume is used for the load operation, which we call the consume operation for short. The comsume operation prevents subsequent operations that depend on atomic variables from being rearranged to the front. In this situation:

  • For the same atomic variable, the writes that the release operation depends on must be visible to the operations that depend on the atomic variable after the subsequent consume operation.

This combination is more relaxed than the previous one, comsume only prevents operations that depend on this atomic weight from being remade to the front, rather than preventing them all like aquire. The above example is slightly modified to show this memory order, assuming flag is an atomic

memory_order_acq_rel

This option looks like a combination of release and acquire, and in fact it does have the characteristics of both. This operation is used for read-modify-writeback operations that have both read and modify operations, such as CAS operations. The role of this operation in memory order can be thought of as bundling the release operation with the acquire operation, so that any retakes of read and write operations cannot span this call. Still use an example to illustrate, flag is an atomic

step thread A thread B
1 a = 1
2 flag.store(true, memory_order_release)
3 b = true
4 c = 2
5 while (!flag.compare_exchange_weak(b, false, memory_order_acq_rel)) {b = true}
6 assert(a == 1)
7 if (true == flag.load(memory_order_acquire)
8 assert(c == 2)

Since memory_order_acq_rel has the functions of memory_order_release and memory_order_acquire at the same time, step2 can be combined with step5 to form the release/acquire combination mentioned above, so the assertion of step6 will be successful, and step5 can be combined with step7 to form a release/acquire combination, and the assertion of step8 is the same will surely succeed.

memory_order_seq_cst

This memory order is the default option of memory order for each member function. If memory order is not selected, memory_order_seq_cst is used by default. This is a "beautiful" option. If operations on atomic variables are all in memory_order_seq_cst memory order, multithreading behavior is equivalent to those operations being performed by a thread in a specific order, which thread observes the Atomic weights operate the same way. At the same time, any write operation using this option is equivalent to a release operation, any read operation is equivalent to an acquire operation, and any "read-modify-writeback" operation is equivalent to an operation using memory_order_acq_rel.

memory_order_relaxed

This option, like its name, is loose, it only guarantees that its member function operations themselves are atomically indivisible, but makes no guarantees about ordering.

cost

In general, the stricter the memory ordering, the greater the performance overhead. For our commonly used x86 processors, release/acquire semantics are supported at the processor level, so both release and acquire/consume only affect compiler optimization, while memory_order_seq_cst also affects processor instruction rearrangement.

thoughts

In the process of checking data to learn atomic weight and memory order, I deeply feel the esoteric of multithreading and concurrency. Although I will try to understand various memory orders out of curiosity, I should try to choose a more secure way to write code in practice. It can be locked and locked, but the atomic weight of the default memory_order_seq_cst option cannot be used. After all, as far as ordinary programmers are concerned, it is actually difficult to encounter scenarios where performance needs to be squeezed. If you really feel the need, most of them are because of our unscientific design = = ! If you do encounter such a scenario, you must do more research carefully before doing it, choose a simple way to use it, and be aware of it.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324692222&siteId=291194637