Introduction and Talking about Lock-Free Programming

I. Introduction

Modern computers, even small smartphones or tablets, are multi-core (multi-CPU) processing devices. How to make full use of multi-core CPU resources to maximize the performance of a single computer has become a pain point and difficulty in software development. In a multi-core server, the use of multiple processes or threads to process tasks in parallel has become the standard solution for performance tuning. The multi-process (multi-thread) parallel programming method must face the problem of accessing shared data. How to access shared data resources concurrently, efficiently and safely has become an important and difficult point of parallel programming.


Traditional shared data access methods use synchronization primitives (critical sections, locks, condition variables, etc.) to achieve safe access to shared data. However, synchronization is exactly the opposite of parallel programming and can easily become a bottleneck in parallel programs. On the one hand, some synchronization primitives are kernel objects of the operating system, and calling this primitive will bring expensive context switching (user mode to kernel mode) costs. At the same time, the kernel object is a relatively limited resource. On the other hand, synchronization eliminates parallel operations. When a thread accesses shared data, other multiple threads must be waiting in a queue. At the same time, synchronization scalability is very weak. With the increase of parallel threads, it is easy to become a program. One of the bottlenecks, even appeared, the service performance throughput did not increase linearly with the increase of the number of CPU cores or the increase of concurrent threads, on the contrary, there was a decline.

When programming with multithreading, we should follow the following conventions as much as possible:

 

1. For scenarios with high CPU overhead, if you can use multiple cores, try to use multiple cores as much as possible (often thinking that the amount of calculation required for a certain demand is not large, and the CPU is fast enough, then lazily write a single thread, the result is very low efficiency)

2. When using multithreading, it is locked by default. Under the condition of locking to ensure normal business, consider optimizing the performance loss caused by mutual exclusion locks. Mutex lock <read-write lock <spin lock <no lock (atomic operation)

3 Reduce the correlation between threads. Shared variables between threads <Variables within threads <Functional programming (no variables)

4 Minimize the granularity of the lock. a. Reduce the locked code segment (reduce the lock time) b. Divide into multiple locks to reduce competition (use fine-grained locks, such as MyISAM and InnoDB)

 

The conventional thread synchronization mechanism based on the lock mechanism often involves the following programming pain points:

1. Race condition (race condition) caused by forgetting to use lock
2. Deadlock (deadlock) caused by incorrect lock sequence
3. Program crash (corruption) caused by uncaught exception
4. Because of wrong Ignore the notification, causing the thread to fail to wake up normally (lost wakeup)


As a result, people began to study data structures and algorithms for concurrent access to shared data, usually in the following aspects:

1. Transactional memory --- Transactional memory
2. Fine-grained algorithms --- Fine-grained (locking) algorithms
3. Lock-free data structures --- Lock-free data structures

 

(1) Transactional memory TM is a software technology that simplifies the writing of concurrent programs. TM draws on the concept that was first established and developed in the database community. The basic idea is to declare a code area as a transaction. A transaction is executed and atomically commits all results to memory (if the transaction succeeds), or aborts and cancels all results (if the transaction fails). However, there is currently no embedded transactional memory, which is difficult to integrate with traditional code and requires software to make relatively large changes. At the same time, software TM performance overhead is extremely high, and a 2-10 times speed drop is common, which also limits This resulted in the widespread use of software TM.

(2) Fine-grained (lock) algorithm is an algorithm based on alternative synchronization methods, which is usually based on "lightweight" atomic primitives (such as spin locks), rather than on expensive synchronization provided by the system Primitive. The fine-grained (lock) algorithm is suitable for any situation where the lock holding time is less than the time required to block and wake up a thread. Because the lock granularity is extremely small, the data structure built on this type of primitive can be read in parallel , And even concurrent writes. The kernel before Linux 4.4 uses the fine-grained lock algorithm _spin_lock to securely access the shared listen socket. In the case of relatively lightweight concurrent connections, its performance is comparable to the lock-free performance. However, in the context of high concurrent connections, fine-grained (locking) algorithms will become the bottleneck of concurrent programs.

(3) Lock-free data structure. In order to solve the performance bottleneck that can not be avoided by fine-grained locks in high concurrency scenarios, shared data is placed in a lock-free data structure, and shared data is accessed by atomic modification.
At present, common lock-free data structures mainly include: lock free queue, lock free container (b+tree, list, hashmap, etc.).

 

Two lock-free programming

For lock-free programming, we will mainly involve the following aspects:

1. Atomicity, atomic primitives, atomic operations

2. CAS solution and ABA problem

3. seqlock (sequential lock)

4. Memory Barriers

 

2.1 Atomicity, atomic primitives, atomic operations

We know that no matter what the situation is, as long as there is a shared place, it is inseparable from synchronization, which is concurrency. Safe access to shared resources, without using locks and synchronization primitives, can only rely on atomic operations supported by hardware. Without the guarantee of atomic operations, lock-free programming will become impossible. .

Atomic operations can be simply divided into two parts:

1. Atomic read and write: atomic load (read), atomic store (write).
2. Atomic Exchange (Atomic Read-Modify-Write - RMW).

 

What is atomic operation

Atomic operations can ensure that instructions are executed in an atomic manner-the execution process will not be interrupted. Atomic operations are the basic premise of most lock-free programming.

Atomic operations must be guaranteed either all or none of them. In this way, atomic operations are definitely not a cheap instruction with low cost. On the contrary, atomic operations are a relatively expensive instruction. So in lock-free programming, we need to avoid abuse of atomic operations. Under what circumstances do we need to use atomic operations for shared variable operations? Is the normal read assignment operation of the variable atomic?
Under normal circumstances, we have a rule that atomic operations must be used on shared variables: at

any time, as long as there are two or more threads concurrently operating on the same shared variable, and one of these operations is a write operation , Then all threads must use atomic operations.

 

Basic principles of atomic operations

On the x86 platform, the CPU provides a means to lock the bus during instruction execution. There is a lead #HLOCK pin on the CPU chip. If the prefix "LOCK" is added in front of an instruction in the assembly language program, the machine code after assembly will cause the CPU to pull the potential of #HLOCK pin when the instruction is executed. Low, continue to release when this instruction ends, thereby locking the bus, so that other CPUs on the same bus can temporarily not access memory through the bus, ensuring the atomicity of this instruction in a multiprocessor environment.

LOCK is an instruction descriptor, which means that a lock is placed on the memory bus when subsequent instructions are executed. The bus lock will cause the other cores to be unable to access the memory within a certain clock cycle. Although bus locks will affect the performance of other cores, they are much lighter than operating system-level locks.

#lock is the lock FSB (front serial bus). FSB is the bus between the processor and RAM. Locking it can prevent other processors or cores from obtaining data from RAM.

 

The operating system kernel provides atomic_* series of atomic operations

Declaration and definition:

void atomic_set(atomic_t *v, int i);
atomic_t v = ATOMIC_INIT(0);

Read and write operations:

int atomic_read(atomic_t *v);
void atomic_add(int i, atomic_t *v);
void atomic_sub(int i, atomic_t *v);

Plus one minus one:

void atomic_inc(atomic_t *v);
void atomic_dec(atomic_t *v);

Perform the operation and test the result: After performing the operation, if v is 0, then return 1, otherwise return 0

int atomic_inc_and_test(atomic_t *v);
int atomic_dec_and_test(atomic_t *v);
int atomic_sub_and_test(int i, atomic_t *v);
int atomic_add_negative(int i, atomic_t *v);
int atomic_add_return(int i, atomic_t *v);
int atomic_sub_return(int i, atomic_t *v);
int atomic_inc_return(atomic_t *v);
int atomic_dec_return(atomic_t *v);

Compiler level: gcc built-in __sync_* series of built-in functions

The built-in __sync_* functions of gcc provide atomic operations for addition and subtraction and logical operations. There are twelve functions in the __sync_fetch_and_add series, including atomic operations functions such as addition/subtraction/and/or/exclusive OR/, etc., __sync_fetch_and_add, as the name implies , Fetch first, then self-add, return the value before self-add. Take count = 4 as an example, call __sync_fetch_and_add(&count,1), after that, the return value is 4, and then count becomes 5. With __sync_fetch_and_add, there will naturally be __sync_add_and_fetch, which is added first, and then returned. The relationship between these two is the same as the relationship between i++ and ++i.

type可以是1,2,4或8字节长度的int类型,即: 
int8_t / uint8_t
 int16_t / uint16_t
 int32_t / uint32_t
 int64_t / uint64_t
type __sync_fetch_and_add (type *ptr, typevalue);
 type __sync_fetch_and_sub (type *ptr, type value);
 type __sync_fetch_and_or (type *ptr, type value);
 type __sync_fetch_and_and (type *ptr, type value);
 type __sync_fetch_and_xor (type *ptr, type value);
 type __sync_fetch_and_nand(type *ptr, type value);
type __sync_add_and_fetch (type *ptr, typevalue);
 type __sync_sub_and_fetch (type *ptr, type value);
 type __sync_or_and_fetch (type *ptr, type value);
 type __sync_and_and_fetch (type *ptr, type value);
 type __sync_xor_and_fetch (type *ptr, type value);
 type __sync_nand_and_fetch (type *ptr, type value);

The following is a performance comparison of a program using  __sync_fetch_and_add and the synchronization interface mutex mutex in pthread in regular LInux.

Code explanation 1: Use __sync_fetch_and_add to manipulate global variables

#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/time.h>
#include <stdint.h>

int count = 0;

void *test_func(void *arg)
{
	int i=0;
	for(i=0;i<2000000;++i)
	{
		__sync_fetch_and_add(&count,1);
	}
	return NULL;
}

int main(int argc, const char *argv[])
{
	pthread_t id[20];
	int i = 0;

	uint64_t usetime;
	struct timeval start;
	struct timeval end;
	
	gettimeofday(&start,NULL);
	
	for(i=0;i<20;++i)
	{
		pthread_create(&id[i],NULL,test_func,NULL);
	}

	for(i=0;i<20;++i)
	{
		pthread_join(id[i],NULL);
	}
	
	gettimeofday(&end,NULL);

	usetime = (end.tv_sec-start.tv_sec)*1000000+(end.tv_usec-start.tv_usec);
	printf("count = %d, usetime = %lu usecs\n", count, usetime);
	return 0;
}

Code explanation 2: Use mutex to manipulate global variables

#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/time.h>
#include <stdint.h>

int count = 0;
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

void *test_func(void *arg)
{
	int i=0;
	for(i=0;i<2000000;++i)
	{
		pthread_mutex_lock(&mutex);
		++count;
		pthread_mutex_unlock(&mutex);
	}
	return NULL;
}

int main(int argc, const char *argv[])
{
	pthread_t id[20];
	int i = 0;

	uint64_t usetime;
	struct timeval start;
	struct timeval end;
	
	gettimeofday(&start,NULL);
	
	for(i=0;i<20;++i)
	{
		pthread_create(&id[i],NULL,test_func,NULL);
	}

	for(i=0;i<20;++i)
	{
		pthread_join(id[i],NULL);
	}
	
	gettimeofday(&end,NULL);

	usetime = (end.tv_sec-start.tv_sec)*1000000+(end.tv_usec-start.tv_usec);
	printf("count = %d, usetime = %lu usecs\n", count, usetime);
	return 0;
}

Explanation of results:

[root@blake lock-free]#./atom_add_gcc_buildin

count = 40000000, usetime = 756694 usecs

[root@blake lock-free]# ./atom_add_mutex

count = 40000000, usetime = 3247131 usecs

It can be seen that the use of atomic operations is about 5 times the performance of using mutexes. As the number of conflicts increases, the performance gap will be further widened. Alexander Sandler measured that the performance of atomic operations is about 6-7 times that of mutexes.

Interested friends can refer to the article: http://www.alexonlinux.com/multithreaded-simple-data-type-access-and-atomic-variables 

 

At the language level: C++ atomic library , volatile

In C/C++, all memory operations are assumed to be non-atomic, even ordinary 32-bit integer assignments, unless the compiler or hardware manufacturer specifically states that this assignment operation is atomic. In all modern x86, x64, Itanium, SPARC, ARM and PowerPC processors, ordinary 32-bit shaping, as long as the memory address is aligned, then the assignment operation is an atomic operation, this guarantee is the compiler and processor under the specific platform Guarantees made. Since the C/C++ language standard does not guarantee that integer assignment is an atomic operation, if we want to write truly portable C and C++ code, we can only use the atomic library provided by C++11 (C+ +11 atomic library) to ensure that the load (read) and store (write) of variables are atomic.

 

Keyword that must be said: volatile

From the above we know that in modern processors, for an aligned integer type (integer or pointer), the read and write operations are atomic, and for modern compilers, the basic type modified with volatile guarantees the correct alignment and limits The compiler optimizes it. In this way, by adding volatile modification to the int variable, we can read and write the variable atomically.

volatile int i=10;//用volatile修饰变量i
......//something happened 
int b = i;//atomic read


Because volatile limits the optimization of the compiler to a certain extent, and many times, for the same variable, we have atomic read and write requirements in some places, and in some places we don’t need atomic read and write. This I hope the compiler will optimize when it should. However, without the volatile modification, the previous point cannot be achieved. With the addition of volatile, there is no way to talk about the latter aspect. What should I do? In fact, here is a little trick to achieve this goal:

int i = 2; //变量i还是不用加volatile修饰
#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
#define READ_ONCE(x) ACCESS_ONCE(x)
#define WRITE_ONCE(x, val) ({ ACCESS_ONCE(x) = (val); })
a = READ_ONCE(i);
WRITE_ONCE(i, 2);


From the above, we know that in modern processors, int modified with volatile can be read and written atomically, and the optimization of the compiler is restricted. Every time the latest value is read from memory, many students mistakenly think Volatile can guarantee atomicity and has the function of Memery Barrier. In fact, vloatile neither guarantees atomicity, nor does it have any Memery Barrier (memory barrier) guarantee. In the above example, volatile only ensures that the address of int is aligned, and the aligned shaping can be read and written atomically in modern processors. Volatile in C++ has the following characteristics:

1. Volatile : The so-called volatility, reflected in the assembly level, is two statements. The next statement does not directly use the register content of the volatile variable corresponding to the previous statement, but rewrites it. Read from memory.
2. "Non-optimization": Volatile tells the compiler not to perform various radical optimizations on my variable, or even eliminate the variable directly, to ensure that the instructions written in the code by the programmer will be executed.
3. "Sequence": The order of Volatile variables can be guaranteed, and the compiler will not perform out-of-order optimization. The order of non-Volatile variables is not guaranteed by the compiler and may be optimized out of order.

 

2.2 CAS solution and ABA problem

2.2.1 CAS solution

Generally, the atomic read-modify-write primitive is used to implement the Lock-Free algorithm, and compare-and-swap (CAS) is considered to be the most basic atomic RMW operation. Its pseudo code is as follows:

bool CAS( int * pAddr, int nExpected, int nNew )
atomically {
    if ( *pAddr == nExpected ) {
         *pAddr = nNew ;
         return true ;
    }
    else
        return false ;
}


The above CAS returns bool to inform whether the atomic exchange is successful or not. However, in some application scenarios, we hope that after CAS fails, it can return the current value in the memory unit, so there is a variant called valued CAS. The pseudo code is as follows:

int CAS( int * pAddr, int nExpected, int nNew )
atomically {
      if ( *pAddr == nExpected ) {
           *pAddr = nNew ;
           return nExpected ;
       }
       else
            return *pAddr;
}


CAS is the most basic RMW operation. All other RMW operations can be implemented through CAS, such as fetch-and-add (FAA) . The pseudo code is as follows:

int FAA( int * pAddr, int nIncr )
{
     int ncur = *pAddr;
     do {} while ( !compare_exchange( pAddr, ncur, ncur + nIncr ) ;//compare_exchange失败会返回当前值于ncur
     return ncur ;
}


In the C++11 atomic lib, there are mainly the following RMW operations:

std::atomic<>::fetch_add()
std::atomic<>::fetch_sub()
std::atomic<>::fetch_and()
std::atomic<>::fetch_or()
std::atomic<>::fetch_xor()
std::atomic<>::exchange()
std::atomic<>::compare_exchange_strong()
std::atomic<>::compare_exchange_weak()


Among them, compare_exchange_weak() is the most basic CAS. Using compare_exchange_weak(), we can implement all other RMW operations. The atomic RMW operations in the C++11 atomic library are a bit less and can’t meet our actual needs. We can do it ourselves. The atomic RMW operation.

For example: we need an atom to perform multiplication on the value in memory, that is, atomic fetch_multiply . The pseudo code is as follows:

uint32_t fetch_multiply(std::atomic<uint32_t>& shared, uint32_t multiplier)
{
    uint32_t oldValue = shared.load();
    while (!shared.compare_exchange_weak(oldValue, oldValue * multiplier))
    {
    }
    return oldValue;
}


The above atomic RMW operations can only perform atomic modification operations on one integer variable. If we want to perform atomic operations on two integer variables at the same time, how can we achieve it? We know that the C++11 atomic library std::atomic<> is a template, so we can use a structure to contain two integer variables to make atomic modifications to the structure.

The C++11 atomic library std::atomic<> template can be of any type (buil-in type such as int, bool, or user-defined type), but not all types of atomic operations are lock-free. The C++11 standard library std::atomic provides specialized implementations for integral and pointer types. Integal represents the following types: char, signed char, unsigned char, short, unsigned short, int, unsigned int, long , unsigned long, long long, unsigned long long, char16_t, char32_t, wchar_t, these specializations all include an is_lock_free() member to determine whether the atomic type is lock-free or not.

The modern processor architecture divides the implementation of CAS into two camps: (1) It implements atomic CAS primitives-X86, Intel Itanium, Sparc and other processor architectures, which were first implemented in IBM System 370. (2) Realize LL/SC pair (load-linked/store-conditional) - PowerPC, MIPS, Alpha, ARM and other processor architectures, first implemented in DEC, atomic CAS can be realized through LL/SC pair, but in some Under circumstances, it is not atomic. Why is there the use of LL/SC pairs instead of directly implementing CAS primitives? To explain the existence of the LL/SC pair, I have to talk about a thorny problem in lock-free programming: the ABA problem.

2.2.2 ABA problem

When deciding whether to modify a variable, the general CAS will determine whether the current value is equal to the old value. If they are equal, it is considered that the variable has not been modified by other threads and can be changed. However, "equal" does not really mean "unmodified". Another thread may change the value of the variable from A to B, and back from B to A. This is the ABA problem. In many cases, ABA issues will not affect your business logic and can therefore be ignored. But sometimes it can't be ignored. To solve this problem, the general approach is to associate a variable with a version number that can only be incremented but not decremented. When comparing, not only the value of the variable is compared, but the version number is also compared. The AtomicStampedReference class in Java does this.

 

2.3, seqlock (sequential lock)

It is used in situations where reading and writing can be distinguished, and there are many read operations and few write operations. The priority of write operations is higher than that of read operations. The idea of ​​seqlock is to use an increasing integer to represent the sequence. When the write operation enters the critical section, sequence++; when exits the critical section, sequence ++ again. Write operations also need to obtain a lock (such as mutex), this lock is only used for writing mutual exclusion, to ensure that there is at most one ongoing write operation at the same time. When the sequence is an odd number, it means that a write operation is in progress. At this time, the read operation needs to wait until the sequence becomes an even number to enter the critical section. When a read operation enters the critical area, it needs to record the value of the current sequence, and when it exits the critical area, compare the recorded sequence with the current sequence. If it is not equal, it means that a write operation occurred during the read operation entering the critical area. At this time, read The operation read is invalid and needs to go back and try again. seqlock writing must be mutually exclusive. However, the application scenario of seqlock itself is the case of more reads and less writes, and the probability of write conflicts is very low. Therefore, there is basically no performance loss in writing and writing mutual exclusion. The read and write operations do not need to be mutually exclusive. The application scenario of seqlock is that write operations take precedence over read operations. For write operations, there is almost no blocking (unless a small probability event of write-write conflict occurs), and only the additional action of sequence++ is required. The read operation does not need to be blocked, but retry is required when a read-write conflict is found. A typical application of seqlock is the update of the clock. There will be a clock interrupt every 1 millisecond in the system, and the corresponding interrupt handler will update the clock (write operation). The user program can call system calls such as gettimeofday to get the current time (read operation). In this case, using seqlock can avoid too many gettimeofday system calls to block the interrupt handler (if you use a read-write lock instead of seqlock). The interrupt handler always takes precedence, and if the gettimeofday system call conflicts with it, it does not matter if there are more user programs and so on. The implementation of seqlock is very simple: When a write operation enters a critical section:

 void write_seqlock(seqlock_t *sl)
 {
     spin_lock(&sl->lock); // 上写写互斥锁
     ++sl->sequence; // sequence++
 }
写操作退出临界区时:
 void write_sequnlock(seqlock_t *sl)
 {
     sl->sequence++; // sequence再++
     spin_unlock(&sl->lock); // 释放写写互斥锁
 }
 
 读操作进入临界区时:
 unsigned read_seqbegin(const seqlock_t *sl)
 {
     unsigned ret;
     repeat:
         ret = sl->sequence; // 读sequence值
         if (unlikely(ret & 1)) { // 如果sequence为奇数自旋等待
             goto repeat;
         }
     return ret;
 }

When a read operation attempts to exit the critical section:

 int read_seqretry(const seqlock_t *sl, unsigned start)
 {
     return (sl->sequence != start); //看看sequence与进入临界区时是否发生过改变
 }

The read operation will generally proceed like this:

 do {
     seq = read_seqbegin(&seq_lock);// 进入临界区
     do_something();
 } while (read_seqretry(&seq_lock, seq)); // 尝试退出临界区,存在冲突则重试

2.4, memory barriers (Memory Barriers)

2.4.1 What Memory Barriers?


Memory barrier, also known as memory barrier, memory barrier, barrier instruction, etc., is a type of synchronization barrier instruction, which is a synchronization point in the operation of random access to memory by the CPU or compiler, so that all read and write operations before this point After all are executed, the operations after this point can be started. Most modern computers implement out-of-order execution in order to improve performance, which makes memory barriers necessary. Semantically, all write operations before the memory barrier must be written to the memory; read operations after the memory barrier can obtain the result of the write operation before the synchronization barrier. Therefore, for sensitive program blocks, memory barriers can be inserted after the write operation and before the read operation.


Under normal circumstances, we hope that the program code we write can "what you see is what you get", that is, the program logic meets the order of the program (satisfies the program order). However, unfortunately, our program logic ("what you see") and The final execution result ("income") is separated by:

Compiler and CPU instruction execution

1. The compiler translates the logic (program code) that conforms to human thinking into assembly instructions that conform to the CPU arithmetic rules. The compiler understands the thinking mode of the underlying CPU, so it can translate the program into Optimization during assembly (such as reordering of memory access instructions) to make the generated assembly instructions run faster on the CPU. However, the results of this optimization may not conform to the original logic of the programmer. Therefore, as a programmer, you must have the ability to understand the behavior of the compiler and guide the optimization of the compiler through the memory barrier embedded in the program code. Behavior (this kind of memory barrier is also called optimization barrier), which allows the compiler to produce efficient and logically correct code.


2. The core idea of ​​CPU is to fetch instructions. For in-order single-core CPUs without cache, the fetching and execution of assembly instructions are strictly in order, that is to say, assembly instructions are what you see is what you get , The logic of the assembly instruction is strictly executed by the CPU. However, as computer systems become more and more complex (multi-core, cache, superscalar, out-of-order), language close to the processor using assembly instructions cannot guarantee the consistency of the results executed by the CPU, thus requiring programmers Tell the CPU how to ensure that the logic is correct.

In summary, the memory barrier is a way to ensure the order of memory access, so that the HW block (each cpu, DMA controler, device, etc.) in the system has a consistent view of the memory.

 

Through the above introduction, we know that the code we write will be out of order during the interaction with the memory according to certain rules. The memory execution order changes both in the compiler (during compilation) and cpu (during runtime), and the purpose is to make the code run faster. Even if it is out of order for performance, there is always a degree of out-of-order (you can't always out-order the initialization code of the pointer after the code that uses the pointer, so who dares to write the code). Compiler developers and cpu vendors abide by the basic principle of memory disorder, which

can be briefly summarized as follows: the execution behavior of single-threaded programs cannot be changed - but threaded programs always satisfy Program Order (what you see is what you get)

under the guidance of this principle , Programmers who write single-threaded code do not need to care about memory disorder. In multi-threaded programming, due to the use of mutexes, semaphores and events are designed to prevent memory disorder in their call points (all kinds of memery barriers are implicitly included), and the problem of memory disorder is also not Need to consider. Only when lock-free technology is used-memory is shared between threads without any mutual exclusion, the effect of memory disorder will be revealed, so we need to consider adding appropriate memery barriers in appropriate places .


Through the above, we know that there are two types of Memory Barriers: the memory barrier of the compiler and the memory barrier of the processor. A better understanding of the memory barrier of the compiler is to prevent the compiler from adjusting the code execution out of order for optimization. And how does the memory barrier of the processor prevent the CPU from disordering? How does the CPU's memory disorder come from?
The problem with disorder is essentially that the old data is read, or part of the new data is read and the other part of the old data is read, which causes problems. How does this data inconsistency come about? I believe that this time everyone has a word in their minds: Cache. That is, the hierarchical cache architecture of modern processors.

 

In order to make up for the shortcomings of low memory speed, modern processors introduce Cache to increase the speed at which the processor accesses programs and data. Cache serves as a bridge between the kernel and the memory, which greatly improves the running speed of the program. Why can a cache with a fast speed and a small capacity be added to the processor to increase the speed? This is based on two characteristics of the program: Temporal locality and Spatial

[1] Temporal locality: If a certain data is accessed, it will be very important in the near future. It may be visited again. A typical example is looping. The looping code is executed repeatedly by the processor, and the looping code is placed in the Cache. Then it takes a long time to access the memory for the first time. Later, these codes can be accessed by the kernel from the cache. Quick access to it.
[2] Spatial: If a piece of data is accessed, the data adjacent to it is likely to be accessed soon. A typical example is an array. The elements in the array are often accessed by the program in the order in which they are installed.

Modern processors generally have multiple core cores. Each core executes different codes and accesses different data concurrently. In order to isolate the impact, each core will have its own private cache (L1 and L2 in the figure). There is a balance between capacity and storage speed (the larger the capacity, the slower the storage speed, speed: L1>L2>L3, capacity: L3>L2>L1), so the hierarchical management in the figure appears. Hierarchical Cache will inevitably bring about a cache consistency problem. The solution to this problem is: the  cache consistency protocol MESI. I won't elaborate on it here, you can learn by yourself.

2.4.2  How Memory Barriers?

The semantics of memory barriers are different on different CPUs. Therefore, to implement a portable memory barrier code needs to summarize the memory barriers on all kinds of CPUs. Fortunately, no matter which type of cpu obeys the following rules:

[1] From the perspective of the CPU itself, its own memory order is subject to the program order
[2], from the perspective of the sharebility domain that contains all CPUs , All cpu accesses to a shared variable should obey several global storage orders
[3], memory barriers need to be used in pairs
[4], memory barrier operations are the cornerstone of constructing mutual exclusion lock primitives

2.4.3 Types of memory barrier

Explicit Memory Barrier
Memory barrier smp_mb() refers to a general memory barrier (General memory barrier), but a full-featured memory barrier has a greater impact on performance. In some cases, we can use some weaker memory barriers. .

Implicit memory barrier

Some operations can imply the function of a memory barrier. There are mainly two types of operations: one is the lock operation, and the other is the lock release operation.

[1] LOCK operations - lock operations
[2] UNLOCK operations - release lock operations

(1) lock operations are considered a half memory barrier, and memory access before the lock operation can arbitrarily penetrate through the lock operation. In other executions, however, the other direction is absolutely not allowed: that is, the memory access operation after the lock operation must be completed after the lock operation.
(2) Like the lock operation, unlock is also a half memory barrier. It ensures that the memory operation before the unlock operation is completed before the unlock operation, that is to say, the operation before the unlock must never cross the unlock fence and be executed thereafter. Of course, the other direction is OK, that is, the memory operation after unlock can be completed before the unlock operation.

 2.4.4 C++11 memory order


To write the correct lock free multi-threaded program, we need to insert the appropriate memory barrier code in the correct position. However, the memory barrier instructions for different CPU architectures are very different. To write a portable C++ program, we need a language level. The Memory Order specification, so that the compiler can insert different memory barrier instructions according to different CPU architectures, or there is no need to insert additional memory barrier instructions.


With this Memory Order specification, we can achieve order control of multi-threaded shared memory interaction in multi-processors at the high level language level, regardless of the impact of compiler and CPU arch on multi-threaded programming.

C++11 provides 6 memory orders that can be applied to atomic variables:

[1] memory_order_relaxed
[2] memory_order_consume
[3] memory_order_acquire
[4] memory_order_release
[5] memory_order_acq_rel
[6] memory_order_seq_cst The

above 6 memory orders describe three kinds of memory Model (memory model):

[1] sequential consistent(memory_order_seq_cst)
[2] relaxed(momory_order_relaxed)
[3] acquire release(memory_order_consume, memory_order_acquire, memory_order_release, memory_order_acq_rel)

 

Memory barriers are very important. What kind of memory barriers to use and when to use memory barriers are the top priority. Memory barriers are not easy to understand. It is even more difficult to use memory barriers correctly and efficiently. Normally, it is not necessary to directly use memory barrier primitives. It is best to use mutual exclusion primitives such as locks (mutexes). Implied the primitives of the memory barrier function. Locks have been misunderstood for a long time, thinking that locks are slow. Due to the introduction of locks, it is very common to bring huge bottlenecks to performance. But this does not mean that all locks are slow. When we use lightweight locks and control the lock competition, the locks still have very good performance, the locks are not slow, and the lock competition is slow.

 

 2.5  References

http://www.wowotech.net/kernel_synchronization/Why-Memory-Barriers.html

http://www.wowotech.net/kernel_synchronization/why-memory-barrier-2.html

https://cloud.tencent.com/developer/article/1021128

https://kukuruku.co/post/lock-free-data-structures-basics-atomicity-and-atomic-primitives/

https://kukuruku.co/post/lock-free-data-structures-the-inside-memory-management-schemes/

http://chonghw.github.io/blog/2016/08/11/memoryreorder/

 

Guess you like

Origin blog.csdn.net/smilejiasmile/article/details/114188217