Kernel synchronization of the Linux kernel

One, synchronization introduction

1. Critical regions and competition conditions The
so-called critical regions are code segments that access and manipulate shared data. In order to avoid concurrent access in the critical section, the programmer must ensure that the code is executed atomically-that is, the code cannot be interrupted before the end of execution, just as the entire critical section is an indivisible instruction. If two threads of execution are likely to execute simultaneously in the same critical section, then the program contains a bug. If this happens, we call it race conditions (race conditions) to avoid concurrency and prevent The race condition is called synchronization (synchronization).

In Linux, the main race conditions occur in the following situations:

(1) Symmetric multi-processor (SMP) multiple CPUs

The feature is that multiple CPUs use a common system bus, so they can access common peripherals and memories.

(2) Processes in a single CPU and processes that preempt it

(3) Interrupt (hard interrupt, soft interrupt, Tasklet, lower half of interrupt) and process

As long as multiple concurrent execution units have access to shared resources, race conditions may occur.

If the interrupt handler accesses the resource that the process is accessing, race conditions will also occur.

Multiple interrupts may themselves cause concurrency and cause race conditions (interrupts are interrupted by higher priority interrupts).

2. Deadlock
Deadlock requires certain conditions: there must be one or more execution threads and one or more resources, each thread is waiting for one of the resources, but all resources have been occupied, all threads They are all waiting for each other, but they will never release the resources they have occupied, so no thread can continue, which means that a deadlock occurs.

The simplest deadlock example is self-deadlock:

Get lock
Try to get the lock again
Waiting for the lock to be reused
…

This situation belongs to a thread with a lock, waiting for itself, usually a function and another function, in a broad sense, it is a kind of nested use. My previous experience summaries "Summary of experience stepping on pits (4): Deadlock" falls into this situation.

The most common example of deadlock is ABBA lock:

Thread 1
Get lock A
Trying to get lock B
Waiting for lock B
…
Thread 2
Get lock B
Trying to get lock A
Waiting for lock A
…

This kind of problem is indeed very common.

3. Locking rules
Preventing deadlocks is very important, so what should I pay attention to?

(1) Lock in order. The use of nested locks must ensure that the locks are acquired in the correct order, which can prevent fatal embrace class deadlocks, namely ABBA locks. It is best to record the order of the locks, and follow this order.

(2) Prevent the occurrence of starvation. Especially in some large cycles, try to move the lock inside, otherwise the outside waits for too long. If an endless loop occurs, starvation will occur.

(3) Do not request the same lock repeatedly. This is for the self-deadlock situation, but once this happens, it is often not obvious, that is, the nesting is not very obvious. After a few turns, it is called curve nesting.

(4) The design should be simple. The more complex the locking scheme, the more likely it is to cause deadlock.

Each item here is important and suitable for applications as well. Let me focus on the design.

When designing the code at the very beginning, you should consider locking; the later you think about it, the more you pay, and the less satisfactory the effect will be. So we must consider when locking in the design stage. Why should we lock and what data is to be protected? I think this is a positioning problem. The positioning of a product in the demand phase and the positioning of the data in the design phase determine the subsequent series of actions such as the scheme adopted, the algorithm adopted, the structure adopted... Let's talk about experience :).

So how to lock in the end, remember: to lock the data, not the code. I think this is a golden rule, and it was emphasized in Deadlock.

4. Contention and scalability.
Lock contention (lock contention), referred to as contention, refers to other threads trying to acquire the lock while the lock is being occupied.

To say that a lock is in a highly contented state means that there are multiple other threads waiting to acquire the lock.

Since the function of the lock is to enable the program to access resources in a serial manner, the use of the lock will undoubtedly reduce the performance of the system. Locks that are highly contented (frequently held, or held for a long time-both are worse) will become the bottleneck of the system and severely reduce system performance.

Scalability is a measure of the scalability of the system.

For operating systems, when we talk about scalability, we associate it with a large number of processes, a large number of processors, or a large amount of memory. In fact, any computer component that can be measured can involve scalability. Ideally, doubling the number of processors should double the processing performance of the system. In reality, this is impossible to achieve.

Since the introduction of multi-processing support in the 2.0 version of the kernel, Linux has greatly improved the scalability of cluster processors. When Linux first added support for multi-processors, only one task could be executed in the kernel at a time; in version 2.2, when the locking mechanism developed to fine-grained locking, this restriction was removed. In 2.4 and subsequent versions, the granularity of kernel locking becomes more and more refined. Now, in the Linux kernel version 2.6, the locks added by the kernel are very fine-grained, and the scalability is also very good.

The lock granularity is used to describe the data size of the lock protection.

An overly coarse lock protects a large piece of data—for example, all data structures used by a subsystem: On the contrary, an overly fine lock protects a very small piece of data—for example, an element in a large data structure. In actual use, the locking range of most locks is between the above two extremes, and the protection is neither a complete subsystem nor an independent element, but may be a single data structure. Many lock designs are coarse at the beginning, but when the contention problem of the lock becomes serious, the design evolves in the direction of more refined locking.

The run queue discussed earlier is an example of a lock from coarse to fine.

In kernels 2.4 and earlier, the scheduler has a separate scheduling queue (recall that the scheduling queue is a linked list of schedulable processes). In the earlier versions of the 2.6 kernel series, O(1) scheduling The program is equipped with a separate run queue for each processor, and each queue has its own lock, so the locking is refined from a global lock to each processor having its own lock. This is an important optimization, because the run queue lock is contended for use on large machines. Essentially, the entire scheduling process must be executed on a single processor every time in the scheduler. In the version of the 2.6 kernel series, the CFS scheduler further improves the scalability of the lock.

Generally speaking, improving scalability is a good thing, because it can improve the performance of Linux on larger, more powerful systems.

However, blindly "improving" scalability will reduce the performance of Linux on small SMP and UP machines. This is because small machines may not need special locks. Too fine a lock will only increase complexity and increase Big overhead.

Considering a linked list, the initial locking scheme may be to use a lock to protect the linked list. Later, it was discovered that on a machine with cluster processors, when each processor needs to frequently access the linked list, only a single lock becomes an extension. Sexual bottleneck. In order to solve this bottleneck, we turn the entire linked list that was originally locked into each node in the linked list and add its own lock. In this way, if you want to read and write a node, you must first get the corresponding node lock. After narrowing the lock granularity, when multiple processors access the same node, only one lock will be contended. However, at this time, the contention of the lock is still not completely avoided. So, can a lock be provided for each element in each node? (The answer is: no.) Strictly speaking, even if such a fine lock can perform well on a large-scale SMP machine, what about its performance on a dual-processor machine? If the lock contention is not obvious on a dual-processor machine, the extra locks will increase the system overhead and cause a lot of waste.

In any case, scalability is very important and requires careful consideration. The key is that at the beginning of the design lock should be considered to ensure good scalability. Because even on small machines, if you lock important resources too thickly, it is easy to cause system performance bottlenecks. If the lock is too thick or too thin, the difference is often only between the first line. When the lock contention is serious, too thick locks will reduce scalability; when the lock contention is not obvious, too fine locks will increase system overhead and bring waste, both of which will cause system performance degradation. But remember: The initial design of the locking scheme should be simple, and the locking scheme should be further refined only when needed. The essence is to strive for simplicity.
The above paragraph comes from the book, and the analysis is very good. It introduces the harm of too coarse and fine lock granularity, and also introduces a change and evolution of kernel locking. In short, it has reference significance for our design software. It is also the reason why there are multiple synchronization methods behind the kernel.
Need C/C++ Linux server architect learning materials plus qun (563998835) to obtain (data including C/C++, Linux, golang technology, Nginx, ZeroMQ, MySQL, Redis, fastdfs, MongoDB, ZK, streaming media, CDN, P2P, K8S , Docker, TCP/IP, coroutine, DPDK, ffmpeg, etc.), free sharing Insert picture description here

Two, synchronization method

1. Atomic operation
Atomic operation is the cornerstone of other synchronization methods. Atomic operations can ensure that instructions are executed in an atomic manner-the execution process cannot be interrupted. This is also a basic requirement in database transactions.

The linux kernel provides two groups of atomic operation interfaces: one group operates on integers, and the other group operates on individual bits.

Atomic integer manipulation

Atomic operations on integers can only process data of atomic_t type. The reason why a special data type is introduced here instead of using the int type of the C language is mainly for two reasons:

First, let the atomic function only accept the operand of the atomic_t type, which can ensure that the atomic operation is only used with this special type of data. At the same time, it also ensures that the data of this type will not be passed to any other non-atomic functions;

Second, use the atomic_t type to ensure that the compiler does not optimize the access to the corresponding value-this allows the atomic operation to finally receive the correct memory address instead of an alias. Finally, when the atomic operation is implemented on different architectures, use atomic_t can shield the difference.

The atomic_t type is defined in the file <linux/type.h>:

typedef struct {
    
    
    volatile int counter;  
}atomic_t;

The most common use of atomic integer operations is to implement counters.

Another point needs to be explained that atomic operations can only guarantee that the operation is atomic. Either it is completed or not. There is no possibility of half of the operation. However, the atomic operation does not guarantee the sequence of operations, that is, it cannot guarantee that two operations are performed according to a certain Completed in order. If you want to ensure the order of atomic operations, use memory barrier instructions.

Compared with more complex synchronization methods, atomic operations bring less overhead to the system and less impact on cache lines.

Atomic bit manipulation

Functions that operate on bit-level data operate on ordinary memory addresses. Its parameters are a pointer and a bit number.

2. Spin lock The
most common lock in the Linux kernel is spin lock. A spinlock can only be held by at most one executable thread. If an execution thread tries to obtain a contented (already held) spin lock, then the thread will continue to busy loop-spin-wait for the lock to be available again. If the lock is not contended, the thread of execution requesting the lock can immediately obtain it and continue execution. At any time, the spin lock can prevent more than one execution thread from entering the critical section at the same time. The same lock can be used in multiple locations—for example, all access to a given data can be secured and synchronized.

A contended spin lock causes the thread requesting it to spin while waiting for the lock to be available again (especially a waste of processor time), that is, busy waiting, which is the main point of spin locks. So the spin lock should not be held for a long time. In fact, this is the original intention of using the spin lock, which is a lightweight lock in a short period of time.

The implementation of the spin lock is closely related to the system, and the code is often implemented through assembly. The interface actually used is defined in the file. The basic usage form of spin lock is as follows:

DEFINE  SPINLOCK(mr_lock);

spin_lock(&mr_lock);
/*临界区....*/
spin_unlock(&mr_lock);

Spin locks can be used in interrupt handlers (semaphores cannot be used here because they cause sleep). When using spin locks in interrupt handlers, you must first disable local interrupts before acquiring the lock (in the current Interrupt request on the processor). Note that only the interrupt on the current processor needs to be closed. If the interrupt occurs on a different processor, even if the interrupt handler spins on the same lock, it will not interfere with the holder of the lock (on a different processor) Finally release the lock.

3. Read and write spin locks
Sometimes, the purpose of locks can be clearly divided into two scenarios: read and write. Then read and write can be processed separately, data can be shared when reading, and mutually exclusive when writing. To this end, the Linux kernel provides a special read-write spin lock.

This read-write spin lock provides different locks for reading and writing, so it has the following characteristics:

Read locks are shared, that is, after a thread holds a read lock, other threads can also hold the lock in a read mode.
Write locks are mutually exclusive, that is, after a thread holds a write lock, other threads cannot hold the lock in a read or write mode.
The read-write locks are mutually exclusive, that is, after a thread holds the read lock, other threads cannot hold the lock by writing, and the write lock must wait for the release of the read lock.
The usage of read-write spin lock is similar to ordinary spin lock:

DEFINE_RWLOCK(mr_rwlock);

read_lock(&mr_rwlock);
/*critical region, only for read*/
read_unlock(&mr_rwlock);

write_lock(&mr_lock);
/*critical region, only for write*/
write_unlock(&mr_lock);

Note: If writing and reading cannot be clearly separated, then a normal spin lock is sufficient, no read-write spin lock is required.

4. Semaphore
semaphore is also a kind of lock. Unlike spin locks, when the thread cannot obtain the semaphore, it will not loop like a spin lock to try to acquire the lock, but will go to sleep until there is a semaphore. When released, the sleeping thread will be awakened and enter the critical section for execution.

Because the thread sleeps when using the semaphore, the waiting process does not take up CPU time. Therefore, the semaphore is suitable for critical areas with long waiting times.

The semaphore consumes CPU time in the thread sleep and wake up the thread-two obvious context switches.

If the CPU time of (sleeping thread + waking up thread)> thread spinning waiting for CPU time, then you can consider using a spin lock.

There are two types of semaphores: binary semaphores and counting semaphores, among which binary semaphores are more commonly used.

Binary semaphore means that the semaphore has only two values, namely 0 and 1. When the semaphore is 1, it means the critical section is available, and when the semaphore is 0, it means the critical section is not accessible. So it can also be called a mutually exclusive semaphore.

The count semaphore has a count value. For example, the count value is 5, which means that 5 threads can access the critical area at the same time. So the binary semaphore is the counting semaphore whose count is equal to 1.

5. Read and
write semaphores The relationship between read and write semaphores and semaphores is similar to that between read and write spin locks and spin locks.

Both read and write semaphores are binary semaphores, that is, the maximum count value is 1. When readers are added, the counter remains unchanged, and when writers are added, the counter decreases by one.

That is to say, the critical section of read and write semaphore protection can only have one writer at most, but there can be multiple readers.

6. Mutex The
mutex is also a sleepable lock, which is equivalent to a binary semaphore, but the API provided is simpler and the usage scenarios are more strict, as shown below:

The count value of mutex can only be 1, which means that only one thread is allowed to access the critical area at most

Lock and unlock in the same context

Cannot lock and unlock recursively

When holding a mutex, the process cannot exit

Mutex cannot be used in interrupt or the lower half, that is, mutex can only be used in process context

Mutex can only be managed through the official API, you cannot write code to operate it

When faced with the choice of mutexes and semaphores, as long as the use scenarios of mutexes are met, mutexes should be used first.

When faced with the choice of mutex and spinlock, see the following table:
Insert picture description here

7.
Completion variables The mechanism of completing variables is similar to semaphores. For example, after a thread A enters the critical section, another thread B will wait on the completion variable. After thread A completes the task and exits the critical section, the completion variable is used to wake up the thread. B.

Generally, when two tasks need simple synchronization, you can consider using completion variables.

8. Big kernel lock The
big kernel lock is no longer in use, and only exists in some legacy codes.

9. Sequence lock
Sequence lock provides a simple implementation mechanism for reading and writing shared data. The read-write spin lock and read-write semaphore mentioned before, after the read lock is acquired, the write lock can no longer be acquired, that is, you must wait for all read locks to be released before writing to the critical section operating.

Sequential locks are different. When a read lock is acquired, a write lock can still be acquired. The read operation using the sequence lock will check the sequence value of the sequence lock before and after the read. If the before and after values do not match, it means that there is a write operation during the reading process, then the read operation will be executed again until the sequence value before and after the read it's the same.

Sequential lock first guarantees the availability of write locks, so it is suitable for scenarios where there are many readers and few writers, and writing is better than reading.

10. Prohibition of preemption
In fact, the use of spin locks can already prevent kernel preemption, but sometimes it is only necessary to prohibit kernel preemption, and there is no need to shield even interrupts like spin locks.

At this time, you need to use the method of prohibiting kernel preemption:
Insert picture description here

Here preempt_disable() and preempt_enable() can be called nested, and the number of disable and enable should be the same in the end.

11. Order and barriers
For a piece of code, the compiler or processor may optimize the execution order when compiling and executing, so that the execution order of the code is somewhat different from the code we wrote.

Under normal circumstances, this is not a problem, but under concurrent conditions, the obtained value may be inconsistent with expectations.

In some concurrency situations, in order to ensure the execution order of the code, a series of barrier methods are introduced to prevent the optimization of the compiler and the processor.
Insert picture description here

Examples are as follows:

void thread_worker()
{
    
    
    a = 3;
    mb();
    b = 4;
}

The above usage will ensure that the assignment of a always precedes the assignment of b, and will not be reversed by compiler optimization. In some cases, the reversal may bring incalculable consequences.

12. Summary
This section discusses about 11 kernel synchronization methods. Except that the large kernel lock is no longer recommended, all other locks have their applicable scenarios.

Knowing the applicable scenarios of various synchronization methods can we use them correctly, so that our code can achieve the best performance under the guarantee of security.

The purpose of synchronization is to ensure the security of data. In fact, it is to ensure the security of shared resources between various threads. The following discusses the selection of 10 synchronization methods according to the situation of shared resources.

The 10 synchronization methods are marked with blue boxes in the figure.
Insert picture description here

Finally, make a summary based on this figure.

Among the 10 kinds of locks mentioned above, the most common ones in the kernel are spin locks, semaphores and mutex locks. Among them, a table has been listed for these three choices in Section 6 of Part Two. This is the focus of the full text!

Learning the implementation of kernel locks will help us how to use locks in program design, what types of locks to use, and how to design locks.