Brief analysis of various locks in the Linux kernel: semaphore/mutex lock/read-write lock/atomic lock/spin lock/memory barrier, etc.

First of all, you have to figure out that different locks are 作用对象different.
The following is a summary of various locks that act on 临界区, CPU, 内存, respectively:cache
insert image description here

1. Atomic variable/spinlock spin lock——CPU

Since it is to lock the CPU, it is all 针对多核处理器或多CPU处理器. For a single core, only an interrupt will cause the task to be preempted, so you can enter the critical section first 关中断, but it is not enough to turn off the interrupt for the multi-core CPU, because turning off the interrupt for the current CPU can only prevent the current CPU from running. Programs in the critical section, but other CPUs may still execute programs that enter the critical section.

Atomic variables : In the x86 multi-core environment, when multiple cores compete for the data bus, the Lock command is provided to lock the bus to ensure the atomicity of the "read-modify-write" operation at the chip level. This is easy to say, we generally set a variable that is accessed by multiple threads to the atomic type, such as atomic_int x;oratomic<int> x;

Spin lock:
When a thread is acquiring a lock, if the lock has been acquired by other threads, then the thread will 循环等待, and then continue to judge whether the lock can be successfully acquired. The usage examples are as follows:

#include <linux/spinlock.h>
// 定义自旋锁
spinlock_t my_lock;

void my_function(void)
{
    
    
    spin_lock(&my_lock);
    // 访问共享资源的操作
    spin_unlock(&my_lock);
}

In a mutex, if the current thread does not get the lock, it will give up the CPU; in a spin lock, if the current thread does not get the lock, the current thread waits busy on the CPU until the lock is available, which is to ensure faster response times. But if there are too many such threads, it means that multiple CPU cores are busy waiting, which degrades system performance.
Therefore, it must not spin for too long, so if a spin lock is used to protect a critical section in user mode programming, the critical section must be as small as possible, and the granularity of the lock should be as small as possible.

Why are spinlocks more responsive than mutexes?

CASAs mentioned in Kobayashi coding, the spin lock is a function (Compare And Swap) provided by the CPU .Complete lock and unlock operations in " user state ", will not actively generate thread context switching, so it will be faster and less overhead than mutex.
Mutex locks are not. As mentioned earlier, if the mutex lock fails, the thread will relinquish the CPU. This process is actually completed by the kernel to switch threads . Therefore, when the lock fails, 1) First 从用户态切换至内核态, the kernel will change the state of the thread from Set the "running" state to "sleep" state, and then switch the CPU to other threads to run; 2) When the mutex is available, the thread in the previous "sleep" state will become "ready" state (to enter the ready queue) , and then the kernel will switch the CPU to the thread to run at an appropriate time .
Then return to user mode.
In this process, there is not only the overhead of switching from user mode to kernel mode, but also the overhead of two thread context switches.
Thread context switching is mainly thread stack, registers, thread local variables, etc.
The spin lock does not switch threads when the current thread fails to acquire the lock, but waits in a loop until the lock is successfully acquired. Therefore, the spin lock does not switch to kernel mode, and there is no thread switching overhead.
So if the lock is occupied for a short time, or each thread is fast in and fast out of the critical section, then use it 自旋锁是开销最小!
自旋锁的缺点As I said before, if the spin is long or the number of spin threads is too large, the utilization rate of the CPU will drop, because each thread executed on it is busy waiting - occupying the CPU but doing nothing .

Two, semaphore / mutex - critical section

Semaphore:
The semaphore (semaphore) is essentially a counter, which is a counter that describes the number of resources available in the critical section.
The semaphore is 3, indicating that the available resources are 3. The initial semaphore is 3, and the semaphore is 1 at a certain moment, indicating that the number of available resources is 1, then there are 2 processes/threads using resources or 2 resources are consumed (the specific resource depends on the specific situation) . The process has a PV operation on the semaphore. The P operation is -1 before entering the shared resource area, and the V operation is +1 after leaving the shared resource (at this time, the semaphore indicates how many processes can be allowed to enter the critical area).
When the semaphore is used for multi-thread communication programming, the semaphore is often initialized to 0, and then two functions are used to synchronize between threads: : Wait for the semaphore,
sem_wait()if the value of the semaphore is greater than 0, reduce the value of the semaphore by 1, and return immediately . If the value of the semaphore is 0, the thread blocks.
sem_post(): Release resources, semaphore +1, which is equivalent to unlock, so that the thread that executes sem_wait() will not be blocked.

Note: 信号量本身也是个共享资源, its ++操作(release resource) and --操作(acquire resource) also need to be protected. In fact, it is used for 自旋锁protection. If there is an interrupt, it will save the interrupt to the eflags register. After the operation is completed, it will read the register and then execute the interrupt.

struct semaphore {
    
    
     spinlock_t lock; // 自旋锁
     unsigned int count;
     struct list_head wait_list;
};

Mutex:
The semaphore indicates the number of available resources, which allows multiple processes/threads to be in the critical section. But the mutex is not. Its purpose is 只让一个线程to enter the critical section. If the other threads do not get the lock, they can only block and wait. Threads enter the critical section mutually exclusive, which is the origin of the mutex name.
In addition std::timed_mutex, sleep locks are mentioned. The difference between them and mutex locks is:
in mutex locks, threads that do not get the lock will always block and wait, while sleep locks set a certain sleep time, such as 2s, and threads sleep for 2s. If you haven't got the lock after you get it, then give up the lock (you can output the failure to get the lock), if you get it, then continue to work. For example, use the member function try_lock_for()

std::timed_mutex g_mutex;
//先睡2s再去抢锁
if(g_mutex.try_lock_for(std::chrono::seconds(2)))){
    
    
	// do something
}
else{
    
    
	// 没抢到
	std::cout<<"获取锁失败";
}

3. Read-write lock/preemption - critical section

Read-write lock:
used in scenarios where read operations are more frequent than write operations, 让读和写分开加锁which can reduce the granularity of locks and improve program performance.
It allows multiple threads to read shared resources at the same time, but only allows one thread to write to shared resources. This can improve concurrency performance, since read operations are usually much more frequent than write operations. Read-write locks are high-order locks, and spin locks can be used to implement them.

seize:
Preemption must involve switching of process contexts, while interrupts involve switching of interrupt contexts.
The kernel has supported kernel preemption since 2.6. The previous kernel did not support preemption. As long as the process is occupying the CPU and the time slice is not used up, it can ;
Situation of preemption:
For example, a high-priority task (process) will actively give up the CPU because it needs to wait for resources (or it is interrupted because of an interrupt), and then the low-priority task will occupy the CPU first. When the resource arrives, the kernel will let it The task with the higher priority preempts the task that is running on the CPU. In other words, the current low-priority process is running, the time slice is not used up, and no interruption occurs, but it is kicked out.
In order to support kernel preemption, the kernel introduces preempt_counta field. The initial value of the count is 0, +1 whenever the lock is used, and -1 when the lock is released. When preempt_count is 0, it means that the kernel can be preempted safely. When it is greater than 0, the kernel preemption is prohibited.

Per-CPU——acts on the cache
per-cpu variable to solve the data inconsistency between the L2 cache and memory in each CPU.

Four, RCU mechanism / memory barrier - memory

The RCU mechanism is read copy update, that is, read copy update .
Like the read-write lock, the RCU mechanism is also allowed 多个读者同时读, but when updating data, it is necessary 复制一份副本to complete the modification on the copy first, and then replace the old data at one time.
For example, to modify the data of a certain node in the linked list, copy the node first, modify the value inside, and then point the pointer in front of the node to the copied node, refer to the link and wait until no one wants to read the old data, then reclaim the
insert image description here
memory .
soThere are two cores of the RCU mechanism: 1) update after copying; 2) delay recovery of memory
With the RCU mechanism, there is no need to synchronize reading and writing, and there will be no competition for reading and writing, because the reader is reading the original data, and the writer is modifying the copied memory 读写可以并行.
Their reading and writing is carried out according to the memory pointer. After the writer finishes writing, he assigns the pointer of the old reader to the pointer of the new data, so that the 指针的赋值操作是原子的new reader will access the new data.
Old memory is reclaimed by a single thread.

Memory barrier:
The memory barrier is used to control the order of memory access to ensure that the order of execution of instructions is as expected.
Because the code is often not executed in the order we write, it has two levels of disorder:
1) 编译器level. Because the optimization of the compiler often rearranges the assembly instructions of the code, refer to the blog
2) CPUlevel. There is out-of-order memory access among multiple CPUs.
The memory barrier is to make the compiler or CPU access to memory in order.

Out-of-order access at compile time:

int x, y, r;
void f()
{
    
    
    x = r;
    y = 1;
}

After compiling with the optimization option enabled, the resulting assembly may be executed first with y = 1, and then executed with x = r. You can use g++ -O2 -S test.cppthe generated assembly code to view the assembly after -O2 optimization, refer to the article :
We can use the macro function provided by the kernel barrier()to avoid this disorder of the compiler:

#define barrier() __asm__ __volatile__("" ::: "memory")
int x, y, r;
void f()
{
    
    
	x = r;
	__asm__ __volatile__("" ::: "memory");
	y = 1;
}

Or modify the related variables x and y with the volatile keyword:

volatile int x, y;

Notice,The volatile keyword in C++ can only avoid instruction rearrangement at compile time, the instruction rearrangement of multiple CPUs does not work, so in fact, when the code actually runs, it may be out of order again. The Java volatile keyword seems to have a memory barrier function at the compiler and CPU levels.

Multi-CPU out-of-order access to memory:
On a single CPU, regardless of the out-of-order caused by compiler optimization, multi-thread execution does not have the problem of memory out-of-order access. Because a single CPU fetches instructions in order (queue FIFO), the result of returning instruction execution to the register is also in order (also through the queue) but
in a multi-CPU processor, because each CPU has a cache, when the data x When fetched by a CPU for the first time, x is obviously not in the CPU's cache (this is a cache miss). When a cache miss occurs, it means that the CPU needs to obtain data from the memory, and then the data x will be loaded into the cache of the CPU, so that it can be directly accessed from the cache in the future.
When a CPU performs a write operation, it must ensure that other CPUs have removed the data x from their caches (in order to ensure consistency), and only after the removal operation is completed, the CPU can safely modify the data.
Obviously, when there are multiple caches, we must cache 的一致性协议avoid the problem of data inconsistency, and this communication process may lead to out-of-order access.
There are three types of CPU-level memory barriers:

  1. General barrier to ensure that read and write operations are in order, mb() and smp_mb() //mb即memory barrier
  2. Write operation barrier, only guaranteed to write in order, wmb() and smp_wmb()
  3. Read operation barrier, only guarantees that read operations are in order, rmb() and smp_rmb()

The above functions are also defined by macros, such as mb(), which is used in the above example of out-of-order during compilation by adding mfence:

#define mb() _asm__volatile("mfence":::"memory")
void f()
{
    
    
	x = 1;
	__asm__ __volatile__("mfence" ::: "memory");
	r1 = y;
}
// GNU中的内存屏障#define mfence() _asm__volatile_("mfence": : :"memory")

Notice,All CPU-level Memory Barriers (except data dependency barriers) imply compiler barriers.

Moreover, in fact, many thread synchronization mechanisms are supported by memory barriers at the bottom , such as 原子锁and 自旋锁are all dependent on the implementation provided by the CPU CAS操作. CAS is Compare and Swap. Its basic idea is :
in a multi-threaded environment, if you need to modify the value of a shared variable, first read the value of the variable, then modify the value of the variable, and finally compare the new value with the old value. , if they are the same, the modification is successful, otherwise the modification fails, and the operation needs to be performed again.
When implementing CAS operations, memory barriers need to be used to ensure the order and consistency of operations. For example, in Java, when using the compareAndSet method of the Atomic class to implement the CAS operation, a memory barrier is automatically inserted to ensure the correctness of the operation.

For application-level programming, C++11 introduces a memory model that ensures synchronization and consistency in multithreaded programs. Memory barriers (CPU level) are part of the memory model, used to ensure a specific order of memory operations,Only one instruction rearrangement is supported under X86-64: Store-Load, that is, the read operation may be rearranged before the write operation.
There are two types of memory barriers: store and load. Examples of usage are as follows:

// store屏障 
std::atomic<int> x; 
x.store(1, std::memory_order_release); // store屏障确保之前的写操作在之后的写操作之前完成

// load屏障 
std::atomic<int> y; 
int val = y.load(std::memory_order_acquire); // load屏障确保之前的读操作在之后的读操作之前完成

In addition to guaranteeing the order of instructions, the memory barrier at the CPU level must also ensure 数据的可见性that the invisibility will lead to data inconsistency.
Therefore, acquire and release semantics are also used in the above code to set barriers for reading and writing respectively:

acquire: ensure that the read and write operations after acquire will not occur before the acquire action
release: ensure that the read and write operations before release will not occur after the release action

In addition to the above atomic load and store, C++11 also provides a separate memory barrier function std::atomic_thread_fence, its usage is similar to the above:

#include <atomic>
std::atomic_thread_fence(std::memory_order_acquire);
std::atomic_thread_fence(std::memory_order_release);

5. Examples of using these locks in the kernel

Process scheduling : Kernel locks are used to protect the data structures of the scheduler to avoid errors caused by multiple CPUs modifying them at the same time.

// 自旋锁
spin_lock(&rq->lock); 
... 
spin_unlock(&rq->lock);

File system : The kernel lock is used to protect the metadata of the file system, such as inode, dentry and other data structures, to avoid errors caused by multiple processes accessing them at the same time.

spin_lock(&inode->i_lock); 
... 
spin_unlock(&inode->i_lock);

Network protocol stack : The kernel lock is used to protect the data structures of the network protocol stack, such as sockets, routing tables, etc., to avoid errors caused by multiple processes accessing them at the same time.

read_lock(&rt_hash_lock); 
...
read_unlock(&rt_hash_lock);

Memory management : kernel locks are used to protect memory management data structures, such as page tables, memory maps, etc., to avoid errors caused by multiple processes accessing them at the same time

spin_lock(&mm->page_table_lock);
... 
spin_unlock(&mm->page_table_lock);

Guess you like

Origin blog.csdn.net/mrqiuwen/article/details/130478178