In-depth understanding of the core principle of RCU lock-free

Hi, everyone, today I will share with you the most important lock in parallel programming - RCU lock . The essence of RCU lock is to exchange space for time. It is an optimization and enhancement of read-write lock, but it is not just that simple. The idea of ​​garbage collection is also worth learning and learning from. The standard libraries of various languages ​​such as C, C++, Java, and go have RCU lock implementations. At the same time, the exquisite implementation of the kernel is also a good material for learning code design. A deep understanding of RCU is divided into two parts: Part, the first part mainly talks about the core principles , understands its core design ideas, and has a macro understanding of RCU; the follow-up part two will analyze the source code implementation , I hope you like it.

Evolution of Parallel Programming

How to correctly and effectively protect shared data is a difficult problem that must be faced when writing parallel programs, and the usual method is synchronization. Synchronization can be divided into blocking synchronization (Blocking Synchronization) and non-blocking synchronization (Non-blocking Synchronization).

Blocking synchronization means that when a thread reaches the critical section, because another thread already holds a lock to access the shared data, it cannot acquire the lock resource and blocks (sleeps) until another thread releases the lock. Common synchronization primitives include mutex, semaphore, etc. If the synchronization scheme is not adopted properly, it will cause deadlock, livelock, priority inversion, and low efficiency.

In order to reduce the degree of risk and improve the efficiency of program operation, the industry has proposed a synchronization scheme that does not use locks. The algorithm designed according to this design idea is called non-blocking synchronization . Its essence is that stopping the execution of a thread will not hinder other executions in the system. The operation of the entity.

blocking synchronization

A mutual exclusion lock (English: Mutual exclusion, Mutex for short ) is a mechanism used in multi-threaded programming to prevent two threads from reading and writing the same common resource at the same time. This goal is achieved by slicing the code into critical sections one by one. A critical section refers to a piece of code that accesses common resources.

Semaphore (Semaphore) is a facility used in a multi-threaded environment. It can be used to ensure that two or more key code segments are not called concurrently. It can be considered that mutex is a 0-1 semaphore;

Read-write lock is a synchronization mechanism for concurrency control of computer programs. It divides the visitors to shared resources into readers and writers . Readers only have read access to shared resources, and writers need to write to shared resources . Read operations can be concurrently reentrant, and write operations are mutually exclusive.

non-blocking synchronous

There are three popular non-blocking synchronization implementations today:

  1. Wait-free (no waiting)
    Wait-free  means that any operation of any thread can end within a limited number of steps, regardless of the execution speed of other threads. Wait-free is based on per-thread and can be considered as starvation-free. Unfortunately, the actual situation is not the case. The program using Wait-free cannot guarantee starvation-free, and the memory consumption also increases linearly with the number of threads. Only a handful of non-blocking algorithms currently achieve this.
    Simple understanding : all threads are working at any time;
  2. Lock-free (lock-free)
    Lock-Free means that at least one of all threads that execute it can continue to execute. Since each thread is not starvation-free, that is, some threads may be delayed arbitrarily, but at least one thread can be executed at each step, so the system as a whole is continuously executing, which can be considered as system- wide. All Wait-free algorithms are Lock-Free.
    Simple understanding: at least one thread is working at any time;
  3. Obstruction-free (obstacle-free)
    Obstruction-free means that at any point in time, each operation of an isolated running thread can be completed within a limited number of steps. Threads can keep running as long as there is no contention. Once the shared data is modified, Obstruction-free requires that some of the completed operations be aborted and rolled back. All Lock-Free algorithms are Obstruction-free.
    Simple understanding: as long as the data is modified, it will be re-acquired, and the completed operation will be rolled back and restarted;

To sum up, it is not difficult to conclude that Obstruction-free has the worst performance in Non-blocking synchronization, while Wait-free has the best performance, but it is also the most difficult to implement. Therefore, the Lock-free algorithm has begun to be valued and widely used. Used in various programming, here mainly introduces the Lock_free algorithm .

Lock-free (no lock) can often provide better performance and scalability guarantees, but in fact its advantages go beyond that. In the early days, these concepts were first applied on the operating system, because an algorithm that does not depend on locks can be applied to various scenarios without considering various errors, failures, failures, etc. Such as deadlock, interrupt, or even CPU failure.

Mainstream lock-free technology

Atomic operation, an operation that reads and changes data in a single, uninterrupted step. Requires processor instructions to support atomic operations:

● test-and-set (TSR)

● compare-and-swap (CAS)

● load-link/store-conditional (ll/sc)

Spin Lock (spin lock) is a lightweight synchronization method, a non-blocking lock. When the lock operation is blocked, it does not hang itself into a waiting queue, but the CPU idles in an endless loop waiting for other threads to release the lock.

Seqlock (sequential lock)  is a new type of lock introduced in the Linux 2.6 kernel. It is very similar to the spin lock read-write lock, except that it gives higher priority to the writer. That is to say, the writer is allowed to continue running even when the reader is reading. The reader will check whether the data has been updated, and retry if the data has been updated, because seqlock is more beneficial to the writer. As long as there are no other writers, the write lock Always succeed.

RCU (Read-Copy Update), as the name implies, is read-copy modification, which is named based on its principle. For the shared data structure protected by RCU, readers can access it without obtaining any locks, but when accessing it, the writer first copies a copy, then modifies the copy, and finally uses a callback mechanism at the appropriate time Replace the pointer to the original data with the new modified data. This timing is when all CPUs referencing the data exit access to the shared data.

This article mainly explains the core principles of RCU.

History background

In high-performance parallel programs , data consistency access is a very important part. Generally, lock mechanisms (semaphore, spinlock, rwlock, etc.) are used to protect shared data. The fundamental idea is to access a global resource first when accessing critical resources. The variable (lock) of the global variable controls the thread's access to the critical resource through the state of the global variable. However, this idea requires hardware support, and the hardware needs to cooperate to realize the read-modify-write of global variables (locks). Modern CPUs will provide such atomic instructions.

Using the lock mechanism to achieve data access consistency has the following two problems:

  • Efficiency issues . The implementation of the lock mechanism requires atomic access to memory. This access operation will destroy the pipeline operation and reduce the pipeline efficiency, which is a factor affecting performance. In addition, in the case of using the read-write lock mechanism, the write lock is an exclusive lock, and the concurrent operation of the write lock and the read lock cannot be realized, and the performance will be reduced in some applications.
  • Scalability issues . For example, when the number of CPUs in the system increases, the efficiency of synchronous data access by using the lock mechanism is relatively low. And as the number of CPUs increases, the efficiency decreases, which shows that the data consistency access achieved by the lock mechanism has poor scalability.

Original RCU idea

In multi-threaded scenarios, we often need to access a data structure concurrently. In order to ensure thread safety, we will consider using mutual exclusion facilities for synchronization. Further, we will choose read-write locks for optimization according to the read-write ratio of this data structure. . But the read-write lock is not the only way. We can use the COW technology to achieve the write operation without locking, that is, when reading normally, when writing, first lock and copy a copy, then write, and finish writing As far as atomic updates are concerned, COW is used to avoid the performance overhead of frequently adding read-write locks.

Advantages and disadvantages

Since RCU is designed to minimize read-side overhead, it should only be used for read operations using synchronous logic at a higher rate. If the update operation exceeds 10%, the performance will deteriorate instead, so another synchronization method should be selected instead of RCU.

  • benefit
    • There is almost no read-side overhead. zero wait, zero overhead
    • no deadlock problem
    • No priority inversion issues (priority inversion and priority inheritance)
    • Unlimited Latency No Problem
    • No risk of memory leaks
  • shortcoming
    • a bit complicated to use
    • For write operations, it is slightly slower than other synchronization techniques
  • Applicable scene

 Information through train: Linux kernel source code technology learning route + video tutorial kernel source code

Learning through train: Linux kernel source code memory tuning file system process management device driver/network protocol stack

core principle

Theoretical basis - QSBR algorithm

(Quiescent State-Based Reclamation)

The core idea of ​​this algorithm is to identify the inactive (quiescent) state of the thread, so when is it considered inactive? This state is relative to the critical section state, and the thread leaving the critical section is inactive. After identifying the inactive state, you need to notify the state to let other threads know. The whole process can be described by the following diagram:

There are four threads above. Thread 1 adds a callback to release memory after executing the update operation. At this time, threads 2, 3, and 4 all read the previous content. After they are executed, they go back to call onQuescentState to indicate that they have Not inactive, wait until the last thread calls onQuescentState to call the registered callback. The key to realize the above process is to choose a suitable location to execute onQuescentState, and how to know who is the last thread to execute onQuescentState.

Batch recycling , if there are many updates, but only one callback is called back each time, releasing the memory once will cause the memory release to fail to keep up with the speed of recycling. For this reason, batch recycling is required, and a new callback will be registered for each update. When all threads enter the inactive state for the first time, all current callbacks are saved, and when all threads enter the inactive state next time, all previous callbacks are called back.

basic structure

The Linux kernel RCU designs a lock-free synchronization mechanism with reference to the QSBR algorithm .

  • Multiple readers can access shared data concurrently without locking;
  • When the writer updates the shared data, he needs to copy the copy first and modify it on the copy. In the end, the reader only accesses the original data, so they can access the data safely. Multiple writers need to use locks to mutually exclusive access (for example, use spinlock);
  • After the resource is modified, the shared resource needs to be updated so that subsequent readers can access the latest data;
  • After all readers on the old resource have finished accessing, the old resource can be recycled;

RCU model

  • Removal : In the critical section of the write side, read (Read()), copy (Copy), and perform change (Update) operations;
  • Grace Period : This is a waiting period to ensure that all readers related to the deleted data are accessed;
  • Reclamation : reclaim old data;

three important concepts

Quiescent state QS (Quescent State):  The CPU has a context switch and is called experiencing a quiescent state;

Grace Period GP (Grace Period):  The grace period is the waiting time required for all CPUs to experience a quiescent state, that is, all readers in the system complete access to the shared critical section;

GP principle

Read-Side Critical Section RCS (Read-Side Critical Section):  protects the code area that is prohibited from being modified by other CPUs, but allows multiple CPUs to read at the same time;

three main roles

Reader reader :

  • Secure access to critical section resources;
  • Responsible for identifying entry and exit critical areas;

Written by updater :

  • Make a copy of the data and then update the data;
  • Overwrite the old data with new data, and then enter the grace period;

Reclaimer reclaimer :

  • Waiting for readers before the grace period to exit the critical section;
  • Responsible for reclaiming old resources after the grace period ends;

three important mechanisms

publish/subscribe mechanism

  • Primarily used to update data, threads can safely browse data even when the data is modified concurrently. RCU implements this concurrent insert operation capability through the Publish-Subscribe Mechanism;

Delayed recovery mechanism :

  • Implement check completion of all RCU readers on old data for safe deletion of old data;

Multi-version mechanism :

  • Maintaining multiple versions of recently updated objects is used to allow readers to tolerate concurrent insertion and deletion of multiple versions of new objects;

final summary

Finally, summarize the core idea of ​​RCU lock :

  • Readers access data without locks, and mark entry and exit of critical sections;
  • Writers read, copy, update;
  • Delayed recovery of old data;

The core idea of ​​RCU is only three sentences, and product managers say it is simple, but the implementation of the Linux kernel is not so simple. In addition to implementing basic functions, many complex situations need to be considered:

The RCU system of the kernel can be said to be one of the most complex systems in the kernel . For high performance and multi-core scalability, a very delicate data structure is designed:

At the same time, many core processes are cleverly implemented:

  • Check whether the current CPU has passed QS;
  • QS report (the reporting grace period has passed);
  • Initiation and completion of the grace period;
  • rcu callbacks processing;

Many of these implementations can be said to be very sophisticated, combining preprocessing, batch processing, delayed (asynchronous) processing, multi-core concurrency, atomic operations, exception handling, multi-scenario fine optimization and other technologies, with good performance and strong scalability , strong stability, and has a certain learning and reference value. Even if your job is not kernel programming, it embodies many programming ideas and code design ideas, and it is worth learning for everyone.

Predict what will happen next, and listen to the next chapter to break it down. Next time, I will show you the beauty of RCU source code implementation.

 

Guess you like

Origin blog.csdn.net/youzhangjing_/article/details/131646945