RCU prequel: from synchronization to the introduction of RCU

1. Start with synchronization

1.1 Generation of synchronization

When reading or writing kernel code, you always need to take a default prerequisite: any execution flow may be interrupted after any instruction, and then come back to execute again after an indeterminate time.

Therefore, it is often necessary to consider a question: whether the execution environment on which the instruction originally depends will change during the process of being interrupted and returning to the breakpoint to continue execution. The corresponding problem is that the environment on which the instruction execution depends is exclusive is still shared. If it's exclusive, it's safe, and if it's shared, it may be accidentally modified, causing some synchronization problems. Dealing with synchronization problems is usually solved through synchronization mechanisms such as atomic variables and locking.

Most engineers' judgment on whether to use the synchronization mechanism is based on a simple concept: the operation of global variables needs to be locked, but local variables do not.

In the vast majority of cases, this sentence is applicable. The reason why I replaced global and local with shared and exclusive is because in certain circumstances, local variables are not equal to exclusive resources, and the same is true for global variables. The question of whether to introduce a synchronization mechanism is not static. For example, the following situations:

(1) One problem that needs to be noted is that we usually regard the behavior of returning a pointer to a resource on the stack as an absolute bug without thinking, but we ignore the cause of this bug: the data on the stack after the function returns will be overwritten. But what if the function doesn't return? You can see such code in the kernel: initialize some resources on the stack, then link them to the global linked list, and then fall into sleep, and the life cycle of the data on the stack will remain until it is woken up. Since the data on the stack can be exported to other places, it is naturally changed from exclusive to shared, and synchronization issues need to be considered.

(2) When we focus all our attention on the data, or when we take it for granted that the stack should be exclusive as mentioned in the first point, we actually ignore the form of their own existence: whether it is an instruction or a Whether data, stack area, or heap area, they all exist in the memory, and the hardware attributes of the memory itself are readable and writable. For example, the code segment is read-only, and the stack is independent of each process, which is just an attribute given to them by the operating system. If we are only users of the operating system, we can naturally default to these laws, but if we are developers, is there any possibility to modify the code segment or Other non-data part needs? And will there be synchronization problems with these modifications? So sometimes sync issues are not limited to just data.

(3) Some global variables may be defined only to expand the scope of access, or although they are shared resources, they are not generated concurrently in specific scenarios (such as per-task variables, percpu variables). Therefore, for synchronization problems, sharing is only one of the necessary conditions, it has another necessary condition: simultaneous operation. That is, multiple execution streams access the same shared resource at the same time.

How to understand the simultaneity in simultaneous operation? The same moment under the time scale? If this is the case, then there has never been a real simultaneity. The simultaneity we define is that when A has not completed a certain work, B also participates. This situation is regarded as A and B doing at the same time. this work.

For example, while there is no code execution in a single-core environment, all codes are executed serially, but synchronization problems will still occur, such as i++. This is because the minimum granularity of the C language is not the minimum granularity of instruction execution. An i++ operation is actually composed of load/modified/store instructions. If it is interrupted after the execution of the load instruction is completed, operate i again in other execution flows, from From the perspective of C language execution, i++ has not been executed, which brings about a kind of "simultaneity". And this concept can also be extended to composite structures.

(4) Multiple execution streams operate on shared data at the same time. When this scenario occurs, many engineers will not hesitate to add a synchronization mechanism to avoid problems. I don’t know if you guys have thought about it, if you don’t lock it, will it definitely cause program bugs? To figure this out, we need to know something about the behavior of the CPU during execution.

First of all, it is necessary to distinguish between reading and writing. Usually, the problem we are talking about is actually reading an unexpected value when performing a read operation, and there will be logical problems in the subsequent logical judgment or subsequent write operation, and this kind of problem It is caused by writing shared resources at the same time.

On the other hand, if the program is not synchronized during execution, it will encounter several kinds of disorder:

  • The compiler will perform optimization operations on the program. The compiler will assume that the compiled program is running in a single execution flow environment, and optimize on this basis, such as reordering the code and using the register cache instead of writing the calculation result. Back to memory, etc. Naturally, if you don't want the compiler to do this, you need to disable aggressive optimization items through volatile. For the kernel, it usually uses the WRITE_ONCE/READ_ONCE interface, or try a barrier barrier to prevent specific code segments out-of-order behavior.

  • In order to further improve concurrency performance, the CPU also arranges the code out of order. Usually, the CPU only ensures that instructions that have logical dependencies are executed in order. For example, if an instruction depends on the execution result of the previous instruction, it will be executed in order. For other instructions that do not have logical dependencies, the order cannot be guaranteed. As for how the order will be disordered, this does not make any assumptions. In a multi-threaded and multi-core environment, this disorder will cause problems, and it can pass through the barrier of the CPU. to prohibit this behavior.

  • The weakly ordered memory model of modern CPUs is related to the processor architecture. In the weaker memory model, write operations will not be submitted to the memory in the order of execution. After one CPU finishes writing, the other CPU will not access it next time. The new value must be accessed immediately, and the order in which another CPU sees the write is not necessarily the only way to prohibit this behavior through a data barrier or a specific memory barrier.

Therefore, when we understand what problems the program will cause without using the synchronization mechanism, we can analyze specific problems in detail, even if it is for the simultaneous operation of the same shared data, such as the following situation, even if it is not locked Problems arise:

  • Read-only for shared data

  • Even if there is simultaneous reading and writing, bugs may not necessarily occur. The most common example is to set some nodes in the /proc/ directory. This setting corresponds to a global variable, and only this variable is read in the kernel code. This situation does not add synchronization measures (or only adds compiler barriers), and usually only causes the reader to read the old value in a very short period of time, and usually does not cause logic problems.

  • For simultaneous writing, locks may not be added in specific application scenarios. Refer to the picture below:   d76ceb1a6c7505d4bc6a3fa6996e2eac.png

  • Thread A and thread B operate the variable cnt at the same time. Although both A and B execute the cnt++ operation, the operation of B is overwritten, and only one of the two ++ operations finally produces an effect. It seems that there must be a problem. However, in the case of network data packet statistics, the probability of this happening is very low. Compared with the huge performance loss caused by locking all paths, it is not unacceptable to have a slight error in the statistical value. In this case, we You can add only one WRITE_ONCE to limit compiler optimization. Therefore, when we understand the performance loss caused by different synchronization measures and what problems it can actually solve, it is more about the relationship between performance and accuracy (it can also be indicators such as throughput and power consumption). The trade-off between, rather than locking involuntarily.

1.2 Synchronization mechanism

When it comes to synchronization issues, it is naturally inseparable from its solution: the synchronization mechanism. Of course, the synchronization mechanism we discuss most is the lock. Since the necessary conditions for the occurrence of synchronization problems are "sharing" and "simultaneity", it only needs to destroy one of the conditions to solve the synchronization problem.

The simplest and most classic solution is spinlock and mutex. As long as you have been in touch with linux, you are basically familiar with these two. The logic implemented by these two is to establish a critical section to protect the operation of a shared resource. Only one visitor is allowed to enter, and when the visitor exits, the next one will come in again, destroying the "simultaneous" condition and solving the synchronization problem.

Spinlock will choose to spin and wait when it cannot acquire a lock, and mutex will choose to sleep when it cannot acquire a lock. They are used in different scenarios. For an operating system, both are necessary. But when many friends look at the differences between them, they often only notice the difference in the scenario of trying to hold the lock but fail, and ignore the difference after holding the lock:

(1) The spinlock lock is preempted, but the interrupt is not necessarily turned off. That is to say, in the spinlock critical section, there will be no scheduling, but there may be switching of the process environment, such as interrupts and soft interrupts. Scenarios need to be considered.

(2) The mutex cannot be applied in the interrupt environment, so you don't need to consider the switching of the process running environment, but the mutex is not related to preemption, so the mutex will also bring about the complicated problem of nested locks.

These two issues are unavoidable if one wants to delve into code implementations and even optimize them. And if you just use spinlock and mutex as synchronization measures to protect specific global data, these two issues do not need to be considered too much, and if there are no other requirements such as performance, you only need to know the two simple interfaces of mutex and spinlock. Can handle work.

Of course, if you are a friend who is curious about the origin of things, you can think deeply about the implementation principle of spinlock (mutex), and you will find a contradiction: the role of spinlock is to protect the mutual exclusion of global data operations at the same time, so that One visitor enters while others wait, and letting one visitor enter while others wait requires inter-thread communication.

In other words, in the implementation of spinlock, the waiting thread needs to know that the lock is already occupied, and the occupier needs to know that he has occupied the lock when he tries to hold the lock. They must also access a global resource to obtain this information. Does this generate simultaneous access to a shared resource? Then who will protect the implementation of the spinlock itself?

Software cannot solve it, so it needs to be implemented with the help of hardware. Therefore, each different hardware architecture needs to implement at least atomic operation instructions for single-word variables. For example, under a 64-bit platform, the hardware must support a class or a set of instructions. Guarantee its atomicity when performing operations such as ++ on a long type variable.

With the help of this atomic hardware operation, spinlock can be implemented in this way: those who request the lock first can obtain the ownership of a global variable through atomic operations, and then those who request the lock must wait for the previous owner to release its ownership, that is, unlock, and one of the global variables mentioned above is a lock variable.

Therefore, it can be seen that the operation of shared global data becomes a competition for locks, and the competition for locks is actually a competition for lock variables. In essence, the protection of composite data is converged into the protection of single-word variables. , and then use hardware atomic instructions to solve this problem.

For example, multiple execution flows have read and write operations on a struct foo structure instance. In order to prevent synchronization problems, these execution flows change from competing for struct foo to competing for foo->spinlock, and spinlock is based on the lock variable lock->val implemented, so, in fact, all execution flows contend for the lock variable. Based on this, it is not difficult to find that the realization of the lock does not eliminate the competition, it just narrows the competition to a single variable.

In the subsequent development, because of the need to balance delay, throughput, power consumption and other factors, more logic was gradually added to the lock. For example, mutex first implemented the queuing mechanism, and then introduced optimistic spin to reduce context switching. , and introduce the handoff mechanism in order to solve the fairness problem brought by optimistic spin. And its cousin rwsem further distinguishes read and write on the basis of mutex, and the implementation is more complicated.

As the saying goes, medicine is three-point poison, and lock is a powerful medicine to solve synchronization problems, but the problems it brings should not be underestimated:

  • Deadlock and starvation are common and directly cause the system to crash.

  • Locks only alleviate competition, not eradicate it, so competition still exists, and the overhead is still not small under severe conditions.

  • The lock implementation is becoming more and more complex, and it will also consume instruction cycles and cache space, and this complexity makes quantitative analysis more and more difficult.

  • Specific locks have specific problems. For example, the spinlock mechanism will cause CPU idling. In some highly competitive scenarios, 8 CPUs compete for the same spinlock lock at the same time. Because of preemption, these 8 CPUs do not process Interrupt, can no longer do anything, just idle waste of CPU resources. The invalid wakeup brought by mutex and its own process switching also have a lot of overhead. For example, in the environment of mutex, when multiple CPUs are competing for the same lock, bad lock use or untimely design will lead to many invalid Wake up, that is, many processes cannot acquire locks after being woken up, and can only sleep again, and these overheads are a waste of resources, which is very wasteful in a heavy load environment.

These are the implementation problems of the lock itself, which come from the irreconcilable contradictions among performance, fairness, and throughput, and the time spent on lock competition cannot produce any benefits at all.

When the solution itself becomes the biggest problem, when the boy who slays the dragon is about to become the new dragon, we have to turn to find a more suitable synchronization solution.

1.3 Other synchronization solutions

When the general lock scheme cannot go further, another direction is to split the usage scenarios and find targeted solutions for specific scenarios.

One idea is to continue to use the lock form, but distinguish between reading and writing, because the nature of reading and writing is completely different.

In the above description, the roles that operate on shared resources are collectively referred to as visitors, but from the perspective of actual hardware, it is found that reading and writing are fundamentally different in the operation of shared data, and writing operations are usually brought The culprit of the synchronization problem is that for the scenario of more reading and less writing, rwlock and rwsem are derived on the basis of spinlock and mutex (rwsem does not actually have the semantics of semaphore, it is more like the sleep version of rwlock).

The other is the lock-free design, which is also divided into many types. The first one is not to use the synchronization mechanism at all, or to minimize the use of the synchronization mechanism, because in some scenarios, the out-of-sync of global data is acceptable. This has been demonstrated above.

Another more common lock-free solution adopts a more detailed design in combination with application scenarios, and only uses atomic operations provided by hardware without introducing complex lock logic such as spinlock, thereby avoiding consuming too much CPU on locks resource.

In addition, a secondary confirmation mechanism is also commonly used, because some shared data operations may cause synchronization problems. If the probability of concurrent conditions is low enough, sometimes it is not necessary to directly use locks. We may You can directly use the lock-free operation, and then check the operation result. If the result does not meet our expectations, then perform the operation again to ensure that the data is updated normally.

There are also many lock-free solutions that are aimed at "sharing" conditions, such as using additional memory to avoid competition. Among them, the percpu mechanism in the kernel is the most used, although most engineers usually do not use it. It is regarded as a synchronization solution, but it actually solves the "sharing" condition caused by the synchronization problem, that is, the global data shared between the CPUs is distributed to each CPU, so that although There is still the synchronization problem between the process environment and the interrupt environment, but it greatly reduces the synchronization problem between multiple CPUs, and converges from the multi-core environment to the single-core environment.

At the same time, some lock-free solutions have emerged for some specific scenarios. The most common one is that there are hot and cold paths in specific scenarios. The lock-free solution under the hot path is realized by increasing the overhead under the cold path. The definition of hot and cold paths depends entirely on the application scenario, so these optimizations cannot be used as a general solution, because their implementation is a trade-off in some special cases.

And what we are going to talk about today, RCU, is also a lock-free solution in a specific scenario.

2. What is RCU?

2.1 The basic concept of RCU

RCU, read-copy-update, that is, read-copy-update, the basic idea is that when we need to operate on a shared data, we can first copy a copy of the original data B, and the part that needs to be updated Realize on B, and then use B to replace A, which is also the most typical usage scenario of RCU.

Obviously, this lock-free solution is aimed at the "sharing" feature, after all, it does not directly operate on the target data.

Starting from this concept, we can actually feel the first feature of RCU very intuitively: RCU is aimed at the usage scenario of more reading and less writing. After all, this form of implementation obviously increases the overhead of writing.

The design concept of RCU is so simple that anyone can understand it when they hear it for the first time, but when we try to read its code implementation through its lock/unlock interface like spinlock, we are surprised to find that its lock/unlock interface The implementation of unlock is just a preemption switch. After confirming that there is no problem with the kernel configuration many times, it is found that this is indeed the case, which creates a very absurd feeling: how to realize read-copy-update only by switching preemption?

2.2 What is the core problem of RCU implementation?

Many friends who are interested in RCU have actually read a lot of RCU-related articles on the Internet, know the operation form of RCU: read-copy-update, and naturally feel that this is a good idea, and it does not seem to need Locks are implemented, because the update operation and the original reader read different data, which does not meet the sharing conditions, and then perform the replacement. Moreover, these three operation steps can be completely completed by the user himself, and it is really impossible to think of any place where the operating system is required to intervene.

If you can’t figure out a problem, let’s think about it in a real scenario: Suppose there are 3 readers and 1 updater who need to access shared data, and the reader initiates read operations on the data continuously, while the updater When the reader needs to update, copy a new data, and then replace the old data after the operation, so that the data update is completed, and the reader can read the new data.

It seems very reasonable and efficient, but the biggest problem here is that we have defaulted to a precondition that does not exist: the replacement of new data will take effect immediately for all readers, and the old data can be deleted immediately after the replacement , and readers can immediately read the new data.

203e119556fa117a9143e80b24b5b202.png

We can understand this process through the above figure, in which there are three readers, the two arrows of the readers mark the start point and end point of the read operation, and the middle one indicates shared data.

As can be seen from the figure, the writer first copies D2 from D1, then modifies D2, and then updates D2 to new shared data. This process can be understood as a Read-Copy-Update operation.

In the whole process, reader1 always reads the data of D1, and reader3 always reads the new data of D2, but reader2 is more troublesome, its read operation spans the update process from D1 to D2, so what it reads is D1 Or D2, or read half of the data of D1 and the other half of the data of D2?

According to the traditional synchronization lock method, at this time, the writer needs to wait for all old readers to exit, and then the reader waits for the writer to finish updating before continuing to read the critical section. Corresponding to the above figure, the writer must wait for reader2 to finish reading before performing the update. Have you realized that the writer waits for the reader to exit, and the reader waits for the writer to update. This operation is actually the implementation of rwlock. Is the RCU operation based on rwlock? So why don't I use rwlock directly? Obviously, the way to implement RCU by making the replacement take effect immediately, you can only say that you have created a new synchronization mechanism, but it will never be used by anyone.

Then, in order to exceed the performance of rwlock, on the one hand, lock synchronization between readers and writers cannot be done, so that RCU can have performance advantages in specific scenarios; on the other hand, if lock synchronization is not performed, it means that readers cannot Know when the writer updates, and the writer does not know whether there are readers when updating. The only solution is: even after the writer is updated, reader2 still reads the old data of D1 (because reader2 does not know that the data has been updated), After the update, new readers will naturally read the new data.

In this case, it means that RCU cannot protect instances of compound structures like ordinary locks, but only for data pointers pointing to dynamic resources. If you think about it a little deeper, you can find that if D1 and D2 are For the same structure instance, D2 will overwrite the data of D1, and it will produce an error result that reader2 reads half of D1 and reads the other half of D2 after updating. And if reader2 still needs to be able to read D1 after the update, then D1 and D2 must be two independent memories.

The previous problem was how to deal with readers who crossed the update point. After determining that such readers still read the old data, the remaining problem now becomes: judging when these readers have finished reading the old data, so that the resources of the old data can be recovered ?

This is the problem that an RCU implementation in the kernel needs to solve: how to cheaply implement waiting for readers who are still accessing old data to exit? The read-copy-update operation can be completely left to the user. Therefore, the implementation of RCU in the kernel is not actually a Read-Copy-Update operation, but a mechanism for waiting for the reader to exit the critical section.

At the same time, since RCU usually waits for all old readers to exit, the main operation is to release old data, so its implementation is also very similar to a garbage collection mechanism.

Combining the above two points, it leads to several other features of RCU:

  1. Readers are still allowed to read old data even after the writer has updated it. The RCU implementation of the kernel needs to ensure that all RCUs that can read the old data exit before deleting the old data.

  2. The object protected by the RCU synchronization mechanism cannot directly be a composite structure, only the pointer corresponding to the dynamically allocated data can be protected

  3. Pursuing the ultimate performance of the reading end is the foundation of RCU in the kernel.

2.3 Implementation Discussion of RCU

If I have clearly expressed the logic implemented by RCU in the kernel in the last section, and you have understood it, then the next step to discuss is: how to implement waiting for old readers to exit the critical section at low cost. That is, from now on, we really enter the implementation of RCU.

Waiting for an event to end, the most commonly used and easy-to-think solution is to mark it at the beginning, enter the venue with the ticket, and refund the ticket when you exit the venue. In this way, you only need to judge whether the records of entry and exit are paired to determine whether it still exists. Exiters, of course, this idea has been proven to be too inefficient above. Recording the start of the read end means that a global write operation needs to be performed, and once the read end critical section needs to perform a global write operation, it will be executed concurrently on multi-core Synchronization problems are not easy to solve, and the overhead is not small. Of course, this global write operation can be replaced by percpu type, so as to reduce some performance losses, but this method is always a temporary solution, and our ideal state It means that there is no synchronization overhead on the read side, that is, the entry of the read critical section is not recorded.

Another idea is, can we use other events to complete this waiting operation? That is to say, whether it is possible to judge that the conditions we need to wait for have been met through some existing events, without the need to directly record the event.

In the kernel, the implementation of RCU uses a very ingenious way: a read critical section is simply implemented by turning off-on preemption. When the reader enters the critical section, the preemption will be turned off, and the preemption will be turned on when the reader exits the critical section. The scheduling of the process will only happen when the preemption is enabled. Therefore, the writer waits for all the previous readers to exit, and only needs to wait for the scheduling to be executed once on all CPUs.

It is necessary to explain further here. The very important words in the previous paragraph are: all previous readers.

f8980577d8dd39f470b70010fdcaa6c3.png

Referring to the above figure, before the writer is updated, reader1 and reader2 still refer to the D1 data, and reader3 has already read the new data, so you only need to wait for reader1 and reader2 to complete the read operation, and then you can release D1.

During the entire reading process of reader1, it is in the preemptive state. If reader1 runs on cpu0, after the writer is updated, it only needs to judge that once scheduling occurs on cpu0, it can be judged that reader1 has exited the critical area. The premise of scheduling is that preemption is enabled on cpu0, which means that reader1 has finished reading.

After the updater finishes updating the data, the process of waiting for all readers to exit the critical section is named the grace period (grace period), that is, once the grace period passes, it means that the process of updating the data and all readers exiting has been completed. At this time, the old data can be released. If it is a simple add operation, then naturally there is no need to delete the old data, just confirm that the update has been completed.

Of course, the process of waiting for all previous readers to exit the critical section may be relatively long, even tens of milliseconds. Therefore, this needs to be taken into account before deciding whether to use RCU as a synchronization.

This leads to two other features of RCU:

  1. The RCU read-end critical section under Linux is implemented by turning off-on preemption, and the performance and multi-core scalability are very good, but it is obvious that the read-end critical section does not support preemption and sleep.

  2. The write side has a certain delay. The reader will obtain new or old data within a certain period of time.

bc07e4db18f0e89b968e52cd4cb42e0c.png

The above figure is a simple example. The updater performs a NULL operation on gptr on CPU1, and then calls synchronize_rcu to block and wait for all previous readers to exit the critical section. After the read-side critical section corresponding to the bar, a scheduling is executed, which also means that CPU2 has passed the critical section, and on CPU3, it has actually gone through three stages of entering-exiting the read critical section, but because no process is triggered Switching, the RCU core cannot judge that CPU3 has passed the critical section, until finally CPU3 executes a scheduling, the whole system has passed a complete grace period, the task blocked on CPU1 can continue to run, and the corresponding memory is freed.

At the same time, let’s summarize the characteristics of RCU as a whole:

  1. RCU is for usage scenarios with more reads and less writes

  2. The write side has a certain delay. The reader will get new or old data within a certain period of time

  3. Readers are still allowed to read old data even after the writer has updated it. The RCU implementation of the kernel needs to ensure that all readers who can read the old data exit before deleting the old data.

  4. The object protected by the RCU synchronization mechanism cannot directly be a composite structure, only a pointer

  5. RCU pursues the ultimate performance of the reading end, which is the foundation of RCU in the kernel

  6. The classic RCU read-end critical section under Linux is implemented by off-on preemption. The performance and multi-core scalability are very good, but it is obvious that the read-end critical section does not support preemption and sleep

3. Conclusion

In fact, there are still many things to say about RCU, including the use and implementation of RCU, variants of RCU, development of RCU, and source code analysis. The entire RCU is a very large system.

According to my past experience, the length of this kind of large text without any interesting things should not be too long. If you are really interested in RCU, next time we will walk into the use and implementation of RCU together.

references

1.https://www.kernel.org/doc/Documentation/RCU/Design/Data-Structures/Data-Structures.html

2.https://www.kernel.org/doc/Documentation/RCU/Design/Requirements/Requirements.rst

3.https://lwn.net/Articles/305782/

c2c367f70edb056a6d5f8bd8989ec227.gif

Long press to follow Kernel Craftsman WeChat

Linux Kernel Black Technology | Technical Articles | Featured Tutorials

Guess you like

Origin blog.csdn.net/feelabclihu/article/details/129631033