A BUG that has been hidden on the critical path for 11 months: DragonOS process switching error checking

front of the foreword

DragonOS is an independent, server-oriented open source operating system that develops the kernel and user mode environment from scratch, providing Linux compatibility.

Official website: https://DragonOS.org

Code repository: GitHub - fslongjin/DragonOS: A 64-bit operating system. An x86_64 operating system.

foreword

When writing DragonOS, I always encountered some magical bugs, including but not limited to:

  • Add a line of printk("");, the code can run normally
  • Read and write a few unrelated variables, the code can run
  • Add a layer of function calls, wrap a certain function, and the code can run
  • In October, my classmates and I debugged the source code of IDR, and there was a unit test case that failed. Also, the location of the error is always different. After reducing the data size of the test case, no error will be reported.
  • When the XHCI driver is initialized, an error is randomly reported, and there is a probability that it will be initialized normally after the system is restarted.

The above bugs, every time I encounter them, I can’t figure it out. I think it’s really a metaphysical problem. I don’t know how to solve them, and I can’t find the direction at all. Until recently, when using Rust to refactor the CFS scheduler, I suddenly realized that the above phenomena all came from the process switching code, which caused errors.

Let me talk about the conclusion first, the generation of BUG comes from two aspects:

  • inline assembly code with undefined behavior
  • Before the process is switched, there is a call path that does not fully save the execution site. (That is, sometimes it saves, sometimes it doesn't)

How did I discover this bug?

First, I refactored the CFS scheduler using Rust. This logic is not complicated, and it was implemented quickly.

Due to the original C language version of the code, these two macros are called for process switching: switch_mm() and switch_proc(), which are used to switch page tables and process contexts respectively.

See lines 84 and 86 of cfs.c of DragonOS-0.1.2:

http://opengrok.ringotek.cn/xref/DragonOS-0.1.2/kernel/src/sched/cfs.c#84

These two macros are mainly assembly codes, which look like this:

http://opengrok.ringotek.cn/xref/DragonOS-0.1.2/kernel/src/process/process.h?fi=process_switch_mm#process_switch_mm

http://opengrok.ringotek.cn/xref/DragonOS-0.1.2/kernel/src/process/process.h?fi=switch_proc#switch_proc

Briefly introduce the role of these two macros:

  • The main function of the process_switch_mm macro is to load the base address of the next process into the page table base address register CR3.
  • The switch_proc macro first saves the rbp register (the base address of the current stack frame) and the rsp register (the current stack pointer), and saves them in the thread structure of the current process. Then switch to the kernel stack of the next process, and simultaneously obtain a return address (the address where switch_proc_ret_addr is located) for the setting of the current process, and store it in the rip member variable in the thread structure of the current process. And, push the return address of the next process (next->thread->rip) into the kernel stack of the next process, and then jump to the function __switch_to (note that it is not call, but jmp, so here is will not push the stack), and perform other work. When the __switch_to function returns, the processor will pop up the "RIP of the next process" pushed by line 63, thus completing the process switch.

Later experiments proved that there were two errors, one of which occurred in the inline assembly code of the switch_proc macro.

Back to the topic of refactoring CFS, I want to implement the action of switching processes in Rust code. Since the writing of inline assembly is a bit cumbersome, the simplest and most direct way is to add a function in C, encapsulate the two macros switch_proc and switch_mm, and then call this C function directly in Rust.

Therefore, I encapsulated these two macros and encapsulated them like this:

http://opengrok.ringotek.cn/xref/DragonOS/kernel/src/sched/core.c?r=d4f3de93#9

Note, in order to avoid ambiguity, I renamed the original switch_proc() macro here to switch_to(). In the following, switch_to will be used to refer to the previous switch_proc macro.

Then, in the Rust code, call this function. At first, I thought everything would be fine, but when it was running, the processor generated a General Protection exception during process scheduling, and the error occurred at the ret instruction of the __switch_to function. (The fs and gs registers are switched in the switch to function)

There are many reasons for pointing to this exception. After consulting Chapter 6.15 of Volume3A of the Intel Development Manual, the description of the cause of General Protection is probably as follows:

Since in the document, a large number of descriptions are about those segment selection registers, and the fs and gs registers are switched in the __switch_to function, so I have several segments of cs, ds, es, fs, gs, and ss before and after the process switch The value of the option register, and the value to be swapped in, are checked in detail. It is found that their values ​​are correct, and their permissions are also correct.

Debug is deadlocked.

Solve bugs

I thought repeatedly: Why can these two macros work when used alone, but not as a function independently? Is it due to compiler instruction reordering optimization problems, or processor out-of-order execution problems? I added a memory barrier, but still can't solve it.

One of the reasons for the BUG: the context of the execution site is not fully saved

At this time, I checked and found that the sched() called at the end of the interrupt saved the context when entering the interrupt. In addition to this situation, at other times, calling sched() directly, we did not save the current execution site of the process! At this time, I think of those strange bugs before, which are the ones mentioned at the beginning of the article. I thought about them together, and suddenly realized that those metaphysical bugs were generated precisely because of process scheduling, but the execution site was not saved. When the process was rescheduled, due to the lack of data in the execution site, it reported an error! The phenomenon of random errors is precisely caused by the uncertain timing of scheduling!

Therefore, I propose a solution to this problem: the scheduler must run in interrupt context to ensure that the execution context is fully preserved. In order to support those scenarios that require immediate scheduling (corresponding to the scheduling triggered by the clock interrupt), I added a new system call for DragonOS: sys_sched(). The function of the original sched() function is changed to "initiate a SYS_SCHED system transfer". This system call takes advantage of the feature that the execution site is saved by the interrupt processing mechanism before entering the system call, thus solving the problem that the execution site of the process is not saved.

The specific code is shown in the figure:

http://opengrok.ringotek.cn/xref/DragonOS/kernel/src/arch/x86_64/sched.rs?r=d4f3de93#6

http://opengrok.ringotek.cn/xref/DragonOS/kernel/src/sched/core.rs?r=d4f3de93#78

After the above modification, all the paths that can run the process scheduler and switch the process have saved the context of the process. I thought, this problem should be solved, right? As a result, when it runs, it still reports an error, and it is still the familiar General Protection exception.

At this time, I re-examined the above code. After an hour of thinking, I confirmed that what I was looking for above was indeed a BUG, ​​and the error still reported must be because there are still undiscovered bugs.

The second reason for the BUG: the inline assembly of the switch_to macro is the code of undefined behavior

I rethought for a long time, and I firmly believe that the problem must exist in two places, switch_to and __switch_to. However, before and after entering these two places, I did not find any abnormalities in the register value, and the value to be swapped in. I've been staring at the code for the switch_to() macro for a long time and there's just something wrong with it!

http://opengrok.ringotek.cn/xref/DragonOS-0.1.2/kernel/src/process/process.h?fi=switch_proc#switch_proc

In this string of assembly, I modified the value of the rax register, and rax does not exist in the input and output sections of the inline assembly, nor is it declared in the damaged section. The GCC compiler doesn't know that I changed the rax register in this string of assembly! Then, the behavior of this code is undefined behavior, because the compiler may use rax to store some temporary data, and I will destroy it in this way. Therefore, directly add the "rax" register to the damaged part ( line 70 in the figure below ) , and then run, the bug will be solved!

http://opengrok.ringotek.cn/xref/DragonOS/kernel/src/process/process.h?r=d4f3de93#54

follow-up test

In order to verify whether, as I thought, the big data test cases in IDR could not pass, and the phenomenon of random assert failed was caused by the BUG during process switching. I re-run all the test cases of IDR, and they all passed directly. up.

summary

It took me 5 days to debug this bug. If you count the time spent on debugging the real-time scheduler, IDR, XHCI and other modules before, due to metaphysical problems, the total time-consuming may reach nearly a month. Really, the code with undefined behavior and the bug of not saving the context have wasted a lot of time for me and my friends.

This bug, after the detection of codeQL, cppcheck, ControlFlag, and Tencent Cloud's code inspection service, could not be found out, and it was really hidden deep enough. Maybe it's because those tools were developed for checking application software.

It is not easy to check bugs, interested friends, welcome to follow my official account "Denglong"~

Please indicate the original text for reprinting:

A BUG that has been hidden on the critical path for 11 months: DragonOS process switching error checking​longjin666.cn/?p=1667Uploading... reupload to cancel

Welcome to pay attention to my public account "Denglong", let us know more things together~

Guess you like

Origin blog.csdn.net/qq_34026204/article/details/128546479