1- Kernel Error Handling

When a fatal error occurs in the kernel, as long as the CPU can still operate normally, the most important thing is to output detailed error information to the user and save the error scene when the problem occurs . The above fatal errors can include the following two types:

(1) Errors that can be detected by the hardware, such as illegal memory access, illegal instructions, etc., at this time, the CPU will trigger an exception and enter the exception handling process . Oops or panic will be triggered during the exception handling process .
(2) The kernel code enters an abnormal branch that cannot be handled by some codes. If the program continues to execute at this time, unpredictable consequences may occur. At this time, the relevant code will actively enter oops or panic.

Among them, the meaning of panic is panic, panic, that is, the kernel will not be able to continue, it will determine whether to dump crash memory according to the configuration, send a notifier notification to the module concerned about the panic event, and print the system information related to the panic, and finally hang the system or reboot .

The severity of oops is lower than that of panic , so in general it just outputs relevant error messages and exits the process without suspending the kernel. But if oops occurs in interrupt context, or the kernel configures the panic_on_oops option, it will also enter panic.

2-arm64 exception information register

For the arm64 architecture, if the CPU enters an exception due to a memory access error, etc. , the cause of the exception can be obtained through the esr register , and the address information of the abnormal memory can be obtained through the far register . The esr register is defined as follows: EC
insert image description here
in the above figure indicates the exception type , and some typical values are as follows:

(1) b100000: An instruction error from a low exception level, such as an illegal instruction in user mode
(2) b100001: The instruction of the current exception level is wrong
(3) b100010: pc alignment error
(4) b100100: data abort exception from low exception level, such as memory exception in user mode
(5) b100101: data abort exception of the current exception level
(6) b100110: stack pointer sp alignment error
(7) b101111: Serror interrupt, which is an asynchronous exception, generally comes from an external abort, such as the abort exception generated when the memory accesses the bus, etc.

IL indicates the instruction length when an exception occurs , and its values are as follows:

(1) 0: Indicates the 16-bit thumb instruction length
(2) 1: Indicates the length of the 32-bit arm instruction

ISS indicates the specific cause of each type , and its value will vary according to the EC. For example, if EC is data abort, the corresponding ISS is defined as follows (for specific meaning, please refer to armv8 trm):

insert image description here
Among them, DFSC (data fault status code) is used to give information related to data abort, and the following is part of its definition: In addition,

for data abort type exceptions, the abort address is very important for analyzing the cause of the exception, so the armv8 architecture provides this through the far register The value of the address (virtual address), and its corresponding register is defined as follows:

insert image description here

3- Exception handling process

After a synchronous exception occurs in the kernel, it will jump to The corresponding exception handling entry.

The exception handling function will jump to a specific type of handler after performing some basic tasks such as context saving and stack pointer switching. If the cpu is in arm64 mode when an exception occurs, and the stack pointer used is sp_el1, it will jump to el1h_64_sync_handler .

This function will obtain the corresponding exception type according to the value in EC in the esr_el1 register, and then call the processing function related to the specific exception type . In this function, the specific abnormal cause is generally obtained through the value of ISS in the esr_el1 register , and corresponding processing is performed.

In the processing flow, if the exception is indeed caused by an illegal operation (the exception is not necessarily an error, such as page fault exception, breakpoint, single-step debugging and other debug exceptions are normal code processing logic), it will call oops or panic to The user reports an error and exits the current process or hangs the system.

Since there are many types of exceptions in the kernel, and their processing procedures are similar , the following will take the illegal address access of the kernel in arm64 mode as an example. The corresponding processing flow is as follows:

insert image description here

3.1 data abort processing flow

el1h_64_sync_handler first reads the value of the esr_el1 register , then parses the contents of the EC , and calls its corresponding processing function according to the EC value. For example, el1_abort will be called for data abort. The following code implements it:

asmlinkage void noinstr el1h_64_sync_handler(struct pt_regs *regs)
{
	unsigned long esr = read_sysreg(esr_el1);                  

	switch (ESR_ELx_EC(esr)) {                           
	case ESR_ELx_EC_DABT_CUR:
	case ESR_ELx_EC_IABT_CUR:
		el1_abort(regs, esr);
		break;
	case ESR_ELx_EC_PC_ALIGN:
		el1_pc(regs, esr);
		break;
	…
	default:
		__panic_unhandled(regs, "64-bit el1h sync", esr);
	}
}

el1_abort will call do_mem_abort , which will call its corresponding processing function according to the value of DFSC in the esr_el1 register. These functions are defined by the fault_info variable shown below:

static const struct fault_info fault_info[] = {
	…
	{ do_translation_fault,	SIGSEGV, SEGV_MAPERR,	"level 0 translation fault"		},
	{ do_translation_fault,	SIGSEGV, SEGV_MAPERR,	"level 1 translation fault"		},
	{ do_translation_fault,	SIGSEGV, SEGV_MAPERR,	"level 2 translation fault"		},
	{ do_translation_fault,	SIGSEGV, SEGV_MAPERR,	"level 3 translation fault"		},
	{ do_bad,		SIGKILL, SI_KERNEL,	"unknown 8"			},
	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 1 access flag fault"	},
	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 2 access flag fault"	},
	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 3 access flag fault"	},
	…
}

The following is the code flow of do_mem_abort:

void do_mem_abort(unsigned long far, unsigned int esr, struct pt_regs *regs)
{
	const struct fault_info *inf = esr_to_fault_info(esr);          （1）
	unsigned long addr = untagged_addr(far);                        （2）

	if (!inf->fn(far, esr, regs))                                   （3）
		return;

	if (!user_mode(regs)) {                                         （4）
		pr_alert("Unhandled fault at 0x%016lx\n", addr);
		mem_abort_decode(esr);
		show_pte(addr);
	}

	arm64_notify_die(inf->name, regs, inf->sig, inf->code, addr, esr);
}

(1) Select the corresponding processing function pointer in the fault_info array according to the value of DFSC
(2) Because the arm64 architecture can use the free high-order bits of the virtual address to store tag information to support the MTE feature. Therefore, the corresponding tag information needs to be removed first when obtaining its actual virtual address
(3) Call the callback function obtained in fault_info. For illegal address access errors, the corresponding callback function is do_translation_fault
(4) If the exception is an unknown exception, then directly perform error handling through the following process

do_translation_fault calls its corresponding processing function according to whether the exception is triggered by user mode or kernel mode. The code is as follows:

static int __kprobes do_translation_fault(unsigned long far,
					  unsigned int esr,
					  struct pt_regs *regs)
{
	…
	if (is_ttbr0_addr(addr))
		return do_page_fault(far, esr, regs);               （1）

	do_bad_area(far, esr, regs);                                （2）
	return 0;
}

(1) User mode processing function
(2) Kernel mode processing function

For the kernel mode situation, it will eventually call die_kernel_fault to perform actual error handling, and its code is as follows:

static void die_kernel_fault(const char *msg, unsigned long addr,
			     unsigned int esr, struct pt_regs *regs)
{
	…
	mem_abort_decode(esr);                             （1）

	show_pte(addr);                                    （2）
	die("Oops", regs, esr);                            （3）
	bust_spinlocks(0);
	do_exit(SIGKILL);                                  （4）
}

(1) It will parse the value of the esr_el1 register and print its related contents, such as EC, IL, DFSC, etc.
(2) This function will print the page table information corresponding to the abnormal address, including pgd, p4d, pud, pmd, and pte, etc.
(3) Perform the actual die operation, which will be highlighted in the next section
(4) Kill the current process

3.2 die processing flow

The die function mainly executes oops-related processes, and if the exception is triggered in the interrupt process or the panic_on_oops option is set, the system will be further suspended through panic. Its main process is as follows:

void die(const char *str, struct pt_regs *regs, int err)
{
	…
	ret = __die(str, err, regs);                                  （1）

	if (regs && kexec_should_crash(current))
		crash_kexec(regs);                                    （2）
	…
	if (in_interrupt())
		panic("%s: Fatal exception in interrupt", str);
	if (panic_on_oops)                                            （3）
		panic("%s: Fatal exception", str);
	…
}

(1) Call the notification corresponding to the die-related notification chain to make it perform die-related operations and print oops-related information
(2) If a crash system is required, start a new crash kernel through this function, and dump the system memory information through the new kernel for post-analysis. For example, the corresponding crash kernel can be configured by kdump or ramdump
(3) If the exception occurs in an interrupt, or panic_on_oops is set, call panic to suspend the system

3.3 panic processing flow

When the kernel goes to panic, it indicates that it cannot continue to run, so it is necessary to perform some preparations before the system hangs up, which mainly includes the following parts:

(1) In the smp system, when one cpu is processing a panic, another cpu may also trigger a panic. However, this process is mainly used for some error information collection, memory dump, etc., and does not require or support concurrent operations. Therefore, the process does not need to be executed for subsequent triggered cpu
(2) If you are using kgdb to debug the kernel, you obviously hope that the debugger can continue to perform debugging work. Therefore, the system will not be hanged at this time, but the control will be transferred to the debugger
(3) If the kernel is configured with a memory dump function such as kdump, the dump-related process will be started when panic occurs
(4) Before the smp system hangs, it is necessary to stop the operation of all other cpus to make the system really stop
(5) Finally, after printing relevant system information, restart the system or enter an infinite loop

The corresponding code implementation is as follows:

void panic(const char *fmt, ...)
{
	…
	this_cpu = raw_smp_processor_id();
	old_cpu  = atomic_cmpxchg(&panic_cpu, PANIC_CPU_INVALID, this_cpu);

	if (old_cpu != PANIC_CPU_INVALID && old_cpu != this_cpu)                       （1）
		panic_smp_self_stop();
	…
	pr_emerg("Kernel panic - not syncing: %s\n", buf);
	…
	kgdb_panic(buf);                                                               （2）

	if (!_crash_kexec_post_notifiers) {
		printk_safe_flush_on_panic();
		__crash_kexec(NULL);                                                   （3）

		smp_send_stop();                                                       （4）
	} else {
		crash_smp_send_stop();                                                 （5）
	}

	atomic_notifier_call_chain(&panic_notifier_list, 0, buf);                      （6）

	printk_safe_flush_on_panic();
	kmsg_dump(KMSG_DUMP_PANIC);                                                    （7）

	if (_crash_kexec_post_notifiers)
		__crash_kexec(NULL);                                                   （8）

	…
	panic_print_sys_info();                                                        （9）

	if (!panic_blink)
		panic_blink = no_blink;

	if (panic_timeout > 0) {
		pr_emerg("Rebooting in %d seconds..\n", panic_timeout);

		for (i = 0; i < panic_timeout * 1000; i += PANIC_TIMER_STEP) {
			touch_nmi_watchdog();
			if (i >= i_next) {
				i += panic_blink(state ^= 1);
				i_next = i + 3600 / PANIC_BLINK_SPD;
			}
			mdelay(PANIC_TIMER_STEP);                                      （10）
		}
	}
	if (panic_timeout != 0) {
		if (panic_reboot_mode != REBOOT_UNDEFINED)
			reboot_mode = panic_reboot_mode;
		emergency_restart();                                                   （11）
	}
	…
	pr_emerg("---[ end Kernel panic - not syncing: %s ]---\n", buf);

	suppress_printk = 1;
	local_irq_enable();
	for (i = 0; ; i += PANIC_TIMER_STEP) {
		touch_softlockup_watchdog();
		if (i >= i_next) {
			i += panic_blink(state ^= 1);
			i_next = i + 3600 / PANIC_BLINK_SPD;
		}
		mdelay(PANIC_TIMER_STEP);                                              （12）
	}
}

(1) If a cpu is already processing the panic process before, this cpu will not repeat the process, just stop the current cpu
(2) Print panic reason information
(3) If the panic process will perform a memory dump, all system-related information will be saved in the dump file, so there is no need to call the following notification chain, so the dump operation can be called directly. But the dump operation is not 100% safe, so if you do not absolutely trust it, you will set _crash_kexec_post_notifiers, which will first execute the notification chain call and log dump related processes, and then call the core dump operation. The __crash_kexec function will be set according to whether it is currently
　　set In order to dump the kernel to determine whether to actually perform the dump operation, if the dump is performed, the system will switch to the new kdump kernel through kexec and will not return. If the dump is not performed, continue to execute the subsequent process
(4 - 5) Stop the operation of other cpus other than the current cpu
(6) Invoke the notifications registered by concerned modules related to panic events
(7) Dump the log information in the kernel log buffer
(8) If _crash_kexec_post_notifiers is set, determine whether to perform a memory dump operation according to whether the kexec kernel is set
(9) If the memory dump is not performed, print system-related information
(10) If the panic_timeout timeout value is set, perform the timeout waiting operation
(11) If the panic_timeout timeout value is set, restart the system after the timeout wait is completed
(12) If the panic_timeout timeout value is not set, set the system to an infinite loop state, causing it to hang

4- How to trigger oops and panic manually

In the coding process, there may be some unexpected code branches, when the system enters these branches, it indicates that some problems or serious errors have occurred. Depending on the severity of the problem, we may want the program to print some warning messages, or set the system to oops, or even panic.

To this end, the kernel provides some related macros and functions to support the above requirements. The following are some commonly used definitions:

(1) WARN_ON(): Print warning information and call stack, but will not enter oops or panic
(2) BUG_ON(): Print bug-related information and enter the oops process
(3) panic(): This function will directly start the panic process and set the system to hang state

In addition to coding, users can also trigger the panic process through the sysrq magic key. The following is the command to trigger the sysrq-related panic process through the proc method:

　　echo c > /proc/sysrq-trigger

Original link: https://zhuanlan.zhihu.com/p/545307249

Linux kernel debugging (2): kernel error handling process