The Linux Kernel Module Programming Guide

Peter Jay Salzman, Michael Burian, Ori Pomerantz, Bob Mottram, Jim Huang
Translate WaterCutter ( WaterCutter )
SOURCELKMPG _

10 system calls

The only thing we've done so far is use well-defined kernel mechanisms to register /proc files and device handlers. This is fine if you want to do what the kernel programmers think you'll want to do, like write device drivers. But what if you want to do something unusual, change the behavior of the system in some way? Then you can only rely on yourself.

This is where kernel programming becomes dangerous if you don't use virtual machines sensibly. While writing the example below, I turned off the open() system call. This means I can't open any files, run any programs, or shut down the system. I had to restart the virtual machine. While no important files were lost, that might be the outcome if I were doing this on a mission critical system. To make sure you don't lose any files, even in a test environment, run a sync before insmod and rmmod.

Forget /proc files, forget device files. They are just small details. It's just a detail in the vast universe. The real process-to-kernel communication mechanism, the one used by all processes, is the system call. This mechanism is used when a process requests a service from the kernel (such as opening a file, forking to a new process, or requesting more memory). If you want to change the behavior of the kernel in interesting ways, this is the place to do it. By the way, if you want to see what syscalls a program uses, run strace.

In general, a process should not have access to the kernel. It cannot access kernel memory, nor can it call kernel functions. The CPU's hardware enforces this (that's why it's called "protected mode" or "page protection").

System calls are the exception to this general rule. The process of a system call is that the process fills the registers with the appropriate values and then calls a special instruction that jumps to a previously defined location in the kernel (user processes can, of course, read this location, but not write to it). In Intel CPUs, this is done with interrupt 0x80. The hardware knows that once you jump to this location, you are no longer running in restricted user mode, but as the operating system kernel, so you can do whatever you want.

The kernel location to which a process can jump is called system_call. The process at this location checks the system call number, which tells the kernel what service the process is requesting. It then looks in the system call table (sys_call_table) to see the address of the kernel function to call. Then call the function, do some system checks after returning, and return to the process (or to another process if the process time runs out). If you want to read this code, it can be found after the ENTRY(system_call) line in the source file arch/$(architecture)/kernel/entry.S.

So, if we want to change the way a certain system call works, what we need to do is write our own function to do it (usually by adding some code of our own, and then call the original function), and then change the sys_call_table pointer point to our function. Since we may be removed later, we don't want to leave the system in an unstable state, so it is important that the cleanup_module restore the original state of the table.

To modify the contents of sys_call_table, we need to consider the control registers. A control register is a processor register used to change or control the general behavior of the CPU. In the x86 architecture, the cr0 register has various control flags that modify the basic operation of the processor. The WP flag in cr0 stands for write protection. Therefore, we must disable the WP flag before modifying sys_call_table. Since Linux v5.3, the write_cr0 function cannot be used, because the cr0 bit is locked by a security question, an attacker may write to the CPU control register to disable CPU protection, such as write protection. Therefore, we have to provide custom assembly routines to bypass it.

However, to prevent misuse, the sys_call_table symbol is unexported. But there are several ways to get symbols, manual symbol lookup and kallsyms_lookup_name. Here we use both methods depending on the kernel version.

This is a technique that prevents an attacker from redirecting code execution due to control flow integrity, which ensures that indirect calls go to the intended address and that the return address is not altered. Since Linux v5.7, the kernel has patched a series of Control Flow Execution (CET) for x86. Some configurations of GCC, such as GCC 9 and 10 versions in Ubuntu, will add CET to the kernel by default (-fcf- protection option). Compiling a kernel with this GCC with retpoline turned off may cause CET to be enabled in the kernel. You can check if the -fcf-protection option is enabled using the following command:

$ gcc -v -Q -O2 --help=target | grep protection
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper
...
gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
COLLECT_GCC_OPTIONS='-v' '-Q' '-O2' '--help=target' '-mtune=generic' '-march=x86-64'
 /usr/lib/gcc/x86_64-linux-gnu/9/cc1 -v ... -fcf-protection ...
 GNU C17 (Ubuntu 9.3.0-17ubuntu1~20.04) version 9.3.0 (x86_64-linux-gnu)
...

But CET should not be enabled in the kernel as it may break Kprobes and bpf. Therefore, CET is disabled starting from v5.11. In order to ensure the effectiveness of manual symbol lookup, we only use v5.4.

Unfortunately, as of Linux v5.7, kallsyms_lookup_name is also unexported, requiring some tricks to get the address of kallsyms_lookup_name. If CONFIG_KPROBES is enabled, we can get the function address through Kprobes to dynamically enter a specific kernel routine. Kprobes insert a breakpoint at function entry by replacing the first byte of the probe instruction. When the CPU hits a breakpoint, the registers are saved and control is passed to the Kprobes. It passes the saved register address and Kprobe structure to the handler you defined, and executes it. Kprobes can be registered by symbolic name or address. In a symbolic name, the address will be handled by the kernel.

Otherwise, specify the address of sys_call_table in /proc/kallsyms and /boot/System.map in the sym parameter. The following is an example usage of /proc/kallsyms:

$ sudo grep sys_call_table /proc/kallsyms
ffffffff82000280 R x32_sys_call_table
ffffffff820013a0 R sys_call_table
ffffffff820023e0 R ia32_sys_call_table
$ sudo insmod syscall.ko sym=0xffffffff820013a0

Be aware of KASLR (Kernel Address Space Layout Randomization) when using addresses in /boot/System.map. KASLR may randomize the addresses of kernel code and data on each boot, e.g. static addresses listed in /boot/System.map are offset by some entropy. The purpose of KASLR is to protect the kernel space from attacks. Without KASLR, it is easy for an attacker to find the target address among fixed addresses. If there is no KASLR, the attacker may easily find the target address in the fixed address, and then the attacker can use the return-oriented programming method to insert some malicious code to execute or receive the target data by tampering with the pointer. KASLR mitigates this type of attack because the target address is not immediately known to the attacker, but brute force attacks can still work. If the address of the symbol in /proc/kallsyms is different from the address in /boot/System.map, it means that KASLR is enabled in the kernel running on the system.

$ grep GRUB_CMDLINE_LINUX_DEFAULT /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
$ sudo grep sys_call_table /boot/System.map-$(uname -r)
ffffffff82000300 R sys_call_table
$ sudo grep sys_call_table /proc/kallsyms
ffffffff820013a0 R sys_call_table
# Reboot
$ sudo grep sys_call_table /boot/System.map-$(uname -r)
ffffffff82000300 R sys_call_table
$ sudo grep sys_call_table /proc/kallsyms
ffffffff86400300 R sys_call_table

If KASLR is enabled, we have to take care of the address from /proc/kallsyms each time we reboot the machine. In order to use the address from /boot/System.map, make sure that KASLR is disabled. You can add the nokaslr for disabling KASLR in next booting time:

$ grep GRUB_CMDLINE_LINUX_DEFAULT /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
$ sudo perl -i -pe 'm/quiet/ and s//quiet nokaslr/' /etc/default/grub
$ grep quiet /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet nokaslr splash"
$ sudo update-grub

For more information, please review the following:

[Unexporting the system call table
Cook: Security things in Linux v5.3
Control-flow integrity for the kernel
Unexporting kallsyms_lookup_name()
Kernel Probes (Kprobes)
Kernel address space layout randomization

The source code here is an example of such a kernel module. We want to "monitor" a certain user and send a pr_info() message when that user opens a file. To do this, we replace the system call to open the file with our own function our_sys_openat. This function checks the uid (user id) of the current process, and if it is equal to the uid we monitor, it calls pr_info() to display the name of the file to open. Then, either way, it calls the original openat() function, with the same parameters, to actually open the file.

The init_module function replaces the corresponding location in sys_call_table and retains the original pointer in a variable. The cleanup_module function uses this variable to restore everything to normal. This approach is dangerous because it is possible for two kernel modules to alter the same system call. Suppose we have two kernel modules, A and B, A's openat system call is A_openat, and B's is B_openat. Now, when A is inserted into the kernel, the system call is replaced by A_openat, which will call the original sys_openat. Next, B is inserted into the kernel, replacing the system call with B_openat, and when done it will call what it thinks is the original system call, A_openat.

Now, if B is removed first, everything will be fine - it will simply restore the syscall A_openat, calling the original call. However, if A is deleted first and then B is deleted, the system will crash. Removing A restores the syscall to the original sys_openat, removing B from the loop. Then, when B is removed, the system will restore the syscall to what it thinks is the original A_openat, which is no longer in memory. At first glance, it seems like we could solve this particular problem by checking if the syscall is equal to our open function, and if so, leaving the syscall alone (so that B doesn't change the syscall when it's removed), but This leads to an even worse problem. When A is removed, it sees that the syscall was changed to B_openat, so it no longer points to A_openat, so it doesn't restore it to sys_openat before removing it from memory. Unfortunately, B_openat will still try to call the defunct A_openat, so even without removing B, the system will crash.

Note that all related issues make syscall stealing unfeasible for production use. To prevent people from doing potentially harmful things, sys_call_table is no longer exported. That means, if you want to do something more than this example, you'll have to patch your current kernel so that sys_call_table is exported.

/* 
 * syscall.c 
 * 
 * System call "stealing" sample. 
 * 
 * Disables page protection at a processor level by changing the 16th bit 
 * in the cr0 register (could be Intel specific). 
 * 
 * Based on example by Peter Jay Salzman and 
 * https://bbs.archlinux.org/viewtopic.php?id=139406 
 */ 
 
#include <linux/delay.h> 
#include <linux/kernel.h> 
#include <linux/module.h> 
#include <linux/moduleparam.h> /* which will have params */ 
#include <linux/unistd.h> /* The list of system calls */ 
#include <linux/cred.h> /* For current_uid() */ 
#include <linux/uidgid.h> /* For __kuid_val() */ 
#include <linux/version.h> 
 
/* For the current (process) structure, we need this to know who the 
 * current user is. 
 */ 
#include <linux/sched.h> 
#include <linux/uaccess.h> 
 
/* The way we access "sys_call_table" varies as kernel internal changes. 
 * - Prior to v5.4 : manual symbol lookup 
 * - v5.5 to v5.6  : use kallsyms_lookup_name() 
 * - v5.7+         : Kprobes or specific kernel module parameter 
 */ 
 
/* The in-kernel calls to the ksys_close() syscall were removed in Linux v5.11+. 
 */ 
#if (LINUX_VERSION_CODE < KERNEL_VERSION(5, 7, 0)) 
 
#if LINUX_VERSION_CODE <= KERNEL_VERSION(5, 4, 0) 
#define HAVE_KSYS_CLOSE 1 
#include <linux/syscalls.h> /* For ksys_close() */ 
#else 
#include <linux/kallsyms.h> /* For kallsyms_lookup_name */ 
#endif 
 
#else 
 
#if defined(CONFIG_KPROBES) 
#define HAVE_KPROBES 1 
#include <linux/kprobes.h> 
#else 
#define HAVE_PARAM 1 
#include <linux/kallsyms.h> /* For sprint_symbol */ 
/* The address of the sys_call_table, which can be obtained with looking up 
 * "/boot/System.map" or "/proc/kallsyms". When the kernel version is v5.7+, 
 * without CONFIG_KPROBES, you can input the parameter or the module will look 
 * up all the memory. 
 */ 
static unsigned long sym = 0; 
module_param(sym, ulong, 0644); 
#endif /* CONFIG_KPROBES */ 
 
#endif /* Version < v5.7 */ 
 
static unsigned long **sys_call_table; 
 
/* UID we want to spy on - will be filled from the command line. */ 
static uid_t uid = -1; 
module_param(uid, int, 0644); 
 
/* A pointer to the original system call. The reason we keep this, rather 
 * than call the original function (sys_openat), is because somebody else 
 * might have replaced the system call before us. Note that this is not 
 * 100% safe, because if another module replaced sys_openat before us, 
 * then when we are inserted, we will call the function in that module - 
 * and it might be removed before we are. 
 * 
 * Another reason for this is that we can not get sys_openat. 
 * It is a static variable, so it is not exported. 
 */ 
#ifdef CONFIG_ARCH_HAS_SYSCALL_WRAPPER 
static asmlinkage long (*original_call)(const struct pt_regs *); 
#else 
static asmlinkage long (*original_call)(int, const char __user *, int, umode_t); 
#endif 
 
/* The function we will replace sys_openat (the function called when you 
 * call the open system call) with. To find the exact prototype, with 
 * the number and type of arguments, we find the original function first 
 * (it is at fs/open.c). 
 * 
 * In theory, this means that we are tied to the current version of the 
 * kernel. In practice, the system calls almost never change (it would 
 * wreck havoc and require programs to be recompiled, since the system 
 * calls are the interface between the kernel and the processes). 
 */ 
#ifdef CONFIG_ARCH_HAS_SYSCALL_WRAPPER 
static asmlinkage long our_sys_openat(const struct pt_regs *regs) 
#else 
static asmlinkage long our_sys_openat(int dfd, const char __user *filename, 
                                      int flags, umode_t mode) 
#endif 
{
    
     
    int i = 0; 
    char ch; 
 
    if (__kuid_val(current_uid()) != uid) 
        goto orig_call; 
 
    /* Report the file, if relevant */ 
    pr_info("Opened file by %d: ", uid); 
    do {
    
     
#ifdef CONFIG_ARCH_HAS_SYSCALL_WRAPPER 
        get_user(ch, (char __user *)regs->si + i); 
#else 
        get_user(ch, (char __user *)filename + i); 
#endif 
        i++; 
        pr_info("%c", ch); 
    } while (ch != 0); 
    pr_info("\n"); 
 
orig_call: 
    /* Call the original sys_openat - otherwise, we lose the ability to 
     * open files. 
     */ 
#ifdef CONFIG_ARCH_HAS_SYSCALL_WRAPPER 
    return original_call(regs); 
#else 
    return original_call(dfd, filename, flags, mode); 
#endif 
} 
 
static unsigned long **acquire_sys_call_table(void) 
{
    
     
#ifdef HAVE_KSYS_CLOSE 
    unsigned long int offset = PAGE_OFFSET; 
    unsigned long **sct; 
 
    while (offset < ULLONG_MAX) {
    
     
        sct = (unsigned long **)offset; 
 
        if (sct[__NR_close] == (unsigned long *)ksys_close) 
            return sct; 
 
        offset += sizeof(void *); 
    } 
 
    return NULL; 
#endif 
 
#ifdef HAVE_PARAM 
    const char sct_name[15] = "sys_call_table"; 
    char symbol[40] = {
    
     0 }; 
 
    if (sym == 0) {
    
     
        pr_alert("For Linux v5.7+, Kprobes is the preferable way to get " 
                 "symbol.\n"); 
        pr_info("If Kprobes is absent, you have to specify the address of " 
                "sys_call_table symbol\n"); 
        pr_info("by /boot/System.map or /proc/kallsyms, which contains all the " 
                "symbol addresses, into sym parameter.\n"); 
        return NULL; 
    } 
    sprint_symbol(symbol, sym); 
    if (!strncmp(sct_name, symbol, sizeof(sct_name) - 1)) 
        return (unsigned long **)sym; 
 
    return NULL; 
#endif 
 
#ifdef HAVE_KPROBES 
    unsigned long (*kallsyms_lookup_name)(const char *name); 
    struct kprobe kp = {
    
     
        .symbol_name = "kallsyms_lookup_name", 
    }; 
 
    if (register_kprobe(&kp) < 0) 
        return NULL; 
    kallsyms_lookup_name = (unsigned long (*)(const char *name))kp.addr; 
    unregister_kprobe(&kp); 
#endif 
 
    return (unsigned long **)kallsyms_lookup_name("sys_call_table"); 
} 
 
#if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 3, 0) 
static inline void __write_cr0(unsigned long cr0) 
{
    
     
    asm volatile("mov %0,%%cr0" : "+r"(cr0) : : "memory"); 
} 
#else 
#define __write_cr0 write_cr0 
#endif 
 
static void enable_write_protection(void) 
{
    
     
    unsigned long cr0 = read_cr0(); 
    set_bit(16, &cr0); 
    __write_cr0(cr0); 
} 
 
static void disable_write_protection(void) 
{
    
     
    unsigned long cr0 = read_cr0(); 
    clear_bit(16, &cr0); 
    __write_cr0(cr0); 
} 
 
static int __init syscall_start(void) 
{
    
     
    if (!(sys_call_table = acquire_sys_call_table())) 
        return -1; 
 
    disable_write_protection(); 
 
    /* keep track of the original open function */ 
    original_call = (void *)sys_call_table[__NR_openat]; 
 
    /* use our openat function instead */ 
    sys_call_table[__NR_openat] = (unsigned long *)our_sys_openat; 
 
    enable_write_protection(); 
 
    pr_info("Spying on UID:%d\n", uid); 
 
    return 0; 
} 
 
static void __exit syscall_end(void) 
{
    
     
    if (!sys_call_table) 
        return; 
 
    /* Return the system call back to normal */ 
    if (sys_call_table[__NR_openat] != (unsigned long *)our_sys_openat) {
    
     
        pr_alert("Somebody else also played with the "); 
        pr_alert("open system call\n"); 
        pr_alert("The system may be left in "); 
        pr_alert("an unstable state.\n"); 
    } 
 
    disable_write_protection(); 
    sys_call_table[__NR_openat] = (unsigned long *)original_call; 
    enable_write_protection(); 
 
    msleep(2000); 
} 
 
module_init(syscall_start); 
module_exit(syscall_end); 
 
MODULE_LICENSE("GPL");

Linux Kernel Module Development Chapter 10 System Calls

The Linux Kernel Module Programming Guide

10 system calls

Guess you like