1. What is a system call
系统调用
It is a functional function provided by the kernel to the application program. Since the application program generally runs in the 用户态
user state process, there are many restrictions (such as not being able to perform I/O operations), so some functions must be performed by the kernel. The kernel is provided to the application layer 系统调用
to complete some work that cannot be done in the user mode.
To put it bluntly, a system call is actually a function call, but it calls a kernel-mode function. But unlike ordinary function calls, system calls cannot be call
called using instructions, but need to be called using 软中断
instructions. In Linux systems, system calls are generally invoked using int 0x80
instructions (x86) or syscall
instructions (x64).
Let's take int 0x80
the instruction (x86) call method as an example to illustrate the principle of system call.
2. System call principle
In the Linux kernel, sys_call_table
an array is used to store all system calls, and sys_call_table
each element of the array represents the entry of a system call, which is defined as follows:
typedef void (*sys_call_ptr_t)(void);
const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
...
};
When the application program needs to call a system call, it first needs to place the number of the system call to be called (that is, the sys_call_table
index of the array where the system call is located) into eax
the register, and then int 0x80
trigger the call 0x80
number soft interrupt service by using the instruction.
0x80
No. soft interrupt service, the system call will be called through the following code, as shown below:
...
call *sys_call_table(,%eax,8)
...
The above code will eax
call the correct system call according to the value in the register, and the process is shown in the following figure:
3. System call interception
After understanding the principle of system calls, it is very simple to intercept system calls. So how to intercept it?
The method is: we only need to sys_call_table
replace the system call of the array with the function entry written by ourselves. For example, if we want to intercept write()
system calls, we only need to sys_call_table
replace the first element of the array with the function we wrote (because the index of write()
the system call in sys_call_table
the array is 1).
To modify sys_call_table
the value of an array element, proceed as follows:
1. Get sys_call_table
the address of the array
To modify
sys_call_table
the value of an array element, it generally needs to be done through a kernel module. Because the user mode program cannot rewrite the data in the kernel mode due to the memory protection mechanism. The kernel module runs in kernel mode, so it can skip this limitation.
To modify sys_call_table
the value of an array element, first obtain sys_call_table
the virtual memory address of the array (since sys_call_table
the variable is not an exported symbol, the kernel module cannot use it directly).
There are two ways to get sys_call_table
the virtual memory address of an array:
First method: System.map
read from a file
System.map
It is a kernel symbol table, which contains the variable names and function name addresses in the kernel, and is automatically generated every time the kernel is compiled. To obtain sys_call_table
the virtual address of an array use the following command:
sudo cat /boot/System.map-`uname -r` | grep sys_call_table
The result is shown in the figure below:
As can be seen from the figure above, sys_call_table
the virtual address of the array is: ffffffff818001c0
.
The second method: kallsyms_lookup_name()
get through the function
The method of reading from System.map
a file is not very elegant, so the kernel provides a kallsyms_lookup_name()
function called .
kallsyms_lookup_name()
The use of the function is very simple, you only need to pass in the variable name to get the virtual memory address, as shown in the following code:
#include <linux/kallsyms.h>
void func() {
...
unsigned long *sys_call_table;
// 获取 sys_call_table 的虚拟内存地址
sys_call_table = (unsigned long *)kallsyms_lookup_name("sys_call_table");
...
}
2. Set the sys_call_table array to be writable
sys_call_table
Is it possible to modify the value of its elements after obtaining the virtual address of the array? not that simple.
Since sys_call_table
the array is in a write-protected area, its contents cannot be modified directly. But there are two ways to temporarily close the write protection, as follows:
First method: cr0
set bit 16 of the register to zero
cr0
The 16th bit of the control register is the write protection bit, if it is set to zero, it allows the super authority to write data into the kernel. In this way, we can clear the 16th bit of the register sys_call_table
before modifying the value of the array , so that it can modify the content of the array. When the modification is completed, restore that bit again.cr0
sys_call_table
code show as below:
/*
* 设置cr0寄存器的第16位为0
*/
unsigned int clear_and_return_cr0(void)
{
unsigned int cr0 = 0;
unsigned int ret;
/* 将cr0寄存器的值移动到rax寄存器中,同时输出到cr0变量中 */
asm volatile ("movq %%cr0, %%rax" : "=a"(cr0));
ret = cr0;
cr0 &= 0xfffeffff; /* 将cr0变量值中的第16位清0,将修改后的值写入cr0寄存器 */
/* 读取cr0的值到rax寄存器,再将rax寄存器的值放入cr0中 */
asm volatile ("movq %%rax, %%cr0" :: "a"(cr0));
return ret;
}
/*
* 还原cr0寄存器的值为val
*/
void setback_cr0(unsigned int val)
{
asm volatile ("movq %%rax, %%cr0" :: "a"(val));
}
The second method: set the read and write attributes of the page table entry corresponding to the virtual address
Since x86 CPU
the memory protection mechanism is implemented through the virtual memory page table (you can refer to this article: talk about memory mapping ), so we only need to sys_call_table
clear the protection flag in the virtual memory page table entry of the array, the code is as follows:
/*
* 把虚拟内存地址设置为可写
*/
int make_rw(unsigned long address)
{
unsigned int level;
//查找虚拟地址所在的页表地址
pte_t *pte = lookup_address(address, &level);
if (pte->pte & ~_PAGE_RW) //设置页表读写属性
pte->pte |= _PAGE_RW;
return 0;
}
/*
* 把虚拟内存地址设置为只读
*/
int make_ro(unsigned long address)
{
unsigned int level;
pte_t *pte = lookup_address(address, &level);
pte->pte &= ~_PAGE_RW; //设置只读属性
return 0;
}
3. Modify sys_call_table
the contents of the array
All is ready except for the opportunity. We have finished all the preparatory work before, now we only need to sys_call_table
replace the system call entry in the array with the function entry we wrote.
We can modify sys_call_table
the value of the array in the kernel module initialization function, and then change it back to the original value in the kernel module exit function. The complete code is as follows:
/*
* File: syscall.c
*/
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/unistd.h>
#include <linux/time.h>
#include <asm/uaccess.h>
#include <linux/sched.h>
#include <linux/kallsyms.h>
unsigned long *sys_call_table;
unsigned int clear_and_return_cr0(void);
void setback_cr0(unsigned int val);
static int sys_hackcall(void);
unsigned long *sys_call_table = 0;
/* 定义一个函数指针,用来保存原来的系统调用*/
static int (*orig_syscall_saved)(void);
/*
* 设置cr0寄存器的第16位为0
*/
unsigned int clear_and_return_cr0(void)
{
unsigned int cr0 = 0;
unsigned int ret;
/* 将cr0寄存器的值移动到rax寄存器中,同时输出到cr0变量中 */
asm volatile ("movq %%cr0, %%rax" : "=a"(cr0));
ret = cr0;
cr0 &= 0xfffeffff; /* 将cr0变量值中的第16位清0,将修改后的值写入cr0寄存器 */
/* 读取cr0的值到rax寄存器,再将rax寄存器的值放入cr0中 */
asm volatile ("movq %%rax, %%cr0" :: "a"(cr0));
return ret;
}
/*
* 还原cr0寄存器的值为val
*/
void setback_cr0(unsigned int val)
{
asm volatile ("movq %%rax, %%cr0" :: "a"(val));
}
/*
* 自己编写的系统调用函数
*/
static int sys_hackcall(void)
{
printk("Hack syscall is successful!!!\n");
return 0;
}
/*
* 模块的初始化函数,模块的入口函数,加载模块时调用
*/
static int __init init_hack_module(void)
{
int orig_cr0;
printk("Hack syscall is starting...\n");
/* 获取 sys_call_table 虚拟内存地址 */
sys_call_table = (unsigned long *)kallsyms_lookup_name("sys_call_table");
/* 保存原始系统调用 */
orig_syscall_saved = (int(*)(void))(sys_call_table[__NR_perf_event_open]);
orig_cr0 = clear_and_return_cr0(); /* 设置cr0寄存器的第16位为0 */
sys_call_table[__NR_perf_event_open] = (unsigned long)&sys_hackcall; /* 替换成我们编写的函数 */
setback_cr0(orig_cr0); /* 还原cr0寄存器的值 */
return 0;
}
/*
* 模块退出函数,卸载模块时调用
*/
static void __exit exit_hack_module(void)
{
int orig_cr0;
orig_cr0 = clear_and_return_cr0();
sys_call_table[__NR_perf_event_open] = (unsigned long)orig_syscall_saved; /* 设置为原来的系统调用 */
setback_cr0(orig_cr0);
printk("Hack syscall is exited....\n");
}
module_init(init_hack_module);
module_exit(exit_hack_module);
MODULE_LICENSE("GPL");
In the above code, we perf_event_open()
replaced the system call with our own implemented function.
Note: It is best to use unpopular system calls when testing, otherwise it may cause the system to crash.
4. Write the Makefile
For the convenience of compiling, we write a Makefile to compile, as follows:
obj-m:=syscall.o
PWD:= $(shell pwd)
KERNELDIR:= /lib/modules/$(shell uname -r)/build
EXTRA_CFLAGS= -O0
all:
make -C $(KERNELDIR) M=$(PWD) modules
clean:
make -C $(KERNELDIR) M=$(PWD) clean
Pay attention to adding EXTRA_CFLAGS= -O0
the option to turn off gcc optimization to avoid errors in inserting modules.
5. Test procedure
Now, we write a test program to test whether the system call interception is successful, the code is as follows:
#include <syscall.h>
#include <stdio.h>
#include <unistd.h>
int main(void)
{
unsigned long ret = syscall(__NR_perf_event_open, NULL, 0, 0, 0, 0);
printf("%d\n", (int)ret);
return 0;
}
6. Running results
Step 1: Install the interception kernel module
Install the kernel module with the following command:
root# insmod syscall.ko
Then dmesg
observe the system log through the command, you can see the following output:
...
[ 133.564652] Hack syscall is starting...
This shows that our kernel module was installed successfully.
Step 2: Run the test program
Next, we run the test program we just wrote, and then observe the system log, the output is as follows:
...
[ 532.243714] Hack syscall is successful!!!
This shows that the interception system call was successful.