Practical Skills for Solving Linux Kernel Problems-New Gameplay of Dev/mem

Original dog250 Linux reading field2019-11-20


Continuing from the previous article "Practical Skills for Solving Linux Kernel Problems-Crash Tool Combining /dev/mem to Modify Memory Arbitrarily", in this article, let's take a look at several ways to play with /dev/mem.

What's in /dev/mem

Simply put, /dev/mem is the image file of the system's physical memory, and the "physical memory" here needs further explanation.

Does physical memory refer to the memory sticks we insert in the memory slot? Of course it is, but physical memory is not just a memory stick.

Strictly speaking, physical memory should refer to the physical address space. The memory bar is only mapped to a part of this address space, and the rest include various PCI devices, IO ports, etc. We can see this mapping from /proc/iomem:

[root@localhost mem]# cat /proc/iomem
00000000-00000fff : reserved
00001000-0009fbff : System RAM
0009fc00-0009ffff : reserved
000c0000-000c7fff : Video ROM
000e2000-000ef3ff : Adapter ROM
000f0000-000fffff : reserved
  000f0000-000fffff : System ROM
00100000-31ffffff : System RAM
 01000000-01649aba : Kernel code
 01649abb-01a74b7f : Kernel data
  01c13000-01f30fff : Kernel bss
32000000-33ffffff : RAM buffer
3fff0000-3fffffff : ACPI Tables
e0000000-e0ffffff : 0000:00:02.0
  e0000000-e0ffffff : vesafb
f0000000-f001ffff : 0000:00:03.0
  f0000000-f001ffff : e1000
...
...

Among them, only RAM refers to the memory stick. For the details of the physical address space, please refer to E820 related materials.

After understanding the composition of physical memory, let's take a look at what is in /dev/mem. In fact, it is a real-time image of a living Linux system. All the process taskstruct structure, sock structure, skbuff structure, process data, etc. are in a certain position inside:

Practical Skills for Solving Linux Kernel Problems-New Gameplay of Dev/mem

If we can locate their position in /dev/mem, we can get the real-time value of these data structures in the system, and so-called debugging tools can do nothing more. In fact, when we debug the kernel dump file, vmcore is also a physical memory image. Unlike /dev/mem, it is a corpse.

Whether it is a living body or a corpse, there are all five internal organs, and the means of analyzing them are the same. Unlike static analysis of vmcore, /dev/mem is a dynamic memory image, and sometimes you can do some serious things with it.

Here are a few small examples to introduce and demonstrate some playing methods of /dev/mem.

Mapping system reserved memory

The memory management subsystem of the Linux kernel is very powerful and complex. While we are blessed by it, we occasionally suffer from it.

OOM kills key processes at every turn, and frequently flushes dirty pages, causing the CPU to soar...

In order to avoid arbitrary use of memory by any process, we introduce a resource isolation mechanism, such as cgroup, but this will become more complicated.

Can a part of the memory be reserved, not under the control of the kernel's memory management? Just like many databases directly access bare disks without the file system, is there any mechanism in the kernel that allows us to use memory directly without the memory management system?

Of course there is! Add mem startup parameters to achieve. Here is the simplest configuration about reserved memory. Set mem startup parameters as follows:

mem=800M

Assuming that our system has a total of 1G of memory (referring to the total capacity of the memory module), the above startup parameters will reserve 1G-800M of memory not to be managed by the system memory management system. So my reserved memory is 200M:

[root@localhost mem]# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-327.x86_64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet LANG=en_US.UTF-8 vga=793 mem=800M
[root@localhost mem]# cat /proc/iomem |grep RAM
00001000-0009fbff : System RAM
00100000-31ffffff : System RAM
32000000-33ffffff : RAM buffer
[root@localhost mem]#

Let's focus on the physical address 0x34000000 (0x33ffffff+1) of the reserved memory. At this point, if you use the free command or /proc/meminfo, you will see that the physical memory is less than 200M, and the 200M memory we reserve will not be recorded in any statistics of the kernel.

In other words, the kernel no longer cares about the 200M memory, your program can smear arbitrarily, leak arbitrarily, overflow arbitrarily, and overwrite them arbitrarily without any impact on the system.

The so-called system reservation means "the kernel will not create a one-to-one mapping page table for this segment of memory (x86_64-bit systems can map 64T of physical memory)".

The crash tool we often use to read the memory is a one-to-one mapping.

在x86_64平台,每一个非保留的物理内存页面可能会有多个映射,而保留物理内存页面不会有下面第一种映射:

1.         一一映射到0xffff880000000000开始虚拟地址。【保留页面缺失】
2.         映射到用户态使用它的进程地址空间。
3.         临时映射到内核态空间临时touch。
4.         .....

We try to use the crash tool to read the reserved memory:


    crash> rd -p 0x34000000
    rd: read error: physical address: 34000000  type: "64-bit PHYSADDR"
    crash>

Obviously, the kernel has not established a one-to-one mapping page table entry for the reserved page, so the read fails.

We know that the /dev/mem file is the entire physical memory image, so user mode processes can use mmap system calls to rebuild the page table of the user mode address space. Methods as below:

#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>

int main(int argc, char **argv)
{
    int fd;
    unsigned long *addr;

    fd = open("/dev/mem", O_RDWR);

    // 0x34000000 即/dev/mem的偏移,也就是保留内存在物理地址空间的偏移,我的例子就是0x34000000
    addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0x34000000);
    // ... 随意使用保留内存
    close(fd);
    munmap(addr, 4096);

    return 1;
}

Is it very simple?

At this point, we can access the reserved memory in the process of implementing mmap:

crash> vtop 0x7f3751c3a000
VIRTUAL     PHYSICAL
7f3751c3a000  34000000

   PML: 6e477f0 => 2dbf7067
   PUD: 2dbf76e8 => c524067
   PMD: c524470 => 2c313067
   PTE: 2c3131d0 => 8000000034000277
   PAGE: 34000000

      PTE         PHYSICAL  FLAGS
8000000034000277  34000000  (PRESENT|RW|USER|PCD|ACCESSED|DIRTY|NX)

      VMA           START       END     FLAGS FILE
ffff88000b7e7af8 7f3751c3a000 7f3751c3b000 50444fb /dev/mem

In this example, we show how /dev/mem can be used to access reserved memory. Next, we continue to use simple small examples to demonstrate other ways of playing /dev/mem.

Swap pages between processes

There is such a demand:

  • We don't want Process A and Process B to share any pages, which means they cannot operate the same piece of data at the same time.
  • Occasionally, we want process A and process B to exchange data, but we don't want to use the inefficient traditional inter-process communication mechanism.

Do you feel a dilemma? In fact, we can exchange the pages of the two processes to achieve the goal. In order to make the exchange of page table entries as simple as possible, we still use reserved memory to remove the constraints of the kernel memory management on operations.

The sample program code is given below, first look at process A, master.c:

// gcc master.c -o master
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <fcntl.h>
int main(int argc, char **argv)
{
    int fd;
    unsigned long *addr;

    fd = open("/dev/mem", O_RDWR);

    // 映射保留地址的一个页面P1
    addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0x34000000);
   // 写页面P1的内容
    *addr = 0x1122334455667788;

    printf("address at: %p   content is: 0x%lx\n", addr, addr[0]);
    // 等待交换
    getchar();
    printf("address at: %p   content is: 0x%lx\n", addr, addr[0]);

    close(fd);
    munmap(addr, 4096);

    return 1;
}

Next, look at the process B, slave.c that you want to exchange with the process page:

// gcc slave.c -o slave
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <fcntl.h>

int main(int argc, char **argv)
{
    int fd;
    unsigned long *addr;

    fd = open("/dev/mem", O_RDWR);
    // 映射保留地址的页面P2
    addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0x34004000);
    // 写页面P2的内容
    *addr = 0x8877665544332211;

    printf("address at: %p   content is: 0x%lx\n", addr, addr[0]);
    // 等待交换
    getchar();
    printf("address at: %p   content is: 0x%lx\n", addr, addr[0]);

    close(fd);
    munmap(addr, 4096);

    return 1;
}

The principle of page swap is very simple, just swap the page table entries of two virtual addresses of two processes. To achieve this means that you need to write a kernel module, but since we are just a demonstration, we can easily achieve the goal with the crash tool.

小帖士:如若希望crash工具可写/dev/mem,参见上一篇文章,用systemtap HOOK住devmemisallowed,使其恒返回1.

The operation demonstration process is as follows:

Practical Skills for Solving Linux Kernel Problems-New Gameplay of Dev/mem

This example is very suitable for designing the inter-process communication mechanism of the microkernel. With the cache consistency protocol, it will be very efficient.

Safely tampering with program memory

The so-called safe tampering with the program memory refers to the use of a reliable method to modify the program memory, rather than manually hacking the page table. For simplicity, this time we use the crash tool to complete.

First we look at a program:

#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>

int main(int argc, char **argv)
{
    unsigned char *addr;

    // 匿名映射一段内存
    addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED, -1, 0);
    // 为其拷贝数据
    strcpy(addr, "浙江温州皮鞋湿");

    // 只是演示,所以我直接将地址打印出,真实场景需要自己hack出来
    printf("address at: %p   content is: %s\n", addr, addr);
    getchar();
    printf("address at: %p   content is: %s\n", addr, addr);

    munmap(addr, 4096);

    return 1;
}

Run it:

[root@localhost mem]# ./a.out
address at: 0x7fa5990f2000   content is: 浙江温州皮鞋湿

We want to change the memory content of "Wet leather shoes in Wenzhou, Zhejiang" to "You won't get fat when it rains". How to do it?

There are many ways, here is how to crash /dev/mem. First we find the physical page corresponding to addr:

crash> set 1819
    PID: 1819
    ...
crash> vtop 0x7f360c756000
VIRTUAL     PHYSICAL
7f360c756000  6d3d000

   PML: 2ddec7f0 => 150b6067
   PUD: 150b66c0 => 1506b067
   PMD: 1506b318 => 2c591067
   PTE: 2c591ab0 => 8000000006d3d067
  PAGE: 6d3d000

      PTE         PHYSICAL  FLAGS
8000000006d3d067   6d3d000  (PRESENT|RW|USER|ACCESSED|DIRTY|NX)

      VMA           START       END     FLAGS FILE
ffff880015070000 7f360c756000 7f360c757000     fb dev/zero

      PAGE       PHYSICAL      MAPPING       INDEX CNT FLAGS
ffffea00001b4f40  6d3d000 ffff88002fe24710        0  2 1fffff00080038 uptodate,dirty,lru,swapbacked

We got the physical address corresponding to addr is 0x6d3d000.

Now let us write another program, map /dev/mem, and then modify the memory at offset 0x6d3d000:

// gcc hacker.c -o hacker
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <fcntl.h>

int main(int argc, char **argv)
{
    int fd;
    unsigned char *addr;
    unsigned long off;

    off = strtol(argv[1], NULL, 16);

    fd = open("/dev/mem", O_RDWR);

    addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, off);

    strcpy(addr, "下雨进水不会胖");
    close(fd);

    munmap(addr, 4096);

    return 1;
}

Direct execution:

[root@localhost mem]# ./hacker 0x6d3d000

Then hit enter in the running terminal of a.out:

[root@localhost mem]# ./a.out
address at: 0x7fa5990f2000   content is: 浙江温州皮鞋湿

address at: 0x7fa5990f2000   content is: 下雨进水不会胖
[root@localhost mem]#

This example is relatively simple and boring, OK, the following example is a bit interesting.

Modify the name of any process by writing /dev/mem

In this example, we will abandon the use of the crash tool and only rely on hack /dev/mem to modify the name of a process.

这对于一些互联网产品的运营是有意义的。\
特别是在一些托管机器上,为了防止数据泄漏,一般是不允许使用类似crash & gdb工具debug的,当然了,systemtap接口有限制,所以安全,内核模块一般也会禁止。但是有systemtap和/dev/mem就够了!

Let's do such an experiment:

  • Modify the name of the process that is already running.

See how it is done. First look at a very simple program:

// gcc pixie.c -o pixie
#include <stdio.h>

int main(int argc, char **argv)
{
    getchar();
}

Run it:

[root@localhost mem]# gcc pixie.c -o pixie
[root@localhost mem]# ./pixie

Now we are trying to change the name of the process from pixie to skinshoe.

There is no crash tool, no gdb tool, only one /dev/mem that can be read and written (assuming we have HOOK the devmemsiallowed function). How to do it?

We know that all data structures of the kernel can be found in /dev/mem, so we need to find the location of the task_struct structure of the pixie process, and then change its comm field. The problem is that /dev/mem is a physical address space, and any memory operated by the operating system is based on a virtual address. How to establish an association between the two is the key.

We noticed three facts:

  • x86_64 can directly map 64T of physical memory, which is enough to map any common physical memory one by one.
  • The Linux kernel establishes a one-to-one mapping for all physical memory. A fixed offset between the physical address and the virtual address.
  • The data structure of the Linux kernel is a network structure that is related to each other, so you can follow the instructions.

This means that as long as we provide a virtual address of a Linux kernel space data structure, we can find its physical address, and then follow the vine to find the task_struct structure of our pixie process.

In the Linux system, the address of the kernel data structure can be found in many places:

  • /proc/kallsyms file.
  • /boot/System.map file.
  • the result of lsof

The easiest way is to find the address of init_task in /proc/kallsyms or System.map. For example, in my environment:

ffffffff81951440 D init_task

Then find the mapping rules of inittask to physical memory in arch/x86/kernel/vmlinux.lds.S, traverse the system task list from inittask, find our target pixie process, and change it.

But this method can't let people experience the happy feeling of walking along the vine in /dev/mem, so let's talk about it at the end. Now we try to use a slightly troublesome method to achieve the goal of modifying the name of a specific process.

My method is to create a tcpdump process, but does not capture any packets, it is just a clue to provide clues, we start from it:

[root@localhost ~]# tcpdump -i lo -n
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes

The tcpdump process is created because tcpdump will create a packet socket, and the virtual address of the socket will be displayed in procfs:

[root@localhost mem]# cat /proc/net/packet
sk       RefCnt Type Proto  Iface R Rmem   User   Inode
ffff88002c201000 3      3    0003   1     1 0      0      22050
[root@localhost mem]#

OK, this is 0xffff88002c201000 as our breakthrough. We start from it!

We know that 0xffff88002c201000 is the memory address of a struct sock object. We are familiar with the details of the sock structure and know that its offset 224 bytes is a waitqueuehead_t object.

这一点需要你对Linux内核结构体非常熟悉,如果不熟悉,就找到对应源码取手算一下偏移。【 或者借用一下crash工具的struct X.y -o计算偏移也可 】

The waitqueueheadt object is linked in by a waitqueuet object. The waitqueueheadt object is managed by the pollwqueues structure initiated by tcpdump. In the end, its pollingtask field points to tcpdump itself, and what we need is the tcpdump task_struct object itself, because the entire system All tasks are linked in a list.

The overall association diagram is as follows:

Practical Skills for Solving Linux Kernel Problems-New Gameplay of Dev/mem

With this structure, we can write code.

Since x86_64 can directly map 64T of memory one by one, and I only have 1G of memory, what can be guaranteed is that the virtual address minus the one-to-one mapping base address (in my system it is 0xffff880000000000) is the physical address.

Assuming that the address of the packet socket is 0xffff88002c201000, we can confirm that its physical address is at 0x2c201000, write code, map /dev/mem, and start from 0x2c201000:

#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <fcntl.h>

int main(int argc, char **argv)
{
    int fd;
    unsigned long off;
    unsigned long *pltmp;
    unsigned char *addr, *base;

    fd = open("/dev/mem", O_RDWR);
    off = strtol(argv[1], NULL, 16);

    addr = mmap(NULL, 0xffffffff, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    base = addr;
    addr += off;    // 定位sock
    addr += 224;
    pltmp = addr;

    addr = *pltmp;
    addr -= 0xffff880000000000;
    addr += (unsigned long)base;
    addr += 8;
    pltmp  = addr;

    addr = *pltmp;
    addr -= 0xffff880000000000;
    addr += (unsigned long)base;
    addr -= 24;
    addr += 8;
    pltmp  = addr;

    addr = *pltmp;
    addr -= 0xffff880000000000;
    addr += (unsigned long)base;
    addr += 24;
    pltmp  = addr;

    // 找到了tcpdump的task结构体。
    addr = *pltmp;
    addr -= 0xffff880000000000;
    addr += (unsigned long)base;
    addr += 1288;
    addr += 8;
    pltmp = addr;

    // 这里开始应该是一个循环,遍历整个tasks链表,然而本例中我们可以保证pixie进程在tcpdump之前,所以就简化了逻辑,直接找它前面的task即可。
    addr = *pltmp;
    addr -= 0xffff880000000000;
    addr += (unsigned long)base;
    addr += 8;
    pltmp = addr;

    addr = *pltmp;
    addr -= 0xffff880000000000;
    addr += (unsigned long)base;
    addr -= 1288;
    pltmp = addr;

    addr += 1872;
    pltmp = addr;
    printf("program name is: %s\n", addr);

    strcpy(addr, "skinshoe");

    close(fd);
    munmap(addr, 0xffffffff);

    return 1;
}

We use the physical address of 0xffff88002c201000 as the offset to execute the program:

[root@localhost mem]# ./modname 2c201000
program name is: pixie

Then, you will find that the name of the pixie program has been changed:

[root@localhost mem]# cat /proc/3442/comm
skinshoe

It can be seen that pixie has become a skinshoe.

Modifying the process name is just an example. Now that we have all got the task_struct structure, we can learn from the crash tool to do something similar to debug. Let's continue.

Parse /dev/mem to traverse all processes in the system

When it comes to traversal processes, there are generally two ideas that can be thought of:

  1. Traverse the process pid directory of the procfs file system and parse the contents of the directory.
  2. Write a module and call the for_each_process macro to traverse the process.

There is the first method.

When we know that /dev/mem is the memory image of the entire system, we know that all data structures of the entire system can be found in it, including the process linked list, of course. Our task now is obviously to find it in /dev/mem.

In the previous section, we can already find any of our processes named pixie through a packet socket made by tcpdump and change its name to skinshoe. Let's review the process of finding the pixie process through the packet socket:

  1. Find the waitqueueheadt field skwq at a fixed offset through the sock structure.
  2. Find the waitqueuet field listhead through waitqueueheadt.
  3. Find the waitqueue_t field through the listhead object.
  4. Find the poll_wqueues field private through the waitqueuet object.
  5. Find the taskt field polling_task through the pollwqueues object.

We can locate a list from a fixed offset of 1288 bytes of polling_task. To traverse the list is to traverse the entire system process linked list!

Since we can find a specific process named pixie, we can naturally traverse the entire linked list.

We continue to write the content of this section from the previous section, except that "locating a specific process" has been changed to "traversing the entire linked list". The latter is simpler.

In the whole process, all we have to do is to determine the following two things:

  1. What is the starting address of the memory mapping? In my 3.10.0-327.el7.x86_64 experimental system, the value is 0xffff880000000000. c//arch/x86/include/asm/page_64 types.h:32: #define _PAGE_OFFSET _AC(0xffff880000000000, UL)
  2. Which virtual address is mapped to init_task when the system is first started? In my 3.10.0-327.el7.x86_64 experimental system, the value is 0xffffffff81951440 bash ffffffff81951440 D init_task

Let's look at the overall look of the x86_64 memory map:

Practical Skills for Solving Linux Kernel Problems-New Gameplay of Dev/mem
That's it for the basics. How can I traverse the process of the entire system just by using a /dev/mem file? talk is cheap, show you the code:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>

#include <string.h>
#include <fcntl.h>

#define    DIRECT_MAPPING_START    0xffff880000000000

// from ./kernel/vmlinux.lds.S
// .data : AT(ADDR(.data) - LOAD_OFFSET)
// #define LOAD_OFFSET __START_KERNEL_map
// #define __START_KERNEL_map 0xffffffff80000000
#define    START_MAPPING_START     0xffffffff80000000

int main(int argc, char **argv)
{
    int i = 0;
    unsigned int pid0 = 0xffffffff, *pitmp;
    int fd;
    unsigned long off, map_off = DIRECT_MAPPING_START;
    unsigned long *pltmp, *pltmp2;
    unsigned char *addr, *taddr, *base, *pctmp;

    fd = open("/dev/mem", O_RDWR);
    off = strtoll(argv[1], NULL, 16) - DIRECT_MAPPING_START & 0x0000ffffffffffff;
    addr = mmap(NULL, 0xffffffff, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    base = addr;
    addr += off;    // 定位sock
#define    SK_WQ_OFFSET    224 // sock.sk_wq
    addr += SK_WQ_OFFSET;
    pltmp = (unsigned long *)addr;

    addr = (unsigned char *)*pltmp;
    addr -= DIRECT_MAPPING_START;
    addr += (unsigned long)base;
#define LIST_PREV_OFFSET    8   // list_head.prev
    addr += LIST_PREV_OFFSET;
    pltmp  = (unsigned long *)addr;

    addr = (unsigned char *)*pltmp;
    addr -= DIRECT_MAPPING_START;
    addr += (unsigned long)base;
#define    WAITQ_TASK_OFFSET   24  // __wait_queue.task_list
    addr -= WAITQ_TASK_OFFSET;
#define    PRIVATE_OFFSET  8   // __wait_queue.private
    addr += PRIVATE_OFFSET;
    pltmp  = (unsigned long *)addr;

    addr = (unsigned char *)*pltmp;
    addr -= DIRECT_MAPPING_START;
    addr += (unsigned long)base;
#define POLLING_TASK_OFFSET    24  // poll_wqueues.polling_task
    addr += POLLING_TASK_OFFSET;
    pltmp  = (unsigned long *)addr;

    addr = (unsigned char *)*pltmp;
    addr -= DIRECT_MAPPING_START;
    addr += (unsigned long)base;
#define PID_OFFSET    1404    // task_struct.pid
#define COMM_OFFSET    1872    // task_struct.comm
    pitmp = (unsigned int *)(addr + PID_OFFSET);
    pid0 = *pitmp;
    printf("pid0 is:%d\n", *pitmp);
#define    TASKS_OFFSET    1288    // task_struct.tasks
   addr += TASKS_OFFSET;
    addr += LIST_PREV_OFFSET;
    pltmp = (unsigned long *)addr;

    while (1) {

        addr = (unsigned char *)*pltmp; // list
        addr -= map_off;
        addr += (unsigned long)base;
        addr += LIST_PREV_OFFSET ;// prev
        pltmp = (unsigned long *)addr;

        taddr = (unsigned char *)*pltmp; // list_entry
        if (*pitmp == 1) {
            taddr -= START_MAPPING_START;
        } else {
            taddr -= DIRECT_MAPPING_START;
        }
        taddr += (unsigned long)base;
        taddr -= TASKS_OFFSET;

        pitmp = (unsigned int *)(taddr + PID_OFFSET);
        if (*pitmp == pid0) {
            break;
        }
        if (*pitmp == 0) {
            map_off = START_MAPPING_START;
        } else {
             map_off = DIRECT_MAPPING_START;
        }
        printf("%d\t", *pitmp);

        pctmp = (taddr + COMM_OFFSET);
        printf("[%s] \n", pctmp);
    }

    close(fd);
    munmap(addr, 0xffffffff);

    return 1;
}

The above code is not too difficult to understand, the only thing to pay attention to is the positioning of init_task.

inittask, which is the process of Linux system 0, it is very special, it is not generated by fork, it is accompanied by the initial startup of the system, that is to say, the moment the power is turned on, the moment x86 enters the protected mode, it is In the context of process 0, however, at this time, the memory mapping rules have not been established. Where is inittask?

inittask is mapped in "a handwritten location" that is, the location specified in the arch/x86/kernel/vmlinux.lds.S file. When you find the virtual address of inittask by some means, subtract LOAD_OFFSET to get its physical address:

#define LOAD_OFFSET __START_KERNEL_map
#define __START_KERNEL_map 0xffffffff80000000
...
    /* Data */
    .data : AT(ADDR(.data) - LOAD_OFFSET) {
        /* Start of data section */
       _sdata = .;

        /* init_task */
        INIT_TASK_DATA(THREAD_SIZE)

So we defined separately:

#define    START_MAPPING_START     0xffffffff80000000

This is the basis for positioning init_task.

Closer to home, compile the above C code to demonstrate the traversal process. as follows:

[root@localhost mem]# cat /proc/net/packet
sk       RefCnt Type Proto  Iface R Rmem   User   Inode
ffff88002dbb8000 3      3    0003   2     1 0      0      42249
[root@localhost mem]# ./listtasks 0x88002dbb8000
pid0 is:6020
5784    [kworker/1:1]
5762    [ssh]
5760    [kworker/3:2H]
5754    [ssh]
... // 篇幅所限,省略
1    [systemd]
0    [swapper/0]
6159    [listtasks]
... // 篇幅所限,省略
[root@localhost mem]#

OK, all processes have been successfully traversed! The fly in the ointment is that the traversal process is not locked, and there may be synchronization problems, but in any case, the most serious problem is only the listtasks process SEGV.

Now, let us transplant the above code to 3.10.0-862.11.6.el7.x86_64 kernel system, execute it as it is, and SEGV appears.

In fact, the address link parameters of each running kernel may be inconsistent, which is also the reason and result of the ABI incompatibility between Linux kernel versions. However, it is not difficult to make the code of 3.10.0-327.el7.x8664 run on the system version 3.10.0-862.11.6.el7.x8664, just modify the macro according to the following definition:

#define    DIRECT_MAPPING_START    0xffff8b9d40000000

// from ./kernel/vmlinux.lds.S
// .data : AT(ADDR(.data) - LOAD_OFFSET)
// #define LOAD_OFFSET __START_KERNEL_map
// #define __START_KERNEL_map 0xffffffff80000000
#define    START_MAPPING_START     0xffffffffa6a00000
#define    SK_WQ_OFFSET    224 // sock.sk_wq
#define LIST_PREV_OFFSET    8   // list_head.prev
#define    WAITQ_TASK_OFFSET   24  // __wait_queue.task_list
#define    PRIVATE_OFFSET  8   // __wait_queue.private
#define POLLING_TASK_OFFSET    24  // poll_wqueues.polling_task
#define PID_OFFSET    1188    // task_struct.pid
#define COMM_OFFSET    1656    // task_struct.comm
#define    TASKS_OFFSET    1072    // task_struct.tasks

The following is a demonstration on the 3.10.0-862.11.6.el7.x86_64 kernel:

[root@localhost ~]# cat /proc/net/packet
sk       RefCnt Type Proto  Iface R Rmem   User   Inode
ffff8b9d7bf89000 3      3    0003   3     1 0      0      16883
[root@localhost ~]# ./a.out 8b9d7bf89000
pid0 is:959
719    [NetworkManager]
716    [firewalld]
708    [login]
703    [crond]
698    [systemd-logind]
696    [polkitd]
...
...
3    [ksoftirqd/0]
2    [kthreadd]
1    [systemd]
0    [swapper/0]
2685    [a.out]
2171    [pickup]
2166    [stapio]
1416    [qmgr]
1414    [master]
1136    [xinetd]
1131    [tuned]
1129    [rsyslogd]
1126    [sshd]
[root@localhost ~]#

After we have mastered the technique of finding a specific structure in /dev/mem through the clues of a specific memory address, it is easy to understand the relatively simple method of traversing all processes through init_task.

Here is the code to traverse all processes through init_task as a primer:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <fcntl.h>

#define    DIRECT_MAPPING_START    0xffff880000000000
#define    DIRECT_MAPPING_END      0xffffc7ffffffffff

// from ./kernel/vmlinux.lds.S
// .data : AT(ADDR(.data) - LOAD_OFFSET)
// #define LOAD_OFFSET __START_KERNEL_map
// #define __START_KERNEL_map 0xffffffff80000000
#define    START_MAPPING_START     0xffffffff80000000

int main(int argc, char **argv)
{
    int i = 0 ;
    unsigned int pid0 = 0xffffffff, *pitmp;
    int fd;
    unsigned long off, map_off = DIRECT_MAPPING_START;
    unsigned long *pltmp, *pltmp2;
    unsigned char *addr, *taddr, *base, *pctmp;

    fd = open("/dev/mem", O_RDWR);
    // 参数为我们在/proc/kallsyms里找到的init_task的地址。
    off = strtoll(argv[1], NULL, 16) - START_MAPPING_START & 0x0000ffffffffffff;

    addr = mmap(NULL, 0xffffffff, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    base = addr;
    addr += off;
#define TASKS_OFFSET    1288
    addr += TASKS_OFFSET;
    pltmp  = (unsigned long *)addr;

    while (1) {
        taddr = (unsigned char *)*pltmp; // list_entry
        taddr -= map_off;
        taddr += (unsigned long)base;
        taddr -= TASKS_OFFSET;

#define PID_OFFSET    1404    // task_struct.pid
#define COMM_OFFSET    1872    // task_struct.comm
        pitmp = (unsigned int *)(taddr + PID_OFFSET);
        printf("%d\t", *pitmp);

        pctmp = (taddr + COMM_OFFSET);
        printf("[%s] \n", pctmp);
        addr = (unsigned char *)*pltmp; // list
        addr -= map_off;
        addr += (unsigned long)base;
        pltmp = (unsigned long *)addr;
        if (*pltmp > DIRECT_MAPPING_END) {
                    break;
        }
    }

    close(fd);
    munmap(addr, 0xffffffff);

   return 1;
}

The code looks a lot shorter. The demonstration is as follows:

[root@10 mem]# ./a.out ffff81951440
1    [systemd]
2    [kthreadd]
3    [ksoftirqd/0]
7    [migration/0]
8    [rcu_bh]
9    [rcuob/0]
10    [rcuob/1]
11    [rcuob/2]
12    [rcuob/3]
13    [rcu_sched]
14    [rcuos/0]
15    [rcuos/1]
16    [rcuos/2]
...

Manually process by writing /dev/mem

There are two ways to unconditionally kill a process:

  1. SIGKILL kill it.
  2. Self-destructed by a serious problem.

Either start from the outside or decay from the inside, but the demise of a process requires a reason. However, what does it mean that a good process is hollowed out? Can it be done? This is like a kind and healthy person who suddenly suffered a serious accident.

You can easily empty a process by writing /dev/mem. When the process is ready to execute again, you find that you have nothing.

We can locate the location of the process in /dev/mem, remove the vma of the process, and clear the stack... It's a bit cruel, so I won't give an example.

Modify function instructions by writing /dev/mem

As the last example, I think it’s time to echo back and forth. The first example in "Practical Techniques for Solving Linux Kernel Issues-Crash Tool Combining /dev/mem to Modify Memory Anyway" is the premise that you can freely and happily play the following examples. , That is, modify the instruction of the devmemisallowed function to make it always return 1. Now, we restore it back by writing /dev/mem, thus ending this article.

We found the address of devmemisallowed from /proc/kallsyms:

ffffffff8105e630 T devmem_is_allowed

This is how it looks like:

Practical Skills for Solving Linux Kernel Problems-New Gameplay of Dev/mem

We changed the ja statement to two nops, and now we want to restore the two nops to ja:

Practical Skills for Solving Linux Kernel Problems-New Gameplay of Dev/mem

It is worth noting that the devmemisallowed function only restricts the open and mmap calls of /dev/mem. Once mmap is successful, accessing /dev/mem is like a normal memory access operation, and is no longer restricted by file reading and writing, so it can be safe Write /dev/mem locally, instead of having to write atomically to succeed like when hooking it.

OK, from the beginning of revising devmemisallowed to the end of restoring devmemisallowed, we have had fun.

In order to get off the bus easily, we will arrange another example at the end.

Legal access to NULL address

Can NULL addresses be accessed? Who on earth is preventing access to NULL addresses?

Let me start with the conclusion:

  • The NULL address is completely accessible, as long as there is a page table entry that maps it to a physical page.

Linux system has a parameter to control whether mmap NULL address can be:

[root@10 ~]# cat /proc/sys/vm/mmap_min_addr
4096
[root@10 ~]# echo  0 >/proc/sys/vm/mmap_min_addr
[root@10 ~]# cat /proc/sys/vm/mmap_min_addr
0

Let's do an experiment to see what happens, first look at the code:

// gcc access0.c -o access0
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>

int main(int argc, char * argv)
{
int i;
unsigned char
niladdr = NULL;
unsigned char str[] = "Zhejiang Wenzhou pixie shi,xiayu jinshui buhui pang!";

    mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_ANONYMOUS|MAP_SHARED, -1, 0);
    perror("a");

    for (i = 0 ; i < sizeof(str); i++) {
        niladdr[i] = str[i];
    }
    printf("using assignment at NULL: %s\n", niladdr);
    for (i = 0 ; i < sizeof(str); i++) {
        printf ("%c", niladdr[i]);
    }
    printf ("\n");

    getchar();

    munmap(0, 4096);

    return 0;
}

Run it:

[root@10 mem]# ./access0
a: Success
using assignment at NULL: (null)
Zhejiang Wenzhou pixie shi,xiayu jinshui buhui pang!

At this point, we crash tool to look at the NULL mapping:

Practical Skills for Solving Linux Kernel Problems-New Gameplay of Dev/mem

It doesn't look different.

The reason why the NULL address is not allowed to access is to better distinguish what is a legal address, so a special address NULL is artificially created. At the MMU level, NULL is no different.

Conclusion

Well, the topic about crash tools and /dev/mem should end. Combined with the previous article, it is recommended to do these experiments in person to gain a deeper understanding.

More importantly, when everyone is doing these experiments by themselves, they will encounter various new problems. Analyzing and eventually hacking these problems is the origin of the feeling of happiness. Sharing this happiness is also a happiness in itself. This is also the motivation for me to write these two articles.


Pro-Yuan Xian fish, as retreat webs.

The leather shoes in Wenzhou, Zhejiang are wet, so they won’t get fat in the rain.

(Finish)

Guess you like

Origin blog.51cto.com/15015138/2555570