Linux mem 1.1 用户态进程空间布局 --- mmap()详解

1. 原理介绍

为了进行权限管理,linux的地址分为用户地址空间和内核地址空间。一般来说这两者大小相等,各占总空间的一半。

例如:

地址模式 单个空间 用户地址空间 内核地址空间
32位 2G 0x00000000 - 0x7FFFFFFF 0x80000000 - 0xFFFFFFFF
64位(48bit) 128T 0x00000000 00000000 - 0x00007FFF FFFFFFFF 0xFFFF8000 00000000 - 0xFFFFFFFF FFFFFFFF
64位(57bit) 64P 0x00000000 00000000 - 0x00FFFFFF FFFFFFFF 0xFF000000 00000000 - 0xFFFFFFFF FFFFFFFF

本篇我们先讨论用户地址空间。

1.1 映射

一般来说地址空间就是实现虚拟地址和物理地址之间的映射,另外用户地址空间还有一个重要的因素参与映射那就是文件。所以用户地址空间的映射分为两类:文件映射(虚拟地址、物理地址、文件)、匿名映射(虚拟地址、物理地址)。

  • 1、文件映射

在这里插入图片描述
虚拟地址、物理地址、文件,三者的详细映射关系如下图:

在这里插入图片描述
地址空间的映射为什么要把文件牵扯进来?因为在系统中内存的资源相比较文件存储资源往往比较小,需要把文件当成内存的一个慢速备份使用:
1、在加载的时候使用“lazy”的延迟分配策略,分配虚拟地址空间时,并不会马上分配对应的物理内存并建立mmu映射,而是在虚拟地址真正使用时触发缺页异常,在异常处理中再进行物理内存分配、数据读取和建立mmu映射。
2、在系统中内存不够需要进行内存回收时,文件映射的物理内存可以暂时释放掉。因为文件中还有备份,在真正使用的时候通过缺页异常再进行加载。

内核空间为什么不使用这种策略?
1、内核对文件的映射很少,一般就是vmlinux,不会消耗太多内存资源。
2、内核空间的代码一般要求快速响应,缺页处理这种会让内核速度未知。
3、内核操作可能处于各种复杂的锁上下文中,在这种上下文中处理缺页异常,会触发新的异常。

  • 2、匿名映射

在这里插入图片描述

没有慢速备份的空间就只能进行匿名映射了,这类空间一般是数据类的,比如:全局数据。这类空间的处理策略:
1、分配时可以"lazy"延迟分配,真正使用时触发缺页异常,在异常处理中分配全零物理内存、建立mmu映射。
2、在内存回收时,这类空间的数据不能直接释放掉,只能通过swap把数据交换到文件/zram当中后,才能释放映射的物理内存。

1.2 vma管理

用户地址空间是和进程绑定的,每个进程的用户地址空间都是独立的。进程使用mm结构来管理用户地址空间(内核进程没有用户地址空间所以他的mm=NULL)。

active_mm
mm指向的是进程拥有的内存描述符,而active_mm指向进程运行时使用的内存描述符。对普通进程来说,两个字段值相同。然而内核线程没有内存描述符,所以mm一直是NULL。内核线程得以运行时,active_mm字段会被初始化为上一个运行进程的active_mm值(参考schedule()调度函数)。

进程使用vma来管理一段虚拟地址,vma结构中关键的数据成员:

struct vm_area_struct {
	unsigned long vm_start;		// 起始虚拟地址
	unsigned long vm_end;		// 虚拟地址长度

	unsigned long vm_pgoff;		// 起始虚拟地址对应的文件偏移
	struct file * vm_file;		// 虚拟地址对应的映射文件

}

mm结构使用红黑树来管理所有的虚拟地址vma,同时使用链表串联起所有的vma:

在这里插入图片描述

对vma的操作通常有查找(find_vma())、插入(vma_link())。

需要特别注意的是find_vma(mm, addr)。
从字面上理解是根据addr找到一个vma满足条件vma->vm_start < addr < vma->vm_end,但实质上他是根据addr找到第一个满足addr < vma->vm_end的vma。所以要注意你虽然使用find_vma()查找到了一个vma,但并不意味着你的addr位于这个vma当中。非常容易犯的错误!

1.3 mmap

上面的章节介绍了各种背景,而建立用户地址映射的具体实施是从mmap()开始的,mmap操作在进程执行的过程中被大量使用。

在这里插入图片描述

通常来说mmap()只会分配一段vma空间,并且记录vma和file的关系,并不会马上进行实质性的映射。分配物理内存page、拷贝文件内容、创建mmu映射的这些后续动作在缺页异常中进行。

1.4 缺页异常

当有用户访问上一节mmap的地址时,因为没有对应的mmu映射,缺页异常发生了。

在这里插入图片描述

在异常处理中进行:分配物理内存page、拷贝文件内容、创建mmu映射。

1.5 layout

除了execve()时把text、data映射完成以后,后续so的加载等都需要使用mmap()进行映射,mmap映射的起始地址称为mm->mmap_base。

根据mmap_base的不同映射方式,整个用户地址空间有两种layout模式。

  • 1、legacy layout (mmap从低向高增长)

在这里插入图片描述

可以看到传统layout模式下,mm_base大概从用户空间的1/3处开始。stack的最大空间基本是固定的,能灵活增长的是heap和mmap,都是从下向上增长。

这种布局有什么问题吗?最大的问题是heap和mmap的空间不能共享,heap空间最大为1/3 task_size左右,mmap空间最大为2/3 task_size左右。

  • 2、modern layout (mmap从高向低增长)

在这里插入图片描述

现在普遍采用新型的layout布局,能实现heap和mmap空间共享,heap从下向上增长、mmap从上向下增长。实现了更大空间的可能。

例如:

$ cat /proc/3389/maps 
00400000-00401000 r-xp 00000000 fd:00 104079935                          /home/ipu/hook/build/output/bin/sample-target
00600000-00601000 r--p 00000000 fd:00 104079935                          /home/ipu/hook/build/output/bin/sample-target
00601000-00602000 rw-p 00001000 fd:00 104079935                          /home/ipu/hook/build/output/bin/sample-target
01b9f000-01bc0000 rw-p 00000000 00:00 0                                  [heap]
7f5c77ad9000-7f5c77c9c000 r-xp 00000000 fd:00 191477                     /usr/lib64/libc-2.17.so
7f5c77c9c000-7f5c77e9c000 ---p 001c3000 fd:00 191477                     /usr/lib64/libc-2.17.so
7f5c77e9c000-7f5c77ea0000 r--p 001c3000 fd:00 191477                     /usr/lib64/libc-2.17.so
7f5c77ea0000-7f5c77ea2000 rw-p 001c7000 fd:00 191477                     /usr/lib64/libc-2.17.so
7f5c77ea2000-7f5c77ea7000 rw-p 00000000 00:00 0 
7f5c77ea7000-7f5c77ec9000 r-xp 00000000 fd:00 191470                     /usr/lib64/ld-2.17.so
7f5c7809d000-7f5c780a0000 rw-p 00000000 00:00 0 
7f5c780c6000-7f5c780c8000 rw-p 00000000 00:00 0 
7f5c780c8000-7f5c780c9000 r--p 00021000 fd:00 191470                     /usr/lib64/ld-2.17.so
7f5c780c9000-7f5c780ca000 rw-p 00022000 fd:00 191470                     /usr/lib64/ld-2.17.so
7f5c780ca000-7f5c780cb000 rw-p 00000000 00:00 0 
7ffe4a3ef000-7ffe4a410000 rw-p 00000000 00:00 0                          [stack]
7ffe4a418000-7ffe4a41a000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

1.6 brk

前面说过heap是一块特殊的mmap匿名映射区域,系统有一个专门的系统调用brk()来负责这个区域mmap的大小。

brk()和mmap()原理上是一样的的,相当于mmap()的一个特例,它调整heap区域结束指针mm->brk

在这里插入图片描述

2. 代码详解

2.1 关键数据结构

mm:

/**
 * 内存描述符。task_struct的mm字段指向它。
 * 它包含了进程地址空间有关的全部信息。
 */
struct mm_struct {
	/**
	 * 指向线性区对象的链表头。
	 */
	struct vm_area_struct * mmap;		/* list of VMAs */
	/**
	 * 指向线性区对象的红-黑树的根
	 */
	struct rb_root mm_rb;
	/**
	 * 指向最后一个引用的线性区对象。
	 */
	struct vm_area_struct * mmap_cache;	/* last find_vma result */
	/**
	 * 在进程地址空间中搜索有效线性地址区的方法。
	 */
	unsigned long (*get_unmapped_area) (struct file *filp,
				unsigned long addr, unsigned long len,
				unsigned long pgoff, unsigned long flags);
	/**
	 * 释放线性地址区间时调用的方法。
	 */
	void (*unmap_area) (struct vm_area_struct *area);
	/**
	 * 标识第一个分配的匿名线性区或文件内存映射的线性地址。
	 */
	unsigned long mmap_base;		/* base of mmap area */
	/**
	 * 内核从这个地址开始搜索进程地址空间中线性地址的空间区间。
	 */
	unsigned long free_area_cache;		/* first hole */
	/**
	 * 指向页全局目录。
	 */
	pgd_t * pgd;
	/**
	 * 次使用计数器。存放共享mm_struct数据结构的轻量级进程的个数。
	 */
	atomic_t mm_users;			/* How many users with user space? */
	/**
	 * 主使用计数器。每当mm_count递减时,内核都要检查它是否变为0,如果是,就要解除这个内存描述符。
	 */
	atomic_t mm_count;			/* How many references to "struct mm_struct" (users count as 1) */
	/**
	 * 线性区的个数。
	 */
	int map_count;				/* number of VMAs */
	/**
	 * 内存描述符的读写信号量。
	 * 由于描述符可能在几个轻量级进程间共享,通过这个信号量可以避免竞争条件。
	 */
	struct rw_semaphore mmap_sem;
	/**
	 * 线性区和页表的自旋锁。
	 */
	spinlock_t page_table_lock;		/* Protects page tables, mm->rss, mm->anon_rss */

	/**
	 * 指向内存描述符链表中的相邻元素。
	 */
	struct list_head mmlist;		/* List of maybe swapped mm's.  These are globally strung
						 * together off init_mm.mmlist, and are protected
						 * by mmlist_lock
						 */

	/**
	 * start_code-可执行代码的起始地址。
	 * end_code-可执行代码的最后地址。
	 * start_data-已初始化数据的起始地址。
	 * end_data--已初始化数据的结束地址。
	 */
	unsigned long start_code, end_code, start_data, end_data;
	/**
	 * start_brk-堆的超始地址。
	 * brk-堆的当前最后地址。
	 * start_stack-用户态堆栈的起始地址。
	 */
	unsigned long start_brk, brk, start_stack;
	/**
	 * arg_start-命令行参数的起始地址。
	 * arg_end-命令行参数的结束地址。
	 * env_start-环境变量的起始地址。
	 * env_end-环境变量的结束地址。
	 */
	unsigned long arg_start, arg_end, env_start, env_end;
	/**
	 * rss-分配给进程的页框总数
	 * anon_rss-分配给匿名内存映射的页框数。s
	 * total_vm-进程地址空间的大小(页框数)
	 * locked_vm-锁住而不能换出的页的个数。
	 * shared_vm-共享文件内存映射中的页数。
	 */
	unsigned long rss, anon_rss, total_vm, locked_vm, shared_vm;
	/**
	 * exec_vm-可执行内存映射的页数。
	 * stack_vm-用户态堆栈中的页数。
	 * reserved_vm-在保留区中的页数或在特殊线性区中的页数。
	 * def_flags-线性区默认的访问标志。
	 * nr_ptes-this进程的页表数。
	 */
	unsigned long exec_vm, stack_vm, reserved_vm, def_flags, nr_ptes;

	/**
	 * 开始执行elf程序时使用。
	 */
	unsigned long saved_auxv[42]; /* for /proc/PID/auxv */

	/**
	 * 表示是否可以产生内存信息转储的标志。
	 */
	unsigned dumpable:1;
	/**
	 * 懒惰TLB交换的位掩码。
	 */
	cpumask_t cpu_vm_mask;

	/* Architecture-specific MM context */
	/**
	 * 特殊体系结构信息的表。
	 * 如80X86平台上的LDT地址。
	 */
	mm_context_t context;

	/* Token based thrashing protection. */
	/**
	 * 进程有资格获得交换标记的时间。
	 */
	unsigned long swap_token_time;
	/**
	 * 如果最近发生了主缺页。则设置该标志。
	 */
	char recent_pagein;

	/* coredumping support */
	/**
	 * 正在把进程地址空间的内容卸载到转储文件中的轻量级进程的数量。
	 */
	int core_waiters;
	/**
	 * core_startup_done-指向创建内存转储文件时的补充原语。
	 * core_done-创建内存转储文件时使用的补充原语。
	 */
	struct completion *core_startup_done, core_done;

	/* aio bits */
	/**
	 * 用于保护异步IO上下文链表的锁。
	 */
	rwlock_t		ioctx_list_lock;
	/**
	 * 异步IO上下文链表。
	 * 一个应用可以创建多个AIO环境,一个给定进程的所有的kioctx描述符存放在一个单向链表中,该链表位于ioctx_list字段
	 */
	struct kioctx		*ioctx_list;

	/**
	 * 默认的异步IO上下文。
	 */
	struct kioctx		default_kioctx;

	/**
	 * 进程所拥有的最大页框数。
	 */
	unsigned long hiwater_rss;	/* High-water RSS usage */
	/**
	 * 进程线性区中的最大页数。
	 */
	unsigned long hiwater_vm;	/* High-water virtual memory usage */
};

vma:

/*
 * This struct defines a memory VMM memory area. There is one of these
 * per VM-area/task.  A VM area is any part of the process virtual memory
 * space that has a special rule for the page-fault handlers (ie a shared
 * library, the executable area etc).
 */
/**
 * 线性区描述符。
 */
struct vm_area_struct {
	/**
	 * 指向线性区所在的内存描述符。
	 */
	struct mm_struct * vm_mm;	/* The address space we belong to. */
	/**
	 * 线性区内的第一个线性地址。
	 */
	unsigned long vm_start;		/* Our start address within vm_mm. */
	/**
	 * 线性区之后的第一个线性地址。
	 */
	unsigned long vm_end;		/* The first byte after our end address
					   within vm_mm. */

	/* linked list of VM areas per task, sorted by address */
	/**
	 * 进程链表中的下一个线性区。
	 */
	struct vm_area_struct *vm_next;

	/**
	 * 线性区中页框的访问许可权。
	 */
	pgprot_t vm_page_prot;		/* Access permissions of this VMA. */
	/**
	 * 线性区的标志。
	 */
	unsigned long vm_flags;		/* Flags, listed below. */

	/**
	 * 用于红黑树的数据。
	 */
	struct rb_node vm_rb;

	/*
	 * For areas with an address space and backing store,
	 * linkage into the address_space->i_mmap prio tree, or
	 * linkage to the list of like vmas hanging off its node, or
	 * linkage of vma in the address_space->i_mmap_nonlinear list.
	 */
	/**
	 * 链接到反映射所使用的数据结构。
	 */
	union {
		/**
		 * 如果在优先搜索树中,存在两个节点的基索引、堆索引、大小索引完全相同,那么这些相同的节点会被链接到一个链表,而vm_set就是这个链表的元素。
		 */
		struct {
			struct list_head list;
			void *parent;	/* aligns with prio_tree_node parent */
			struct vm_area_struct *head;
		} vm_set;

		/**
		 * 如果是文件映射,那么prio_tree_node用于将线性区插入到优先搜索树中。作为搜索树的一个节点。
		 */
		struct raw_prio_tree_node prio_tree_node;
	} shared;

	/*
	 * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
	 * list, after a COW of one of the file pages.  A MAP_SHARED vma
	 * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
	 * or brk vma (with NULL file) can only be in an anon_vma list.
	 */
	/**
	 * 指向匿名线性区链表的指针(参见"映射页的反映射")。
	 * 页框结构有一个anon_vma指针,指向该页的第一个线性区,随后的线性区通过此字段链接起来。
	 * 通过此字段,可以将线性区链接到此链表中。
	 */
	struct list_head anon_vma_node;	/* Serialized by anon_vma->lock */
	/**
	 * 指向anon_vma数据结构的指针(参见"映射页的反映射")。此指针也存放在页结构的mapping字段中。
	 */
	struct anon_vma *anon_vma;	/* Serialized by page_table_lock */

	/* Function pointers to deal with this struct. */
	/**
	 * 指向线性区的方法。
	 */
	struct vm_operations_struct * vm_ops;

	/* Information about our backing store: */
	/**
	 * 在映射文件中的偏移量(以页为单位)。对匿名页,它等于0或vm_start/PAGE_SIZE
	 */
	unsigned long vm_pgoff;		/* Offset (within vm_file) in PAGE_SIZE
					   units, *not* PAGE_CACHE_SIZE */
	/**
	 * 指向映射文件的文件对象(如果有的话)
	 */
	struct file * vm_file;		/* File we map to (can be NULL). */
	/**
	 * 指向内存区的私有数据。
	 */
	void * vm_private_data;		/* was vm_pte (shared mem) */
	/**
	 * 释放非线性文件内存映射中的一个线性地址区间时使用。
	 */
	unsigned long vm_truncate_count;/* truncate_count or restart_addr */

#ifndef CONFIG_MMU
	atomic_t vm_usage;		/* refcount (VMAs shared if !MMU) */
#endif
#ifdef CONFIG_NUMA
	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
#endif
};

2.2 mmap()

arch\x86\kernel\sys_x86_64.c:

SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
		unsigned long, prot, unsigned long, flags,
		unsigned long, fd, unsigned long, off)
{
	long error;
	error = -EINVAL;
	if (off & ~PAGE_MASK)
		goto out;

	error = sys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
out:
	return error;
}

↓

mm\mmap.c:

SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
		unsigned long, prot, unsigned long, flags,
		unsigned long, fd, unsigned long, pgoff)
{
	struct file *file = NULL;
	unsigned long retval;

    /* (1) 文件内存映射(非匿名内存映射) */
	if (!(flags & MAP_ANONYMOUS)) {
		audit_mmap_fd(fd, flags);
        /* (1.1) 根据fd获取到file */
		file = fget(fd);
		if (!file)
			return -EBADF;
        /* (1.2) 如果文件是hugepage,根据hugepage计算长度 */
		if (is_file_hugepages(file))
			len = ALIGN(len, huge_page_size(hstate_file(file)));
		retval = -EINVAL;
        /* (1.3) 如果指定了huge tlb映射,但是文件不是huge page,出错返回 */
		if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
			goto out_fput;
    /* (2) 匿名内存映射 且huge tlb */
	} else if (flags & MAP_HUGETLB) {
		struct user_struct *user = NULL;
		struct hstate *hs;

		hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
		if (!hs)
			return -EINVAL;

		len = ALIGN(len, huge_page_size(hs));
		/*
		 * VM_NORESERVE is used because the reservations will be
		 * taken when vm_ops->mmap() is called
		 * A dummy user value is used because we are not locking
		 * memory so no accounting is necessary
		 */
		file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,
				VM_NORESERVE,
				&user, HUGETLB_ANONHUGE_INODE,
				(flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
		if (IS_ERR(file))
			return PTR_ERR(file);
	}

    /* (3) 清除用户设置的以下两个标志 */
	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);

    /* (4) 进一步调用 */
	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
out_fput:
	if (file)
		fput(file);
	return retval;
}

↓

unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
	unsigned long len, unsigned long prot,
	unsigned long flag, unsigned long pgoff)
{
	unsigned long ret;
	struct mm_struct *mm = current->mm;
	unsigned long populate;
	LIST_HEAD(uf);

	ret = security_mmap_file(file, prot, flag);
	if (!ret) {
		if (down_write_killable(&mm->mmap_sem))
			return -EINTR;
        /* 进一步调用 */
		ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
				    &populate, &uf);
		up_write(&mm->mmap_sem);
		userfaultfd_unmap_complete(mm, &uf);
        /* 如果需要,立即填充vma对应的内存 */
		if (populate)
			mm_populate(ret, populate);
	}
	return ret;
}

↓
do_mmap_pgoff()
↓

unsigned long do_mmap(struct file *file, unsigned long addr,
			unsigned long len, unsigned long prot,
			unsigned long flags, vm_flags_t vm_flags,
			unsigned long pgoff, unsigned long *populate,
			struct list_head *uf)
{
	struct mm_struct *mm = current->mm;
	int pkey = 0;

	*populate = 0;

	if (!len)
		return -EINVAL;

	/*
	 * Does the application expect PROT_READ to imply PROT_EXEC?
     * 应用程序是否期望PROT_READ暗示PROT_EXEC?
	 *
	 * (the exception is when the underlying filesystem is noexec
	 *  mounted, in which case we dont add PROT_EXEC.)
     * (例外是在底层文件系统未执行noexec挂载时,在这种情况下,我们不添加PROT_EXEC。)
	 */
    /* (4.1.1) PROT_READ暗示PROT_EXEC */
	if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC))
		if (!(file && path_noexec(&file->f_path)))
			prot |= PROT_EXEC;

    /* (4.1.2) 如果没有设置固定地址的标志,给地址按page取整,且使其不小于mmap_min_addr */
	if (!(flags & MAP_FIXED))
		addr = round_hint_to_min(addr);

	/* Careful about overflows.. */
    /* (4.1.3) 给长度按page取整 */
	len = PAGE_ALIGN(len);
	if (!len)
		return -ENOMEM;

	/* offset overflow? */
    /* (4.1.4) 判断page offset + 长度,是否已经溢出 */
	if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)
		return -EOVERFLOW;

	/* Too many mappings? */
    /* (4.1.5) 判断本进程mmap的区段个数已经超标 */
	if (mm->map_count > sysctl_max_map_count)
		return -ENOMEM;

	/* Obtain the address to map to. we verify (or select) it and ensure
	 * that it represents a valid section of the address space.
	 */
    /* (4.2) 从本进程的线性地址红黑树中分配一块空白地址 */
	addr = get_unmapped_area(file, addr, len, pgoff, flags);
	if (offset_in_page(addr))
		return addr;

    /* (4.3.1) 如果prot只指定了exec */
	if (prot == PROT_EXEC) {
		pkey = execute_only_pkey(mm);
		if (pkey < 0)
			pkey = 0;
	}

	/* Do simple checking here so the lower-level routines won't have
	 * to. we assume access permissions have been handled by the open
	 * of the memory object, so we don't do any here.
     * 在这里进行简单的检查,因此不必执行较低级别的例程。 我们假定访问权限已由内存对象的打开处理,因此在此不做任何操作。
	 */
    /* (4.3.2) 计算初始vm flags:
            根据prot计算相关的vm flag
            根据flags计算相关的vm flag
            再综合mm的默认flags等
     */
	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;

    /* (4.3.3) 如果指定了内存lock标志,但是当前进程不能lock,出错返回 */
	if (flags & MAP_LOCKED)
		if (!can_do_mlock())
			return -EPERM;

    /* (4.3.4) 如果指定了内存lock标志,但是lock的长度超标,出错返回 */
	if (mlock_future_check(mm, vm_flags, len))
		return -EAGAIN;

    /* (4.4) 文件内存映射的一系列判断和处理 */
	if (file) {
		struct inode *inode = file_inode(file);
		unsigned long flags_mask;

        /* (4.4.1) 指定的page offset和len,需要在文件的合法长度内 */
		if (!file_mmap_ok(file, inode, pgoff, len))
			return -EOVERFLOW;

        /* (4.4.2) 本文件支持的mask,和mmap()传递下来的flags进行判断 */
		flags_mask = LEGACY_MAP_MASK | file->f_op->mmap_supported_flags;

		switch (flags & MAP_TYPE) {
        /* (4.4.3) 共享映射 */
		case MAP_SHARED:
			/*
			 * Force use of MAP_SHARED_VALIDATE with non-legacy
			 * flags. E.g. MAP_SYNC is dangerous to use with
			 * MAP_SHARED as you don't know which consistency model
			 * you will get. We silently ignore unsupported flags
			 * with MAP_SHARED to preserve backward compatibility.
			 */
			flags &= LEGACY_MAP_MASK;
			/* fall through */
        /* (4.4.4) 共享&校验映射 */
		case MAP_SHARED_VALIDATE:
			if (flags & ~flags_mask)
				return -EOPNOTSUPP;
			if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
				return -EACCES;

			/*
			 * Make sure we don't allow writing to an append-only
			 * file..
			 */
			if (IS_APPEND(inode) && (file->f_mode & FMODE_WRITE))
				return -EACCES;

			/*
			 * Make sure there are no mandatory locks on the file.
			 */
			if (locks_verify_locked(file))
				return -EAGAIN;

			vm_flags |= VM_SHARED | VM_MAYSHARE;
			if (!(file->f_mode & FMODE_WRITE))
				vm_flags &= ~(VM_MAYWRITE | VM_SHARED);

			/* fall through */
        /* (4.4.3) 私有映射 */
		case MAP_PRIVATE:
			if (!(file->f_mode & FMODE_READ))
				return -EACCES;
			if (path_noexec(&file->f_path)) {
				if (vm_flags & VM_EXEC)
					return -EPERM;
				vm_flags &= ~VM_MAYEXEC;
			}

			if (!file->f_op->mmap)
				return -ENODEV;
			if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
				return -EINVAL;
			break;

		default:
			return -EINVAL;
		}
    /* (4.5) 匿名内存映射的一系列判断和处理 */
	} else {
		switch (flags & MAP_TYPE) {
		case MAP_SHARED:
			if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
				return -EINVAL;
			/*
			 * Ignore pgoff.
			 */
			pgoff = 0;
			vm_flags |= VM_SHARED | VM_MAYSHARE;
			break;
		case MAP_PRIVATE:
			/*
			 * Set pgoff according to addr for anon_vma.
			 */
			pgoff = addr >> PAGE_SHIFT;
			break;
		default:
			return -EINVAL;
		}
	}

	/*
	 * Set 'VM_NORESERVE' if we should not account for the
	 * memory use of this mapping.
     * 如果我们不应该考虑此映射的内存使用,则设置“ VM_NORESERVE”。
	 */
	if (flags & MAP_NORESERVE) {
		/* We honor MAP_NORESERVE if allowed to overcommit */
		if (sysctl_overcommit_memory != OVERCOMMIT_NEVER)
			vm_flags |= VM_NORESERVE;

		/* hugetlb applies strict overcommit unless MAP_NORESERVE */
		if (file && is_file_hugepages(file))
			vm_flags |= VM_NORESERVE;
	}

    /* (4.6) 根据查找到的地址、flags,正式在线性地址红黑树中插入一个新的VMA */
	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
    /* (4.7) 默认只是分配vma,不进行实际的内存分配和mmu映射,延迟到page_fault时才处理
        如果设置了立即填充的标志,在分配vma时就分配好内存
     */
	if (!IS_ERR_VALUE(addr) &&
	    ((vm_flags & VM_LOCKED) ||
	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
		*populate = len;
	return addr;
}

2.2.1 mm->mmap_base(mmap基地址)

current->mm->get_unmapped_area默认赋值的是arch_get_unmapped_area()/arch_get_unmapped_area_topdown(),它是在进程创建时被赋值。在这个函数中还有另外一件重要的事情,给mmap base进行赋值:

sys_execve() ... → load_elf_binary() → setup_new_exec() → arch_pick_mmap_layout()

void arch_pick_mmap_layout(struct mm_struct *mm)
{
    /* (1) 给get_unmapped_area成员赋值 */
	if (mmap_is_legacy())
		mm->get_unmapped_area = arch_get_unmapped_area;
	else
		mm->get_unmapped_area = arch_get_unmapped_area_topdown;

    /* (2) 计算64bit模式下,mmap的基地址
             传统layout模式:用户空间的1/3处,再加上随机偏移
             现代layout模式:用户空间顶端减去堆栈和堆栈随机偏移,再减去随机偏移
     */
	arch_pick_mmap_base(&mm->mmap_base, &mm->mmap_legacy_base,
			arch_rnd(mmap64_rnd_bits), task_size_64bit(0));

#ifdef CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
	/*
	 * The mmap syscall mapping base decision depends solely on the
	 * syscall type (64-bit or compat). This applies for 64bit
	 * applications and 32bit applications. The 64bit syscall uses
	 * mmap_base, the compat syscall uses mmap_compat_base.
	 */
    /* (3) 计算32bit兼容模式下,mmap的基地址
	arch_pick_mmap_base(&mm->mmap_compat_base, &mm->mmap_compat_legacy_base,
			arch_rnd(mmap32_rnd_bits), task_size_32bit());
#endif
}

|→

// 用户地址空间的最大值:2^47-0x1000 = 0x7FFFFFFFF000       // 约128T
#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
/* This decides where the kernel will search for a free chunk of vm
 * space during mmap's.
 */
#define IA32_PAGE_OFFSET	((current->personality & ADDR_LIMIT_3GB) ? \
					0xc0000000 : 0xFFFFe000)
#define TASK_SIZE_LOW		(test_thread_flag(TIF_ADDR32) ? \
					IA32_PAGE_OFFSET : DEFAULT_MAP_WINDOW)
#define TASK_SIZE		(test_thread_flag(TIF_ADDR32) ? \
					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
#define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
#define STACK_TOP		TASK_SIZE_LOW
#define STACK_TOP_MAX		TASK_SIZE_MAX

|→

// 根据定义的随机bit,来计算随机偏移多少个page
static unsigned long arch_rnd(unsigned int rndbits)
{
	if (!(current->flags & PF_RANDOMIZE))
		return 0;
	return (get_random_long() & ((1UL << rndbits) - 1)) << PAGE_SHIFT;
}

unsigned long task_size_64bit(int full_addr_space)
{
	return full_addr_space ? TASK_SIZE_MAX : DEFAULT_MAP_WINDOW;
}

|→

static void arch_pick_mmap_base(unsigned long *base, unsigned long *legacy_base,
		unsigned long random_factor, unsigned long task_size)
{
    /* (2.1) 传统layout模式下,mmap的基址:
        PAGE_ALIGN(task_size / 3) + rnd         // 用户空间的1/3处,再加上随机偏移
     */
	*legacy_base = mmap_legacy_base(random_factor, task_size);
	if (mmap_is_legacy())
		*base = *legacy_base;
	else
        /* (2.2) 现代layout模式下,mmap的基址:
            PAGE_ALIGN(task_size - stask_gap - rnd)     // 用户空间顶端减去堆栈和堆栈随机偏移,再减去随机偏移
         */
		*base = mmap_base(random_factor, task_size);
}

||→

static unsigned long mmap_legacy_base(unsigned long rnd,
				      unsigned long task_size)
{
	return __TASK_UNMAPPED_BASE(task_size) + rnd;
}

#define __TASK_UNMAPPED_BASE(task_size)	(PAGE_ALIGN(task_size / 3))

||→

static unsigned long mmap_base(unsigned long rnd, unsigned long task_size)
{
    // 堆栈的最大值
	unsigned long gap = rlimit(RLIMIT_STACK);   
    // 堆栈的最大随机偏移 + 1M
	unsigned long pad = stack_maxrandom_size(task_size) + stack_guard_gap;
	unsigned long gap_min, gap_max;

	/* Values close to RLIM_INFINITY can overflow. */
	if (gap + pad > gap)
		gap += pad;

	/*
	 * Top of mmap area (just below the process stack).
	 * Leave an at least ~128 MB hole with possible stack randomization.
	 */
    // 最小不小于128M,最大不大于用户空间的5/6
	gap_min = SIZE_128M;
	gap_max = (task_size / 6) * 5;

	if (gap < gap_min)
		gap = gap_min;
	else if (gap > gap_max)
		gap = gap_max;

	return PAGE_ALIGN(task_size - gap - rnd);
}

static unsigned long stack_maxrandom_size(unsigned long task_size)
{
	unsigned long max = 0;
	if (current->flags & PF_RANDOMIZE) {
		max = (-1UL) & __STACK_RND_MASK(task_size == task_size_32bit());
		max <<= PAGE_SHIFT;
	}

	return max;
}

/* 1GB for 64bit, 8MB for 32bit */
// 堆栈的最大随机偏移:64bit下16G,32bit下8M
#define __STACK_RND_MASK(is32bit) ((is32bit) ? 0x7ff : 0x3fffff)

/* enforced gap between the expanding stack and other mappings. */
// 1M 的stack gap空间
unsigned long stack_guard_gap = 256UL<<PAGE_SHIFT;

2.2.2 get_unmapped_area()

get_unmapped_area()函数是从当前进程的用户地址空间中找出一块符合要求的空闲空间,给新的vma。

unsigned long
get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
		unsigned long pgoff, unsigned long flags)
{
	unsigned long (*get_area)(struct file *, unsigned long,
				  unsigned long, unsigned long, unsigned long);

	unsigned long error = arch_mmap_check(addr, len, flags);
	if (error)
		return error;

	/* Careful about overflows.. */
	if (len > TASK_SIZE)
		return -ENOMEM;

    /* (1.1) 默认,使用本进程的mm->get_unmapped_area函数 */
	get_area = current->mm->get_unmapped_area;
	if (file) {
        /* (1.2) 文件内存映射,且文件有自己的get_unmapped_area,则使用file->f_op->get_unmapped_area */
		if (file->f_op->get_unmapped_area)
			get_area = file->f_op->get_unmapped_area;
	} else if (flags & MAP_SHARED) {
		/*
		 * mmap_region() will call shmem_zero_setup() to create a file,
		 * so use shmem's get_unmapped_area in case it can be huge.
		 * do_mmap_pgoff() will clear pgoff, so match alignment.
		 */
		pgoff = 0;
        /* (1.3) 匿名共享内存映射,使用shmem_get_unmapped_area函数 */
		get_area = shmem_get_unmapped_area;
	}

    /* (2) 实际的获取线性区域 */
	addr = get_area(file, addr, len, pgoff, flags);
	if (IS_ERR_VALUE(addr))
		return addr;

	if (addr > TASK_SIZE - len)
		return -ENOMEM;
	if (offset_in_page(addr))
		return -EINVAL;

	error = security_mmap_addr(addr);
	return error ? error : addr;
}

2.2.2.1 arch_get_unmapped_area()

进程地址空间,传统/经典布局:

在这里插入图片描述

经典布局的缺点:在x86_32,虚拟地址空间从0到0xc0000000,每个用户进程有3GB可用。TASK_UNMAPPED_BASE一般起始于0x4000000(即1GB)。这意味着堆只有1GB的空间可供使用,继续增长则进入到mmap区域。这时mmap区域是自底向上扩展的。

传统layout模式下,mmap分配是从低到高的,从&mm->mmap_base到task_size。默认调用current->mm->get_unmapped_area -> arch_get_unmapped_area():

arch\x86\kernel\sys_x86_64.c:

unsigned long
arch_get_unmapped_area(struct file *filp, unsigned long addr,
		unsigned long len, unsigned long pgoff, unsigned long flags)
{
	struct mm_struct *mm = current->mm;
	struct vm_area_struct *vma;
	struct vm_unmapped_area_info info;
	unsigned long begin, end;

    /* (2.1) 地址和长度不能大于 47 bits */
	addr = mpx_unmapped_area_check(addr, len, flags);
	if (IS_ERR_VALUE(addr))
		return addr;

    /* (2.2) 如果是固定映射,则返回原地址 */
	if (flags & MAP_FIXED)
		return addr;

    /* (2.3) 获取mmap区域的开始、结束地址:
        begin   :mm->mmap_base
        end     :task_size
     */
	find_start_end(addr, flags, &begin, &end);

    /* (2.4) 超长出错返回 */
	if (len > end)
		return -ENOMEM;

    /* (2.5) 按照用户给出的原始addr和len查看这里是否有空洞 */
	if (addr) {
		addr = PAGE_ALIGN(addr);
		vma = find_vma(mm, addr);
        /* 这里容易有个误解:
            开始以为find_vma()的作用是找到一个vma满足:vma->vm_start <= addr < vma->vm_end
            实际find_vma()的作用是找到一个地址最小的vma满足: addr < vma->vm_end
         */
        /* 在vma红黑树之外找到空间:
            1、如果addr的值在vma红黑树之上:!vma ,且有足够的空间:end - len >= addr
            2、如果addr的值在vma红黑树之下,且有足够的空间:addr + len <= vm_start_gap(vma)
         */
		if (end - len >= addr &&
		    (!vma || addr + len <= vm_start_gap(vma)))
			return addr;
	}

	info.flags = 0;
	info.length = len;
	info.low_limit = begin;
	info.high_limit = end;
	info.align_mask = 0;
	info.align_offset = pgoff << PAGE_SHIFT;
	if (filp) {
		info.align_mask = get_align_mask();
		info.align_offset += get_align_bits();
	}
    /* (2.6) 否则只能废弃掉用户指定的地址,根据长度重新给他找一个合适的地址
            优先在vma红黑树的空洞中找,其次在空白位置找
     */
	return vm_unmapped_area(&info);
}

|→

/* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
/* 查找最小的VMA,满足addr < vma->vm_end */
struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
{
	struct rb_node *rb_node;
	struct vm_area_struct *vma;

	/* Check the cache first. */
    /* (2.5.1) 查找vma cache,是否有vma的区域能包含addr地址 */
	vma = vmacache_find(mm, addr);
	if (likely(vma))
		return vma;

    /* (2.5.2) 查找vma红黑树,是否有vma的区域能包含addr地址 */
	rb_node = mm->mm_rb.rb_node;
	while (rb_node) {
		struct vm_area_struct *tmp;

		tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);

		if (tmp->vm_end > addr) {
			vma = tmp;
			if (tmp->vm_start <= addr)
				break;
			rb_node = rb_node->rb_left;
		} else
			rb_node = rb_node->rb_right;
	}

    /* (2.5.3) 利用查找到的vma来更新vma cache */
	if (vma)
		vmacache_update(addr, vma);
	return vma;
}

|→

/*
 * Search for an unmapped address range.
 *
 * We are looking for a range that:
 * - does not intersect with any VMA; // 不与任何VMA相交;
 * - is contained within the [low_limit, high_limit) interval; // 包含在[low_limit,high_limit)间隔内;
 * - is at least the desired size.  // 至少是所需的大小。
 * - satisfies (begin_addr & align_mask) == (align_offset & align_mask) // 满足如下条件
 */
static inline unsigned long
vm_unmapped_area(struct vm_unmapped_area_info *info)
{
    /* (2.6.1) 从高往低查找 */
	if (info->flags & VM_UNMAPPED_AREA_TOPDOWN)
		return unmapped_area_topdown(info);
	/* (2.6.2) 默认从低往高查找 */
    else
		return unmapped_area(info);
}

||→

unsigned long unmapped_area(struct vm_unmapped_area_info *info)
{
	/*
	 * We implement the search by looking for an rbtree node that
	 * immediately follows a suitable gap. That is,
     * 我们查找红黑树,找到一个合适的洞。需要满足以下条件:
	 * - gap_start = vma->vm_prev->vm_end <= info->high_limit - length;
	 * - gap_end   = vma->vm_start        >= info->low_limit  + length;
	 * - gap_end - gap_start >= length
	 */

	struct mm_struct *mm = current->mm;
	struct vm_area_struct *vma;
	unsigned long length, low_limit, high_limit, gap_start, gap_end;

	/* Adjust search length to account for worst case alignment overhead */
    /* (2.6.2.1) 长度加上mask开销 */
	length = info->length + info->align_mask;
	if (length < info->length)
		return -ENOMEM;

	/* Adjust search limits by the desired length */
    /* (2.6.2.2) 计算high_limit,gap_start<=high_limit */
	if (info->high_limit < length)
		return -ENOMEM;
	high_limit = info->high_limit - length;

    /* (2.6.2.3) 计算low_limit,gap_end>=low_limit */
	if (info->low_limit > high_limit)
		return -ENOMEM;
	low_limit = info->low_limit + length;

	/* Check if rbtree root looks promising */
	if (RB_EMPTY_ROOT(&mm->mm_rb))
		goto check_highest;
	vma = rb_entry(mm->mm_rb.rb_node, struct vm_area_struct, vm_rb);
	/*
     * rb_subtree_gap的定义:
	 * Largest free memory gap in bytes to the left of this VMA. 
     * 此VMA左侧的最大可用内存空白(以字节为单位)。
	 * Either between this VMA and vma->vm_prev, or between one of the
	 * VMAs below us in the VMA rbtree and its ->vm_prev. This helps
	 * get_unmapped_area find a free area of the right size.
     * 在此VMA和vma-> vm_prev之间,或在VMA rbtree中我们下面的VMA之一与其-> vm_prev之间。 这有助于get_unmapped_area找到合适大小的空闲区域。
	 */    
	if (vma->rb_subtree_gap < length)
		goto check_highest;

    /* (2.6.2.4) 查找红黑树根节点的左子树中是否有符合要求的空洞。
            有个疑问:
                根节点的右子树不需要搜索了吗?还是根节点没有右子树?
     */
	while (true) {
		/* Visit left subtree if it looks promising */
        /* (2.6.2.4.1) 一直往左找,找到最左边有合适大小的节点
                因为最左边的地址最小
         */
		gap_end = vm_start_gap(vma);
		if (gap_end >= low_limit && vma->vm_rb.rb_left) {
			struct vm_area_struct *left =
				rb_entry(vma->vm_rb.rb_left,
					 struct vm_area_struct, vm_rb);
			if (left->rb_subtree_gap >= length) {
				vma = left;
				continue;
			}
		}

		gap_start = vma->vm_prev ? vm_end_gap(vma->vm_prev) : 0;
check_current:
		/* Check if current node has a suitable gap */
		if (gap_start > high_limit)
			return -ENOMEM;
        /* (2.6.2.4.2) 如果已找到合适的洞,则跳出循环 */
		if (gap_end >= low_limit &&
		    gap_end > gap_start && gap_end - gap_start >= length)
			goto found;

		/* Visit right subtree if it looks promising */
        /* (2.6.2.4.3) 如果左子树查找失败,从当前vm的右子树查找 */
		if (vma->vm_rb.rb_right) {
			struct vm_area_struct *right =
				rb_entry(vma->vm_rb.rb_right,
					 struct vm_area_struct, vm_rb);
			if (right->rb_subtree_gap >= length) {
				vma = right;
				continue;
			}
		}

		/* Go back up the rbtree to find next candidate node */
        /* (2.6.2.4.4) 如果左右子树都搜寻失败,向回搜寻父节点 */
		while (true) {
			struct rb_node *prev = &vma->vm_rb;
			if (!rb_parent(prev))
				goto check_highest;
			vma = rb_entry(rb_parent(prev),
				       struct vm_area_struct, vm_rb);
			if (prev == vma->vm_rb.rb_left) {
				gap_start = vm_end_gap(vma->vm_prev);
				gap_end = vm_start_gap(vma);
				goto check_current;
			}
		}
	}

    /* (2.6.2.5) 如果红黑树中没有合适的空洞,从highest空间查找是否有合适的
            highest空间是还没有vma分配的空白空间
            但是优先查找已分配vma之间的空洞
     */
check_highest:
	/* Check highest gap, which does not precede any rbtree node */
	gap_start = mm->highest_vm_end;
	gap_end = ULONG_MAX;  /* Only for VM_BUG_ON below */
	if (gap_start > high_limit)
		return -ENOMEM;

    /* (2.6.2.6) 搜索到了合适的空间,返回开始地址 */
found:
	/* We found a suitable gap. Clip it with the original low_limit. */
	if (gap_start < info->low_limit)
		gap_start = info->low_limit;

	/* Adjust gap address to the desired alignment */
	gap_start += (info->align_offset - gap_start) & info->align_mask;

	VM_BUG_ON(gap_start + info->length > info->high_limit);
	VM_BUG_ON(gap_start + info->length > gap_end);
	return gap_start;
}

2.2.2.2 arch_get_unmapped_area_topdown()

新的虚拟地址空间::

在这里插入图片描述

与经典布局不同的是:使用固定值限制栈的最大长度。由于栈是有界的,因此安置内存映射的区域可以在栈末端的下方立即开始。这时mmap区是自顶向下扩展的。由于堆仍然位于虚拟地址空间中较低的区域并向上增长,因此mmap区域和堆可以相对扩展,直至耗尽虚拟地址空间中剩余的区域。

现代layout模式下,mmap分配是从高到低的,调用current->mm->get_unmapped_area -> arch_get_unmapped_area_topdown():

unsigned long
arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
			  const unsigned long len, const unsigned long pgoff,
			  const unsigned long flags)
{
	struct vm_area_struct *vma;
	struct mm_struct *mm = current->mm;
	unsigned long addr = addr0;
	struct vm_unmapped_area_info info;

	addr = mpx_unmapped_area_check(addr, len, flags);
	if (IS_ERR_VALUE(addr))
		return addr;

	/* requested length too big for entire address space */
	if (len > TASK_SIZE)
		return -ENOMEM;

	/* No address checking. See comment at mmap_address_hint_valid() */
	if (flags & MAP_FIXED)
		return addr;

	/* for MAP_32BIT mappings we force the legacy mmap base */
	if (!in_compat_syscall() && (flags & MAP_32BIT))
		goto bottomup;

	/* requesting a specific address */
	if (addr) {
		addr &= PAGE_MASK;
		if (!mmap_address_hint_valid(addr, len))
			goto get_unmapped_area;

		vma = find_vma(mm, addr);
		if (!vma || addr + len <= vm_start_gap(vma))
			return addr;
	}
get_unmapped_area:

	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
	info.length = len;
	info.low_limit = PAGE_SIZE;
	info.high_limit = get_mmap_base(0);

	/*
	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
	 * in the full address space.
	 *
	 * !in_compat_syscall() check to avoid high addresses for x32.
	 */
	if (addr > DEFAULT_MAP_WINDOW && !in_compat_syscall())
		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;

	info.align_mask = 0;
	info.align_offset = pgoff << PAGE_SHIFT;
	if (filp) {
		info.align_mask = get_align_mask();
		info.align_offset += get_align_bits();
	}
	addr = vm_unmapped_area(&info);
	if (!(addr & ~PAGE_MASK))
		return addr;
	VM_BUG_ON(addr != -ENOMEM);

bottomup:
	/*
	 * A failed mmap() very likely causes application failure,
	 * so fall back to the bottom-up function here. This scenario
	 * can happen with large stack limits and large mmap()
	 * allocations.
	 */
	return arch_get_unmapped_area(filp, addr0, len, pgoff, flags);
}

↓

unmapped_area_topdown()

unmapped_area_topdown()函数和unmapped_area()函数的逻辑类似,不过unmapped_area_topdown()是优先高地址的,所以它会优先搜寻右子树的。

2.2.2.3 file->f_op->get_unmapped_area

如果文件系统有get_unmapped_area函数会调用自己的,以ext4为例,它的get_unmapped_area函数也为空:

const struct file_operations ext4_file_operations = {
	.llseek		= ext4_llseek,
	.read_iter	= ext4_file_read_iter,
	.write_iter	= ext4_file_write_iter,
	.unlocked_ioctl = ext4_ioctl,
#ifdef CONFIG_COMPAT
	.compat_ioctl	= ext4_compat_ioctl,
#endif
	.mmap		= ext4_file_mmap,
	.mmap_supported_flags = MAP_SYNC,
	.open		= ext4_file_open,
	.release	= ext4_release_file,
	.fsync		= ext4_sync_file,
	.get_unmapped_area = thp_get_unmapped_area,
	.splice_read	= generic_file_splice_read,
	.splice_write	= iter_file_splice_write,
	.fallocate	= ext4_fallocate,
};

2.2.3 mmap_region()

unsigned long mmap_region(struct file *file, unsigned long addr,
		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
		struct list_head *uf)
{
	struct mm_struct *mm = current->mm;
	struct vm_area_struct *vma, *prev;
	int error;
	struct rb_node **rb_link, *rb_parent;
	unsigned long charged = 0;

	/* Check against address space limit. */
    /* (1) 判断地址空间大小是否已经超标
        总的空间:mm->total_vm + npages > rlimit(RLIMIT_AS) >> PAGE_SHIFT
        数据空间:mm->data_vm + npages > rlimit(RLIMIT_DATA) >> PAGE_SHIFT
     */
	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
		unsigned long nr_pages;

		/*
		 * MAP_FIXED may remove pages of mappings that intersects with
		 * requested mapping. Account for the pages it would unmap.
		 */
        /* (1.1) 固定映射指定地址的情况下,地址空间可能和已有的VMA重叠,其他情况下不会重叠
                需要先unmap移除掉和新地址交错的vma地址
                所以可以先减去这部分空间,再判断大小是否超标
         */
		nr_pages = count_vma_pages_range(mm, addr, addr + len);

		if (!may_expand_vm(mm, vm_flags,
					(len >> PAGE_SHIFT) - nr_pages))
			return -ENOMEM;
	}

	/* Clear old maps */
    /* (2) 如果新地址和旧的vma有覆盖的情况:
            把覆盖的地址范围的vma分割出来,先释放掉
     */
	while (find_vma_links(mm, addr, addr + len, &prev, &rb_link,
			      &rb_parent)) {
		if (do_munmap(mm, addr, len, uf))
			return -ENOMEM;
	}

	/*
	 * Private writable mapping: check memory availability
     * 私有可写映射:检查内存可用性
	 */
	if (accountable_mapping(file, vm_flags)) {
		charged = len >> PAGE_SHIFT;
		if (security_vm_enough_memory_mm(mm, charged))
			return -ENOMEM;
		vm_flags |= VM_ACCOUNT;
	}

	/*
	 * Can we just expand an old mapping?
	 */
    /* (3) 尝试和临近的vma进行merge */
	vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
			NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX);
	if (vma)
		goto out;

	/*
	 * Determine the object being mapped and call the appropriate
	 * specific mapper. the address has already been validated, but
	 * not unmapped, but the maps are removed from the list.
	 */
    /* (4.1) 分配新的vma结构体 */
	vma = vm_area_alloc(mm);
	if (!vma) {
		error = -ENOMEM;
		goto unacct_error;
	}

    /* (4.2) 结构体相关成员 */
	vma->vm_start = addr;
	vma->vm_end = addr + len;
	vma->vm_flags = vm_flags;
	vma->vm_page_prot = vm_get_page_prot(vm_flags);
	vma->vm_pgoff = pgoff;

    /* (4.3) 文件内存映射 */
	if (file) {
		if (vm_flags & VM_DENYWRITE) {
			error = deny_write_access(file);
			if (error)
				goto free_vma;
		}
		if (vm_flags & VM_SHARED) {
			error = mapping_map_writable(file->f_mapping);
			if (error)
				goto allow_write_and_free_vma;
		}

		/* ->mmap() can change vma->vm_file, but must guarantee that
		 * vma_link() below can deny write-access if VM_DENYWRITE is set
		 * and map writably if VM_SHARED is set. This usually means the
		 * new file must not have been exposed to user-space, yet.
		 */
        /* (4.3.1) 给vma->vm_file赋值 */
		vma->vm_file = get_file(file);
        /* (4.3.2) 调用file->f_op->mmap,给vma->vm_ops赋值
            例如ext4:vma->vm_ops = &ext4_file_vm_ops;
         */
		error = call_mmap(file, vma);
		if (error)
			goto unmap_and_free_vma;

		/* Can addr have changed??
		 *
		 * Answer: Yes, several device drivers can do it in their
		 *         f_op->mmap method. -DaveM
		 * Bug: If addr is changed, prev, rb_link, rb_parent should
		 *      be updated for vma_link()
		 */
		WARN_ON_ONCE(addr != vma->vm_start);

		addr = vma->vm_start;
		vm_flags = vma->vm_flags;
    /* (4.4) 匿名共享内存映射 */
	} else if (vm_flags & VM_SHARED) {
        /* (4.4.1) 赋值:
            vma->vm_file = shmem_kernel_file_setup("dev/zero", size, vma->vm_flags);
            vma->vm_ops = &shmem_vm_ops;
         */
		error = shmem_zero_setup(vma);
		if (error)
			goto free_vma;
	}
    /* (4.4) 匿名私有内存映射呢?
            vma->vm_file = NULL?
            vma->vm_ops = NULL?
     */

    /* (4.6) 将新的vma插入 */
	vma_link(mm, vma, prev, rb_link, rb_parent);
	/* Once vma denies write, undo our temporary denial count */
	if (file) {
		if (vm_flags & VM_SHARED)
			mapping_unmap_writable(file->f_mapping);
		if (vm_flags & VM_DENYWRITE)
			allow_write_access(file);
	}
	file = vma->vm_file;
out:
	perf_event_mmap(vma);

	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
	if (vm_flags & VM_LOCKED) {
		if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
					vma == get_gate_vma(current->mm)))
			mm->locked_vm += (len >> PAGE_SHIFT);
		else
			vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
	}

	if (file)
		uprobe_mmap(vma);

	/*
	 * New (or expanded) vma always get soft dirty status.
	 * Otherwise user-space soft-dirty page tracker won't
	 * be able to distinguish situation when vma area unmapped,
	 * then new mapped in-place (which must be aimed as
	 * a completely new data area).
	 */
	vma->vm_flags |= VM_SOFTDIRTY;

	vma_set_page_prot(vma);

	return addr;

unmap_and_free_vma:
	vma_fput(vma);
	vma->vm_file = NULL;

	/* Undo any partial mapping done by a device driver. */
	unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
	charged = 0;
	if (vm_flags & VM_SHARED)
		mapping_unmap_writable(file->f_mapping);
allow_write_and_free_vma:
	if (vm_flags & VM_DENYWRITE)
		allow_write_access(file);
free_vma:
	vm_area_free(vma);
unacct_error:
	if (charged)
		vm_unacct_memory(charged);
	return error;
}

↓

vma_link()

2.2.3.1 vma_link()

static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
			struct vm_area_struct *prev, struct rb_node **rb_link,
			struct rb_node *rb_parent)
{
	struct address_space *mapping = NULL;

	if (vma->vm_file) {
		mapping = vma->vm_file->f_mapping;
		i_mmap_lock_write(mapping);
	}

    /* (4.6.1) 将新的vma插入到vma红黑树和vma链表中,并且更新树上的各种参数 */
	__vma_link(mm, vma, prev, rb_link, rb_parent);
    /* (4.6.2) 将vma插入到文件的file->f_mapping->i_mmap缓存树中 */
	__vma_link_file(vma);

	if (mapping)
		i_mmap_unlock_write(mapping);

	mm->map_count++;
	validate_mm(mm);
}

Mmap映射以后用户虚拟地址和文件磁盘缓存之间的映射关系如下:

在这里插入图片描述

f_op->mmap()函数并没有立即建起其vma所描述的用户地址和文件缓存之间的映射,而只是把vma->vm_ops函数操作赋值。在用户访问vma线性地址时,发生缺页异常,调用vma->vm_ops->nopage函数建立映射关系。

2.2.4 do_munmap()

/* Munmap is split into 2 main parts -- this part which finds
 * what needs doing, and the areas themselves, which do the
 * work.  This now handles partial unmappings.
 * Jeremy Fitzhardinge <[email protected]>
 */
int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
	      struct list_head *uf)
{
	unsigned long end;
	struct vm_area_struct *vma, *prev, *last;

	if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start)
		return -EINVAL;

	len = PAGE_ALIGN(len);
	if (len == 0)
		return -EINVAL;

	/* Find the first overlapping VMA */
    /* (1) 找到第一个可能重叠的VMA */
	vma = find_vma(mm, start);
	if (!vma)
		return 0;
	prev = vma->vm_prev;
	/* we have  start < vma->vm_end  */

	/* if it doesn't overlap, we have nothing.. */
    /* (2) 如果地址没有重叠,直接返回 */
	end = start + len;
	if (vma->vm_start >= end)
		return 0;

    /* (3) 如果有unmap区域和vma有重叠,先尝试把unmap区域切分成独立的小块vma,再unmap掉 */
	/*
	 * If we need to split any vma, do it now to save pain later.
	 *
	 * Note: mremap's move_vma VM_ACCOUNT handling assumes a partially
	 * unmapped vm_area_struct will remain in use: so lower split_vma
	 * places tmp vma above, and higher split_vma places tmp vma below.
	 */
    /* (3.1) 如果start和vma重叠,切一刀 */
	if (start > vma->vm_start) {
		int error;

		/*
		 * Make sure that map_count on return from munmap() will
		 * not exceed its limit; but let map_count go just above
		 * its limit temporarily, to help free resources as expected.
		 */
		if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
			return -ENOMEM;

		error = __split_vma(mm, vma, start, 0);
		if (error)
			return error;
		prev = vma;
	}

	/* Does it split the last one? */
    /* (3.2) 如果end和vma冲切,切一刀 */
	last = find_vma(mm, end);
	if (last && end > last->vm_start) {
		int error = __split_vma(mm, last, end, 1);
		if (error)
			return error;
	}
	vma = prev ? prev->vm_next : mm->mmap;

	if (unlikely(uf)) {
		/*
		 * If userfaultfd_unmap_prep returns an error the vmas
		 * will remain splitted, but userland will get a
		 * highly unexpected error anyway. This is no
		 * different than the case where the first of the two
		 * __split_vma fails, but we don't undo the first
		 * split, despite we could. This is unlikely enough
		 * failure that it's not worth optimizing it for.
		 */
		int error = userfaultfd_unmap_prep(vma, start, end, uf);
		if (error)
			return error;
	}

	/*
	 * unlock any mlock()ed ranges before detaching vmas
	 */
    /* (4) 移除目标vma上的相关lock */
	if (mm->locked_vm) {
		struct vm_area_struct *tmp = vma;
		while (tmp && tmp->vm_start < end) {
			if (tmp->vm_flags & VM_LOCKED) {
				mm->locked_vm -= vma_pages(tmp);
				munlock_vma_pages_all(tmp);
			}
			tmp = tmp->vm_next;
		}
	}

    /* (5) 移除目标vma */
	/*
	 * Remove the vma's, and unmap the actual pages
	 */
    /* (5.1) 从vma红黑树中移除vma */
	detach_vmas_to_be_unmapped(mm, vma, prev, end);
    /* (5.2) 释放掉vma空间对应的mmu映射表以及内存 */
	unmap_region(mm, vma, prev, start, end);
    /* (5.3) arch相关的vma释放 */
	arch_unmap(mm, vma, start, end);
    /* (5.4) 移除掉vma的其他信息,最后释放掉vma结构体 */
	/* Fix up all other VM information */
	remove_vma_list(mm, vma);

	return 0;
}

|→

static void unmap_region(struct mm_struct *mm,
		struct vm_area_struct *vma, struct vm_area_struct *prev,
		unsigned long start, unsigned long end)
{
	struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap;
	struct mmu_gather tlb;

	lru_add_drain();
	tlb_gather_mmu(&tlb, mm, start, end);
	update_hiwater_rss(mm);
	unmap_vmas(&tlb, vma, start, end);
	free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
				 next ? next->vm_start : USER_PGTABLES_CEILING);
	tlb_finish_mmu(&tlb, start, end);
}

||→

free_pgtables()

2.2.5 mm_populate()

static inline void mm_populate(unsigned long addr, unsigned long len)
{
	/* Ignore errors */
	(void) __mm_populate(addr, len, 1);
}

↓

int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
{
	struct mm_struct *mm = current->mm;
	unsigned long end, nstart, nend;
	struct vm_area_struct *vma = NULL;
	int locked = 0;
	long ret = 0;

	end = start + len;

	for (nstart = start; nstart < end; nstart = nend) {
		/*
		 * We want to fault in pages for [nstart; end) address range.
		 * Find first corresponding VMA.
		 */
		/* (1) 根据目标地址,查找vma */
		if (!locked) {
			locked = 1;
			down_read(&mm->mmap_sem);
			vma = find_vma(mm, nstart);
		} else if (nstart >= vma->vm_end)
			vma = vma->vm_next;
		if (!vma || vma->vm_start >= end)
			break;
		/*
		 * Set [nstart; nend) to intersection of desired address
		 * range with the first VMA. Also, skip undesirable VMA types.
		 */
		/* (2) 计算目标地址和vma的重叠部分 */
		nend = min(end, vma->vm_end);
		if (vma->vm_flags & (VM_IO | VM_PFNMAP))
			continue;
		if (nstart < vma->vm_start)
			nstart = vma->vm_start;
		/*
		 * Now fault in a range of pages. populate_vma_page_range()
		 * double checks the vma flags, so that it won't mlock pages
		 * if the vma was already munlocked.
		 */
		/* (3) 对重叠部分进行page内存填充 */
		ret = populate_vma_page_range(vma, nstart, nend, &locked);
		if (ret < 0) {
			if (ignore_errors) {
				ret = 0;
				continue;	/* continue at next VMA */
			}
			break;
		}
		nend = nstart + ret * PAGE_SIZE;
		ret = 0;
	}
	if (locked)
		up_read(&mm->mmap_sem);
	return ret;	/* 0 or negative error code */
}

↓

long populate_vma_page_range(struct vm_area_struct *vma,
		unsigned long start, unsigned long end, int *nonblocking)
{
	struct mm_struct *mm = vma->vm_mm;
	unsigned long nr_pages = (end - start) / PAGE_SIZE;
	int gup_flags;

	VM_BUG_ON(start & ~PAGE_MASK);
	VM_BUG_ON(end   & ~PAGE_MASK);
	VM_BUG_ON_VMA(start < vma->vm_start, vma);
	VM_BUG_ON_VMA(end   > vma->vm_end, vma);
	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);

	gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK;
	if (vma->vm_flags & VM_LOCKONFAULT)
		gup_flags &= ~FOLL_POPULATE;
	/*
	 * We want to touch writable mappings with a write fault in order
	 * to break COW, except for shared mappings because these don't COW
	 * and we would not want to dirty them for nothing.
	 */
	if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
		gup_flags |= FOLL_WRITE;

	/*
	 * We want mlock to succeed for regions that have any permissions
	 * other than PROT_NONE.
	 */
	if (vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC))
		gup_flags |= FOLL_FORCE;

	/*
	 * We made sure addr is within a VMA, so the following will
	 * not result in a stack expansion that recurses back here.
	 */
	return __get_user_pages(current, mm, start, nr_pages, gup_flags,
				NULL, NULL, nonblocking);
}

↓

/**
 * __get_user_pages() - pin user pages in memory // 将用户页面固定在内存中
 * @tsk:	task_struct of target task
 * @mm:		mm_struct of target mm
 * @start:	starting user address
 * @nr_pages:	number of pages from start to pin
 * @gup_flags:	flags modifying pin behaviour
 * @pages:	array that receives pointers to the pages pinned.
 *		Should be at least nr_pages long. Or NULL, if caller
 *		only intends to ensure the pages are faulted in.
 * @vmas:	array of pointers to vmas corresponding to each page.
 *		Or NULL if the caller does not require them.
 * @nonblocking: whether waiting for disk IO or mmap_sem contention
 *
 * Returns number of pages pinned. This may be fewer than the number
 * requested. If nr_pages is 0 or negative, returns 0. If no pages
 * were pinned, returns -errno. Each page returned must be released
 * with a put_page() call when it is finished with. vmas will only
 * remain valid while mmap_sem is held.
 * 返回固定的页数。这可能少于请求的数量。如果nr_pages为0或负数,则返回0。如果没有固定页面,则返回-errno。完成后,返回的每个页面都必须使用put_page()调用释放。只有保留mmap_sem时,vmas才保持有效。
 *
 * Must be called with mmap_sem held.  It may be released.  See below.
 * 必须在保持mmap_sem的情况下调用。它可能会被释放。见下文。
 *
 * __get_user_pages walks a process's page tables and takes a reference to
 * each struct page that each user address corresponds to at a given
 * instant. That is, it takes the page that would be accessed if a user
 * thread accesses the given user virtual address at that instant.
 * __get_user_pages遍历进程的页表,并引用给定瞬间每个用户地址所对应的每个struct页。也就是说,如果用户线程在该时刻访问给定的用户虚拟地址,则它将占用要访问的页面。
 *
 * This does not guarantee that the page exists in the user mappings when
 * __get_user_pages returns, and there may even be a completely different
 * page there in some cases (eg. if mmapped pagecache has been invalidated
 * and subsequently re faulted). However it does guarantee that the page
 * won't be freed completely. And mostly callers simply care that the page
 * contains data that was valid *at some point in time*. Typically, an IO
 * or similar operation cannot guarantee anything stronger anyway because
 * locks can't be held over the syscall boundary.
 * 这不能保证在__get_user_pages返回时该页面存在于用户映射中,并且在某些情况下甚至可能存在一个完全不同的页面(例如,如果映射的页面缓存已失效并随后发生故障)。但是,它可以确保不会完全释放该页面。而且大多数调用者只是在乎页面是否包含在某个时间点有效的数据。通常,由于无法在系统调用边界上保持锁,因此IO或类似操作无论如何都无法保证更强大。
 *
 * If @gup_flags & FOLL_WRITE == 0, the page must not be written to. If
 * the page is written to, set_page_dirty (or set_page_dirty_lock, as
 * appropriate) must be called after the page is finished with, and
 * before put_page is called.
 * 如果@gup_flags和FOLL_WRITE == 0,则不得写入该页面。如果要写入页面,则必须在页面完成之后且在调用put_page之前调用set_page_dirty(或适当的set_page_dirty_lock)。
 *
 * If @nonblocking != NULL, __get_user_pages will not wait for disk IO
 * or mmap_sem contention, and if waiting is needed to pin all pages,
 * *@nonblocking will be set to 0.  Further, if @gup_flags does not
 * include FOLL_NOWAIT, the mmap_sem will be released via up_read() in
 * this case.
 * 如果@nonblocking!= NULL,则__get_user_pages将不等待磁盘IO或mmap_sem争用,并且如果需要等待固定所有页面,则* @ nonblocking将设置为0。此外,如果@gup_flags不包含FOLL_NOWAIT,则mmap_sem在这种情况下将通过up_read()释放。
 *
 * A caller using such a combination of @nonblocking and @gup_flags
 * must therefore hold the mmap_sem for reading only, and recognize
 * when it's been released.  Otherwise, it must be held for either
 * reading or writing and will not be released.
 * 因此,使用@nonblocking和@gup_flags的组合的调用者必须将mmap_sem保留为只读,并识别它何时被释放。否则,必须保留它以进行读取或书写,并且不会被释放。
 *
 * In most cases, get_user_pages or get_user_pages_fast should be used
 * instead of __get_user_pages. __get_user_pages should be used only if
 * you need some special @gup_flags.
 * 在大多数情况下,应使用get_user_pages或get_user_pages_fast而不是__get_user_pages。 __get_user_pages仅在需要一些特殊的@gup_flags时使用。
 */
static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
		unsigned long start, unsigned long nr_pages,
		unsigned int gup_flags, struct page **pages,
		struct vm_area_struct **vmas, int *nonblocking)
{
	long i = 0;
	unsigned int page_mask;
	struct vm_area_struct *vma = NULL;

	if (!nr_pages)
		return 0;

	VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));

	/*
	 * If FOLL_FORCE is set then do not force a full fault as the hinting
	 * fault information is unrelated to the reference behaviour of a task
	 * using the address space
	 */
	if (!(gup_flags & FOLL_FORCE))
		gup_flags |= FOLL_NUMA;

	do {
		struct page *page;
		unsigned int foll_flags = gup_flags;
		unsigned int page_increm;

		/* first iteration or cross vma bound */
		/* (3.1) 地址跨区域的处理 */
		if (!vma || start >= vma->vm_end) {
			vma = find_extend_vma(mm, start);
			if (!vma && in_gate_area(mm, start)) {
				int ret;
				ret = get_gate_page(mm, start & PAGE_MASK,
						gup_flags, &vma,
						pages ? &pages[i] : NULL);
				if (ret)
					return i ? : ret;
				page_mask = 0;
				goto next_page;
			}

			if (!vma || check_vma_flags(vma, gup_flags))
				return i ? : -EFAULT;
			if (is_vm_hugetlb_page(vma)) {
				i = follow_hugetlb_page(mm, vma, pages, vmas,
						&start, &nr_pages, i,
						gup_flags, nonblocking);
				continue;
			}
		}
retry:
		/*
		 * If we have a pending SIGKILL, don't keep faulting pages and
		 * potentially allocating memory.
		 */
		if (unlikely(fatal_signal_pending(current)))
			return i ? i : -ERESTARTSYS;
		cond_resched();
		/* (3.2) 逐个查询vma中地址对应的page是否已经分配 */
		page = follow_page_mask(vma, start, foll_flags, &page_mask);
		if (!page) {
			int ret;
			/* (3.3) 如果page没有分配,则使用缺页处理来分配page */
			ret = faultin_page(tsk, vma, start, &foll_flags,
					nonblocking);
			switch (ret) {
			case 0:
				goto retry;
			case -EFAULT:
			case -ENOMEM:
			case -EHWPOISON:
				return i ? i : ret;
			case -EBUSY:
				return i;
			case -ENOENT:
				goto next_page;
			}
			BUG();
		} else if (PTR_ERR(page) == -EEXIST) {
			/*
			 * Proper page table entry exists, but no corresponding
			 * struct page.
			 */
			goto next_page;
		} else if (IS_ERR(page)) {
			return i ? i : PTR_ERR(page);
		}
		if (pages) {
			pages[i] = page;
			flush_anon_page(vma, page, start);
			flush_dcache_page(page);
			page_mask = 0;
		}
next_page:
		if (vmas) {
			vmas[i] = vma;
			page_mask = 0;
		}
		page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask);
		if (page_increm > nr_pages)
			page_increm = nr_pages;
		i += page_increm;
		start += page_increm * PAGE_SIZE;
		nr_pages -= page_increm;
	} while (nr_pages);
	return i;
}

↓

2.3 缺页处理

2.3.1 do_page_fault()

缺页异常的处理路径:do_page_fault() -> __do_page_fault() -> do_user_addr_fault() -> handle_mm_fault()。

核心函数为handle_mm_fault(),和mm_populate()中的处理一样,只不过实际不同。

mm_populate()是在mmap映射完vma虚拟地址以后,如果参数中指定了需要VM_LOCKED,会立即进行物理内存的分配和内容读取。

而do_page_fault()是在发生实际的内存访问以后,才进行物理内存的分配和内容读取。

2.3.2 faultin_page()

几种缺页原因:unrepresent、COW、swap。

static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
		unsigned long address, unsigned int *flags, int *nonblocking)
{
	unsigned int fault_flags = 0;
	int ret;

	/* mlock all present pages, but do not fault in new pages */
	if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK)
		return -ENOENT;
	if (*flags & FOLL_WRITE)
		fault_flags |= FAULT_FLAG_WRITE;
	if (*flags & FOLL_REMOTE)
		fault_flags |= FAULT_FLAG_REMOTE;
	if (nonblocking)
		fault_flags |= FAULT_FLAG_ALLOW_RETRY;
	if (*flags & FOLL_NOWAIT)
		fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
	if (*flags & FOLL_TRIED) {
		VM_WARN_ON_ONCE(fault_flags & FAULT_FLAG_ALLOW_RETRY);
		fault_flags |= FAULT_FLAG_TRIED;
	}

	ret = handle_mm_fault(vma, address, fault_flags);
	if (ret & VM_FAULT_ERROR) {
		int err = vm_fault_to_errno(ret, *flags);

		if (err)
			return err;
		BUG();
	}

	if (tsk) {
		if (ret & VM_FAULT_MAJOR)
			tsk->maj_flt++;
		else
			tsk->min_flt++;
	}

	if (ret & VM_FAULT_RETRY) {
		if (nonblocking)
			*nonblocking = 0;
		return -EBUSY;
	}

	/*
	 * The VM_FAULT_WRITE bit tells us that do_wp_page has broken COW when
	 * necessary, even if maybe_mkwrite decided not to set pte_write. We
	 * can thus safely do subsequent page lookups as if they were reads.
	 * But only do so when looping for pte_write is futile: in some cases
	 * userspace may also be wanting to write to the gotten user page,
	 * which a read fault here might prevent (a readonly page might get
	 * reCOWed by userspace write).
	 */
	if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE))
	        *flags |= FOLL_COW;
	return 0;
}

↓

int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
		unsigned int flags)
{
	int ret;

	__set_current_state(TASK_RUNNING);

	count_vm_event(PGFAULT);
	count_memcg_event_mm(vma->vm_mm, PGFAULT);

	/* do counter updates before entering really critical section. */
	check_sync_rss_stat(current);

	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
					    flags & FAULT_FLAG_INSTRUCTION,
					    flags & FAULT_FLAG_REMOTE))
		return VM_FAULT_SIGSEGV;

	/*
	 * Enable the memcg OOM handling for faults triggered in user
	 * space.  Kernel faults are handled more gracefully.
	 */
	if (flags & FAULT_FLAG_USER)
		mem_cgroup_oom_enable();

	if (unlikely(is_vm_hugetlb_page(vma)))
		ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
	else
		ret = __handle_mm_fault(vma, address, flags);

	if (flags & FAULT_FLAG_USER) {
		mem_cgroup_oom_disable();
		/*
		 * The task may have entered a memcg OOM situation but
		 * if the allocation error was handled gracefully (no
		 * VM_FAULT_OOM), there is no need to kill anything.
		 * Just clean up the OOM state peacefully.
		 */
		if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
			mem_cgroup_oom_synchronize(false);
	}

	return ret;
}

↓

static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
		unsigned int flags)
{
	struct vm_fault vmf = {
		.vma = vma,
		.address = address & PAGE_MASK,
		.flags = flags,
		.pgoff = linear_page_index(vma, address),
		.gfp_mask = __get_fault_gfp_mask(vma),
	};
	unsigned int dirty = flags & FAULT_FLAG_WRITE;
	struct mm_struct *mm = vma->vm_mm;
	pgd_t *pgd;
	p4d_t *p4d;
	int ret;

	/* (3.3.1) 查找addr对应的pgd和p4d */
	pgd = pgd_offset(mm, address);
	p4d = p4d_alloc(mm, pgd, address);
	if (!p4d)
		return VM_FAULT_OOM;

	/* (3.3.2) 查找addr对应的pud,如果还没分配空间 */
	vmf.pud = pud_alloc(mm, p4d, address);
	if (!vmf.pud)
		return VM_FAULT_OOM;
	if (pud_none(*vmf.pud) && transparent_hugepage_enabled(vma)) {
		ret = create_huge_pud(&vmf);
		if (!(ret & VM_FAULT_FALLBACK))
			return ret;
	} else {
		pud_t orig_pud = *vmf.pud;

		barrier();
		if (pud_trans_huge(orig_pud) || pud_devmap(orig_pud)) {

			/* NUMA case for anonymous PUDs would go here */

			if (dirty && !pud_write(orig_pud)) {
				ret = wp_huge_pud(&vmf, orig_pud);
				if (!(ret & VM_FAULT_FALLBACK))
					return ret;
			} else {
				huge_pud_set_accessed(&vmf, orig_pud);
				return 0;
			}
		}
	}

	/* (3.3.3) 查找addr对应的pmd,如果还没分配空间 */
	vmf.pmd = pmd_alloc(mm, vmf.pud, address);
	if (!vmf.pmd)
		return VM_FAULT_OOM;
	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
		ret = create_huge_pmd(&vmf);
		if (!(ret & VM_FAULT_FALLBACK))
			return ret;
	} else {
		pmd_t orig_pmd = *vmf.pmd;

		barrier();
		if (unlikely(is_swap_pmd(orig_pmd))) {
			VM_BUG_ON(thp_migration_supported() &&
					  !is_pmd_migration_entry(orig_pmd));
			if (is_pmd_migration_entry(orig_pmd))
				pmd_migration_entry_wait(mm, vmf.pmd);
			return 0;
		}
		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
				return do_huge_pmd_numa_page(&vmf, orig_pmd);

			if (dirty && !pmd_write(orig_pmd)) {
				ret = wp_huge_pmd(&vmf, orig_pmd);
				if (!(ret & VM_FAULT_FALLBACK))
					return ret;
			} else {
				huge_pmd_set_accessed(&vmf, orig_pmd);
				return 0;
			}
		}
	}

	/* (3.3.4) 处理PTE的缺页 */
	return handle_pte_fault(&vmf);
}

↓

static int handle_pte_fault(struct vm_fault *vmf)
{
	pte_t entry;

	/* (3.3.4.1) pmd为空,说明pte肯定还没有分配 */
	if (unlikely(pmd_none(*vmf->pmd))) {
		/*
		 * Leave __pte_alloc() until later: because vm_ops->fault may
		 * want to allocate huge page, and if we expose page table
		 * for an instant, it will be difficult to retract from
		 * concurrent faults and from rmap lookups.
		 */
		vmf->pte = NULL;
	/* (3.3.4.2) pmd不为空,说明:
			1、pte可能分配了,
			2、也可能没分配只是临近的PTE分配连带创建了pmd
	 */
	} else {
		/* See comment in pte_alloc_one_map() */
		if (pmd_devmap_trans_unstable(vmf->pmd))
			return 0;
		/*
		 * A regular pmd is established and it can't morph into a huge
		 * pmd from under us anymore at this point because we hold the
		 * mmap_sem read mode and khugepaged takes it in write mode.
		 * So now it's safe to run pte_offset_map().
		 */
		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
		vmf->orig_pte = *vmf->pte;

		/*
		 * some architectures can have larger ptes than wordsize,
		 * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
		 * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
		 * accesses.  The code below just needs a consistent view
		 * for the ifs and we later double check anyway with the
		 * ptl lock held. So here a barrier will do.
		 */
		barrier();

		/* pte还没分配 */
		if (pte_none(vmf->orig_pte)) {
			pte_unmap(vmf->pte);
			vmf->pte = NULL;
		}
	}

	/* (3.3.4.3) 如果PTE都没有创建过,那是新的缺页异常 */
	if (!vmf->pte) {
		if (vma_is_anonymous(vmf->vma))
			/* (3.3.4.3.1) 匿名映射 */
			return do_anonymous_page(vmf);
		else
			/* (3.3.4.3.2) 文件映射 */
			return do_fault(vmf);
	}

	/* (3.3.4.4) 如果PTE存在,但是page不存在,page是被swap出去了 */
	if (!pte_present(vmf->orig_pte))
		return do_swap_page(vmf);

	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
		return do_numa_page(vmf);

	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
	spin_lock(vmf->ptl);
	entry = vmf->orig_pte;
	if (unlikely(!pte_same(*vmf->pte, entry)))
		goto unlock;
	if (vmf->flags & FAULT_FLAG_WRITE) {
		if (!pte_write(entry))
			return do_wp_page(vmf);
		entry = pte_mkdirty(entry);
	}
	entry = pte_mkyoung(entry);
	if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry,
				vmf->flags & FAULT_FLAG_WRITE)) {
		update_mmu_cache(vmf->vma, vmf->address, vmf->pte);
	} else {
		/*
		 * This is needed only for protection faults but the arch code
		 * is not yet telling us if this is a protection fault or not.
		 * This still avoids useless tlb flushes for .text page faults
		 * with threads.
		 */
		if (vmf->flags & FAULT_FLAG_WRITE)
			flush_tlb_fix_spurious_fault(vmf->vma, vmf->address);
	}
unlock:
	pte_unmap_unlock(vmf->pte, vmf->ptl);
	return 0;
}

|→

2.3.2.1 do_anonymous_page()

static int do_anonymous_page(struct vm_fault *vmf)
{
	struct vm_area_struct *vma = vmf->vma;
	struct mem_cgroup *memcg;
	struct page *page;
	int ret = 0;
	pte_t entry;

	/* File mapping without ->vm_ops ? */
	if (vma->vm_flags & VM_SHARED)
		return VM_FAULT_SIGBUS;

	/*
	 * Use pte_alloc() instead of pte_alloc_map().  We can't run
	 * pte_offset_map() on pmds where a huge pmd might be created
	 * from a different thread.
	 *
	 * pte_alloc_map() is safe to use under down_write(mmap_sem) or when
	 * parallel threads are excluded by other means.
	 *
	 * Here we only have down_read(mmap_sem).
	 */
	/* (3.3.4.3.1.1) 分配pte结构 */
	if (pte_alloc(vma->vm_mm, vmf->pmd, vmf->address))
		return VM_FAULT_OOM;

	/* See the comment in pte_alloc_one_map() */
	if (unlikely(pmd_trans_unstable(vmf->pmd)))
		return 0;

	/* Use the zero-page for reads */
	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
			!mm_forbids_zeropage(vma->vm_mm)) {
		entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
						vma->vm_page_prot));
		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
				vmf->address, &vmf->ptl);
		if (!pte_none(*vmf->pte))
			goto unlock;
		ret = check_stable_address_space(vma->vm_mm);
		if (ret)
			goto unlock;
		/* Deliver the page fault to userland, check inside PT lock */
		if (userfaultfd_missing(vma)) {
			pte_unmap_unlock(vmf->pte, vmf->ptl);
			return handle_userfault(vmf, VM_UFFD_MISSING);
		}
		goto setpte;
	}

	/* Allocate our own private page. */
	if (unlikely(anon_vma_prepare(vma)))
		goto oom;
	/* (3.3.4.3.1.2) 分配page结构 */
	page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
	if (!page)
		goto oom;

	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false))
		goto oom_free_page;

	/*
	 * The memory barrier inside __SetPageUptodate makes sure that
	 * preceeding stores to the page contents become visible before
	 * the set_pte_at() write.
	 */
	__SetPageUptodate(page);

	/* (3.3.4.3.1.3) pte成员赋值 */
	entry = mk_pte(page, vma->vm_page_prot);
	if (vma->vm_flags & VM_WRITE)
		entry = pte_mkwrite(pte_mkdirty(entry));

	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
			&vmf->ptl);
	if (!pte_none(*vmf->pte))
		goto release;

	ret = check_stable_address_space(vma->vm_mm);
	if (ret)
		goto release;

	/* Deliver the page fault to userland, check inside PT lock */
	if (userfaultfd_missing(vma)) {
		pte_unmap_unlock(vmf->pte, vmf->ptl);
		mem_cgroup_cancel_charge(page, memcg, false);
		put_page(page);
		return handle_userfault(vmf, VM_UFFD_MISSING);
	}

	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
	page_add_new_anon_rmap(page, vma, vmf->address, false);
	mem_cgroup_commit_charge(page, memcg, false, false);
	lru_cache_add_active_or_unevictable(page, vma);
	/* (3.3.4.3.1.4) 设置pte */
setpte:
	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);

	/* No need to invalidate - it was non-present before */
	update_mmu_cache(vma, vmf->address, vmf->pte);
unlock:
	pte_unmap_unlock(vmf->pte, vmf->ptl);
	return ret;
release:
	mem_cgroup_cancel_charge(page, memcg, false);
	put_page(page);
	goto unlock;
oom_free_page:
	put_page(page);
oom:
	return VM_FAULT_OOM;
}

2.3.2.2 do_fault()

static int do_fault(struct vm_fault *vmf)
{
	struct vm_area_struct *vma = vmf->vma;
	struct mm_struct *vm_mm = vma->vm_mm;
	int ret;

	/*
	 * The VMA was not fully populated on mmap() or missing VM_DONTEXPAND
	 */
	if (!vma->vm_ops->fault) {
		/*
		 * If we find a migration pmd entry or a none pmd entry, which
		 * should never happen, return SIGBUS
		 */
		if (unlikely(!pmd_present(*vmf->pmd)))
			ret = VM_FAULT_SIGBUS;
		else {
			vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm,
						       vmf->pmd,
						       vmf->address,
						       &vmf->ptl);
			/*
			 * Make sure this is not a temporary clearing of pte
			 * by holding ptl and checking again. A R/M/W update
			 * of pte involves: take ptl, clearing the pte so that
			 * we don't have concurrent modification by hardware
			 * followed by an update.
			 */
			if (unlikely(pte_none(*vmf->pte)))
				ret = VM_FAULT_SIGBUS;
			else
				ret = VM_FAULT_NOPAGE;

			pte_unmap_unlock(vmf->pte, vmf->ptl);
		}
	} else if (!(vmf->flags & FAULT_FLAG_WRITE))
		/* (1) 普通文件的nopage处理 */
		ret = do_read_fault(vmf);
	else if (!(vma->vm_flags & VM_SHARED))
		/* (2) copy on write的处理 */
		ret = do_cow_fault(vmf);
	else
		/* (3) 共享文件的处理 */
		ret = do_shared_fault(vmf);

	/* preallocated pagetable is unused: free it */
	if (vmf->prealloc_pte) {
		pte_free(vm_mm, vmf->prealloc_pte);
		vmf->prealloc_pte = NULL;
	}
	return ret;
}

2.3.2.3 do_read_fault()

static int do_read_fault(struct vm_fault *vmf)
{
	struct vm_area_struct *vma = vmf->vma;
	int ret = 0;

	/*
	 * Let's call ->map_pages() first and use ->fault() as fallback
	 * if page by the offset is not ready to be mapped (cold cache or
	 * something).
	 */
	/* (1.1) 首先尝试一次准备多个page,调用vm_ops->map_pages() */
	if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
		ret = do_fault_around(vmf);
		if (ret)
			return ret;
	}

	/* (1.2) 如果失败,尝试一次准备一个page,调用vm_ops->fault() */
	ret = __do_fault(vmf);
	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
		return ret;

	ret |= finish_fault(vmf);
	unlock_page(vmf->page);
	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
		put_page(vmf->page);
	return ret;
}

↓

static int __do_fault(struct vm_fault *vmf)
{
	struct vm_area_struct *vma = vmf->vma;
	int ret;

	/*
	 * Preallocate pte before we take page_lock because this might lead to
	 * deadlocks for memcg reclaim which waits for pages under writeback:
	 *				lock_page(A)
	 *				SetPageWriteback(A)
	 *				unlock_page(A)
	 * lock_page(B)
	 *				lock_page(B)
	 * pte_alloc_pne
	 *   shrink_page_list
	 *     wait_on_page_writeback(A)
	 *				SetPageWriteback(B)
	 *				unlock_page(B)
	 *				# flush A, B to clear the writeback
	 */
	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm,
						  vmf->address);
		if (!vmf->prealloc_pte)
			return VM_FAULT_OOM;
		smp_wmb(); /* See comment in __pte_alloc() */
	}

	/* (1.2.1) 调用vma对应的具体vm_ops */
	ret = vma->vm_ops->fault(vmf);
	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
			    VM_FAULT_DONE_COW)))
		return ret;

	if (unlikely(PageHWPoison(vmf->page))) {
		if (ret & VM_FAULT_LOCKED)
			unlock_page(vmf->page);
		put_page(vmf->page);
		vmf->page = NULL;
		return VM_FAULT_HWPOISON;
	}

	if (unlikely(!(ret & VM_FAULT_LOCKED)))
		lock_page(vmf->page);
	else
		VM_BUG_ON_PAGE(!PageLocked(vmf->page), vmf->page);

	return ret;
}

以ext4为例,vm_ops实现如下:

static const struct vm_operations_struct ext4_file_vm_ops = {
	.fault		= ext4_filemap_fault,
	.map_pages	= filemap_map_pages,
	.page_mkwrite   = ext4_page_mkwrite,
};

我们简单分析其中的单page映射函数ext4_filemap_fault():

int ext4_filemap_fault(struct vm_fault *vmf)
{
	struct inode *inode = file_inode(vmf->vma->vm_file);
	int err;

	down_read(&EXT4_I(inode)->i_mmap_sem);
	err = filemap_fault(vmf);
	up_read(&EXT4_I(inode)->i_mmap_sem);

	return err;
}

↓

int filemap_fault(struct vm_fault *vmf)
{
	int error;
	struct file *file = vmf->vma->vm_file;
	struct address_space *mapping = file->f_mapping;
	struct file_ra_state *ra = &file->f_ra;
	struct inode *inode = mapping->host;
	pgoff_t offset = vmf->pgoff;
	pgoff_t max_off;
	struct page *page;
	int ret = 0;

	max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
	if (unlikely(offset >= max_off))
		return VM_FAULT_SIGBUS;

	/*
	 * Do we have something in the page cache already?
	 */
	/* (1.2.1.1) 尝试从文件的cache缓存树中,找到offset对应的page */
	page = find_get_page(mapping, offset);
	if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
		/*
		 * We found the page, so try async readahead before
		 * waiting for the lock.
		 */
		do_async_mmap_readahead(vmf->vma, ra, file, page, offset);
	} else if (!page) {
		/* No page in the page cache at all */
		/* 如果文件cache缓存中没有page,先读入文件到缓存 */
		do_sync_mmap_readahead(vmf->vma, ra, file, offset);
		count_vm_event(PGMAJFAULT);
		count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
		ret = VM_FAULT_MAJOR;
retry_find:
		/* 再重新查找page */
		page = find_get_page(mapping, offset);
		if (!page)
			goto no_cached_page;
	}

	/* (1.2.1.2) 锁住page */
	if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
		put_page(page);
		return ret | VM_FAULT_RETRY;
	}

	/* Did it get truncated? */
	if (unlikely(page->mapping != mapping)) {
		unlock_page(page);
		put_page(page);
		goto retry_find;
	}
	VM_BUG_ON_PAGE(page->index != offset, page);

	/*
	 * We have a locked page in the page cache, now we need to check
	 * that it's up-to-date. If not, it is going to be due to an error.
	 */
	if (unlikely(!PageUptodate(page)))
		goto page_not_uptodate;

	/*
	 * Found the page and have a reference on it.
	 * We must recheck i_size under page lock.
	 */
	max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
	if (unlikely(offset >= max_off)) {
		unlock_page(page);
		put_page(page);
		return VM_FAULT_SIGBUS;
	}

	/* (1.2.1.3) 返回获取到的page,已经正确填充了对应offset文件的内容 */
	vmf->page = page;
	return ret | VM_FAULT_LOCKED;

no_cached_page:
	/*
	 * We're only likely to ever get here if MADV_RANDOM is in
	 * effect.
	 */
	error = page_cache_read(file, offset, vmf->gfp_mask);

	/*
	 * The page we want has now been added to the page cache.
	 * In the unlikely event that someone removed it in the
	 * meantime, we'll just come back here and read it again.
	 */
	if (error >= 0)
		goto retry_find;

	/*
	 * An error return from page_cache_read can result if the
	 * system is low on memory, or a problem occurs while trying
	 * to schedule I/O.
	 */
	if (error == -ENOMEM)
		return VM_FAULT_OOM;
	return VM_FAULT_SIGBUS;

page_not_uptodate:
	/*
	 * Umm, take care of errors if the page isn't up-to-date.
	 * Try to re-read it _once_. We do this synchronously,
	 * because there really aren't any performance issues here
	 * and we need to check for errors.
	 */
	ClearPageError(page);
	error = mapping->a_ops->readpage(file, page);
	if (!error) {
		wait_on_page_locked(page);
		if (!PageUptodate(page))
			error = -EIO;
	}
	put_page(page);

	if (!error || error == AOP_TRUNCATED_PAGE)
		goto retry_find;

	/* Things didn't work out. Return zero to tell the mm layer so. */
	shrink_readahead_size_eio(file, ra);
	return VM_FAULT_SIGBUS;
}

2.3.2.4 do_cow_fault()

static int do_cow_fault(struct vm_fault *vmf)
{
	struct vm_area_struct *vma = vmf->vma;
	int ret;

	if (unlikely(anon_vma_prepare(vma)))
		return VM_FAULT_OOM;

	/* (2.1) 分配一个新page:cow_page */
	vmf->cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
	if (!vmf->cow_page)
		return VM_FAULT_OOM;

	if (mem_cgroup_try_charge(vmf->cow_page, vma->vm_mm, GFP_KERNEL,
				&vmf->memcg, false)) {
		put_page(vmf->cow_page);
		return VM_FAULT_OOM;
	}

	/* (2.2) 更新原page的内容 */
	ret = __do_fault(vmf);
	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
		goto uncharge_out;
	if (ret & VM_FAULT_DONE_COW)
		return ret;

	/* (2.3) 拷贝page的内容到cow_page */
	copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma);
	__SetPageUptodate(vmf->cow_page);

	/* (2.4) 提交PTE */
	ret |= finish_fault(vmf);
	unlock_page(vmf->page);
	put_page(vmf->page);
	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
		goto uncharge_out;
	return ret;
uncharge_out:
	mem_cgroup_cancel_charge(vmf->cow_page, vmf->memcg, false);
	put_page(vmf->cow_page);
	return ret;
}

2.3.2.5 do_shared_fault()

static int do_shared_fault(struct vm_fault *vmf)
{
	struct vm_area_struct *vma = vmf->vma;
	int ret, tmp;

	/* (3.1) 更新共享page中的内容 */
	ret = __do_fault(vmf);
	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
		return ret;

	/*
	 * Check if the backing address space wants to know that the page is
	 * about to become writable
	 */
	if (vma->vm_ops->page_mkwrite) {
		unlock_page(vmf->page);
		tmp = do_page_mkwrite(vmf);
		if (unlikely(!tmp ||
				(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) {
			put_page(vmf->page);
			return tmp;
		}
	}

	ret |= finish_fault(vmf);
	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
					VM_FAULT_RETRY))) {
		unlock_page(vmf->page);
		put_page(vmf->page);
		return ret;
	}

	fault_dirty_shared_page(vma, vmf->page);
	return ret;
}

2.3.2.6 do_swap_page()

int do_swap_page(struct vm_fault *vmf)
{
	struct vm_area_struct *vma = vmf->vma;
	struct page *page = NULL, *swapcache = NULL;
	struct mem_cgroup *memcg;
	struct vma_swap_readahead swap_ra;
	swp_entry_t entry;
	pte_t pte;
	int locked;
	int exclusive = 0;
	int ret = 0;
	bool vma_readahead = swap_use_vma_readahead();

	if (vma_readahead) {
		page = swap_readahead_detect(vmf, &swap_ra);
		swapcache = page;
	}

	if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte)) {
		if (page)
			put_page(page);
		goto out;
	}

	/* (3.3.4.4.1) 从PTE中读出swap entry */
	entry = pte_to_swp_entry(vmf->orig_pte);
	if (unlikely(non_swap_entry(entry))) {
		if (is_migration_entry(entry)) {
			migration_entry_wait(vma->vm_mm, vmf->pmd,
					     vmf->address);
		} else if (is_device_private_entry(entry)) {
			/*
			 * For un-addressable device memory we call the pgmap
			 * fault handler callback. The callback must migrate
			 * the page back to some CPU accessible page.
			 */
			ret = device_private_entry_fault(vma, vmf->address, entry,
						 vmf->flags, vmf->pmd);
		} else if (is_hwpoison_entry(entry)) {
			ret = VM_FAULT_HWPOISON;
		} else {
			print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
			ret = VM_FAULT_SIGBUS;
		}
		goto out;
	}


	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
	if (!page) {
		page = lookup_swap_cache(entry, vma_readahead ? vma : NULL,
					 vmf->address);
		swapcache = page;
	}

	/* (3.3.4.4.2) 分配新的page,并且读出swap中的内容到新page中 */
	if (!page) {
		struct swap_info_struct *si = swp_swap_info(entry);

		if (si->flags & SWP_SYNCHRONOUS_IO &&
				__swap_count(si, entry) == 1) {
			/* skip swapcache */
			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address);
			if (page) {
				__SetPageLocked(page);
				__SetPageSwapBacked(page);
				set_page_private(page, entry.val);
				lru_cache_add_anon(page);
				swap_readpage(page, true);
			}
		} else {
			if (vma_readahead)
				page = do_swap_page_readahead(entry,
					GFP_HIGHUSER_MOVABLE, vmf, &swap_ra);
			else
				page = swapin_readahead(entry,
				       GFP_HIGHUSER_MOVABLE, vma, vmf->address);
			swapcache = page;
		}

		if (!page) {
			/*
			 * Back out if somebody else faulted in this pte
			 * while we released the pte lock.
			 */
			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
					vmf->address, &vmf->ptl);
			if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
				ret = VM_FAULT_OOM;
			delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
			goto unlock;
		}

		/* Had to read the page from swap area: Major fault */
		ret = VM_FAULT_MAJOR;
		count_vm_event(PGMAJFAULT);
		count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
	} else if (PageHWPoison(page)) {
		/*
		 * hwpoisoned dirty swapcache pages are kept for killing
		 * owner processes (which may be unknown at hwpoison time)
		 */
		ret = VM_FAULT_HWPOISON;
		delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
		swapcache = page;
		goto out_release;
	}

	locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);

	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
	if (!locked) {
		ret |= VM_FAULT_RETRY;
		goto out_release;
	}

	/*
	 * Make sure try_to_free_swap or reuse_swap_page or swapoff did not
	 * release the swapcache from under us.  The page pin, and pte_same
	 * test below, are not enough to exclude that.  Even if it is still
	 * swapcache, we need to check that the page's swap has not changed.
	 */
	if (unlikely((!PageSwapCache(page) ||
			page_private(page) != entry.val)) && swapcache)
		goto out_page;

	page = ksm_might_need_to_copy(page, vma, vmf->address);
	if (unlikely(!page)) {
		ret = VM_FAULT_OOM;
		page = swapcache;
		goto out_page;
	}

	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL,
				&memcg, false)) {
		ret = VM_FAULT_OOM;
		goto out_page;
	}

	/*
	 * Back out if somebody else already faulted in this pte.
	 */
	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
			&vmf->ptl);
	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
		goto out_nomap;

	if (unlikely(!PageUptodate(page))) {
		ret = VM_FAULT_SIGBUS;
		goto out_nomap;
	}

	/*
	 * The page isn't present yet, go ahead with the fault.
	 *
	 * Be careful about the sequence of operations here.
	 * To get its accounting right, reuse_swap_page() must be called
	 * while the page is counted on swap but not yet in mapcount i.e.
	 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
	 * must be called after the swap_free(), or it will never succeed.
	 */

	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
	dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
	pte = mk_pte(page, vma->vm_page_prot);
	if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
		vmf->flags &= ~FAULT_FLAG_WRITE;
		ret |= VM_FAULT_WRITE;
		exclusive = RMAP_EXCLUSIVE;
	}
	flush_icache_page(vma, page);
	if (pte_swp_soft_dirty(vmf->orig_pte))
		pte = pte_mksoft_dirty(pte);
	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
	vmf->orig_pte = pte;

	/* ksm created a completely new copy */
	if (unlikely(page != swapcache && swapcache)) {
		page_add_new_anon_rmap(page, vma, vmf->address, false);
		mem_cgroup_commit_charge(page, memcg, false, false);
		lru_cache_add_active_or_unevictable(page, vma);
	} else {
		do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
		mem_cgroup_commit_charge(page, memcg, true, false);
		activate_page(page);
	}

	swap_free(entry);
	if (mem_cgroup_swap_full(page) ||
	    (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
		try_to_free_swap(page);
	unlock_page(page);
	if (page != swapcache && swapcache) {
		/*
		 * Hold the lock to avoid the swap entry to be reused
		 * until we take the PT lock for the pte_same() check
		 * (to avoid false positives from pte_same). For
		 * further safety release the lock after the swap_free
		 * so that the swap count won't change under a
		 * parallel locked swapcache.
		 */
		unlock_page(swapcache);
		put_page(swapcache);
	}

	if (vmf->flags & FAULT_FLAG_WRITE) {
		ret |= do_wp_page(vmf);
		if (ret & VM_FAULT_ERROR)
			ret &= VM_FAULT_ERROR;
		goto out;
	}

	/* No need to invalidate - it was non-present before */
	update_mmu_cache(vma, vmf->address, vmf->pte);
unlock:
	pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
	return ret;
out_nomap:
	mem_cgroup_cancel_charge(page, memcg, false);
	pte_unmap_unlock(vmf->pte, vmf->ptl);
out_page:
	unlock_page(page);
out_release:
	put_page(page);
	if (page != swapcache && swapcache) {
		unlock_page(swapcache);
		put_page(swapcache);
	}
	return ret;
}

参考资料:

1.Linux内核学习——内存管理之进程地址空间
2.关于内核中PST的实现

3.进程地址空间 get_unmmapped_area()
4.linux如何感知通过mmap进行的文件修改
5.linux内存布局和ASLR下的可分配地址空间
6.关于内核中PST的实现
7.linux sbrk/brk函数使用整理

猜你喜欢

转载自blog.csdn.net/pwl999/article/details/109188517