Overview of Linux Swap Mechanism

Identify pages to swap

1. The       page frame recycling algorithm (PFRA) divides pages into four types: non-recyclable pages, swappable pages , synchronizable pages, and discardable pages. Among them, the exchangeable pages include:

I. Pages belonging to the anonymous linear area of ​​the process (such as user-mode heap, stack);

II. Dirty pages belonging to the process private memory mapping ;

III. Pages belonging to the IPC shared memory area;

For each page in memory, the kernel uses the Present flag in the page table entry to determine whether the page is swapped out.

2.       A mapped page refers to a page that maps a certain part of a file. All pages in the user-mode address space that belong to the file memory mapping and pages in the page cache are mapped pages; anonymous pages are those that belong to An anonymous linear area of ​​a process, for example, all pages in the process's user-space heap and stack are anonymous pages. The mapping field of the page descriptor (struct page) is used to determine whether the page is a mapped page or an anonymous page, that is:

I. If the mapping field is empty, the page belongs to the swap cache;

II. The mapping field is not empty, and the lowest bit is 1 , indicating that the page is an anonymous page, and the mapping field stores a pointer to the anon_vma descriptor;

III. The mapping field is not empty, and the lowest bit is 0 , indicating that the page is a mapping page, and the mapping field points to the address_space object;

The PageAnon() function accepts the address of the page descriptor as a parameter, if the lowest bit of the mapping field is set, the function returns 1 , otherwise it returns 0 .

3. For shared pages, the _mapcount field       of the page descriptor stores the number of page table entries that refer to the page frame, and the kernel can determine all page table entries that refer to the same page frame through reverse mapping technology.

swap area

1.       Each swap area consists of a set of page slots , that is, a set of blocks with a size of 4096 bytes, and each block contains a swapped out page. Each swap area consists of one or more swap sub-areas , each swap sub-area is represented by a swap_extent descriptor, and a group of page slots corresponding to each sub-area are physically adjacent on the disk. A swap area stored in a disk partition has only one subarea, while a swap area stored in a normal file may have multiple subareas, because the file may not be in a set of contiguous blocks on the disk.

2.       Swap area descriptor. Each active swap area has its own swap_info_struct descriptor in memory, and its main fields are as follows:

field

illustrate

flags

swap area sign

swap_map

Pointer to an array of counters, each array element corresponds to each page slot of the swap area

lowest_bit

The first page slot to scan when searching for a free page slot

highest_bit

The last page slot to scan when searching for a free page slot

cluster_nr

The number of allocated free page slots

cluster_next

The next page slot to scan when searching for a free page slot

prio

swap priority

pages

Number of available page slots

max

swap size in pages

inuse_pages

The number of used page slots in the swap area

next

pointer to the next swap descriptor

 

The SWP_USED and SWP_WRITEOK flags of the flags fieldindicate whether the swap area is active and writable;

swap_map字段指向一个计数器数组,交换区的每一个页槽对应一个元素。如果计数器值等于0,那么这个页槽就是空闲的;如果计数器为正数,则页槽计数器的值就表示共享换出页的进程数;

cluster_nrcluster_next字段相结合用于查找空闲页槽,如果cluster_nr字段为正,则下次查找从cluster_next开始,若cluster_nr0则从lowest_bit开始查找;

lowest_bithighest_bit分别表示第一个和最后一个可能为空的页槽,即所有低于lowest_bit和高于highest_bit的页槽都被认为是已经分配过);

 

3.      swap_info数组包括MAX_SWAPFILES个交换区描述符,只有设置了SWP_USED标志的交换区才被使用。活动的交换区描述符也被插入按交换区优先级排序的swap_list链表中。该链表是通过交换区描述符的next字段实现的,next字段存放的是swap_info数组中下一个描述符的索引。swapon()swapoff()系统调用用于激活和禁用交换区。

换出页标识符

 

 

换出页标识符是由区号页槽索引构成的,分别表示交换区在swap_info数组中的索引以及页槽在交换区内的索引,这样就可以唯一标识一个页槽。swp_entry(type,offset)宏负责从交换区索引type和页槽索引offset中构造换出页标识符。swp_typeswp_offset宏作用相反。当页被换出时,其标识符就作为页的表项插入页表中,这样在需要时就可以找到这个页,因此,一个页表项就可能存在下列三种情况(页表项的最后一位即为Present标志):

I.     空项:该页不属于进程的地址空间,或相应的页框还没有分配给进程;

II.    31位不全等于0,最后一位等于0该页被换出;

III.   最低位等于1:该页包含在RAM中;

分配和释放页槽

1.      查找一个空闲页槽。scan_swap_map()函数进行查找操作,Linux采用一种混合策略,即大多数情况下(cluster_nr为正数)会从cluster_next开始查找,而在已经到达交换区末尾或需重新开始分配时(上次从头开始分配之后,已经分配了SWAPFILE_CLUSTER=256个空闲页槽),则从lowest_bit开始查找。若找不到空项的页槽,则置lowest_bit字段为最大索引,highest_bit0,以表示交换区已满。所有的查找都是对计数器数组swap_map进行的。

(如果交换区已满并且内存也不足,页框回收算法会负责结束进程来回收内存,交换只负责暂时存放未用页以释放内存,它页框回收算法的一个子集)

2.      get_swap_page()函数通过搜索所有活动的交换区(swap_list链表)来查找一个空闲页槽,该函数主要还是通过scan_swap_map()函数来获得一个空闲页槽。swap_free()函数执行swap_map计数器的减1操作,当计数器值为0时表示页槽变为空闲,此时应修改交换区描述符的相应字段。

交换高速缓存

1.      swap cache的引入解决同步的问题,例如两个进程可能同时要换入同一个匿名共享页,或者一个进程可能同时进行换入换出操作。对于第一种情况,换入操作的页必须先存在swap cache上,因此可以通过页描述符的PG_locked标志标志来避免竞争;而对于第二种情况,假定进程AB共享一个页P,当页P需要被换出时,先将其移动到swap cache上,再将AB的页表项修改,此时,页框P只被swap cache所引用,这时候分两种情况:(1)此时若有换入操作需执行(例如进程B需要将该页写入磁盘),由于进程B产生page fault,而缺页处理程序发现页框Pswap cache中,则它直接将页框P的物理地址插入进程B的页表项(而没有必要重新分配页,再换入页槽中的数据)(2)若此时没有换入操作,则swap cache会删除对该页框P的引用并把它释放到伙伴系统。swap cache有一组辅助函数来支持上层的调用。

2.      swap cache是由page cache实现的,其核心数据结构就是radix tree,交换高速缓存中页的存放方式是隔页存放,并具有以下特征:

I. 页描述符的mapping字段为NULL

II. 页描述符的PG_swapcache标志置位;

III. private字段存放与该页有关的换出页标识符;

3.      swap cache中的所有页只使用一个swapper_space地址空间,因此只有一个基树对swap cache中的页进行寻址。nrpages字段存放swap cache中的页数。

换出页与换入页的过程

1.      换出页操作的第一步是准备swap cache,如果页框回收算法(shrink_list()函数)确定某页是匿名页且不在swap cache中,内核就调用add_to_swap()函数,该函数在交换区分配一个新页槽,并把页框插入swap cache然后更新引用该匿名页的所有页表项;接着就是把页的数据写入交换区,这是真正的I/O传输操作,pageout()函数调用页address_space对象的writepage()方法实现,该步I/O传输结束后还需进行唤醒相关进程,清除标记等工作;最后内核试图将swap cache中的页框释放到伙伴系统。

2.      swap thrashing是指页频繁的写入磁盘再从磁盘读入,大部分时间消耗在访问磁盘上,内核解决该问题的方法是把swap token赋给系统中的单个进程,该标记使得进程免于页框回收。当进程拥有交换标记是,swap_token_mm被设为进程的内存描述符的地址。当进程要从交换区读入一个页时,grab_swap_token函数决定是否将交换标记赋给当前进程。handle_pte_fault()函数调用do_swap_page()函数执行换入页操作,函数首先检查swap cache中是否存在该页,如果没在,则调用swap_readpage()从交换区读入页数据到swap cache中,然后完成相应的工作。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324887939&siteId=291194637