In-depth understanding of the Linux kernel-accessing files

access file mode

There are many modes of accessing files. We consider the following situations in this chapter:

规范模式
	规范模式下文件打开后,标志O_SYNC与0_DIRECT清0,而且它的内容是由系统调用read()和write()来存取。
	系统调用read()将阻塞调用进程,直到数据被拷贝进用户态地址空间(内核允许返回的字节数少于要求的字节数)。
	但系统调用write()不同,它在数据被拷贝到页高速缓存(延迟写)后就马上结束。这会在“读写文件”这一节详细阐述。
同步模式
	同步模式下文件打开后,标志0_SYNC置1或稍后由系统调用fcntl()对其置1。
	这个标志只影响写操作(读操作总是会阻塞),它将阻塞调用进程,直到数据被有效地写入磁盘。这也会在“读写文件”这一节详细阐述。
内存映射模式
	内存映射模式下文件打开后,应用程序发出系统调用mmap()将文件映射到内存中。
	因此,文件就成为RAM中的一个字节数组,应用程序就可以直接访问数组元素,而不需用系统调用read()、write()或lseek()。
	这将在“内存映射”这一节详细阐述。
直接I/O模式
	直接I/O模式下文件打开后,标志0_DIRECT置1。
	任何读写操作都将数据在用户态地址空间与磁盘间直接传送而不通过页高速缓存。这将在“直接I/O传送”这一节详细阐述。
	(标志O_SYNC和O_DIRECT的值可以有四种组合。)
异步模式
	异步模式下,文件的访问可以有两种方法,
	即通过一组POSIX APl或Linux特有的系统调用来实现。所谓异步模式就是数据传输请求并不阻塞调用进程,而是在后台执行,
	同时应用程序继续它的正常运行。这将在“异步I/O”这一节详细阐述。

Read and write files

How the read() and write() system calls are implemented has been explained in the "read() and write() system calls" section of Chapter 12. The corresponding service routine will eventually call the read and write methods of the file object, which may depend on the file system. For disk file systems, these methods can determine the physical block location of the data being accessed and activate the block device driver to start the data transfer. Reading files is page-based, and the kernel always transfers several complete pages of data at a time. If a process issues a read() system call to read some bytes that are not already in RAM, the kernel allocates a new page frame and fills the page with the appropriate portion of the file, adding the page to the page. cache, and finally copies the requested bytes into the process address space.

For most file systems, reading a page of data from a file is equivalent to finding on disk which blocks the requested data resides on. Once this process is complete, the kernel can fill these pages by submitting appropriate I/O operations to the general block layer. In fact, the read method of most disk file systems is implemented by a general function called generic_file_read().

For disk-based files, the handling of write operations is quite complicated because the file size can change, so the kernel may allocate some physical blocks on the disk. Of course, how this process is implemented depends on the type of file system. However, many disk file systems implement their write method through the generic function generic_file_write(). Such file systems include Ext2, System V/Coherent/Xenix and Minix. On the other hand, there are several file systems (such as journal file systems and network file systems) that implement their write methods through custom functions.

Read data from file

Let us discuss the generic_file_read() function, which implements the read method of almost all ordinary files in the disk file system and any block device file. This function operates on the following parameters:

filp
	文件对象的地址
buf
	用户态线性区的线性地址,从文件中读出的数据必须存放在这里
count
	要读取的字符个数
ppos
	指向一个变量的指针,该变量存放读操作开始处的文件偏移量(通常为filp文件对象的f_pos字段)

In the first step, the function initializes two descriptors. The first descriptor is stored in the local variable local_iov of type iovec; it contains the address (buf) and length (count) of the user-mode buffer, which is used to store data in the file to be read. The second descriptor is stored in the local variable kiocb of type kiocb; it is used to track the completion status of running synchronous and asynchronous I/O operations. The main field descriptions of the kiocb descriptor are shown in Table 16-1.
Insert image description here
Insert image description here
The function generic_file_read() initializes the descriptor kiocb by executing the macro init_sync_kiocb and sets the relevant fields of a synchronization operation object. Specifically, this macro sets the ki_key field to KIOCB_SYNC_KEY, the ki_filp field to filp, and the ki_obj field to current. Then, generic_file_read() calls __generic_file_aio_read() and passes it the addresses of the just-filled iovec and kiocb descriptors. This latter function returns a value, which is usually the number of bytes effectively read from the file. generic_file_read() ends after returning the value.

The function _generic_file_aio_read() is a common routine used by all file systems to implement synchronous and asynchronous read operations. This function accepts four parameters: the address of the kiocb descriptor iocb, the address of the iovec descriptor array iov, the length of the array and the address of a variable ppos that stores the current pointer to the file. The iovec descriptor array has only one element when called by the function generic_file_read(), which describes the user-mode buffer to receive data (Note 1).

We now explain the operation of the function __generic_file_aio_read(). For simplicity, we only focus on the most common case, which is the synchronization operation caused by the system call read() on the page cache file. Later in this chapter we will explain other situations in which this function is executed. Likewise, we do not discuss how to handle errors and exceptions. The steps performed by this function are as follows:

  1. Call access_ok() to check whether the user-mode buffer described by the iovec descriptor is valid. Because the starting address and length are already obtained from the sys_read() service routine, they need to be checked before use (see the "Verification Parameters" section in Chapter 10). If the parameter is invalid, error code -EFAULT is returned.
  2. Create a read operation descriptor, which is a data structure of type read_descriptor_t. This structure stores the current status of file read operations associated with a single user-mode buffer. The fields of this descriptor are shown in Table 16-2.
  3. Call the function do_generic_file_read(), passing it the file object pointer filp, the file offset pointer ppos, the address of the read operation descriptor just allocated and the address of the function file_read_actor() (will be explained later).
  4. Returns the number of bytes copied to the user mode buffer, which is the value of the written field in the read_descriptor_t data structure.
    Insert image description here
    The function do_generic_file_read() reads the requested pages from disk and copies them to the user-mode buffer. Specifically perform the following steps:
  5. Obtain the address_space object corresponding to the file to be read; its address is stored in filp->f_mapping.
  6. Gets the owner of the address space object, the inode object, which will own the pages populated with file data. Its address is stored in the host field of the address_space object. If the file read is a block device file, then the owner is not the inode object pointed to by filp->f_dentry->d_inode, but the inode object in the bdev special file system.
  7. Treat the file as subdivided data pages (4096 bytes per page), and derive the logical number of the page where the first requested byte is located from the file pointer *ppos, that is, the page index in the address space, and store it in index in local variables. Also store the offset of the first requested byte in the page in the offset local variable.
  8. Start a loop to read in all pages containing the requested bytes. The number of bytes of data to be read is stored in the count field of the read_descriptor_t descriptor. During a single loop, the function transfers a data page by performing the following substeps:
    a. If index*4096+offset exceeds the file size stored in the i_size field of the index node object, exit from the loop and jump to Step 5.
    b. Call cond_resched() to check the flag TIF_NEED_RESCHED of the current process. If this flag is set, the function schedule() is called.
    c. If there are pre-read pages, call page_cache_readahead() to read these pages. We discuss pre-reading in the section "Pre-reading of files" later.
    d. Call find_get_page(), passing in a pointer to the address_space object and the index value as parameters; it will search the page cache to find the page descriptor containing the requested data (if any).
    e. If find_get_page() returns a NULL pointer, the requested page is not in the page cache. If so, it will perform the following steps:
    (1). Call handle_ra_miss() to adjust the parameters of the pre-reading system.
    (2). Allocate a new page.
    (3). Call add_to_page_cache() to insert the new page descriptor into the page cache. Remember that this function sets the PG_locked flag of the new page.
    (4). Call lru_cache_add() to insert a new page descriptor into the LRU linked list (see Chapter 17).
    (5). Jump to step 4j and start reading file data.
    f. If the function has run this far, the page is already in the page cache. Check the flag PG_uptodate; if set, the data stored in the page is the latest, so there is no need to read data from disk. Jump to step 4m.
    g. The data in the page is invalid and must be read from disk. The function obtains exclusive access to the page by calling the lock-page() function. As described in the "Page Cache Handling Functions" section of Chapter 15, if PG_locked is already set, lock_page() blocks the current process until the flag is cleared.
    h. The page is now locked by the current process. However, another process may have removed the page from the page cache before the previous step, in which case it would check whether the mapping field of the page descriptor is NULL. In this case, it will call unlock_page() to unlock the page, decrement its reference count (find get_page() increments the count), and jump back to step 4a to reread the same page.
    i. If the function has run this far, the page is locked and in the page cache. Check the flag PG_uptodate again, as another kernel control path may have completed the necessary reads of steps 4f and 4g. If the flag is set, call unlock_page() and jump to 4m to skip the read operation.
    j. Now the real I/O operation can begin, calling the readpage method of the file's address_space object. The corresponding function is responsible for activating the I/O data transfer between disk and page. We will discuss what this function does for ordinary files and block device files later.
    k. If the flag PG_uptodate has not been set, it will wait until the page is effectively read after calling the lock_page() function. The page is locked in step 4g and is unlocked once the read operation is completed. Therefore, the current process does not stop sleeping until the I/O data transfer is completed.
    l. If the index exceeds the number of pages contained in the file (the number is obtained by dividing the value of the i_size field of the inode object by 4096), then it will decrement the reference counter of the page and jump out of the loop to step 5. This happens when the file being read by this process is being deleted by another process.
    m. Store the number of bytes in the page that should be copied into the user-mode buffer in the local variable nr. This value should be equal to the page size (4096 bytes), unless offset is non-zero (this only occurs when reading the first and last pages of the request) or the requested data is not entirely in the file.
    n. Call mark_page_accessed() to set the flag PG_referenced or PG_active, indicating that the page is being accessed and should not be swapped out (see Chapter 17). If the same file (or part of it) is read several times in subsequent executions of do_generic_file_read(), then this step is only performed on the first read.
    o. Now it's time to copy the data in the page to the user-mode buffer. To do this, o_generic_file_read() calls the file_read_actor() function, passing the address of the function as a parameter. file_read_actor() performs the following steps:
    (1). Call kmap(), which establishes a permanent kernel mapping for the page in high-end memory (see the section "Kernel Mapping of High-end Memory Page Frame" in Chapter 8).
    (2). Call __copy_to_user(), which copies the data in the page to the user state address space (see the section "Accessing the Process Address Space" in Chapter 10). Note that this operation will block the process if there is a page fault exception when accessing the user-mode address space.
    (3). Call kunmap() to release any permanent kernel mapping of the page.
    (4). Update the count, written and buf fields of the read_descriptor_t descriptor.
    p. Update the local variables index and count according to the number of valid bytes passed into the user mode buffer. Under normal circumstances, if the last byte of the page has been copied to the user mode buffer, then the value of index is increased by 1 and the value of offset is cleared to 0; otherwise, the value of index remains unchanged and the value of offset is set to have been copied to the user buffer. The number of bytes in the state buffer.
    q. Decrement the reference counter of the page descriptor.
    r. If the count field of the read_descriptor_t descriptor is not 0, then there is other data to be read in the file, jump to step 4a and continue looping to read the next page of data in the file.
  9. All requested or readable data has been read. The function updates the read-ahead data structure filp->f_ra to mark that the data has been sequentially read from the file (see the next section "Pre-reading of files").
  10. Assign the index 4096+offset value to ppos to save the position where read() and write() are called for sequential access in the future.
  11. Call update_atime() to store the current time in the i_atime field of the file's index node object, mark it as dirty and return.

Readpage method of ordinary files

We saw from the previous section that do_generic_file_read() repeatedly uses the readpage method to read pages from disk into memory. The readpage method of the address_space object stores the address of a function that effectively activates the transfer of I/O data from the physical disk to the page cache. For ordinary files, this field usually points to the wrapper function that calls the mpage_readpage() function. For example, the readpage method of the Ext3 file system is implemented by the following function:

int ext3_readpage(struct file *file,struct page *page)
{
	return mpage_readpage(page,ext3_get_block);
}

The need to encapsulate the function is because the parameters received by the mpage_readpage() function are the page descriptor page of the page to be filled and the address get_block of the function that helps mpage_readpage() find the correct block. The wrapper function relies on the file system and therefore provides the appropriate functions to obtain blocks. This function converts a block number relative to the beginning of the file to a logical block number relative to the block position in the disk partition (see Chapter 18 for an example). Of course, the latter parameter depends on the type of file system where the ordinary file is located; in the previous example, this parameter is the address of the ext3_get_block() function. The passed get_block function always uses the buffer header to store important information such as the block device (b_dev field), the location of the requested data on the device (b_blocknr field), and the block state (b_state field). The function mpage_readpage() can choose two different strategies when reading a page from disk. If the blocks containing the requested data are contiguous on disk, the function issues a read I/O operation to the general block layer using a single bio descriptor. If it is not continuous, the function will use a different bio descriptor to read each block on the page.

The get_block function depends on the file system. One of its important functions is to determine whether the next block in the file is also the next block on the disk. Specifically, the mpage_readpage() function performs the following steps:

  1. Check the PG_private field of the page descriptor: if set, the page is a buffer page, that is, the page is associated with a buffer header list describing the blocks that make up the page (see Chapter 15, "Storing Blocks on the Page High Speed"). Cache" section). This means that the page has been read from disk in the past, and the blocks in the page are not contiguous on disk. Skip to step 11 and read the page one block at a time.
  2. Get the size of the block (stored in the page->mapping->host->i_blkbits index node field), and then calculate the two values ​​required to access all blocks of the page, namely the number of blocks in the page and the first block in the page The file block number, which is the index of the first block in the page relative to the starting position of the file.
  3. For each block in the page, call the file system-dependent get_block function, passing it as a parameter to get the logical block number, which is the block index relative to the start of the disk or partition. The logical block numbers of all blocks in the page are stored in a local array.
  4. While performing the previous step, check for possible exception conditions. There are several specific situations: when some blocks are not adjacent on the disk, or when a block falls into a "file hole" (see the "File Hole" section in Chapter 18), or when a block buffer When it has been written by the get_block function. Then skip to step 11 and read the page one block at a time.
  5. If the function reaches this point, all blocks in the page are contiguous on disk. However, it may be the last page in the file, so some blocks in the page may not have an image on disk. If so, it fills the corresponding block buffer in the page with 0s; if not, it sets the page descriptor flag PG_mappedtodisk.
  6. Call bio_alloc() to allocate a new bio descriptor containing a single segment, and initialize the bi_bdev field and bi_sector field with the block device descriptor address and the logical block number of the first block in the page, respectively. These two pieces of information were obtained in step 3 above.
  7. Set the bio_vec descriptor of the bio segment with the starting address of the page, the offset of the first byte of the data read (0), and the total number of bytes read.
  8. Assign the address of the mpage_end_io_read() function to the bio->bi_end_io field (see below).
  9. Call submit_bio(), which will set the bi_rw flag with the direction of the data transfer, update the per-CPU variable page_states to track the number of sectors read, and call the generic_make_request() function on the bio descriptor (see "Towards I/O Scheduler Makes Requests" section).
  10. Returns 0 (success).
  11. If the function jumps here, the page contains blocks that are not contiguous on disk. If the page is the latest (PG_uptodate is set), the function calls unlock_page() to unlock the page; otherwise, it calls block_read_full_page() to read the page one block at a time (see below).
  12. Returns 0 (success).

The function mpage_end_io_read() is the completion method of bio, which starts execution once the I/O data transfer is completed. Assuming there are no I/O errors, this function sets the page descriptor flag PC_uptodate, calls unlock_page() to unlock the page and wake up any processes that were sleeping due to the event, and then calls bio_put() to clear the bio descriptor.

Readpage method of block device file

In the "VFS Handling of Device Files" section of Chapter 13 and the "Opening Block Device Files" section of Chapter 14, we discussed how the kernel handles requests to open block device files. We also saw how the init_special_inode() function establishes the device's inode and how blkdev_open() completes its open phase. In the bdev special file system, the block device uses the address_space object, which is stored in the i_data field of the corresponding block device index node. Unlike ordinary files (whose readpage method in the address_space object depends on the type of file system the file belongs to), the readpage method of a block device file is always the same. It is implemented by the blkdev_readpage() function, which calls block_read_full_page():

int blkdev_readpage(struct file *file,struct * page page){
	return block_read_full_page(page,blkdev_get_block);
}

As you can see, this function is again a wrapper function, here it is a wrapper function for the block_read_full_page() function. This time, the second parameter also points to a function that converts the file block number relative to the beginning of the file to a logical block number relative to the beginning of the block device. However, for block device files, these two numbers are the same; therefore, the blkdev_get_block() function performs the following steps:

  1. Check whether the block number of the first block in the page exceeds the index value of the last block of the block device (the block device size stored in bdev->bd_inode->i_size is divided by the block size stored in bdev->bd_block_size to get the index value; bdev points to the block device descriptor). If exceeded, then it returns -EIO for write operations and 0 for read operations.
    (Reading beyond the block device is also not allowed, but no error code is returned. The kernel can try to issue a read request for the last data of the block device, and the resulting buffer page is only partially mapped).
  2. Set the b_dev field of the buffer header to b_dev.
  3. Set the b_blocknr field in the buffer header to the file block number, which will be passed to this function as a parameter.
  4. Set the BH_Mapped flag in the buffer header to indicate that the b_dev and b_blocknr fields in the buffer header are valid. The function block_read_full_page() reads a page of data one block at a time. As we have seen, this function is used both when reading block device files and when reading ordinary files on disk whose blocks are not contiguous. It performs the following steps:
  5. Check the page descriptor flag PG_private. If set, the page is related to the buffer header list describing the blocks that make up the page (see the "Storing Blocks in the Page Cache" section of Chapter 15); otherwise , call create_empty_buffers() to allocate buffer headers for all block buffers contained in the page. The buffer header address of the first buffer in the page is stored in the page->private field. The b_this_page field in each buffer header points to the buffer header of the next buffer in that page.
  6. The file block number of the first block in the page is calculated from the file offset relative to the page (page->index field).
  7. For the buffer header of each buffer in the page, perform the following sub-steps:
    a. If the flag BH_Uptodate is set, skip the buffer and continue processing the next buffer of the page.
    b. If the flag BH_Mapped is not set and the block does not exceed the end of the file, the get_block function that depends on the file system is called, and the address of the function has been obtained as a parameter. For ordinary files, this function looks in the file system's disk data structure to get the buffer's logical block number relative to the beginning of the disk or partition. For block device files, the difference is that this function treats the file block number as a logical block number. In both cases, the function stores the logical block number in the b_blocknr field of the corresponding buffer header and sets the flag BH_Mapped (Note 2).
    c. Check the flag BH_Uptodate again, because the get_block function that relies on the file system may have triggered the block 1/0 operation and updated the buffer. If BH_Uptodate is set, processing continues with the next buffer for this page.
    d. Store the address of the buffer head in the local array arr and continue to the next buffer of this page.
  8. If no "file hole" is encountered in the previous step, the flag PG_mappedtodisk of the page is set.
  9. Now the local variable arr stores the addresses of some buffer headers, and the contents of the corresponding buffers are not the latest. If the array is empty, then all buffers in the page are valid, so this function sets the PG_uptodate flag of the page descriptor, calls unlock_page() to unlock the page, and returns.
  10. The local array arr is not empty. For each buffer header in the array, block_read_full_page() performs the following substeps:
    a. Set the BH_Lock flag. Once this flag is set, the function waits until the buffer is released.
    b. Set the b_end_io field in the buffer header to the address of the end_buffer_async_read() function (see below), and set the BH_Async_Read flag in the buffer header.
  11. Call submit_bh() on each buffer header in the local array arr, setting the operation type to READ. As we saw earlier, this function triggers the I/O data transfer of the corresponding block.
  12. Return 0.
    The function end_buffer_async_read() is the completion method of the buffer header. It is executed as soon as the I/O data transfer to the block buffer is completed. Assuming no I/O errors, the function sets the BH_Uptodate flag in the buffer header and clears the BH_Async_Read flag. Then, the function gets the buffer page descriptor containing the block buffer (its address is stored in the b_page field at the beginning of the buffer), and checks whether all blocks in the page are up to date; if so, the function sets the PG_uptodate of the page. The flag is set and unlock_page() is called.

Pre-reading of files

Many disk accesses are sequential. As we will see in Chapter 18, ordinary files are stored on the disk in groups of adjacent sectors, so the files can be retrieved quickly with little movement of the disk head. When a program reads or copies a file, it usually accesses the file sequentially from the first byte to the last byte. Therefore, when a series of read requests from a process to the same file are processed, many adjacent sectors on the disk can be read.

Read-ahead is a technique that consists in reading several adjacent data pages of a regular file or block device file before the actual request. In most cases, read-ahead can greatly improve disk performance because read-ahead causes the disk controller to process fewer commands, each of which involves a large set of contiguous sectors. In addition, pre-reading can improve the responsiveness of the system. A process that reads a file sequentially usually does not need to wait for the requested data because the requested data is already in RAM. However, read-ahead is useless for randomly accessed files; in this case, read-ahead is actually harmful because it wastes page cache space with useless information. Therefore, the kernel reduces or stops read-ahead when it determines that the most recent I/O access is not sequential with the previous I/O access.

File read-ahead requires a more complex algorithm for the following reasons:
· Since the data is read page by page, the read-ahead algorithm does not need to consider the intra-page offset, as long as the page being accessed is inside the file The location is fine.
· As long as the process continues to access a file sequentially, readahead will gradually increase.
· When the current access and the previous access are not sequential (random access), pre-reading will gradually be reduced or even prohibited.
· When a process accesses the same page repeatedly (that is, only uses a small portion of the file), or when almost all pages are already in the page cache, read-ahead must stop. Low-level I/O device drivers must be activated at the appropriate time so that when a future process needs it, the page has been transferred.

If the first page requested immediately follows the last page requested on the last access, then the kernel treats this access to the file as sequential relative to the last access to the file. When a given file is accessed, the read-ahead algorithm uses two sets of pages, each corresponding to a contiguous region of the file. These two page sets are called the current window and the ahead window.
The pages in the current window are the pages requested by the process and the pages read ahead by the kernel, and are located in the page cache (the pages in the current window do not need to be the latest, because I/O data transfers may still be in progress). The current window contains the last page accessed sequentially by the process, and there may be pages read ahead by the kernel but not requested by the process. The pages in the read-ahead window immediately follow the pages in the current window, and they are the pages that the kernel is pre-reading. Pages within the read-ahead window are not requested by the process, but the kernel assumes that the process will request them sooner or later. When the kernel thinks it is a sequential access and the first page is within the current window, it checks whether a read-ahead window has been established. If not, the kernel creates a read-ahead window and triggers a read of the corresponding page. Ideally, the process continues to request pages from the current window while pages from the read-ahead window are being transferred. When the page requested by the process is in the read-ahead window, the read-ahead window becomes the current window. The main data structure used by the read-ahead algorithm is the file_ra_state descriptor, and its fields are shown in Table 16-3. Each file object stores such a descriptor in its f_ra field.
Insert image description here
When a file is opened, in its file_ra_state descriptor, except for the two fields prev_page and ra_pages, all other fields are set to 0. The prev_page field stores the index of the last page of the page requested by the process in the last read operation. Its initial value is -1. The ra_pages field indicates the maximum number of pages in the current window, that is, the maximum amount of pre-read allowed for the file. The initial value (default value) of this field is stored in the backing_dev_info descriptor of the block device where the file is located (see the "Request Queue Descriptor" section in Chapter 14). An application can modify the ra_pages field of an open file to adjust the read-ahead algorithm; the specific implementation method is to call the posix_fadvise() system call and pass it the command POSIX_FADV_NORMAL (set the maximum read-ahead amount to the default value, usually 32 pages) , POSIX_FADV_SEQUENTIAL (set the maximum read-ahead amount to twice the default value) and POSIX_FADV_RANDOM (the maximum read-ahead amount is 0, thereby permanently disabling read-ahead).

There are two important fields RA_FLAG_MISS and RA_FLAG_INCACHE in the flags field. If the page that has been read ahead is not in the page cache (the possible reason is that the kernel has reclaimed it to free up memory, see Chapter 17), then the first flag is set, and the next read-ahead page to be created is The window size will be reduced.

When the kernel determines that the last 256 pages requested by the process are in the page cache (the number of consecutive cache hits is stored in the ra->cache_hit field), the second flag is set, and the kernel considers that all pages are in the page cache. cache, thus turning off read-ahead.

When is the read-ahead algorithm executed? There are several situations:

  1. When the kernel makes a user-space request to read a page of file data. This event triggers a call to the page_cache_readahead() function (see step 4c of the description of do_generic_file_read() in the "Reading Data from a File" section earlier in this chapter).
  2. When the kernel allocates a page for the file memory map (see the filemap_nopage() function in the "Request Paging of Memory Maps" section later in this chapter, it calls the page_cache_readahead() function again).
  3. When a user-mode application executes the readahead() system call, it explicitly triggers a read-ahead activity for a certain file descriptor.
  4. When a user-mode application executes the posix_fadvise() system call using the POSIX_FADV_NOREUSE or POSIX_FADV_WILLNEED command, it notifies the kernel that a certain range of file pages will be accessed soon.
  5. When a user-mode application executes the madvise() system call using the MADV_WILLNEED command, it notifies the kernel that a given range of file pages in a file memory-mapped area will soon be accessed.

page_cache_readahead() function

The page_cache_readahead() function handles all read-ahead operations that are not explicitly triggered by a special system call. It fills in the current window and the read-ahead window, and updates the size of the current window and the read-ahead window based on the number of read-ahead hits, that is, based on the success of the read-ahead strategy for file access in the past. When the kernel must satisfy a read request for one or more pages of a file, the function is called. The function has the following five parameters:

mapping
	描述页所有者的address_space对象指针
ra
	包含该页的文件file_ra_state描述符指针
filp
	文件对象地址
of fset
	文件内页的偏移量
req_size
	要完成当前读操作还需要读的页数(注3)

Figure 16-1 is the flow chart of page_cache_readahead(). This function basically operates on the fields of the file_ra_state descriptor, so even though the behavior in the flowchart is not very formal, you can easily determine the actual steps performed by the function. For example, in order to check whether the requested page is the same as the page just read, the function checks whether the value of the r a->prev_page field is consistent with the value of the offset parameter (see Table 16-3 above). When a process accesses a file for the first time, and its first requested page is the page at offset 0 in the file, the function assumes that the process will perform sequential access. Then, the function creates a new current window from the first page. The initial length of the current window (always a power of two) is related to the number of pages requested by the process's first read operation.

The larger the number of pages requested, the larger the current window is, until the maximum value is stored in the ra->ra_pages field. On the contrary, when the process accesses the file for the first time, but the offset of its first requested page in the file is not 0, the function assumes that the process is not performing sequential reads. Then, the function temporarily prohibits pre-reading (the ra->size field is set to -1). But when read-ahead is temporarily disabled and the function believes that sequential access is needed, a new current window will be created. If the read-ahead window does not exist, the read-ahead window will be created once the function believes that the process has performed a sequential read within the current window. The read-ahead window always starts from the last page of the current window. But its length is related to the length of the current window: if the RA_FLAG_MISS flag is set, the pre-read window length is the current window length minus 2, and if it is less than 4, it is set to 4; otherwise, the pre-read window length is 4 times the current window length or 2 times. If the process continues sequentially like this, readahead will be greatly enhanced as the process reads the file sequentially.
![Insert image description here](https://img-blog.csdnimg.cn/8731f51614264264b567c333becb9d1b.png
Once the function recognizes that access to the file is not sequential relative to the last time, the current window and the read-ahead window are cleared, and read-ahead is temporarily disabled. Read-ahead will restart when the process's read operations are sequential relative to the last file access. Each time page_cache_readahead() creates a new window, it begins reading the contained pages. To read a large set of pages, the function page_cache_readahead() calls blockable_page_cache_readahead().

In order to reduce kernel overhead, the latter function adopts the following flexible method:

  1. If the request queue serving the block device is read-congested, no read operations are performed.
  2. Compare the page to be read with the page cache. If the page is already in the page cache, just skip it.
  3. All page frames required for a read request are allocated at once before reading from disk. If all page frames cannot be obtained at once, the prefetch operation is performed only on the pages that can be obtained. And it doesn't make much sense to defer prefetching until all page frames are available.
  4. Whenever possible, issue reads to the general block layer by using multiple segment bio descriptors (see the "Segments" section of Chapter 14). This is achieved through the readpages method dedicated to the address_space object (if it is defined); if it is not defined, it is achieved by repeatedly calling the readpage method. The readpage method is described in detail for the single-segment case in the previous section "Reading data from a file", but with slight modifications you can easily use it for the multi-segment case.

handle_ra_miss() function

In some cases, the read-ahead policy does not seem to be very effective, and the kernel must modify the read-ahead parameters. Let's consider the do_generic_file_read() function described in the "Reading Data from a File" section earlier in this chapter. The function page_cache_readahead() is called in step 4c.

Figure 16-1 shows two situations: the requested page indicates that it has been read in advance in the current window or the read-ahead window; or it has not been read in advance, calling blockable_page_cache_readahead() to read it in. In both cases, the function do_generic_file_read() should have found the page in the page cache in step 4d. If not, it means that the page frame has been removed from the cache by the eviction algorithm. In this case, do_generic_file_read() calls the handle_ra_miss() function, which adjusts the read-ahead algorithm by setting the RA_FLAG_MISS flag and clearing the RA_FLAG_INCACHE flag.

write file

Recall that the write() system call involves moving data from the calling process's user-space address space to a kernel data structure and then to disk. The file object's write method allows each file type to define a dedicated write operation. In Linux 2.6, the write method of each disk file system is a process. This process mainly identifies the disk blocks involved in the write operation, copies the data from the user mode address space to certain pages in the page cache, and then The buffers in these pages are marked dirty. Many file systems (including Ext2 or JFS) implement the write method of file objects through the generic_file_write() function. It has the following parameters:

file
	文件对象指针
buf
	用户态地址空间中的地址,必须从这个地址获取要写入文件的字符
count
	要写入的字符个数
ppos
	存放文件偏移量的变量地址,必须从这个偏移量处开始写入

This function does the following:

  1. Initialize a local variable of type iovec, which contains the address and length of the user-mode buffer (see the description of the generic_file_read() function in the "Reading Data from Files" section earlier in this chapter).
  2. Determine the address inode of the written file index node object (file->f_mapping->host) and obtain the semaphore (inode->i_sem). With this semaphore, only one process can issue a write() system call on a file at a time.
  3. Call the macro init_sync_kiocb to initialize local variables of type kiocb. As described in the "Reading Data from Files" section earlier in this chapter, this macro sets the ki_key field to KIOCB_SYNC_KEY (synchronous I/O operations), the ki_filp field to filp, and the ki_obj field to current.
  4. Call the __generic_file_aio_write_nolock() function (see below) to mark the involved pages as dirty, and pass the corresponding parameters: the addresses of local variables of type iovec and kiocb, the number of segments of the user-mode buffer (there is only one here) and ppos.
  5. Release the inode->i_sem semaphore.
  6. Check the O_SYNC flag of the file, the S_SYNC flag of the inode, and the MS_SYNCHRONOUS flag of the superblock. If at least one flag is set, the function sync_page_range() is called to force the kernel to flush all pages involved in step 4 in the page cache, blocking the current process until the I/O data transfer is completed.
    Then in turn, sync_page_range() first executes the writepages method of the address_space object (if defined) or the mpage_writepages() function to start the I/O transmission of these dirty pages (see the section "Writing Dirty Pages to Disk" later in this chapter), Then call generic_osync_inode() to flush the index node and related buffers to disk, and finally call wait_on_page_bit() to suspend the current process until the PG_writeback flag of all refreshed pages is cleared to 0.
  7. Return the return value of the __generic_file_aio_write_nolock() function, usually the number of valid bytes written.

The function __generic_file_aio_write_nolock() receives four parameters:
the address iocb of the kiocb descriptor, the address iov of the iovec descriptor array, the length of the array, and the address ppos of the variable that stores the current pointer to the file.
When called by generic_file_write(), the iovec descriptor array has only one element, which describes the user-mode buffer of data to be written (Note 4). We now explain the behavior of the __generic_file_aio_write_nolock() function. For simplicity, we only discuss the most common case, the general case of making a write() system call on a file with a page cache. We discuss how this function behaves in other situations later in this chapter. We do not discuss how to handle errors and exception conditions. This function performs the following steps:

  1. Call access_ok() to confirm that the user-mode buffer described by the iovec descriptor is valid (the starting address and length have been obtained from the service routine sys_write(), so they must be checked before use. See Chapter 10 "Verification Parameters" one period). If the argument is invalid, error -EFAULT is returned.
  2. Determine the address inode of the index node object of the file to be written (file->f_mapping->host). Remember: if the file is a block device file, this is the inode of the bdev special file system (see Chapter 14).
  3. Set the address of the backing_dev_info descriptor of the file (file->f_mapping->backing_dev_info) to current->backing_dev_info. In fact, this setting will allow the current process to write back dirty pages owned by file->f_mapping (see Chapter 17) even if the corresponding request queue is congested.
  4. If the O_APPEND flag of file->flags is set and the file is a normal file (not a block device file), it sets *ppos to the end of the file, so that new data will be appended to the end of the file.
  5. Make several checks on the file size. For example, write operations cannot increase the size of an ordinary file beyond the upper limit per user or the upper limit of the file system. The upper limit per user is stored in current->sigmal->rlim RLIMIT_FSIZE, and the upper limit of the file system is stored in inode- >i_sb->s_maxbytes. In addition, if the file is not a "large file" (when the O_LARGEFILE flag of file->f_flags is cleared to 0), then its size cannot exceed 2GB. If the limit is not set, it reduces the number of bytes to be written.
  6. If set, the suid flag of the file is cleared to 0, and if it is an executable file, the sgid flag is also cleared to 0 (see the "Access Permissions and File Mode" section in Chapter 1). We don't want users to be able to modify setuid files.
  7. Store the current time in the inode->mtime field (the latest time of the file write operation) and the inode->ctime field (the latest time of modifying the index node), and mark the index node object as dirty.
  8. Starts a loop to update all file pages involved in the write operation. During each loop, the following substeps are performed:
    a. Call find_lock_page() to search for the page in the page cache (see the "Page Cache Handling Functions" section in Chapter 15). If the function finds the page, it increments the reference count and sets the PG_locked flag.
    b. If the page is not in the page cache, allocate a new page frame and call add_to_page_cache() to insert the page in the page cache. As described in Chapter 15, "Page Cache Handling Functions," this function also increments the reference count and sets the PG_locked flag. In addition, the function also inserts a page into the inactive linked list in the memory management area (see Chapter 17).
    c. Call the prepare_write method of the address_space object in the index node (file→f-mapping). The corresponding function allocates and initializes the buffer header for the page. We will discuss what this function does for ordinary files and block device files in a later chapter.
    d. If the buffer is in high-end memory, it establishes the kernel mapping of the user-mode buffer (see the "Kernel Mapping of High-end Memory Page Frame" section in Chapter 8), and then it calls __copy_from_user() to copy the user-mode buffer The characters in are copied to the page and the kernel map is released.
    e. Call the commit_write method of the address_space object in the index node (file→f-mapping). The corresponding function marks the underlying buffers as dirty so that they can subsequently be written to disk. We discuss what this function does for ordinary files and block device files in the next two sections.
    f. Call unlock_page() to clear the PG_locked flag and wake up any processes waiting for the page.
    g. Call mark_page_accessed() to update the page status for the memory reclamation algorithm [see the "Least Recently Used (LRU) List" section in Chapter 17].
    h. Decrement the page reference count to undo the increase in step 8a or 8b.
    i. In this step, there is another page that is marked as dirty, and it checks whether the proportion of dirty pages in the page cache exceeds a fixed threshold (usually 40% of the pages in the system). If so, call writeback_inodes() to flush dozens of pages to disk (see the "Searching for Dirty Pages to Flush" section in Chapter 15).
    j. Call cond_resched() to check the TIF_NEED_RESCHED flag of the current process. If this flag is set, the schedule() function is called.
  9. Now, all pages of the file involved in the write operation have been processed. Update the value of *ppos so that it points to the position just after the last character written.
  10. Set current->backing_dev_info to NULL (see step 3).
  11. Ends after returning the number of valid characters written to the file.

Prepare_write and commit_write methods of ordinary files

The prepare_write and commit_write methods of the address_space object are dedicated to general write operations implemented by generic_file_write(). This function is suitable for ordinary files and block device files. These two methods are called once for each page of the file affected by the write operation. Each disk file system defines its own prepare_write method. Similar to the read operation, this method is just a wrapper function for ordinary functions. For example, the Ext2 file system implements the prepare_write method through the following function:

int ext2_prepare_write(struct file *file,struct page *page,unsigned from,unsigned to)
{
	return block_prepare_write(page,from,to,ext2_get_block);
}

The ext2_get_block() function has been mentioned in the previous section "Reading data from files"; it converts the block number relative to the file into a logical block number (indicating the location of the data on the physical block device). The blockprepare_write() function prepares the buffer and buffer header for the file page by performing the following steps:

  1. Check whether a page is a buffer page (if so, the PG_Private flag is set); if the flag is cleared to 0, call create_empty_buffers() to allocate buffer headers for all buffers in the page (see Chapter 15 "Buffers" page" section).
  2. For each buffer header corresponding to a buffer contained in the page, and for each buffer header affected by a write operation, do the following: a.
    If the BH_New flag is set, clear it to 0 (see below).
    b. If the BH_New flag has been cleared to 0, the function performs the following sub-steps:
    (1). Call a function that depends on the file system, and the address of the function get_block is passed as a parameter. Look at this file system disk data structure and look for the buffer's logical block number (relative to the start of the disk partition rather than the start of a normal file).
    File system-related functions store this number in the b_blocknr field of the corresponding buffer header and set its BH_Mapped flag. Filesystem-related functions may allocate a new physical block to the file (for example, if the accessed block falls into a "hole" in a normal file, see the "Hole in a File" section in Chapter 18). In this case, set the BH_New flag.
    (2). Check the value of the BH_New flag; if it is set, call unmap_underlying_metadata() to check whether a block device buffer page in the page cache contains a buffer pointing to the same block on the disk (Note 5). This function actually calls __find_get_block() to find an old block in the page cache (see the "Searching for Blocks in the Page Cache" section in Chapter 15). If a block is found, the function clears the BH_Dirty flag to 0 and waits until the I/O data transmission of the buffer is completed. Additionally, if the write operation does not rewrite the entire buffer, the unwritten area is filled with 0s. Then consider the next buffer in the page.
    c. If the write operation does not rewrite the entire buffer and its BH_Delay and BH_Uptodate flags are not set (that is, the block has been allocated in the disk file system data structure, but the buffer in RAM does not have valid data image), the function calls ll_rw_block() on the block to read its contents from disk (see the section "Submitting the Buffer Header to the Universal Block Layer" in Chapter 15).
  3. Block the current process until all read operations triggered in step 2c are completed.
  4. Return 0.

Once the prepare_write method returns, the generic_file_write() function updates the page with the data stored in the user-space address space. Next, call the commit_write method of the address_space object. This method is implemented by the generic_commit_write() function and is applicable to almost all non-journaling disk file systems. The generic_commit_write() function performs the following steps:

  1. Call the __block_commit_write() function, and then perform the following steps in sequence:
    a. Consider all buffers in the page affected by the write operation; for each buffer, set the BH_Uptodate and BH_Dirty flags in the corresponding buffer header.
    b. Mark the corresponding index node as dirty, as described in the "Searching for Dirty Pages to Refresh" section of Chapter 15. This requires adding the index node to the superblock dirty index node list.
    c. If all buffers in the buffer page are up to date, set the PG_uptodate flag.
    d. Set the PG_dirty flag of the page and mark the page as dirty in the base tree (see the "Base Tree" section in Chapter 15).
  2. Check whether the write operation increased the file size. If it increases, update the i_size field of the file inode object.
  3. Return 0.

Prepare_write and commit_write methods of block device files

Writing to a block device file is very similar to the corresponding operation for an ordinary file. In fact, the prepare_write method of the address_space object of the block device file is usually implemented by the following function:

int blkdev_prepare_write(struct file *file,struct page *page,unsigned from,unsigned to)
{
	return block_prepare_write(page,from,to,blkdev_get_block);
}

As you can see, this function is just a wrapper function around the block_prepare_write() function discussed in the previous section. Of course, the only difference is the second parameter, which is a pointer to a function that must convert the file block number relative to the start of the file into a logical block number relative to the start of the block device. Recall that for block device files, these two numbers are identical (see the discussion of the blkdev_get_block() function in the "Reading Data from Files" section above). The commit_write method for block device files is implemented by the following simple wrapper function:

int blkdev_commit_write(struct file *file,struct page *page,unsigned from,unsigned to)
{
	return block_commit_write(page,from,to);
}

As you can see, the commit_write method for block devices essentially does the same thing as the commit_write method for regular files (we described the block_commit_write() function in the previous section). The only difference is that this method doesn't check whether the write operation expanded the file; you simply can't append characters to the end of a block device file to expand it.

Write dirty pages to disk

The function of the system call write() is to modify the contents of some pages in the page cache. If the required pages are not in the page cache, these pages are allocated and appended. In some cases (such as a file opened with the 0_SYNC flag), I/O data transfer starts immediately (see step 6 of the generic_file_write() function in the "Writing to Files" section earlier in this chapter). But usually I/O data transfer is delayed, which is described in the "Writing Dirty Pages to Disk" section in Chapter 15. When the kernel wants to effectively start I/O data transfer, it calls the writepages method of the file address_space object, which looks for dirty pages in the base tree and flushes them to disk. For example, the Ext2 file system implements the writepages method through the following function:

int ext2_writepages(struct address_space *mapping,struct writeback_control *wbc)
{
	return mpage_writepages(mapping,wbc,ext2_get_block);
}

As you can see, this function is a simple wrapper function for the general mpage_writepages(). In fact, if the file system does not define the writepages method, the kernel directly calls mpage_writepages() and passes NULL to the third parameter. The ext2_get_block() function has been mentioned in the previous section "Reading data from files". This is a file system-dependent function that converts the file block number into a logical block number. The writeback_control data structure is a descriptor that controls how the writeback operation is performed. We have described it in Chapter 15, "Searching for dirty pages to be flushed".

The mpage_writepages() function performs the following steps:

  1. If the request queue is write-congested, but the process does not wish to block, it returns without writing any pages to disk.
  2. Determines the first page of the file. If the writeback_control descriptor is given an initial position within the file, the function will convert it to a page index. Otherwise, if the writeback_control descriptor specifies that the process does not need to wait for the end of the I/O data transfer, it sets the value of mapping->writeback_index to the initial page index (that is, scanning starts from the last page of the previous writeback operation). Finally, if the process must wait for I/O data to be transferred, the scan starts from the first page of the file.
  3. Call find_get_pages_tag() to find the dirty page descriptor in the page cache (see Chapter 15, "Base Tree Tags" section).
  4. For each page descriptor obtained in the previous step, perform the following steps:
    a. Call lock_page() to lock the page.
    b. Verify that the page is valid and in the page cache (because another kernel control path may have acted on the page between steps 3 and 4a).
    c. Check the page's PG_writeback flag. If set, indicates that the page has been flushed to disk. If the process must wait for the I/O data transmission to be completed, call wait_on_page_bit() to block the current process until PG_writeback is cleared to 0; when the function ends, any previously run writeback operations are terminated. Otherwise, if the process does not need to wait, it will check the PG_dirty flag: if the PG_dirty flag is now cleared, the running writeback operation will process the page, unlock it and jump back to step 4a to continue with the next page.
    d. If the parameter of get_block is NULL (the writepages method is not defined), it will call the mapping->writepage method of the file address_space object to refresh the page to the disk. Otherwise, if the parameter of get_block is not NULL, it calls the mpage_writepage() function. See step 8 for details.
  5. Call cond_resched() to check the TIF_NEED_RESCHED flag of the current process. If the flag is set, call the schedule() function.
  6. If the function does not scan all pages in the given range, or the number of valid pages written to disk is less than the original given value in the writeback_control descriptor, then jump back to step 3.
  7. If the writeback_control descriptor is not given an initial position within the file, it assigns the index value of the last scanned page to the mapping->writeback_index field.
  8. If the mpage_writepage() function was called in step 4d and the bio descriptor address was returned, then call mpage_bio_submit() (see below). The writepage method implemented by a typical file system like Ext2 is a wrapper function of the general block_write_full_page() function, and the address of the file system-dependent get_block function is passed to it as a parameter. Just like block_read_full_page() described in the "Reading Data from Files" section earlier in this chapter, the block_write_full_page() function proceeds sequentially: allocates the page buffer header (if it is not already in the buffer page), and calls submit_bh() on each page. function to specify write operations.

As far as block device files are concerned, the wrapper function blkdev_writepage() of block_write_full_page() is used to implement the writepage method. Many non-journaling file systems rely on the mpage_writepage() function rather than the custom writepage method. This improves performance because the mpage_writepage() function aggregates as many pages as possible in the same bio descriptor when performing I/O transfers. This allows block device drivers to take advantage of the DMA scatter-gather capabilities of modern hard disk controllers.

To make a long story short, the mpage_writepage() function will check:
whether the page to be written contains blocks that are not adjacent on disk, whether the page contains a file hole, and whether a block on the page is not dirty or not up to date. If at least one of the above conditions is true, the function still uses the file system-dependent writepage method as above. Otherwise, append the page as a segment of the bio descriptor. The address of the bio descriptor will be passed to the function as a parameter; if the address is NULL, mpage_writepage() will initialize a new bio descriptor and return the address to the calling function, which in turn will call mpage_writepage() in the future. Pass that address back. In this way, the same bio can load several pages. If a page in the bio is not adjacent to the previous loaded page, mpage_writepage() calls mpage_bio_submit() to start the I/O data transmission of the bio and allocate a new bio to the page.

The mpage_bio_submit() function sets the bi_end_io method of bio to the address of mpage_end_io_write(), and then calls submit_bio() to start the transmission (see the section "Submitting the Buffer Header to the General Block Layer" in Chapter 15). Once the data transfer ends successfully, the completion function mpage_end_io_write() wakes up the processes waiting for the page transfer to complete and clears the bio descriptor.

memory map

As we have already introduced in the "Linear Areas" section of Chapter 9, a linear area can be associated with a portion of an ordinary file in a disk file system or with a block device file. This means that the kernel converts an access to a byte within a linear page in the area into an operation on the corresponding byte in the file. This technique is called memory mapping. There are two types of memory mapping:

共享型
	在线性区页上的任何写操作都会修改磁盘上的文件;
	而且,如果进程对共享映射中的一个页进行写,那么这种修改对于其他映射了这同一文件的所有进程来说都是可见的。
私有型
	当进程创建的映射只是为读文件,而不是写文件时才会使用此种映射。
	出于这种目的,私有映射的效率要比共享映射的效率更高。
	但是对私有映射页的任何写操作都会使内核停止映射该文件中的页。因此,写操作既不会改变磁盘上的文件,对访问相同文件的其他进程也不可见。
	但是私有内存映射中还没有被进程改变的页会因为其他进程进行的文件更新而更新。

A process can issue a mmap() system call to create a new memory map (see the "Creating Memory Maps" section later in this chapter). The programmer must specify a MAP_SHARED flag or a MAP_PRIVATE flag as a parameter to this system call. As you can guess, in the former case, the mapping is shared, while in the latter case, the mapping is private. Once this mapping is created, the process can read data from the memory unit in this new linear area, which is equivalent to reading the data stored in the file. If this memory map is shared, then the process can modify the corresponding file by writing to the same memory unit. To undo or shrink a memory map, a process can use the munmap() system call (see the "Undoing Memory Maps" section below).

As a general rule, if a memory map is shared, the corresponding linear region has the VM_SHARED flag set; if a memory map is private, the corresponding linear region has the VM_SHARED flag cleared. As we will see later, there is a special case that does not comply with this rule for read-only shared memory mappings.

memory mapped data structure

Memory mapping can be represented by a combination of the following data structures:

  1. The inode object associated with the mapped file
  2. The address_space object of the mapped file
  3. A file object used by different processes to map a file differently.
  4. The vm_area_struct descriptor used for each different mapping of the file · The page descriptor corresponding to each page frame allocated for the linear area mapped to the file Figure 16-2 illustrates how
    these data structures are linked together. The left side of the figure shows the index node that identifies the file. The i_mapping field of each inode object points to the file's address_space object. The page_tree field of each address_space object points to the base tree of the page in that address space (see the "Base Tree" section in Chapter 15), and the i_mmap field points to the second tree, called the radix priority search tree (priority search tree) ,PST), this tree consists of a linear region of the address space.
    The main purpose of PST is to perform "reverse mapping", which is to quickly identify all processes sharing a page. We will discuss PSTs in detail in the next chapter as they are used for page frame recycling. The establishment of links between file objects and index nodes of the same file is achieved through the f_mapping field.
    Insert image description here

Each linear area descriptor has a vm_file field, which is linked to the file object of the mapped file (if this field is NULL, the linear area is not used for memory mapping). The location of the first mapping unit is stored in the vmupgoff field of the linear area descriptor, which represents the offset in page size units. The length of the mapped portion of the file is the size of the linear area, which can be calculated from the vm_start and vm_end fields. Shared memory mapped pages are usually included in the page cache; private memory mapped pages are also included in the page cache as long as they have not been modified. When a process attempts to modify a private memory mapped page, the kernel copies the page frame and replaces the original page frame with the copied page in the process page table. This is the copy-on-write mechanism introduced in Chapter 8 One of the applications.

虽然原来的页框还仍然在页高速缓存中,但不再属于这个内存映射,这是由于被复制的页框替换了原来的页框。依次类推,这个复制的页框不会被插入到页高速缓存中,因为其中所包含的数据不再是磁盘上表示那个文件的有效数据。图16-2还显示了包含在页高速缓存中的几个指向内存映射文件的页的页描述符。注意图中的第一个线性区有三页,但是只为它分配了两个页框;猜想一下,大概是拥有这个线性区的进程从没有访问过第三页。对每个不同的文件系统,内核提供了几个钩子(hook)函数来定制其内存映射机制。内存映射实现的核心委托给文件对象的mmap方法。对于大多数磁盘文件系统和块设备文件,这个方法是由叫做generic_file_mmap()的通用函数实现的,该函数将在下一节进行描述。文件内存映射依赖于第九章的“请求调页”一节描述的请求调页机制。事实上,一个新建立的内存映射就是一个不包含任何页的线性区。当进程引用线性区中的一个地址时,缺页异常发生,缺页异常中断处理程序检查线性区的nopage方法是否被定义。如果没有定义nopage,则说明线性区不映射磁盘上的文件;否则,进行映射,这个方法通过访问块设备处理读取的页。几乎所有磁盘文件系统和块设备文件都通过filemap_nopage()函数实现nopage方法。

创建内存映射

要创建一个新的内存映射,进程就要发出一个mmap()系统调用,并向该函数传递以下参数:

  1. 文件描述符,标识要映射的文件。
  2. 文件内的偏移量,指定要映射的文件部分的第一个字符。
  3. 要映射的文件部分的长度。
  4. 一组标志。进程必须显式地设置MAP_SHARED标志或MAP_PRIVATE标志来指定所请求的内存映射的种类(注6)。
  5. 一组权限,指定对线性区进行访问的一种或者多种权限:读访问(PROT_READ)、写访问(PROT_WRITE)或执行访问(PROT_EXEC)。
  6. 一个可选的线性地址,内核把该地址作为新线性区应该从哪里开始的一个线索。如果指定了MAP_FIXED标志,且内核不能从指定的线性地址开始分配新线性区,那么这个系统调用失败。

The mmap() system call returns the linear address of the first cell location in the new linear region. For compatibility reasons, on the 80×86 architecture, the kernel reserves two entries for mmap() in the system call table: one at index 90 and one at index 192. The former entry corresponds to the old_mmap() service routine (used by the old C library), while the latter entry corresponds to the sys_mmap2() service routine (used by the more recent C library). The two service routines only differ in how the 6th parameter of the system call is passed. Both service routines call the do-mmap-pg off() function (see the "Allocating Linear Address Range" section in Chapter 9). We now detail the steps performed when creating a linear region that maps a file. What we are discussing is the case where the file parameter (file object pointer) of do_mmap_pgoff() is non-null. For clarity, we will enumerate the steps describing do_mmap_pgoff() and indicate the other steps performed under the new conditions.

  1. Check whether the mmap file operation is defined for the file to be mapped. If not, an error code is returned. A NULL mmap value in the file operation table indicates that the corresponding file cannot be mapped (for example, because it is a directory).
  2. The function get_unmapped_area() calls the get_unmapped_area method of the file object. If defined, allocates a suitable linear address range for the file's memory mapping. The disk file system does not define this method, so as described in the "Linear Area Processing" section of Chapter 9, the get_unmapped_area() function calls the get_unmapped_area method of the memory descriptor.
  3. In addition to normal consistency checks, the type of memory mapping requested (stored in the flags parameter of the mmap() system call) and the flags specified when opening the file (stored in file→f-mode field) for comparison. Based on these two sources, the following checks are performed:
    3.1. If a shared writable memory map is requested, check that the file was opened for writing and not opened in append mode (open() system call O_APPEND logo).
    3.2. If a shared memory mapping is requested, check that there are no forced locks on the file (see the "File Locking" section in Chapter 12).
    3.3. For any kind of memory mapping, check that the file is opened for reading. If none of the above conditions are met, an error code is returned. In addition, when initializing the vm_flags field of the new linear region descriptor, set the VM_READ, VM_WRITE, VM_EXEC, VM_SHARED, VM_MAYREAD, VM_MAYWRITE, VM_MAYEXEC, and VM_MAYSHARE flags according to the file access permissions and the type of memory mapping requested (see Chapter 9 "Linear Zone Access Rights" section).

Optimally, for non-writable shared memory mappings, the flags VM_SHARED and VM_MAYWRITE are cleared to 0. This is done because processes are not allowed to write to pages in this linear region, so this mapping is treated the same as a private mapping. However, the kernel actually allows other processes that share the file to read the pages in this linear region.
4. Initialize the vm_file field of the linear area descriptor with the address of the file object and increment the file's reference counter. Call the mmap method on the mapped file, passing the file object address and linear area descriptor address as parameters. For most file systems, this method is implemented by generic_file_mmap(), which performs the following steps:
a. Assign the current time to the i_atime field of the file inode object and mark the inode as dirty.
b. Initialize the vm_ops field of the linear area descriptor with the address of the generic_file_vm_ops table. All methods in this table are empty except for the nopage and populate methods. The nopage method is implemented by filemap_nopage(), and the populate method is implemented by filemap_populate() (see the "Nonlinear Memory Mapping" section later in this chapter).
5. Increase the value of the i_writecount field of the file index node object, which is the reference counter of the writing process.

Undo memory mapping

When the process is ready to unmap a memory map, it calls munmap(); this system call can also be used to reduce the size of each memory area. The parameters passed to it are as follows: the address of the first unit in the linear address range to be deleted. The length of the linear address range to delete.

The sys_munmap() service routine called by this system actually calls the do_munmap() function, which has been described in the "Release Linear Address Range" section of Chapter 9. Note that pages in the writable shared memory map to be revoked do not need to be flushed to disk. In fact, because these pages are still in the page cache, they continue to function as disk cache.

Memory mapped demand paging

For efficiency reasons, the page frame is not allocated to the memory map immediately after it is created, but is postponed as far back as possible until it can no longer be postponed - that is, when the process attempts to address a page in it, A "page missing" exception occurs. We have seen in the "Page Fault Exception Handler" section in Chapter 9 how the kernel verifies whether the address where the missing page is located is contained in the linear region of a process. If so, the kernel checks the page table entry corresponding to this address and calls the do_no_page() function if the entry is empty (see the "Requesting Paging" section in Chapter 9). The do_no_page() function performs operations common to all types of paging requests, such as allocating page frames and updating page tables. It also checks whether the nopage method is defined for the linear region involved.

In the "Request Paging" section of Chapter 9, we have introduced the case where this method is not defined (anonymous linear area). Now we discuss the main operations performed by do_no_page() when the nopage method is defined:

  1. Call the nopage method, which returns the address of the page frame containing the requested page.
  2. If a process attempts to write to a page and the memory map is private, avoid further "copy-on-write" exceptions by making a copy of the page just read and inserting it into the page's inactive list (see Chapter 17). If the private memory mapped region does not already have a slave anonymous memory region containing the new page, it must either append a new passive anonymous linear region or increase the existing one (see "Linear Regions" in Chapter 9 Festival). In the following steps, the function uses the new page instead of the page returned by the nopage method, so the latter will not be modified by the user mode process.
  3. If some other process deletes or invalidates the page (the truncate_count field of the address_space descriptor is used for this check), the function will jump back to step 1 and try to obtain the page again.
  4. Increasing the rss field of the process memory descriptor indicates that a new page frame has been allocated to the process.
  5. Use the address of the new page frame and the page access right contained in the vm_page_prot field of the linear area to set the page table entry corresponding to the address where the missing page is located.
  6. If the process attempts to write to this page, the Read/Write and Dirty bits of the page table entry are forced to 1. In this case, either the page frame is allocated to the process exclusively, or the page is made shared; in either case, writing to the page should be allowed. The core of the request paging algorithm lies in the nopage method of the linear area. Generally speaking, this method must return the page frame address of the page accessed by the process. Its implementation depends on the type of linear region in which the page is located. When processing a linear region that maps a disk file, the nopage method must first look up the requested page in the page cache. If the corresponding page is not found, this method must read it from disk. Most file systems use the filemap_nopage() function to implement the nopage method, which receives three parameters:
area
	所请求页所在线性区的描述符地址。
address
	所请求页的线性地址。
type
	存放函数侦测到的缺页类型(VM_FAULT_MAJOR或VM_FAULT_MINOR)的变量的指针。

The filemap_nopage() function performs the following steps:

  1. Get the file object address file from the area->vm_file field; get the address_space object address from file->f_mapping; get the index node object address from the host field of the address_space object.
  2. Use the vm_star and vm_pgoff fields of the area to determine the offset in the file of the data corresponding to the page starting from address.
  3. Check if the file offset is greater than the file size. If so, NULL is returned, which means that allocating the new page failed, unless the page fault was caused by the debugger tracing another process through the ptrace() system call, a special case we are not going to discuss.
  4. If the VM_RAND_READ flag of the linear region is set (see below), we assume that the process reads pages in the memory map in a random manner, then it ignores the read ahead and jumps to step 10.
  5. If the VM_SEQ_READ flag of the linear region is set (see below), we assume that the process reads the pages in the memory map in a strictly sequential manner, then it calls page_cache_readahead() to start reading from the missing page (see "Reading ahead of the file" earlier in this chapter) one period).
  6. Call find_get_page() to find the page identified by the address_space object and file offset in the page cache. If no such page is found, skip to step 11.
  7. If the function runs to this point, it means that the page is not found in the page cache. Check the VM_SEQ_READ flag in the memory area:
    a. If the flag is set, the kernel will forcibly pre-read the pages in the linear area, so the pre-read algorithm fails, and it calls handle_ra_miss. () to adjust the pre-reading parameters (see the "File Pre-reading" section earlier in this chapter) and jump to step 10.
    b. Otherwise, if the flag is not set, increase the mmap_miss counter in the file file_ra_state descriptor by 1. If the number of failures is much greater than the number of hits (stored in the mmap_hit counter), it will ignore the pre-read and jump to step 10.
  8. If readahead is not permanently disabled (the ra_pages field of the file_ra_state descriptor is greater than 0), it will call do_page_cache_readahead() and read in a set of pages surrounding the requested page.
  9. Call find_get_page() to check whether the requested page is in the page cache. If so, jump to step 11.
  10. Call page_cache_read(). This function checks whether the requested page is in the page cache. If not, allocates a new page frame, appends it to the page cache, executes the mapping->a_ops->readpage method, and schedules an I/O operation to read it from the disk. Contents of this page.
  11. Call the grab_swap_token() function to allocate a swap token to the current process if possible (see the "Swap Tag" section in Chapter 17).
  12. The requested page is now in the page cache, increment the mmap_hit counter of the file file_ra_state descriptor by one.
  13. If the page is not the latest (the PG_uptodate flag is not set), call lock_page() to lock the page, execute the mapping->a_ops->readpage method to trigger I/O data transmission, call wait_on_page_bit() and then sleep, waiting until the The page is unlocked, that is, until the data transfer is completed.
  14. Call mark_page_accessed() to mark the requested page as visited (see next chapter).
  15. If the latest version of the page is found in the page cache, set *type to VM_FAULT_MINOR, otherwise set to VM_FAULT_MAJOR.
  16. Returns the requested page address. User-mode processes can adjust the read-ahead behavior of the filemap_nopage() function through the madvise() system call.
    The MADV_RANDOM command sets the VM_RAND_READ flag of the linear area, thereby specifying random access to pages in the linear area. The MADV_SEQUENTIAL command sets the VM_SEQ_READ flag of the linear region, thereby specifying strict sequential access to pages. Finally, the MADV_NORMAL command resets the VM_RAND_READ and VM_SEQ_READ flags, thereby specifying that pages are accessed in an unspecified order.

Flush memory-mapped dirty pages to disk

A process can use the msync() system call to flush dirty pages belonging to the shared memory map to disk. The parameters received by this system call are: the starting address of a linear address range, the length of the range, and a set of flags with the following meanings.

MS_SYNC
	要求这个系统调用挂起进程,直到I/O操作完成为止。在这种方式中,调用进程就可以假设当系统调用完成时,这个内存映射中的所有页都已经被刷新到磁盘。
MS_ASYNC(对MS_SYNC的补充)
	要求系统调用立即返回,而不用挂起调用进程。
MS_INVALIDATE
	要求系统调用使同一文件的其他内存映射无效(没有真正实现,因为在Linux中无用)。

For each linear region contained in the linear address range, the sys_msync() service routine calls msync_interval(). The latter in turn does the following:

  1. If the vm_file field of the linear area descriptor is NULL, or if the VM_SHARED flag is cleared to 0, 0 is returned (indicating that this linear area is not a writable shared memory map of the file).
  2. Call the filemap_sync() function, which scans the page table entries corresponding to the linear address range contained in the linear area. For each page found, reset the Dirty flag of the corresponding page table entry and call flush_tlb_page() to refresh the corresponding translation lookaside buffer (TLB). Then set the PG-dirty flag in the page descriptor to mark the page as dirty.
  3. It returns if the MS_ASYNC flag is set. Therefore, the actual function of the MS_ASYNC flag is to set the page flag PG_dirty in the linear area. This system call does not actually start the I/O data transfer.
  4. If the function runs to this point, the MS_SYNC flag is set, so the function must flush the pages in the memory area to disk, and the current process must sleep until all I/O data transfers are completed. To do this, the function gets the semaphore i_sem of the file index node.
  5. Call the filemap_fdatawrite() function. The parameter received by this function is the address of the file's address_space object. This function must create a writeback_control descriptor using WB_SYNC_ALL synchronization mode, and check whether the address space has a built-in writepages method. If so, it returns after calling this function. If not, execute the mpage_writepages() function (see the section "Writing Dirty Pages to Disk" earlier in this chapter).
  6. Checks whether the file object's fsync method is defined, and if so, executes it. For ordinary files, this method limits itself to flushing the file's inode object to disk. However, for block device files, this method calls sync_blockdev(), which activates I/O data transfer for all dirty buffers of the device.
  7. Execute the filemap_fdatawait() function. As mentioned in the "Base Tree Marking" section of Chapter 15, the base tree in the page cache identifies all pages that are being written to disk through the PAGECACHE_TAG_WRITEBACK mark. The function quickly scans this portion of the base tree covering the given linear address range looking for pages with the PG_writeback flag set. The function calls wait_on_page_bit() to put each page to sleep until the PG_writeback flag is cleared to 0, that is, until the ongoing I/O data transmission of the page ends.
  8. Release the file's semaphore i_sem and return.

non-linear memory mapping

For ordinary files, the Linux 2.6 kernel also provides an access method, namely non-linear memory mapping. Nonlinear memory mapping is basically the file memory mapping mentioned above, but its memory pages do not map sequential pages of the file. Instead, each memory page maps a random page (arbitrary page) of file data.
Of course, a user-mode application that repeatedly calls the mmap() system call on different 4096-byte portions of the same file can get the same result. However, this approach is very inefficient for nonlinear mapping of large files because each mapping requires a separate linear region.
To implement nonlinear mapping, the kernel uses additional data structures.
First, the VM_NONLINEAR flag of the linear area descriptor is used to indicate that there is a nonlinear mapping in the linear area. All nonlinear map linear area descriptors for a given file are stored in a two-way circular linked list rooted in the i_mmap_nonlinear field of the address_space object. To create a non-linear memory map, the user-mode process first creates a regular shared memory map with the mmap() system call. The application then calls remap_file_pages() to remap some pages in the memory map. The sys_remap_file_pages() service routine called by this system has the following parameters:

start
	调用进程共享文件内存映射区域内的线性地址
size
	文件重新映射部分的字节数
prot
	未用(必须为0)
pgoff
	待映射文件初始页的页索引
flags
	控制非线性映射的标志

This service routine remaps the portion of the file data determined by linear address start, page index pgoff, and mapping size size. If the linear area is not shared or cannot accommodate all pages to be mapped, the system call fails and returns an error code. In fact, the service routine inserts the linear region into the file's i_mmap_nonlinear linked list and calls the populate method of the linear region. For all ordinary files, the populate method is implemented by the filemap_populate() function. It performs the following steps:

  1. Check whether the MAP_NONBLOCK flag in the flags parameter of the remap_file_pages() system call is cleared to 0. If so, call do_page_cache_readahead() to pre-read the pages of the file to be mapped.
  2. For each remapped page, perform the following steps:
    a. Check whether the page descriptor is already in the page cache. If not and MAP_NONBLOCK is not set, read the page from disk.
    b. If the page descriptor is in the page cache, it updates the page table entry corresponding to the linear address to point to the page frame, and updates the page reference counter of the linear region descriptor.
    c. Otherwise, if the page descriptor is not found in the page cache, it stores the offset of the file page in the highest 32 bits of the page table entry corresponding to the linear address, and clears the Present bit of the page table entry to 0. , Dirty bit is set.

As mentioned in the "Push Paging" section of Chapter 9, when handling a requested paging error, the handle_pte_fault() function checks the Present and Dirty bits of the page table entry. If their values ​​correspond to a non-linear memory map, handle_pte_fault() calls the do_file_page() function to retrieve the index of the requested file page from the high bits of the page table entry. Then, the do_file_page() function calls the populate method of the linear area to retrieve the index from the disk. Read in the page and update the page table entry itself. Because the memory pages of non-linear memory mapping are stored in the page cache according to the page index relative to the beginning of the file, rather than according to the index relative to the beginning of the linear area, the non-linear memory mapping is flushed to disk in the same way. Linear memory mapping is the same (see the section "Flushing memory-mapped dirty pages to disk" earlier in this chapter).

Direct I/O transfer

We have seen that in Linux version 2.6, there is no essential difference between accessing an ordinary file through the file system and by referencing a block on a basic block device file, or even by establishing a file memory map. However, there are still some very complex programs (self-caching applications) that prefer to have full power over the I/O data transfer mechanism. For example, consider high-performance database servers: most of them implement their own caching mechanism to fully exploit the unique query mechanism of the database. For these types of programs, the kernel page cache is of no help; instead, it can be harmful for the following reasons:

  1. Many page frames are wasted copying disk data that is already in RAM (in the user-level disk cache).
  2. The redundant instructions for handling page cache and read-ahead reduce the execution efficiency of read() and write() system calls, and also reduce the paging operations related to file memory mapping.
  3. The read() and write() system calls do not transfer data directly between the disk and user memory, but in two passes: between the disk and the kernel buffer and between the kernel buffer and user memory. Because block hardware devices must be handled via interrupts and direct memory access (DMA), and this can only be done in kernel mode, some kind of kernel support is ultimately required to implement self-caching applications.

Linux provides a simple way to bypass the page cache: direct I/O transfer.
On each direct I/O transfer, the kernel programs the disk controller to transfer data directly between a page in the self-caching application's user-mode address space and the disk. We know that any data transfer is performed asynchronously. When data transfer is in progress, the kernel may switch the current process, the CPU may return to user mode, the pages of the process that generated the data transfer may be swapped out, and so on. This has no effect on normal I/O data transfers, since they involve pages in the disk cache, which is owned by the kernel, cannot be swapped out, and is visible to all processes in kernel mode. Direct I/O transfers, on the other hand, should move data within pages of a given process's user-mode address space. The kernel must be careful that these pages are accessed by any process in kernel mode and cannot swap them out while data transfer is in progress. Let's see how this is accomplished.

When a self-caching application wants to access a file directly, it opens the file with the ○_DIRECT flag set (see the "open() System Call" section in Chapter 12). When running the open() system call, the dentry_open() function checks whether the address_space object of the open file has an implemented direct_IO method, and returns an error code if not. For an open file, 0_DIRECT can also be set by the F_SETFL command called by the fcnt1() system. Let's look at the first case first, where a self-caching application calls the read() system call on a file that was opened with the 0_DIRECT flag set. As mentioned in the "Reading Data from Files" section earlier in this chapter, the file's read method is usually implemented by the generic_file_read() function, which initializes the iovec and kiocb descriptors and calls __generic_file_aio_read(). This latter function checks whether the user-mode buffer described by the iovec descriptor is valid, and then checks whether the 0_DIRECT flag of the file is set. When called by read(), the code segment executed by this function is actually equivalent to the following code:

if〈filp->f_flags & O_DIRECT){
if(count == 0 Il*ppos>filp->f_mapping->host->i_size)
	return 0;
retval = generic_file_direct_IO(READ,iocb,iov,*ppos,1);
if(retval > 0)
	*ppos += retval;
file_accessed(filp);
return retval;
}

The function checks the current value of the file pointer, the file size, and the requested number of characters, and then calls the generic_file_direct_IO() function, passing it the READ operation type, the iocb descriptor, the iovec descriptor, the current value of the file pointer, and the number specified in the io_vec descriptor. User mode buffer number (1). When generic_file_direct_IO() ends, __generic_file_aio_read() updates the file pointer, sets the access timestamp to the file index node, and returns. The situation is similar when the write() system call is called on a file opened with the 0_DIRECT flag set. As mentioned in the "Writing to Files" section earlier in this chapter, the write method of a file is to call generic_file_aio_write_nolock(). This function checks whether the O_DIRECT flag is set. If it is set, the generic_file_direct_IO() function is called, and this time it is limited to the WRITE operation type. The generic_file_direct_IO() function has the following parameters:

rw
	操作类型:READ或WRITE
iocb
	kiocb描述符指针(参见表16-1)
iov
	iove描述符数组指针(参见本章前面“从文件中读取数据”一节)
offset
	文件偏移量
nr_segs
	iov数组中iovec描述符数

The execution steps of the generic_file_direct_IO() function are as follows:

  1. The address file of the file object is obtained from the ki_filp field of the kiocb descriptor, and the address mapping of the address_space object is obtained from the file->f_mapping field.
  2. If the operation type is WRITE, and one or more processes have created memory mappings associated with a portion of the file, then it calls unmap_mapping_range() to unmap all pages of the file. If any unmapped page corresponds to a page table entry with the Dirty bit set, this function also ensures that its corresponding page in the page cache is marked dirty.
  3. If the base tree rooted in mapping is not empty (mapping->nrpages large 0), call the filemap_fdatawrite() and filemap_fdatawait() functions to refresh all dirty pages to disk and wait for the I/O operation to end (see " Memory-mapped dirty pages are flushed to disk" section). (Even if a self-caching application accesses the file directly, there may be other applications in the system that access the file through the page cache. To avoid data loss, the disk image must be consistent with the page cache before initiating direct I/O transfers. to synchronize).
  4. Call the direct_IO method of the mapping address space (see the following paragraph).
  5. If the operation type is WRITE, call invalidate_inode_pages2() to scan all pages in the mapping base tree and release them. This function also clears the user-mode page table entries pointing to these pages.

In most cases, the direct_IO method is a wrapper function of the __blockdev_direct_IO() function. This function is quite complex, and it calls a large number of auxiliary data structures and functions, but in fact it performs the same operation as described in this chapter: split the data stored in the corresponding block to be read or written, and determine where the data is. location on disk and add one or more bio descriptors describing the I/O operations to be performed. Of course, data will be read and written directly from the user-mode buffer determined by the iovec descriptor in the iov array. Call the submit_bio() function to submit the bio descriptor to the general block layer (see the section "Submitting the Buffer Header to the General Block Layer" in Chapter 15). Normally, the __blockdev_direct_IO() function does not return immediately, but waits until all direct I/O transfers have completed; therefore, self-caching applications are safe once the read() or write() system call returns. to access the buffer containing file data.

async/0

The POSIX 1003.1 standard defines a set of library functions for asynchronous file access (as shown in Table 16-4). "Asynchronous" actually means: when the user mode process calls the library function to read and write files, once the read and write operations enter the queue function, it ends, and the real I/O data transmission may not even start. This allows the calling process to continue its operation while data is being transferred.
Insert image description here
Using asynchronous I/O is simple. The application still opens the file through the open() system call, and then fills the control block of type struct aiocb with information describing the requested operation. The most commonly used fields of the struct aiocb control block are:

aio_fildes
	文件的文件描述符(由open()系统调用返回)
aio_buf
	文件数据的用户态缓冲区
aio_nbytes
	待传输的字节数
aio_offset
	读写操作在文件中的起始位置(与“同步”文件指针无关)

Finally, the application passes the control block address to aio_read() or aio_write(). These two functions end once the requested I/O data transfer has been queued by the system library or the kernel. The application can later call aio_error() to check the status of the running I/0 operation. If the data transfer is still in progress, EINPROGRESS is returned;
if it is successfully completed, 0 is returned; if it fails, an error code is returned. The aio_return() function returns the number of valid read and write bytes that have completed the asynchronous I/O operation; or -1 if it fails.

Asynchronous VO in Linux 2.6

Asynchronous I/O can be implemented by system libraries without kernel support at all. In fact, the aio_read() or aio_write() library function clones the current process, lets the child process call the synchronous read() or write() system call, and then the parent process ends the aio_read() or aio_write() function and continues the execution of the program. Therefore, it does not have to wait for the synchronization operation initiated by the child process to complete. However, this "poor man's version" of POSIX functions is much slower than the asynchronous I/O implemented at the kernel layer.

The Linux 2.6 kernel version uses a set of system calls to implement asynchronous I/O. But in Linux 2.6.11, this function is still being implemented, and asynchronous I/O can only be used to open files with the O_DIRECT flag set (see the previous section). Table 16-5 lists the system calls for asynchronous I/O.
Insert image description here

Asynchronous I/O environment

If a user-mode process calls the io_submit() system call to start an asynchronous I/O operation, it must create an asynchronous I/O environment in advance. Basically, an asynchronous I/O environment (referred to as an AIO environment) is a set of data structures used to track the operation of asynchronous I/O operations requested by a process. Each AIO environment is associated with a kioctx object, which stores all information related to the environment. An application can create multiple AIO environments. All kioctx descriptors for a given process are stored in a one-way linked list located in the ioctx_list field of the memory descriptor (see Table 9-2 in Chapter 9).

We will not discuss the kioctx object in detail again. But we should pay attention to an important data structure used by kioctx objects: the AIO ring. The AIO ring is a memory buffer of the address space in the user mode process. It can also be accessed by all processes in the kernel mode. The ring_info.mmap_base and ring_info.mmap_size fields in the kioctx object store the user-mode starting address and length of the AIO ring respectively. The ring_info.ring_pages field stores an array pointer that stores the descriptors of all page frames containing AIO rings.

The AIO ring is actually a ring buffer that the kernel uses to write completion reports for running asynchronous I/O operations. The first byte of the AIO ring has a header (struct aio_ring data structure), and all subsequent bytes are io_event data structures, each representing a completed asynchronous I/O operation. Because the pages of the AIO ring are mapped to the user-mode address space of the process, applications can directly inspect the status of running asynchronous I/O operations, thereby avoiding the use of relatively slow system calls.

The io_setup() system call creates a new AIO environment for the calling process. It takes two parameters: the maximum number of asynchronous I/O operations running (this will determine the size of the AIO ring) and a variable pointer to hold the environment handle. This handle is also the base address of the AIO ring. The sys_io_setup() service routine actually calls do_mmap() to allocate a new anonymous linear area for the process to store the AIO ring (see the "Allocating Linear Address Range" section in Chapter 9), and then creates and initializes the kioctx describing the AIO environment. object. Conversely, the io_destroy() system call deletes the AIO environment and also deletes the anonymous linear region containing the corresponding AIO ring. This system call blocks the current process until all running asynchronous I/O operations have completed.

Submit an asynchronous I0 operation

To begin an asynchronous I/O operation, the application calls the io_submit() system call. This system call has three parameters:

ctx_id
	由io_setup()(标识AIO环境)返回的句柄
iocbpp
	iocb类型描述符的指针数组的地址,其中描述符的每项描述一个异步I/O操作
nr
	iocbpp指向的数组长度

The iocb data structure has the same fields aio_fildes, aio_buf, aio_nbytes, aio_offset as the POSIX aiocb descriptor, and there is also an aio_lio_opcode field to store the type of requested operation (typically: read, write or sync). The sys_io_submit() service routine performs the following steps:

  1. Verify the validity of the iocb descriptor array.
  2. Search the kioctx object corresponding to the ctx_id handle in the linked list corresponding to the ioctx_list field of the memory descriptor.
  3. For each iocb descriptor in the array, perform the following sub-steps:
    a. Obtain the file object address corresponding to the file descriptor in the aio_fildes field.
    b. Allocate and initialize a new kiocb descriptor for this I/O operation.
    c. Check whether there is a free location in the AIO ring to store the completion of the operation.
    d. Set the ki_retry method of the kiocb descriptor according to the operation type (see below).
    e. Execute the aio_run_iocb() function, which actually calls the ki_retry method to start data transfer for the corresponding asynchronous I/O operation. If the ki_retry method returns -EIOCBRETRY, it means that the asynchronous I/O operation has been submitted but not completely successful: later on this kiocb, the aio_run_iocb() function will be called again (see below); otherwise, aio_complete() is called, as Asynchronous I/O operations append completion events in the AIO environment's ring.

If the asynchronous I/O operation is a read request, the ki_retry method corresponding to the kiocb descriptor is implemented by aio_pread(). This function actually executes the aio_read method of the file object, and then updates the ki_buf and ki_left fields of the kiocb descriptor according to the return value of the aio_read method (see Table 16-1 earlier in this chapter). Finally, aio_pread() returns the number of valid bytes read from the file, or -EIOCBRETRY if the function determines that the requested bytes have not been transferred. For most file systems, the aio_read method of the file object is to call the __generic_file_aio_read() function. If the file's O_DIRECT flag is set, the function calls the generic_file_direct_IO() function, which is described in the previous section. But in this case, the __blockdev_direct_IO() function does not block the current process to wait for the I/O data transmission to complete, but returns immediately. Because the asynchronous I/O operation is still running, aio_run_iocb() will be called again, and this time the caller is the aio kernel thread of the aio_wq work queue. The kiocb descriptor tracks the operation of I/O data transfers. Finally, all data transmission is completed, and the completion result is appended to the AIO ring.

Similarly, if the asynchronous I/O operation is a write request, the ki_retry method corresponding to the kiocb descriptor is implemented by aio_pwrite(). This function actually executes the aio_write method of the file object, and then updates the ki_buf and ki_left fields of the kiocb descriptor according to the return value of the aio_write method (see Table 16-1 earlier in this chapter). Finally, aio_pwrite() returns the number of valid bytes written to the file, or -EIOCBRETRY if the function determines that the requested bytes were not completely transferred.
For most file systems, the aio_write method of the file object is to call the generic_file_aio_write_nolock() function. If the file's 0_DIRECT flag is set, the function calls the generic_file_direct_IO() function as above.

Guess you like

Origin blog.csdn.net/x13262608581/article/details/132388409