[Transfer] See the file reading and writing process from the kernel file system

1. System call

The main function of the operating system is to manage hardware resources and provide a good environment for application developers, but the various hardware resources of the computer system are limited, so in order to ensure that each process can be executed safely. The processor has two modes: "user mode" and "kernel mode". Some operations that are prone to security issues are restricted to be performed only in kernel mode, such as I/O operations, modifying the contents of base address registers, etc.

The interface connecting user mode and kernel mode is called a system call.

The application program code runs in the user mode, and when the application program needs to implement the instructions in the kernel mode, it first sends a call request to the operating system. After the operating system receives the request, it executes the system call interface to make the processor enter the kernel mode.

After the processor has processed the system call operation, the operating system will return the processor to user mode and continue to execute user code.

The virtual address space of a process can be divided into two parts, kernel space and user space .

The kernel code and data are stored in the kernel space , and the code and data of the user program are stored in the user space of the process.

Regardless of whether it is kernel space or user space, they are all in virtual space and are mappings to physical addresses .

The process of operating files in the application is a typical system call process.

2. Virtual file system

An operating system can support a variety of different underlying file systems (such as NTFS, FAT, ext3, ext4). In order to provide a unified file system view for the kernel and user processes, Linux adds an abstraction between the user process and the underlying file system Layer, that is, the virtual file system (Virtual File System, VFS) , all file operations of the process go through the VFS, and the VFS adapts to various underlying file systems to complete the actual file operations .

In layman's terms, VFS defines the interface layer and adaptation layer of a common file system.

  • On the one hand, it provides a unified set of methods for user processes to access files, directories and other objects.
  • On the other hand, it needs to adapt to different underlying file systems. as the picture shows:
    insert image description here

1- The main module of the virtual file system

  • 1. The super block (super_block) is used to save all the metadata of a file system, which is equivalent to the information library of the file system and provides information for other modules. So a superblock can represent a file system. Any metadata modification of the file system must modify the super block. Superblock objects are memory resident and cached.

  • 2. The directory item module manages the directory items of the path. For example, if a path is /home/foo/hello.txt, then the directory items include home, foo, hello.txt. The block of the directory entry stores information such as inode numbers and file names of all files in this directory. Its interior is a tree structure. The operating system retrieves a file, starting from the root directory, and analyzing all directories in the path hierarchically until the file is located.

  • 3. The inode module manages a specific file and is the unique identifier of the file. A file corresponds to an inode. The location of the file in the disk sector can be easily found through the inode. At the same time, the inode module can be linked to the address_space module, which is convenient to find out whether its own file data has been cached.

  • 4. Open the file list module, including all the files that have been opened by the kernel. The opened file object is created in the kernel by the open system call, also called a file handle. The open file list module contains a list, and each list item is a structure struct file, and the information in the structure is used to represent various state parameters of an opened file.

  • 5. file_operations module. This module maintains a data structure, which is a collection of function pointers, which contains all available system call functions, such as open, read, write, mmap, etc. Each open file (an entry of the open file list module) can be connected to the file_operations module, so that various operations can be realized by calling functions on any opened file.

  • 6. The address_space module, which represents a physical page of a file that has been cached in the page cache. It is a bridge between the page cache and the file system in the external device. If the file system can be understood as a data source, then address_space can be said to be associated with the memory system and the file system. We will continue the discussion later in the article.

The interaction and logical relationship between modules are shown in the figure below:

insert image description here
It can be seen from the figure:

  • 1. Each module maintains an X_op pointer pointing to its corresponding operation object X_operations.

  • 2. The super block maintains a s_files pointer pointing to the "opened file list module", that is, the linked list of all open files in the kernel. This linked list information is shared by all processes.

  • 3. Both the directory operation module and the inode module maintain an X_sb pointer pointing to the super block, so that metadata information of the entire file system can be obtained.

  • 4. Directory entry objects and inode objects maintain pointers to each other, and can find each other's data.

  • 5. Each file structure instance on the opened file list maintains a f_dentry pointer, which points to its corresponding directory entry, so that its corresponding inode information can be found according to the directory entry.

  • 6. Each file structure instance on the opened file list maintains a f_op pointer, which points to all function collection file_operations that can operate on this file.

  • 7. There are not only pointers associated with other modules in the inode, but the important thing is that it can point to the address_space module, so as to obtain the cache information of its own files in memory.

  • 8. Address_space internally maintains a tree structure to point to all physical page structure pages, and maintains a host pointer to inode to obtain file metadata.

2- Process and virtual file system interaction

  • 1. The kernel uses task_struct to represent the descriptor of a single process, which contains all the information to maintain a process. The task_struct structure maintains a pointer to files (which is different from the entry on the "opened file list") to point to the structure files_struct, which contains the file descriptor table and open file object information.

  • 2. The file descriptor table in file_struct is actually a pointer list of file type (the same pointer as the table item on the "opened file list"), which can support dynamic expansion, and each pointer points to the file list in the virtual file system One of the opened files of the module.

  • 3. On the one hand, the file structure can be linked from f_dentry to the directory entry module and inode module to obtain all the information related to the file. On the other hand, it can link the file_operations submodule, which contains all the system call functions that can be used, so as to finally complete the file. operate. In this way, the file descriptor table from the process to the process is associated with the corresponding file structure on the opened file list, thereby invoking its executable system call function to realize various operations on the file.

insert image description here

3- Process vs File List vs Inode

  • 1. Multiple processes can point to an open file object (file list entry) at the same time, such as sharing a file object between a parent process and a child process;

  • 2. A process can open a file multiple times to generate different file descriptors, and each file descriptor points to a different file list entry. But because it is the same file, the inode is unique, so these file list entries all point to the same inode. File sharing (sharing the same disk file) is achieved through this method;

3. I/O buffer

1- Concept

For example, the principle of cache (cache) is similar. During the I/O process, the speed of reading the disk is much slower than the speed of reading the memory. Therefore, in order to speed up data processing, it is necessary to cache the read data in memory. The data cached in the memory is the buffer cache, which is referred to as "buffer" below.

Specifically, the buffer (buffer) is an area used to store data transferred between devices with asynchronous speeds or devices with different priorities. On the one hand, through the buffer, the mutual waiting between processes can be reduced, so that when data is read from a slow device, the operation process of the fast device will not be interrupted. On the other hand, it can protect the hard disk or reduce the number of network transmissions.

2-Buffer and Cache

Buffer and cache are two different concepts: cache is high-speed cache, which is used for buffering between CPU and memory; buffer is I/O cache, which is used for buffering memory and hard disk; simply, cache is to accelerate "reading" , and buffer is buffering "writing". The former solves the problem of reading and saves the data read from the disk, while the latter solves the problem of writing and saves the data that is about to be written to the disk.

3-Buffer Cache和 Page Cache

Both buffer cache and page cache are designed to deal with high-speed access when devices and memory interact. The buffer cache can be called a block buffer, and the page cache can be called a page buffer. Before Linux does not support the virtual memory mechanism, there is no concept of pages, so the buffer is allocated to the device in units of blocks. After Linux uses the virtual memory mechanism to manage memory, the page is the smallest unit of virtual memory management, and the page buffer mechanism is used to buffer memory. After Linux2.6, the kernel integrates these two caches, and pages and blocks can be mapped to each other. At the same time, the page cache is oriented to virtual memory, and the block I/O cache Buffer cache is oriented to block devices. It should be emphasized that the page cache and block cache are a storage system for the process, and the process does not need to pay attention to the reading and writing of the underlying devices.

The biggest difference between buffer cache and page cache is the granularity of cache. The buffer cache is oriented to the blocks of the file system. The memory management component of the kernel uses a higher-level abstraction than the block of the file system: page page, which has higher processing performance. Therefore, cache components that interact with memory management use the page cache.

4. Page Cache

The page cache is file-oriented and memory-oriented. In layman's terms, it is located in the buffer between the memory and the file, and the file IO operation actually only interacts with the page cache, not directly with the memory. Page cache can be used in all scenarios where files are used as units, such as network file systems and so on. The page cache realizes mapping a file to the page level through a series of data structures, such as inode, address_space, and struct page:

  • 1. The struct page structure marks a physical memory page, and this page frame can be positioned to a specific location in a file through page + offset. At the same time, the struct page also has the following important parameters:

    • (1) flags to record whether the page is a dirty page, whether it is being written back, etc.;

    • (2) mapping points to the address space address_space, indicating that this page is a page in the page cache, corresponding to the address space of a file;

    • (3) index records the page offset of this page in the file;

  • 2. The inode of the file system actually maintains the block numbers of all the block blocks of the file. By taking the modulo of the file offset offset, the block number of the file system and the sector number of the disk where the offset is located can be quickly located. Similarly, the offset of the page where the offset is located can be calculated by taking the modulus of the file offset offset.

  • 3. The page cache cache component abstracts the concept of the address space address_space as an intermediate bridge between the file system and the page cache. The address space address_space can easily obtain the information of the file inode and struct page through the pointer, so you can easily locate the position of the offset of a file in each component, that is, through: file byte offset --> page offset Quantity --> file system block number block --> disk sector number

  • 4. The page cache actually uses a radix tree structure to organize the content of a file and store it in the physical memory struct page. A file inode corresponds to an address space address_space. And an address_space corresponds to a page cache radix tree. The relationship between them is as follows:
    insert image description here

五、Address Space

Below we summarize all the functionality of address_space that has been discussed. address_space is a key abstraction in the Linux kernel. It is used as an intermediate adapter between the file system and the page cache to indicate the physical pages of a file that have been cached in the page cache. Therefore, it is a bridge between the page cache and the file system in the external device. If the file system can be understood as a data source, then address_space can be said to be associated with the memory system and the file system.

It can be seen from the figure that the address space address_space is linked to the page cache radix tree and inode, so address_space can easily obtain the information of the file inode and page through the pointer. So how does the page cache implement the buffer function through address_space? Let's look at the complete file reading and writing process.

6. Basic process of reading and writing files

1- read file

  • 1. The process calls the library function to initiate a file read request to the kernel;

  • 2. The kernel locates the opened file list entry of the virtual file system by checking the file descriptor of the process;

  • 3. Call the system call function read() available for the file

  • 3. The read() function links to the directory item module through the file table item, searches in the directory item module according to the incoming file path, and finds the inode of the file;

  • 4. In the inode, calculate the page to be read through the file content offset;

  • 5. Find the address_space corresponding to the file through the inode;

  • 6. Access the page cache tree of the file in address_space, and find the corresponding page cache node:

    • (1) If the page cache hits, then directly return the file content;

    • (2) If the page cache is missing, a page missing exception will be generated, a page cache page will be created, and at the same time, the disk address of the page of the file will be found through the inode, and the corresponding page will be read to fill the cache page; Step 6 is performed again to find the page cache ;

  • 7. The file content is read successfully.

2- Write the file

The first 5 steps are consistent with reading the file. In the address_space, query whether the page cache of the corresponding page exists:

  • 6. If the page cache hits, directly modify and update the file content in the page of the page cache. Writing the file is over. At this time, the file modification is located in the page cache and is not written back to the disk file.

  • 7. If the page cache is missing, a page missing exception is generated, a page cache page is created, and at the same time, the disk address of the page of the file is found through the inode, and the corresponding page is read to fill the cache page. At this time, the cache page hits, and proceed to step 6.

  • 8. If a page in a page cache is modified, it will be marked as a dirty page. Dirty pages need to be written back to disk in file blocks. There are two ways to write dirty pages back to disk:

    • (1) Manually call the sync() or fsync() system call to write back the dirty page

    • (2) The pdflush process will regularly write dirty pages back to disk

At the same time, note that dirty pages cannot be replaced out of memory. If a dirty page is being written back, the write-back flag will be set. At this time, the page will be locked, and other write requests will be blocked until the lock is released.

Guess you like

Origin blog.csdn.net/weixin_45264425/article/details/130331069