sys_write() source code reading of Linux kernel

This article only records after reading the source code of sys_write(), and narrates it from my own ideas.

sys_write() is one of the core functions in the Linux file system. The operation it completes is to write the file content of the user buffer to the corresponding position of the file on the disk.

1. File page cache

To understand the process of reading and writing files in Linux, you first need to understand the design of Linux for reading and writing files, specifically how file data is organized in memory. Let's look at a picture first (from the linux kernel scenario analysis)


We can first consider what conditions need to be met to write the content of the file. This is a good habit. In the process of looking at the source code, when looking at a function, don't rush to read it line by line. First, think about if I want to design this function, specifically What needs to be done so that it can be better understood.

2. Disk data blocks are in memory

First of all, we know that the disk is a block device, and reading and writing to the disk is performed in blocks. The reason can be simply understood that the content of the file is stored in the smallest unit of blocks. For example, taking the Ext2 file system as an example, the size of a block is It is 1K bytes. At the device level, we call it the record block of the file on the disk (referred to as the record block). Then we naturally think that the file data is also organized according to the size of the block in the memory , so the design is convenient for the underlying driver to read and write to the block device.

Therefore, no matter how the file data block is stored in memory, there must be a data structure that corresponds to a block of file data in the disk. There is a field (b_page) in this data structure that must save (point to) the file record block in For the in-memory copy, there is a field that definitely records the corresponding disk block number b_blocknr, which is the most basic information.

The fact is as we thought, this data structure in linux is called buffer_head, which is the buffer in the memory of the record block of the file on the disk.

struct buffer_head {
	/* First cache line: */
	struct buffer_head *b_next;	/* Hash queue list */
	unsigned long b_blocknr;	/* block number */
	unsigned short b_size;		/* block size */
	unsigned short b_list;		/* List that this buffer appears */
	kdev_t b_dev;			/* device (B_FREE = free) */

	atomic_t b_count;		/* users using this block */
	kdev_t b_rdev;			/* Real device */
	unsigned long b_state;		/* buffer state bitmap (see above) */
	unsigned long b_flushtime;	/* Time when (dirty) buffer should be written */

	struct buffer_head *b_next_free;/* lru/free list linkage */
	struct buffer_head *b_prev_free;/* doubly linked list of buffers */
	struct buffer_head *b_this_page;/* circular list of buffers in one page */
	struct buffer_head *b_reqnext;	/* request queue */

	struct buffer_head **b_pprev;	/* doubly linked list of hash-queue */
	char * b_data;			/* pointer to data block (512 byte) */
	struct page *b_page;		/* the page this bh is mapped to */
	void (*b_end_io)(struct buffer_head *bh, int uptodate); /* I/O completion */
 	void *b_private;		/* reserved for b_end_io */

	unsigned long b_rsector;	/* Real buffer location on disk */
	wait_queue_head_t b_wait;

	struct inode *	     b_inode;
	struct list_head     b_inode_buffers;	/* doubly linked list of inode dirty buffers */
};

The buffer_head data structure is more complicated. In order to focus on the main line of sys_write(), it may be easier to understand by omitting some details.

Then the designer of the file system is more than we think, only one buffer_head is not enough, and the problem of read and write efficiency also needs to be considered.

3. File page buffering

We know that memory management in linux is managed in units of pages, and the data structure is Page. A page is 4K bytes, which is 4 record block size. In fact, the contents of the file are buffered in page units in memory. I said before that the record block size is 1K bytes. Isn't it simpler and easier to manage the buffer in the memory in the record block unit size? This is for file content caching combined with file memory mapping. The file memory mapping mechanism directly maps the file content to the user space, and the file can be accessed in the same way as accessing memory in the future. If the file content is buffered in page units, and the corresponding memory mapping table is set, it is natural to buffer the page-mapped user space. In this way, the buffering mechanism of the file system and the file memory mapping mechanism are skillfully combined.

As shown in the figure above, the file content is first cached in page units, so each buffered page needs 4 buffer_heads to point to a fixed 1K interval in the page.

--Excerpted from linux kernel scenario analysis

4. How files are organized on disk

After having a general understanding of the structure of the file buffer in memory, we also need to understand the organization of the file in the disk. Only the key content related to file reading and writing is described here, as shown in the following figure


The meta information of the file records various attributes of the file, among which the i_data[15] field records the block number of the file stored on the disk, but the maximum file size of 15 block numbers is 15K bytes, obviously not enough, so the first 12 of the i_data array are Direct indexing, the 13th, 14th, and 15th of the array are one-fold, two-fold, and three-fold profile addressing, respectively, as shown in the figure above. So what does this have to do with file write operations? The relationship is obvious. For example, if the data to be written happens to be a data block that needs triple indirect addressing to be indexed, then it is necessary to read the data blocks of the three storage addresses into the memory. If the block of data to be written is calculated by It is found that the corresponding offset address in the triple indirect addressing block is just empty (the red mark in the figure), indicating that the triple indirect addressing has not been established. At this time, not only the record block needs to be allocated for the file content, but also the triple indirect addressing. The corresponding data block of the device block is allocated. It should be known that only when the file is relatively large and needs to use the triple index, the system will allocate a record block for storing the intermediate address.

5. File write operation

With a general understanding of how files are organized in memory and on disk, we can now think about what exactly needs to be done when writing files? What parameters do I need to know?

It is conceivable that writing a file must know:

1. Which file to write: struct file *file structure

2. What is written: char* buf

3. How many bytes of data to write: size_t count

4. Start writing from that position of the file: file offset f_pos, file->f_pos

Guess what work needs to be done, the most necessary work may be:

1. To be continued. . .













































































































Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324523354&siteId=291194637