How much disk IO will actually occur for one byte of the read file?

47af05b3cea2065c2c4d83782a03f758.gif

Author | Zhang Yanfei allen

Source | Develop inner strength practice

Regarding some seemingly commonplace issues in daily development, I think most people may not really understand, or do not understand thoroughly enough. If you don't believe me, let's take a look at the following simple code for reading files:

b7019533c55e294f66005b44c97e527e.png

The code in the above figure just reads a byte from a certain file. Based on this code snippet, let's think about it:

  • 1. Will reading 1 byte of a file cause disk IO?

  • 2. If disk IO occurs, how much IO occurred?

The various languages ​​​​that everyone usually uses, such as C++, PHP, Java, and Go, have a relatively high level of packaging, and many details are completely shielded. If you want to clarify the above problems, you need to cut open the inside of Linux to see the Linux IO stack.

1. Big talk about Linux IO stack

Without further ado, I drew a simplified version of the Linux IO stack.

32a2898bd2c68af8cf012b0724ff3c25.png

Through the IO stack, we can see that we only need a simple read at the application layer, and the kernel needs many components such as the IO engine, VFS, PageCache, general block management layer, and IO scheduling layer for complex cooperation to complete.

What are these components for? Let's go through them one by one. Students who don't want to read this can skip directly to the reading process of the second section.

1.1 I/O engine

If developers want to read and write files, there are many sets of functions to choose from in the lib library layer, such as read & write, pread & pwrite. This is actually choosing the IO engine provided by Linux.

Common types of IO engines are as follows:

3f1ac6a998122e90357ea01ac845aaaf.png

The read function used in our code snippet at the beginning belongs to the sync engine. The IO engine is still at the upper layer, and it needs the support of lower-level components such as system calls provided by the kernel layer, VFS, and general block layer to realize it.

Then let's continue to go deep into the kernel to introduce the various kernel components.

1.2 System calls

When entering the system call, it also enters the kernel layer.

System calls encapsulate the functions of other components in the kernel, and then expose them to user processes for access through interfaces.

24c45a4f061f416f25ad37de2835c5d0.png

For our file reading needs, system calls need to rely on VFS kernel components.

1.3 VFS virtual file system

The idea of ​​VFS is to abstract a common file system model on Linux, and provide a set of common interfaces for our developers or users, so that we don't have to care about the implementation of specific file systems. There are four core data structures provided by VFS, which are defined in include/linux/fs.h and include/linux/dcache.h of the kernel source code.

  • superblock: Linux is used to mark the information about the specific installed file system.

  • Inode: Each file/directory in Linux has an inode, which records its permissions, modification time and other information.

  • desty: directory item, which is a part of the path, and all directory item objects are strung together to form a directory tree under Linux.

  • file: A file object, used to interact with the process that opened it.

Around these four core data structures, VFS also defines a series of operation methods. For example, the operation method definition of inode inode_operationsdefines mkdirand renameetc. which we are very familiar with. For the file object, the corresponding operation method is defined file_operations, as follows:

// include/linux/fs.h
struct file {
    ......
    const struct file_operations    *f_op
}
struct file_operations {
    ......
    ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
    ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
            ......
    int (*mmap) (struct file *, struct vm_area_struct *);
    int (*open) (struct inode *, struct file *);
    int (*flush) (struct file *, fl_owner_t id);
}

Note that VFS is abstract , so the read and write defined in its file_operations are just function pointers. In practice, it needs a specific file system to implement, such as ext4 and so on.

1.4 Page Cache

Page Cache. Its Chinese translation is called page cache. It is the main disk cache used by the Linux kernel and is a pure memory working component. The Linux kernel uses search trees to efficiently manage large numbers of pages.

With it, Linux can keep some file data on the disk in memory, and then speed up the access to the relatively slow disk.

When the user wants to access the file, if the file block to be accessed happens to exist in the Page Cache, then the Page Cache component can directly copy the data from the kernel state to the memory of the user process. If it does not exist, it will apply for a new page, issue a page fault interrupt, and then fill it with the block content read from the disk, and use it directly next time.

78b5ecdc92ad62da7a049593dec3a40e.png

Seeing this, you may understand half of the question at the beginning. If the file you want to access has been accessed recently, there is a high probability that Linux will just copy it from the Page cache memory to you, and there will be no actual disk IO. .

But there is one case, Pagecache will not take effect, that is, you set the DIRECT_IO flag.

1.5 File system

There are many file systems supported under Linux, commonly used are ext2/3/4, XFS, ZFS, etc.

Which file system to use is specified when formatting. Because each partition can be formatted separately, multiple different file systems can be used simultaneously under one Linux machine.

The specific implementation of VFS is provided in the file system. In addition to data structures, each file system also defines its own actual operation functions. For example ext4_file_operations defined in ext4. Concrete implementations of the read functions defined in the included VFS: do_sync_read and do_sync_write.

const struct file_operations ext4_file_operations = {
    .llseek         = ext4_llseek,
    .read           = do_sync_read,
    .write          = do_sync_write,
    .aio_read       = generic_file_aio_read,
    .aio_write      = ext4_file_write,
    ......
}

Different from VFS, the functions here are actually implemented.

1.6 Generic block layer

The file system also relies on the lower general block layer.

For the upper file system, the common block layer provides a unified interface for file system implementers to use, regardless of the differences in different device drivers, so that the implemented file system can be used for any block device. After abstracting the device, regardless of whether it is a disk or a mechanical hard disk, the file system can use the same interface to read and write logical data blocks.

to the lower floor. The I/O request is added to the device's I/O request queue. It defines a data structure called bio to represent an IO operation request (include/linux/bio.h)

1.7 IO scheduling layer

When the general block layer actually sends out the IO request, it may not be executed immediately. Because the scheduling layer will start from the overall situation and try to maximize the overall disk IO performance.

For mechanical hard disks, the scheduling layer will try to make the heads work like an elevator, going in one direction first and then coming back at the end, so that the overall efficiency will be higher. The specific algorithms include deadline and cfg, and the details of the algorithm will not be expanded. Interested students can search by themselves.

For SSDs, the problem of random IO has been largely solved, so the simplest noop scheduler can be used directly.

On your machine, dmesg | grep -i schedulercheck the scheduling algorithms supported by your Linux with .

The general block layer and the IO scheduling layer together shield the device differences of various hard disks and U disks at the bottom layer for the upper file system.

Second, the process of reading files

We have briefly introduced each kernel component in the Linux IO stack. Now let's go through the process of reading files from the beginning (the source code in the picture is based on Linux 3.10)

81e18c1b757f13fe4f3d43efeb2adc37.png

This long picture strings together the entire process of reading files in Linux.

3. Review the opening question

Back to the first question at the beginning : Will reading 1 byte of the file cause disk IO?

It can be seen from the above process that if the Page Cache hits, no disk IO will be generated at all .

Therefore, don't think that the performance will be slow if there are several logics for reading and writing files in the code. The operating system has already optimized a lot for you. The memory-level access latency is about ns level, which is several orders of magnitude faster than mechanical disk IO. If your memory is large enough, or your files are accessed frequently enough, in fact, the read operation at this time rarely has real disk IO.

If the Page Cache misses, must there be a transmission to the mechanical shaft for disk IO?

In fact, not necessarily, why, because the current disk itself will have a cache. In addition, the current server will build a disk array, and the core hardware Raid card in the disk array will also integrate RAM as a cache. Only when all the caches are not hit, the mechanical shaft with the magnetic head will really work.

Look at the second question at the beginning : If disk IO occurs, how much IO occurred?

If all the caches do not cover the IO read request, then let's see how much the actual Linux will read. Is it really according to our needs, just read one byte?

Several kernel components are involved in the whole IO process. Each component uses blocks of different lengths to manage disk data.

  • Page Cache is in units of pages, and the Linux page size is generally 4KB

  • The file system is managed in units of blocks. You can view it by using it dumpe2fs. Generally, a block defaults to 4KB

  • The general block layer handles disk IO requests in units of segments, and a segment is a page or a part of a page

  • The IO scheduler transfers N sectors to the memory through DMA, and the sector is generally 512 bytes

  • The hard disk also uses "sectors" to manage and transmit data

It can be seen that although we only read 1 byte from the user's point of view (in the code at the beginning, we only left a buffer of 1 byte for this disk IO). But in the entire kernel workflow, the smallest unit of work is a disk sector, which is 512 bytes, which is much larger than 1 byte.

In addition, the working units of high-level components such as block and page cache are larger. Among them, the size of Page Cache is 4KB for a memory page. So generally a disk read is performed by multiple sectors (512 bytes) together . Assuming that the general block layer IO segment is a memory page, a disk IO is 4 KB (eight 512-byte sectors) to read together.

In addition, what we didn't talk about is that there is a complex set of pre-reading strategies. So, in practice, more than 8 sectors may be transferred into memory together.

Finally, a few words

The original intention of the operating system is to make it simple and reliable for you, and let you treat it as a black box as much as possible. If you want a byte, it will give you a byte, but it does a lot of work silently.

Although most of our domestic development is not low-level, if you are very concerned about the performance of your application, you should understand when the operating system has quietly improved your performance and how to improve it. So that at some point in the future when your online server is about to fail, you can quickly find out the problem.

8672a095bebcdd17a61bde27efc52f1a.gif

Past recommendation

Use the open source tool k8tz to elegantly set the Kubernetes Pod time zone

How to elegantly protect Secrets in Kubernetes

What should I do if the Redis memory is full? This is the correct setting!

Cloud Native Experts, Experts, and Novices

c9d39db3c5f9d3596a33089f2ea8e1d2.gif

click to share

a24d6b0504b0b310e23f8e82275d48d3.gif

favorite

33de6e77c235bfab599f1144026cc28c.gif

Like

ab947385aa8cb3c4703f28d0529e67bb.gif

click to watch

Guess you like

Origin blog.csdn.net/FL63Zv9Zou86950w/article/details/125419068#comments_26947638