LinuxIO performance optimization practical study notes

The following content is from the geek course, if it is helpful to you, please see the poster for the detailed course:
Picture name

IO performance

IO performance related knowledge

1. File system

In order to facilitate management, the Linux file system allocates two data structures for each file, an index node and a directory entry. They are mainly used to record the meta-information and directory structure of files.

The index node, abbreviated as inode, is used to record the metadata of the file, such as the inode number, file size, access authority, modification date, and data location. The index node and the file have a one-to-one correspondence. It is the same as the file content and will be stored persistently to disk. So remember that inodes also take up disk space.
The directory entry, called dentry for short, is used to record the name of the file, the index node pointer, and the association relationship with other directory entries. Multiple associated directory entries constitute the directory structure of the file system. However, unlike the index node, the directory entry is a memory data structure maintained by the kernel, so it is usually called the directory entry cache.

2.Slab

cache = cached+sreclaim
In fact, the kernel uses the Slab mechanism to manage the cache of directory entries and index nodes. /proc/meminfo only gives the overall size of the slab. For each type of slab cache, you must also check the file /proc/slabinfo.

3. IO working principle

In fact, whether mechanical disks or solid-state disks, random I/O of the same disk is much slower than continuous I/O, the reason is also obvious.

For mechanical disks, as we just mentioned, because random I/O requires more head seeks and disk rotation, its performance is naturally slower than continuous I/O.
For solid-state disks, although its random performance is much better than mechanical hard disks, it also has the limitation of "erase first and then write". Random reads and writes will cause a lot of garbage collection, so correspondingly, the performance of random I/O is still much worse than that of continuous I/O.
In addition, continuous I/O can also reduce the number of I/O requests through pre-reading, which is one reason for its excellent performance. Many performance optimization programs will also proceed from this perspective to optimize I/O performance.

We can divide the I/O stack of the Linux storage system into three levels from top to bottom, namely the file system layer, the general block layer, and the device layer. The relationship between these three I/O layers is shown in the figure below, which is actually a panorama of the I/O stack of the Linux storage system.
Insert picture description here
According to this panoramic picture of the I/O stack, we can more clearly understand the working principle of the storage system I/O.

The file system layer includes the concrete realization of the virtual file system and various other file systems. It provides a standard file access interface for the upper-level applications; the lower-level will store and manage disk data through the general block layer.
Common block layer, including block device I/O queue and I/O scheduler. It queues the I/O requests of the file system, reorders and merges the requests, and then sends them to the next device layer.
The device layer, including storage devices and corresponding drivers, is responsible for the I/O operations of the final physical device.

IO scheduling algorithm

Among them, the process of sorting I/O requests is the familiar I/O scheduling. In fact, the Linux kernel supports four I/O scheduling algorithms, namely NONE, NOOP, CFQ and DeadLine. Here I also introduce them separately.

The first NONE, to be more precise, cannot be considered as an I/O scheduling algorithm. Because it does not use any I/O scheduler at all, it does not actually do any processing on file system and application I/O, and it is commonly used in virtual machines (at this time, the disk I/O scheduling is completely responsible for the physical machine).
The second NOOP is the simplest I/O scheduling algorithm. It is actually a first-in, first-out queue that only does some basic request merging, and is often used for SSD disks.
The third type of CFQ (Completely Fair Scheduler), also known as the Completely Fair Scheduler, is the default I/O scheduler of many releases now. It maintains an I/O scheduling queue for each process, and according to the time slice To evenly distribute the I/O requests of each process.
The last DeadLine scheduling algorithm creates different I/O queues for read and write requests, which can increase the throughput of mechanical disks and ensure that requests that reach the deadline are processed first. The DeadLine scheduling algorithm is mostly used in scenarios with heavy I/O pressure, such as databases.

4. IO troubleshooting ideas

Use iostat to see disk await, utils, iops, bandwidth
Use smartctl to see the health status of the disk
Use iotop/pidstat to find out the processes that continue to read and write for optimization
Check the file symbol of read or write through strace -f -p [pid], and find a specific problem piece through lsof -p [pid]

Insert picture description here

5. Common ideas for IO optimization

Application optimization

The application is at the top of the entire I/O stack. It can adjust the I/O mode (such as whether the sequence is random, synchronous or asynchronous) through system calls. At the same time, it is also the ultimate source of I/O data. In my opinion, there are several ways to optimize the I/O performance of an application.

First, you can use additional writes instead of random writes to reduce addressing overhead and speed up I/O writing.
Second, you can take advantage of cached I/O to make full use of the system cache and reduce the number of actual I/Os.
Third, you can build your own cache inside the application, or use an external cache system such as Redis. In this way, on the one hand, the cached data and life cycle can be controlled within the application; on the other hand, it can also reduce the impact of other applications using the cache on itself. For another example, library functions such as fopen and fread provided by the C standard library will use the cache of the standard library to reduce disk operations. When you directly use system calls such as open and read, you can only use the page cache and buffer area provided by the operating system, and no library function cache is available.
Fourth, when you need to frequently read and write the same disk space, mmap can be used instead of read/write to reduce the number of memory copies.
Fifth, in scenarios that require synchronous writing, try to merge write requests instead of writing each request to disk synchronously, that is, fsync() can be used instead of O_SYNC. Sixth, when multiple applications share the same disk, in order to ensure that I/O is not completely occupied by an application, it is recommended that you use the I/O subsystem of cgroups to limit the IOPS and throughput of the process/process group.
Finally, when using the CFQ scheduler, ionice can be used to adjust the I/O scheduling priority of the process, especially to increase the I/O priority of the core application. ionice supports three priority classes: Idle, Best-effort and Realtime. Among them, Best-effort and Realtime also support 0-7 levels respectively. The smaller the value, the higher the priority level.

File system optimization

Disk optimization

First, the simplest and most effective optimization method is to switch to a disk with better performance, such as using SSD instead of HDD.
Second, we can use RAID to combine multiple disks into a logical disk to form a redundant independent disk array. Doing so can not only improve the reliability of the data, but also improve the access performance of the data.
Third, in view of the characteristics of disk and application I/O patterns, we can choose the most suitable I/O scheduling algorithm. For example, SSDs and disks in virtual machines usually use the noop scheduling algorithm. For database applications, I prefer to use the deadline algorithm.
Fourth, we can isolate the application data at the disk level. For example, we can configure separate disks for applications with heavy I/O pressure such as logs and databases.
Fifth, in scenarios where there are more sequential reads, we can increase the read-ahead data of the disk. For example, you can adjust the read-ahead size of /dev/sdb through the following two methods.
Sixth, we can optimize the I/O options of the kernel block device. For example, you can adjust the length of the disk queue /sys/block/sdb/queue/nr_requests, and increase the queue length appropriately to increase the throughput of the disk (of course, it will also increase the I/O delay).

View IO performance related commands

1. View the space and the number of INodes df

$ df -i /dev/sda1 
Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sda1 
3870720 157460 3713260 5% /

2. Use slabtop to find the type of cache that takes up the most memory

3. Use iostat to view the usage, IOPS, and throughput of each disk

# -d -x表示显示所有磁盘I/O的指标
$ iostat -d -x 1 Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util 
loop0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
loop1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Insert picture description here
Among these indicators, you should pay attention to:

%util is the disk I/O usage rate we mentioned earlier;
r/s+ w/s, is IOPS;
rkB/s+wkB/s is the throughput;
r_await+w_await is the response time.

4.streace track process, thread kernel call

Add -f after strace -p PID, multi-process and multi-thread can be tracked.

5. File system and disk IO related commands

Insert picture description here