Research on the phenomenon of Direct IO write amplification

Insert picture description hereThis work is licensed under the Creative Commons Attribution-Non-Commercial Use-Share 4.0 International License Agreement in the same way .
Insert picture description hereThis work ( Lizhao Long Bowen by Li Zhaolong creation) by Li Zhaolong confirmation, please indicate the copyright.

introduction

This question appeared on Ali's side. To be honest, when I asked this question, I knew that it was a master on the phone. Unsurprisingly, I didn't answer this question. In fact, if there is a reminder, there may be a chance. It is really a bit troublesome to guess the meaning of nouns that have not been seen before.

After some data inquiries, the interviewer's so-called write amplification phenomenon should be said that the actual number of IOs performed by the OS is greater than the number of IOs performed by the user mode. This is actually caused by the organization of the file system on the disk. df -TYou can use it to view the file system format under the current system, generally there are ext4more on Linux .

ext4 file system

Insert picture description here

Insert picture description here

The figure above is the basic architecture diagram of the ext4 file system.

You can enter to dfview each file system on the machine, and then call to dumpe2fs -h /dev/sda10 | grep nodeview and inoderelated data, call to tune2fs -l /dev/sda10 | grep "Block size"view the block size, of course, the name of the hard disk is replaced according to different machines.

For more details, please refer to [4], but these two pictures are enough to let us understand the general architectural process of ext4.

But at this time we should be keenly aware that if you want to find the data in a particular inode, especially the data with a large offset, it seems that you can’t find it once, and you need to enter the disk multiple times to get it. One IO operation in user mode may cause multiple IO operations actually performed by the operating system.

As for the conversion between disk physical blocks to logical blocks, please refer to [9], and ext4 documents can refer to [10].

Reappearance of write amplification

Let's try to reproduce this process, first generating multiple large files. Then every time the offset is read, the cache is refreshed every time. The simple code is as follows:

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

#include <string>
#include <vector>
#include <iostream>
using std::string;
using std::vector;

int main(){
    
    
    int FileName = 0;
    vector<int> arr;
    arr.reserve(500);

    constexpr int len = 1024*1024*1024;

	// 硬盘上没空间了,所以只生成了10个文件,想要结果更准确且机器允许的话可以生成500个向上。
    for (size_t i = 0; i < 10; i++){
    
    
        int fd = open(std::to_string(i).c_str(), O_RDWR | O_CREAT | O_DIRECT, 0755);
        arr.push_back(fd);
        char *buf;
        size_t buf_size = len;
        posix_memalign((void **)&buf, getpagesize(), buf_size);
        int ret = write(fd, buf, len);
        if(ret != len){
    
    
            std::cerr << "Partially written!\n";
        }
    }

    for(auto x : arr){
    
    
        close(x);
    }
    return  0;
}

Reading these ten files in a loop, sync each time, in order to simulate multiple large files, the machine is in a terrible life.

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

#include <string>
#include <vector>
#include <iostream>
using std::string;
using std::vector;

int main(){
    
    
    vector<int> arr;
    arr.reserve(500);

    constexpr int len = 1024;

    for (size_t i = 0; i < 10; i++){
    
    
        int fd = open(std::to_string(i).c_str(), O_RDONLY | O_DIRECT, 0755);
        std::cout << fd << std::endl;
        arr.push_back(fd);
    }

    for (size_t i = 0; i < 1000; i++){
    
    
        int index = i%10;
        lseek(arr[index], 1024*1024*512, SEEK_SET); // 转移偏移量
        char* buf;
        posix_memalign((void**)&buf, getpagesize(), 1025);
        std::cout << arr[index] << std::endl;
        size_t buf_size = len;
        int ret = read(arr[index], buf, buf_size);
        if(ret != buf_size){
    
    
            free(buf);
            std::cerr << "Partially written!\n";
            continue;
        }
        free(buf);
        sync();
    }
    

    for(auto x : arr){
    
    
        close(x);
    }
    return  0;
}

First of all, you can see in the code that we have performed 1000 IO operations in user mode, and the total amount of data read is 1000KB.

It can be called when the code is executed iostat -d 1. The function is to execute iostat every second. If you want to specify the operand, add the upper limit to the execution. If you don't add it before, there is no upper limit to the number of executions.
Insert picture description here
We can see that the operating system has executed about 2500 IOs, and the total amount of data is about 4000KB.

We changed Direct IO to a normal IO process, that is, O_DIRECTremove the parameters in open , and then change the memory allocation to malloccontinue monitoring IO: The
Insert picture description here
number of IOs is 1200, but it is obvious that the amount of data read has become much smaller. That is 40KB, ten files, and then consider the above 4 times. Obviously there is some relationship between them. I personally think it is like this. We have already seen the organizational structure of ext4 above. When the block is 4096 bytes To save a G file requires a double indirect index to save, which means that compared to direct IO, it requires two additional IOs. In addition, the inode needs to be found on the disk at the beginning, and the data needs to be read in the data block at the end. .

As for why tps is still 1000, because I did not remove it sync. This is also the reason why the number of bytes written is very large. If removed sync, because the file data still exists page cache, it can be expected that the read data will be relatively small, and tps is also It will be less than the price.

Insert picture description here

In line with expectations.

to sum up

After the above exploration and the study of [3] this article, we can basically locate the cause of the write amplification phenomenon in Direct IO to the realization of the file system , because every time Direct IO is a different file, it means We need to perform multiple additional IO operations to obtain the actual storage location of the data. Although the data is cached, it will not be used next time, which leads to a significant increase in the number of additional IOs. In the 1G size file of ext4, it is already 4 times this terrible number.

Of course, a more detailed summary is the last three points in [3], to prevent the original text from being lost, record it here:

  1. The file index method adopted by ext3 will generate additional IO when the read offset is large, and the greater the offset, the more additional IO times;
  2. Linux uses the buffer cache to cache the index block of ext3. To find the location of the data block, the index block is first searched in the cache, and the index block is read from the disk if the cache misses;
  3. The number of IO requests issued by the Linux kernel during DirectIO is actually related to the following factors: a). The continuity of logical blocks on the physical disk; b). Application buffer alignment granularity, try to align with PAGE_SIZE in programming.

In fact, it can be seen that it is more appropriate to use Direct IO in the two situations, at the read and write ends respectively.

When writing, data needs to be put into the disk immediately to prevent the loss of memory data after power failure. We have positioned that the reason for the write amplification phenomenon is the realization of the cache and file system, so there is no need to worry too much about scenarios like WAL, because the index of the disk is in It has been cached for the first time, and the subsequent operation is just an IO (compared to batching writing, of course, the efficiency is lower).

When reading is to determine that the data access does not follow the locality, of course fadviseit also did such a thing.

The conclusion may not be correct, please point it out if you find an error.

reference:

  1. An article to understand the directory of the Ext4 file system
  2. Talk about Linux IO again
  3. Research on IO amplification in DirectIO
  4. Introduction to ext4 file system management mechanism
  5. Detailed iostat
  6. Example of reading files in O_DIRECT mode
  7. DirectIO (O_DIRECT) detailed
  8. Usage of /proc/sys/vm/drop_caches
  9. ext2_get_branch resolves the process of mapping disk space to user mode
  10. Ext4 Disk Layout
  11. fadvise man

Guess you like

Origin blog.csdn.net/weixin_43705457/article/details/115118406