Discussion on atomicity of file writes

Summary: Are writes to files atomic? Will multiple threads write to the same file garbled? Will multiple processes write to the same file garbled? Presumably these problems will cause some trouble to us. Even if we know the result, it is often difficult to express the principle clearly to the questioner. Hou Jie once said, **In front of the source code, there is no secret* *, then this article also hopes to analyze the above problems from the perspective of source code. Before we start, we need to add some basic principles related to Linux files, so as to better understand

whether the writing of Linux source files is atomic? Will multiple threads write to the same file garbled? Will multiple processes write to the same file garbled? Presumably these problems will cause some trouble to us. Even if we know the result, it is often difficult to express the principle clearly to the questioner. Hou Jie once said, **In front of the source code, there is no secret* *, then this article also hopes to analyze the above problems from the perspective of source code. Before we start, we need to add some basic principles related to Linux files, so as to better understand the Linux source code.

​ Readers who have studied Linux should know that the data of a file is divided into two parts, one part is the file data itself, and the other part is the metadata of the file, that is, inode, permissions, extended attributes, mtime, ctime, atime Wait, inode is very important for a file, it can uniquely identify a file (actually it should be inode + dev number, uniquely identify a file, more precisely, it should be established under the premise of the same file system , different file system inodes will be repeated, but this is not the point, let’s not strictly think that inodes are used to uniquely identify a file), the inode number and file metadata are constructed in the kernel as a struct inode object , the object structure is as follows:

struct inode {
    umode_t i_mode;
    uid_t i_uid;
    gid_t i_gid;
    unsigned long i_ino;
    atomic_t i_count;
    dev_t i_rdev; loff_t
    i_size;
    struct timespec i_atime ; struct
    timespec     i_mtime
    ; , and then perform read and write operations on this file. The Linux kernel also has a struct file object to represent the file. The object structure is as follows: struct file {     .....     const struct file_operations *f_op; loff_t     f_pos;     struct address_space *f_mapping;     ....// omit };












​ There are several key members, one is f_op, a set of methods for file operations. File operations do not need to care what the underlying file system is, and you can directly find the corresponding method through the f_op member. The other is f_pos, which is where the file is read, or where it is written, which is an offset. When a process opens a file, it creates a struct file object in the kernel. When reading the file, it is divided into the following steps:

find the corresponding struct file object
through obtain the current offset through the struct file object, that is Read the f_pos member Find the corresponding operation method
through f_op, and pass in the offset to be read to read the data. After the
reading is completed, reset the new offset and
read the file once. This is the case, and the corresponding code is also Very clear, as follows:

// vfs_read -> do_sync_read ssize_t
do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
{
    struct iovec iov = { .iov_base = buf, .iov_len = len };
    struct kiocb kiocb; ssize_t
    ret;
    // set the length to read and the start offset
    init_sync_kiocb(&kiocb, filp);
    kiocb.ki_pos = *ppos;
    kiocb.ki_left = len;
    kiocb.ki_nbytes = len;

    for (;;) {
        // actually start the read operation
        ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos);
        if (ret != -EIOCBRETRY)
            break;
        wait_on_retry_sync_kiocb(&kiocb);
    }

    if (-EIOCBQUEUED == ret)
        ret = wait_on_sync_kiocb(&kiocb);
    // update the last offset after reading
    *ppos = kiocb.ki_pos;
    return ret;
}
The same is true for file writing, get the offset, call the actual writing method, and finally update the offset. So far, we have understood the general process of reading and writing a file. Obviously, the above process is not atomic. Whether it is reading or writing a file, there are at least two steps, one is to take the offset, and the other is to is the actual read and write. And the locking action is not seen in the whole process, then the first problem has been solved. For the second question, we can briefly analyze, if there are two threads, the first thread gets the offset is 1, and then starts to write, in the process of writing, the second thread also gets the offset, because For a file, multiple threads share the same struct file structure, so the offset obtained is still 1. At this time, thread 1 finishes writing, updates the offset, and then thread 2 starts writing. The final result is obvious, thread 2 overwrites the data of thread 1. Through analysis, it can be seen that multi-threaded file writing is not atomic, and data overwriting will occur. But will there be data confusion, that is, data cross-writing? In fact, this situation will not happen, as for why, please see the following code:

ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
        unsigned long nr_segs, loff_t pos)
{
    struct file *file = iocb->ki_filp;
    struct inode *inode = file->f_mapping->host;
    struct blk_plug plug; ssize_t
    ret;

    BUG_ON(iocb->ki_pos != pos);
    // file The write is actually locked
    mutex_lock(&inode->i_mutex);
    blk_start_plug(&plug);
    ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
    mutex_unlock(&inode->i_mutex);

    if (ret > 0 || ret == -EIOCBQUEUED) {
        ssize_t err;

        err = generic_write_sync(file, pos, ret);
        if (err < 0 && ret > 0)
            ret = err;
    }
    blk_finish_plug(&plug);
    return ret;
}
EXPORT_SYMBOL(generic_file_aio_write);

​ So there will be no data confusion, only the problem of data coverage. In this case, do we need to lock when actually reading and writing files? Locking can indeed solve the problem, but it is a little bit here. The feeling of killing chickens with a knife, fortunately, the OS provides us with the method of atomic writing. The first one is to add the **O_APPEND** ​​flag when opening the file, and the **O_APPEND** ​​flag will get the offset and file of the file. Writes are protected by locks together, so that these two steps are atomic. For the specific code, see the __generic_file_aio_write function in the above code.


ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
                 unsigned long nr_segs, loff_t *ppos)
{
    struct file *file = iocb->ki_filp;
    struct address_space * mapping = file->f_mapping;
    size_t ocount; /* original count */
    size_t count; /* after file limit checks */
    struct inode *inode = mapping->host; loff_t
    pos;
    ssize_t written;
    ssize_t err;

    ocount = 0;
    err = generic_segment_checks(iov, &nr_segs, &ocount, VERIFY_READ);
    if (err)
        return err;

    count = ocount;
    pos = *ppos;

    vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);

    /* We can write back this queue in page reclaim */
    current->backing_dev_info = mapping->backing_dev_info;
    written = 0;
    // 重点就在这个函数
    err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
    if (err)
        goto out;
    ......// 省略
}

inline int generic_write_checks(struct file *file, loff_t *pos, size_t *count, int isblk)
{
    struct inode *inode = file->f_mapping->host;
    unsigned long limit = rlimit(RLIMIT_FSIZE);

        if (unlikely(*pos < 0))
                return -EINVAL;

    if (!isblk) {
        /* FIXME: this is for backwards compatibility with 2.4 */
        // If it has the O_APPEND flag, it will directly get the file size and set it to a new offset
        if (file->f_flags & O_APPEND)
                        *pos = i_size_read(inode);

        if (limit != RLIM_INFINITY) {
            if (*pos >= limit) {
                send_sig(SIGXFSZ, current, 0);
                return -EFBIG;
            }
            if (*count > limit - (typeof(limit))*pos) {
                *count = limit - (typeof(limit))*pos;
            }
        }
    }
    ......// omit
}
​ From the above code, we can see that if there is a **O_APPEND** ​​flag, the file is actually written Before entering, generic_write_checks will be called to perform some checks. If the **O_APPEND** ​​flag is found during the check, the offset will be set to the size of the file. And this whole process is completed under the condition of locking, so with the **O_APPEND** ​​flag, the writing of the file is atomic, and multi-threaded writing to the file will not cause data confusion. Another situation is the **pwrite** system call. The **pwrite** system call allows the user to specify the offset of the write, and the entire writing process is naturally atomic. As mentioned above, the entire writing process The process of writing is because obtaining the offset and writing the file are two independent steps, and there is no lock. The step of obtaining the offset is omitted through pwrite. In the end, the entire file writing is only a one-step locking file writing process. The code of pwrite is as follows:

SYSCALL_DEFINE(pwrite64)(unsigned int fd, const char __user *buf,
             size_t count, loff_t pos)
{
    struct file *file;
    ssize_t ret = -EBADF;
    int fput_needed;

    if (pos < 0)
        return -EINVAL;

    file = fget_light(fd, &fput_needed);
    if (file) {
        ret = -ESPIPE;
        if (file->f_mode & FMODE_PWRITE) 
            // Pass the offset or pos directly, while ordinary write requires
            // You need to get the offset from the struct file first, and then pass it in
            ret = vfs_write(file, buf, count, &pos);
        fput_light(file, fput_needed);
    }

    return ret;
}

SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
        size_t, count)
{
    struct file *file;
    ssize_t ret = -EBADF;
    int fput_needed;

    file = fget_light(fd, &fput_needed);
    if (file) {
        // The first step takes the offset
        loff_t pos = file_pos_read(file);
        // The second step actually writes
        ret = vfs_write(file, buf, count, &pos);
        // The third step writes back to the offset
        file_pos_write(file, pos );
        fput_light(file, fput_needed);
    }

    return ret;
}
​ The last question is whether multiple processes writing the same file will cause the file to be cluttered. Intuitively, it is obvious that multiple processes writing files are not atomic , because each process has a struct file object, which is independent, and has an independent file offset, so obviously this will lead to the data overwrite mentioned above, but will it lead to data confusion?, the answer No, although **struct file** objects are independent, **struct inode** is shared (no matter how many times the same file is opened, there is only one **struct inode** object), the last write of the file In fact, it is written into the page cache first, and the page cache and **struct inode** have a one-to-one correspondence, and will be locked before the actual file is written, and this lock belongs to **struct inode** Object (see mutex_lock(&inode->i_mutex) above), no matter how many processes or threads there are, as long as they write data to the same file, they will get the same lock, which is thread-safe, so it is also There will be no data write disorder.

Copyright statement: The content of this article is contributed by Internet users voluntarily, and this community does not own the ownership and does not assume relevant legal responsibilities. If you find any content suspected of plagiarism in this community, please send an email to: [email protected] to report and provide relevant evidence. Once verified, this community will immediately delete the allegedly infringing content.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326175962&siteId=291194637