Summary of the characteristics of linux disk file system

background

When we often install an operating system, we often encounter a file system selection. So what file system should we choose? centos/redhat 7 sets the file system to xfs by default. Take Centos 7.5 as an example. The default partition file system is xfs, but based on custom, it will be changed to ext4. Can xfs be used? What is the difference with ext4, and what is the impact on practical applications? Let's talk about it in this article.

First of all, what is a file system?

The file system is mainly used to control how all programs in the operating system store data when they are not using the data, how to access the data, and what other information (metadata) is related to the data itself, and so on. A log file system can only provide data security in the event of a power failure or accidental disconnection from the storage device, and cannot protect data in the event of a bad block or logic error in the file system. At this time, we can use a redundant array of cheap disks.

Introduction

1. Check the file system supported under the linux system

For Linux systems, there are many file systems to choose from, and now the widely used ext4 is the default. We can execute the following command to see which is supported by the current system:

ll /lib/modules/3.10.0-229.el7.x86_64/kernel/fs/

2. The main difference between each file system
Insert picture description here
The disk structure of the Ext2 file system is
Insert picture description here
as shown in the figure below: As shown above, the first block is the boot block, and the boot block saves the information related to the partition and the kernel boot loader, and is not affected by the file system. management. The rest of the disk blocks are divided into many block groups by Ext2. The data blocks and index nodes contained in each group are stored on adjacent tracks. All block groups are stored in the same size and order on the disk. The kernel can be based on the block group. The integer index quickly calculates the specific location of the block group on the disk. And because the kernel stores the data blocks of the same file in the same block group as much as possible, the block group reduces file fragmentation, that is, reduces the average seek time of the disk accessing the file.

1>Super block

It is a copy of the super block that describes the file system related information. The data structure is ext2_super_block. It mainly records the total number of index nodes, the total number of blocks, the number of free blocks, the number of free index nodes, the number of blocks and the number of index nodes in each block group, and the number of partitions. Status, installation operation counter and last installation time, etc., are used for file system consistency check.

2>Group Descriptor

Record the relevant information of the current block group, the data structure is ext2_group_desc, which mainly contains the data block bitmap, the block number corresponding to the index node bitmap, the block number of the first index node table block, the number of free blocks and free index nodes in the group , Which can quickly locate the data block bitmap, index node bitmap and index node table according to the group descriptor.

3>Database block bitmap and index node bitmap

A bitmap is a sequence of bits, where 0 indicates that the data block or index node corresponding to the bit is free, and 1 indicates that it is occupied. Each bitmap is stored in a separate block. For example, the block size is 1024 bytes. Describe the status of 1024*8=8192 blocks. The free data block or index node can be quickly located according to the bitmap.

4>Inode table

The index node table is a group of continuous index nodes, stored on a group of continuous disk blocks, and the block number of the first disk block is stored in the bg_inode_table field in the group descriptor. All index nodes have the same size, that is, 128 bytes. For example, a block with a size of 4096 bytes can contain 32 index nodes. The data structure of each Ext2 index node is ext2_innode, which mainly includes file type, file access control list, directory access control list, file size, number of file data blocks, last access time, last modification time, and points to the first data block File attribute information such as pointers.

The size of the index node is limited. If you need to add other file attributes, you need to use the enhanced attribute mechanism. The added attributes are stored in a separate data block. The i_file_acl field of the index node points to the data block. Linux provides setxattr(), getxattr. (), listxattr() and other system calls to handle enhanced attributes. The enhanced attribute mechanism is mainly introduced to implement the ACL access control list, through which the users or user groups allowed to access a certain file and their corresponding permissions can be restricted through the list.

5>Data block

Different file types use data blocks in different ways, mainly in the following situations:

Ordinary files are empty when they are created and do not need data blocks. Data blocks are allocated only when data is written. The corresponding data blocks can also be cleared through the truncate() system call.
The data block corresponding to the directory is a special data structure, ext2_dir_entry_2, which contains five attributes: the index node number, the length of the directory entry, the length of the file name, the file type, and the file name. The corresponding index node can be quickly located according to the index node number. The length of the directory entry can find the starting address of the next directory entry.
The data block of the symbolic link depends on the length of the path name of the symbolic link. If it is less than 60 characters, it is placed in the i_blocks field of the index node. If it is greater than 60 characters, the same data block as the directory is used.
Device files, pipes and sockets do not require data blocks, and related information is stored in the index node

Ext2 file system:

Format the disk through the superformat or fdformat program, and then create the Ext2 file system through the mke2fs program. When creating, you need to specify the block size and the number of inodes allocated, the percentage of the reserved block (default 5%), and then initialize all the block groups. Descriptors, index node tables and bitmaps, etc., for defective blocks will be organized in a linked list and placed in the lost+found directory.

Ext2 saves the corresponding mapping relationship through the i_block field in the index node. This field is a fixed-length array whose length is specified by EXT2_N_BLOCKS, and the default is 15.

Ext3 file system:

During the system startup, the e2fsck program will be called to check the s_mount_state field of the file system superblock object. If it is not equal to EXT2_VALID_FS, it means that the file system did not exit normally due to a power failure or system crash. At this time, the saved memory was not refreshed in time. When the relevant data structure on the disk may be in an inconsistent state, the e2fsck program will start to check and appropriately correct all the data structures of the file system on the disk. The time taken for this file system consistency check depends on the number of files and directories to be checked. With the expansion of the disk capacity, the number of files and directories continues to increase, and the time-consuming is also increasing. In order to avoid this time-consuming consistency check, a log file system is introduced. A special disk area is used to record disk write operations. This type of recording is called a log. When the file system is inconsistent, the log is used to repair the relevant data structure. Soon, because the disk data structure that was modified before the failure can be quickly located through the log. The so-called log means that when performing any disk write operation, first write a copy of the block to be written into the log, and when the I/O data written to the log is transferred, the I/O between the block and the corresponding disk logical block will start. /O data transmission, that is, normal disk writing, when the writing is completed, the block copy in the log will be discarded and regarded as invalid. When recovering from a system failure, the e2fsck program will rewrite the copy of the unwritten disk block that has been submitted to the log before the failure into the file system, and ignore the unwritten disk block that has not been submitted to the log before the failure, ensuring a certain degree Data consistency.

Ext3 is an enhanced version of Ext2, compatible with Ext2, its disk data structure is basically the same as Ext2, and the log function is strengthened on the basis of Ext2. File systems usually have two types of blocks, a block containing metadata and a block containing ordinary file data. Ext3 supports writing both metadata blocks and ordinary file data blocks into the log; 3 log modes are provided:

1>journal, all file data and metadata changes of the file system are written into the journal. This mode reduces the loss of file modification, but adds additional disk overhead. It is the safest and slowest mode.
2>Ordered. Only the modification of the file system metadata is written to the log, but it is guaranteed to be the metadata block When all the related file data blocks need to be written to the disk, the file data block will be written to the disk before the metadata block. At this time, a copy of the metadata block has been saved in the log. This mode is the default mode of Linux, which can reduce the loss of common file modification, because it needs to maintain the correlation between metadata block and common file block and the disk writing sequence of the two, which has a slight performance loss compared with write-back mode.
3> Write back (writeback), only the modification of the metadata of the file system is written to the log. There is no restriction on the order in which the metadata blocks and ordinary file data blocks are written to the disk, which is determined by the dirty page refresh mechanism of the page cache. This mode is used by other log file systems and is the fastest mode, but there is a risk of file damage when the system fails.

The log mode of the file system can be specified when the device is mounted through the mount command, such as: mount –t ext3 –odata=writeback /dev/sda2 /vpm

The ext4 file system is an improvement from the ext3 file system, which in turn is an improvement from the ext2 file system. Ext4 is fully backward compatible with Ext3, and its disk data structure is basically the same as Ext3. Here we mainly look at the representative ext4 file system of the ext series.

Insert picture description here
Ext4 file system:

ext4 also has some obvious limitations. The maximum file size is 16 TB (approximately 17.6 TB). The largest volume/partition that can be created with ext4 is 1 exbibyte (approximately 1,152,921.5 terabytes). ext4 has a great speed improvement over ext3. It is a journaling file system, which means it records the location of files on the disk and any other changes to the disk. In addition, it does not support transparent compression , deduplication, or transparent encryption, and supports snapshots.

In this file system, each file corresponds to a series of disk blocks. By storing disk block numbers in an orderly manner in the inode, the mapping relationship of <file logical block number, disk block number> is saved. The logical block number of a file must be continuous, but the disk block number does not have to be continuous; usually a block size is 4KB, so a relatively large file needs to store a lot of block numbers;

For very large files, one solution is to indirectly store the block number, which means that the block pointed to by some block numbers in the inode does not store the data of the file, but stores the block number-that is, a kind of indirect addressing logic . It usually includes three levels, one part points directly to the data block of the file, one part points to the block storing the block number, and the other points to the block storing the "block number of the block storing the block number". The method adopted by ext4 is to use extent to save the mapping relationship of <file logical block number, disk block number>: an extent corresponds to a series of consecutive block numbers, so the most basic fields of an extent are-file logical block number, The starting disk block number, the number of blocks; an inode can directly store 4 extents. For very large files, ext4 uses the extent_tree method, which is essentially an indirect addressing relationship.

Terms involved:

Disk block : It is a logical abstraction of the block device to the disk . For the file system, the disk is a continuous block, each block is usually 4KB in size, and the disk block is coded in order, and each block corresponds to a disk block. number;

File logical block number : Logically speaking, a file can be regarded as a series of continuous data blocks , and the size of each data block is the same as the size of the disk block ; physically speaking, a file can correspond to several disk blocks on the disk . These disk blocks can be physically discontinuous . The corresponding relationship between logical block number and disk block number is very similar to the relationship between virtual address and physical address: the file gives you the feeling that it is continuous data, that is, continuous logical block, but in fact the disk block corresponding to the file can be discontinuous , That is, discontinuous physical blocks.

inode : save the metadata of the file, the metadata can describe a file; the most basic and important information of an inode is two, the inode number used to identify the inode,
used to indicate which disk block information the file corresponds to—— That is, the corresponding relationship between the logical block number and the disk block number. Inode stores file metadata, but how to store file names? The file names here can be regarded as files under / (root). The correspondence between these file names and inodes is stored in the root directory data; the file names of these files under the root are all directory data.

Extent section : An extent is a series of contiguous physical blocks (up to 128 MiB, with a block size of 4 KiB), which can be reserved and addressed at one time. Segments can reduce the number of inodes required for a given file, and significantly reduce fragmentation and improve performance when writing large files.

ext4 is a robust and stable file system. Most people nowadays should be using it as the root file system, but it can't handle all needs. Its limitations are as follows:

1>Although ext4 can handle data up to 1 EiB (equivalent to 1,000,000 TiB), it is not recommended. In practical applications, ext4 will not process (and may never) exceed 50-100 TiB of data. Red Hat Enterprise Linux only supports ext4 file systems up to 50 TiB in its contract, and recommends that ext4 volumes do not exceed 100 TiB.

2>ext4 is not enough to guarantee the integrity of the data. This log file system does not cover many common causes of data corruption, and there is still a risk of physical data corruption, and these cannot detect or repair this damage; ext4 is just a pure file system , not a storage volume manager. Even if a file system is safe, it is very scary to use it as the root file system if there is a problem during the kernel upgrade. If you don't have a good reason to use alternative media to boot through a chroot, patiently operate the kernel module, grub configuration and DKMS... Don't remove the reserved root files in a very important system.
Insert picture description here

XFS file system:
Insert picture description here
XFS and non-ext file systems have the same status in the mainline of Linux. It is a 64-bit high-performance log file system that can support up to 16 EB (approximately 16 million terabytes). It has been built into the Linux kernel since 2001. It provides high performance for large file systems and high concurrency (that is, a large number of All processes will write to the file system immediately.) Xfs is particularly good at handling large files while providing smooth data transfer.

It supports storage of 8EB files (approximately 8 million TB), and the directory structure contains millions of entries. XFS supports metadata logging, which can improve crash recovery speed. The XFS file system can also clear disk fragments and redefine the size when mounted and activated. This file system is selected and recommended by default. XFS supports a maximum partition size of 500 TB.

Starting with RHEL 7, XFS has become the default file system of Red Hat Enterprise Linux. For home or small business users, it still has some shortcomings-most notably, re-adjusting the existing XFS file system is a very painful thing. Although XFS is stable and high-performance, there is not enough specific end-use difference between it and ext4 to be recommended. XFS is in no way a "next generation" file system like ZFS, Btrfs, or even WAFL (a proprietary SAN file system). Just like ext4, it should be seen as a stopgap in a better way.

The performance of xfs under high concurrency pressure is about 5-10% higher than that of ext4. The corresponding io utilization of xfs is significantly lower than that of ext4, but the cpu is higher. If the qps tps is below 5000, there is no significant difference between etf4 and xfs systems. Stress testing shows that xfs has thread_running jitter under high concurrency and 72 concurrency, while ext4 is relatively stable.

XFS file system features:

Data integrity : With the XFS file system, when unexpected downtime occurs, first of all, because the file system has the log function enabled, the files on your disk will no longer be damaged by accidental downtime. Regardless of how many files and data are currently stored on the file system, the file system can quickly restore the contents of the disk file in a short period of time based on the recorded log.

Transmission characteristics : The XFS file system adopts an optimized algorithm, and the log record has a very small impact on the overall file operation. XFS queries and allocates storage space very quickly. The xfs file system can continuously provide fast response time. Someone has tested the XFS, JFS, Ext3, and ReiserFS file systems, and the performance of the XFS file system is quite outstanding.

Scalability : XFS is a full 64-bit file system, which can support millions of T bytes of storage space. For large files parts and small file size support are outstanding performance, support large number of directories. The maximum supported file size is 263 = 9 x 1018 = 9 exabytes, and the maximum file system size is 18 exabytes. XFS uses a high table structure (B+ tree) to ensure that the file system can quickly search and quickly allocate space . XFS can continue to provide high-speed operation for the performance of the file system from the directory and the directory number of files limited.

Transmission bandwidth : XFS can store data with performance close to that of raw device I/O. In the test of a single file system, its throughput can reach up to 7GB per second, and for single file read and write operations, its throughput can reach 4GB per second.

Capacity: XFS is a 64-bit file system that supports a single file system with a maximum of 8 exbibytes minus 1 byte. The actual deployment depends on the maximum block limit of the host operating system. For a 32-bit Linux system, the file and file system size will be limited to 16 tebibytes.

During the use of xfs, there will be a lot of disk space remaining, but an error of insufficient space is reported. Because the xfs file system will store the inode in the first 1T space of the disk, if this part of the space is completely filled, then it will There is an error message about insufficient disk space. The solution is to specify the inode64 option when mounting:

mount -o remount -o noatime,nodiratime,inode64,nobarrier /dev/sdb1 /backup

This has been resolved in versions after kernel 3.7. Inode64 is included in the default defaults mount parameter, as follows:

(rw,noatime,attr2,inode64,sunit=128,swidth=512,noquota)

Note: "XFS file system by default when mounted to enable" write barrier " . This feature will support a suitable time scouring the lower storage device 's write-back cache ., Particularly XFS in the redo log write operation when this feature The original intention is to ensure the consistency of the file system, but the specific implementation varies from device to device-not all lower-level hardware supports cache flushing requests. Deploy the XFS file system on the logical device provided by the hardware RAID controller with battery-powered cache At times, this feature may cause significant performance degradation , because the file system code cannot know that the cache is non-volatile. If the controller implements the flush request again, the data will be written to the physical disk unnecessarily frequently In order to prevent this problem, for devices that can protect data in the cache in the event of a power outage or other host failures, the XFS file system should be mounted with the nobarrier option."

ZFS file system:

ZFS was developed by Sun Microsystems and named after zettabyte-equivalent to 1 trillion GB-because it can theoretically solve large storage systems.

As a true next-generation file system, ZFS provides volume management (capable of handling multiple separate storage devices in a single file system), block-level encryption checksum (allowing extremely high accuracy to detect data corruption), and automatic damage repair (Where redundant or parity storage is available), fast asynchronous incremental replication, inline compression, etc., and more.

From the perspective of Linux users, the biggest problem with ZFS is the license issue . The ZFS license is a CDDL license, which is a semi-licensed license that conflicts with the GPL. There are many controversies about the significance of using ZFS in the Linux kernel. The disputes range from "it is a GPL violation" to "it is a CDDL violation" to "it is completely fine. It has not been tested in court." The most worthwhile. Note that Canonical has inlined the ZFS code in its default kernel since 2016, and there are currently no legal challenges.

Currently, it is not recommended to use ZFS as the root file system of Linux. If you want to take advantage of ZFS on Linux, you can use a small set of root ext4 file system , ZFS will then be used on the rest of your storage, data, applications, and your favorite things on top of it - but the The root partition remains on ext4 until your distribution explicitly supports ZFS root directories.

Btrfs file system:

Btrfs is the abbreviation of B-Tree Filesystem, usually pronounced "butter"-released by Chris Mason during his tenure at Oracle in 2007. Btrfs aims to have most of the same goals as ZFS, providing multiple device management, each block check , asynchronous replication, inline compression, and more. btrfs can span a variety of hard drives.

As of 2018, Btrfs is fairly stable and can be used as a standard single-disk file system , but it probably shouldn't rely on a volume manager. Compared with many common use cases of ext4, XFS or ZFS, it exists in serious performance problems , its next-generation features - copy, topology and multi-disk snapshot management - can be very much, the result could be catastrophic from Performance is reduced to the loss of actual data.

The maintenance status of Btrfs is controversial; SUSE Enterprise Linux adopted it as the default file system in 2015, and Red Hat announced in 2017 that it no longer supports Btrfs starting with RHEL 7.4 . It may be worth noting that this product supports Btrfs deployment as a single-disk file system, rather than a multi-disk volume manager like ZFS.

3. Application overview

ext2 : Using 16-bit internal addressing, it provides the maximum file size at the GB level and the file system size at the TB level; long file names are allocated, up to 255 characters. When data is written to the disk, the system crashes or loses power, which is prone to catastrophic data damage, and it is also prone to data loss due to fragmentation (single file stored in multiple locations); currently, ext2 is still It is used in some special situations-the most common is that it will be used as the file system format of a portable USB drive.

ext3 : The file system uses 32-bit addressing, which limits its maximum support of 2 TiB file size and 16 TiB file system size; when data is written to the Ext3 file system, the Ext3 data block allocator can only allocate one at a time 4KB blocks ; ext3 currently only supports 32,000 subdirectories; provides a time stamp with a granularity of one second; ext3 does not check the log , which brings disks outside the direct control of the kernel or controller devices with their own cache problem. If the controller or the disk with its own cache is out of the write order, it may destroy the ext3 journal transaction order, which may destroy the files written during the crash (or some time before). Running e2defrag on an ext3 file system may cause catastrophic damage and data loss. The main advantage of ext3 is journaling. Using a logging file system can reduce the time required to recover the file system after a crash, because it does not need to run the fsck program to check the file system metadata consistency every time a crash occurs.

ext4 : In the hard disk, there are fewer small files; it uses 48-bit addressing, and theoretically can allocate files up to 16 TiB in size on the file system, and the file system size can reach up to 1000000 TiB (1 EiB); the number of directories is not affected. Restrictions; Ext4's multi-block allocator supports multiple data blocks allocated at one time; the introduction of section extent, extent is a data structure that stores the mapping relationship between file block numbers and logical block numbers, and is an improvement to the inode data structure in Ext3, thus Save the performance overhead of multiple access to the disk to obtain the index node table under large files; increase log verification, check the correctness and integrity of log files, avoid illegal tampering, and support nanosecond timestamps; ext2 and ext3 are not direct Online defragmentation is supported-that is, the file system is defragmented when it is mounted, but ext4 provides support through e4defrag. ext4's e4defrag is an online, kernel mode, file system aware, block and section level defragmentation utility.

ext4 has a great speed improvement over ext3; under ext3, when fsck is called, the entire file system will be checked-including deleted or empty files. In contrast, ext4 marks unallocated blocks and sectors in the inode table, allowing fsck to skip them altogether. This greatly reduces the time to run fsck on most file systems, which is implemented in kernel 2.6.24. ext4 has redundant super blocks, so it provides a way for the file system to verify the metadata in it. It can determine by itself whether the main super block is damaged and need to use a spare block. It is possible to recover from a damaged superblock without a checksum-but the user first needs to realize that it is damaged, and then try to mount the file system manually using an alternate method. Next-generation file systems such as Btrfs or ZFS provide extremely powerful check-per-block.

As a traditional file system, Ext4 is very mature and stable, but with the increasing storage requirements, Ext4 gradually adapts to deterioration. For example, although the Ext4 directory index uses Hash Index Tree, the height is still limited to 2. A single directory file that has been tested in Ext4 exceeds 200W, and the performance degradation is quite severe. Due to the historical disk structure, Ext4's inode number limit (32 digits) can only have about 4 billion files at most . And the single file size of Ext4 can only support up to 16T (4K block size).

xfs : The hard disk supports medium-scale, with many small files; XFS is based on B+Ttree to manage metadata; according to the recorded log, the content of the disk file can be quickly restored in a short time; the optimized algorithm is used to record the overall file operation The impact is very small; it is a full 64-bit file system, which can support millions of terabytes of storage space, and can store data at a performance close to that of raw device I/O. Ext4 is limited by disk structure and compatibility issues, and its scalability and scalability are not as good as XFS. In scenarios such as multiple files, large file systems, and space utilization, xfs still has advantages over ext4.

vfat : FAT32, using a 32-bit file allocation table, supporting a maximum partition of 128GB and a maximum file of 4GB

tmpfs is a memory-based file system, so tmpfs data will not be retained after reboot.

NTFS: Supports a maximum partition of 2TB and a maximum file of 2TB. The security and stability are very good, and file fragmentation is not easy to appear.

4. Performance comparison of ext3, ext4, xfs and btrfs file system

1> Single-byte write performance comparison
Insert picture description here
2> Block write performance comparison (the hard disk is a block device, which makes more sense) The
Insert picture description here
above figure shows that the performance is similar, but the efficiency (CPU occupancy rate) is the most The good thing is that xfs is followed by EXT4, EXT3, BTRFS

3> Direct block sequential read and write (turn off any system and file cache)
Insert picture description here
EXT3/ 4 is the best choice, followed by BTRFS, and finally XFS.

4> Random seek
Insert picture description here
BTRFS system can be the worst, less than 20 seeks/sec; EXT3 has the best performance, if the software uses a lot of random addressing, the file system performance is better

5>Create and delete a large number of files (a certain amount of files)
Insert picture description here
Insert picture description here
BTRFS system has the worst performance, EXT4 is a more efficient and high-performance system, followed by XFS, EXT3

6>Sequential read and write throughput [100 writes/one fsync() without fsync, some 1 writes/one fsync()]
Insert picture description here
100 writes/one fsync() performance is similar; EXT3 performance with 1 writes/one fsync() Best, followed by XFS, EXT4, BTRFS; write + fsync() affects the read performance under BTRFS

7>Random read and write throughput
Insert picture description here
Insert picture description here

EXT3 random write performance is the best, suitable for databases, high-capacity recording programs and virtual machine systems, the best database is EXT3 system

8>Sequentially create 128 files, each with a length of 16 MB (total 2 GB). The fragmentation situation generated by various systems is
Insert picture description here
Insert picture description here
EXT4. XFS, a system with a delay allocation mechanism, generates less fragmentation than EXT3 (even if one write/one fsync ()); BTRFS system fragmentation is a serious problem.

9>

Reference; https://www.linuxidc.com/Linux/2018-09/154065.htm, https://cloud.tencent.com/developer/article/1460643,

For more knowledge about Linux learning, you can refer to the book "Linux Just Learn" . The content is detailed and easy to understand. On this basis, you can also read "Kubernets Authoritative Guide: Full Contact from Docker to Kubernetes Practice" . Provide better guidance for us to engage in related work.

Guess you like

Origin blog.csdn.net/ximenjianxue/article/details/115190999