XSKY interprets the FAST'20 paper "How to efficiently clone the file directory tree"

XSKY Interpretation of FAST'20 Paper  How to Copy File  "On how to efficiently clone the file directory tree"

 

01 Guide

This paper, published in the FAST20 conference, is about how to efficiently clone file directory trees. Yang Zhan and Yizheng Jiao from the University of North Carolina at Chapel Hill, as well as Rutgers University, Pace University, and Stony Brook University , It was developed in cooperation with relevant researchers from the VMWare Research Center.

 

This article is not a complete translation of this paper. It is just an interpretation of some of the key content in it. The interpretation in some places may not be accurate. It is recommended that interested students read it together with the original paper to help you broaden your thinking.

 

The figures and tables used in the article, unless otherwise noted, are from the original paper.

 

02Overview

The author first introduced the background and importance of rapid cloning. With the widespread popularity of virtualization, various applications have a strong demand for fast cloning. Typically, if a virtual machine is created quickly, it is necessary to quickly clone the "root filesystem" of the virtual machine, which is an image disk file. In the container scenario, for example, docker will quickly copy a specific file system directory tree very frequently, which requires an efficient cloning mechanism.

For these application scenarios, the existing file system has implemented some "logical copy" functions, which is different from "physical copy". In general, logical copy only performs some metadata-level copy when it is implemented, and only in subsequent modifications. In this case, the file data is copied and modified as needed. This function is called COW (copy-on-write). This method has greatly improved the timeliness of copying and the utilization of space. Therefore, It has now become the basic technology of cloning. In the implementation, there are generally COW based on block granularity and COW based on file or dir level, such as the cp-reflink technology used by btrfs and xfs.

Here the author points out that for the classic COW technology, there is mainly the problem of "copy granularity". For example, for COW based on file level, a small modification may lead to expensive copies, such as 1G files, only modify 1 of them. Bytes, it will also trigger the copying of 1G files. This will greatly increase the initial COW write delay and inevitably cause a waste of space. On the contrary, if the copy granularity is too small, the initial COW write delay is guaranteed, but it is prone to fragmentation (imagine a large file with a 4K copy granularity, and random writes will result in many 4K COW blocks) , In the subsequent sequential reading, because these 4K data are not continuous, the reading speed will be very slow. Therefore, the author here puts forward his own view on efficient cloning, which needs to meet the following 4 characteristics to be qualified. The author calls it "Nimble clones (Agile clones)"

  • Must be able to complete the cloning operation quickly.
  • There must be good read locality, so that logically related file sets will be read faster, and the performance should be consistent after COW.
  • It must have good write performance, whether it is to write to the file set before cloning and the file set after cloning.
  • Space utilization must be good. Write amplification should be kept as low as possible.

In Figure 1, the author exemplifies several existing COW file systems that support multiple cloning operations and modify a small part of the content, and then use grep to query the attenuation of the file content:

 

 

It can be seen from the figure that after each clone and modification, the read performance will decay. Looking at the results of XFS and ZFS, there is about 3 to 5 times the attenuation after 16 rounds. Btrfs performed better, with only a 50% attenuation. On the whole, these file systems all show the characteristics of monotonous decline in performance.

Therefore, the author believes that the key point here is that the size of copy and write in COW should be decoupled, that is, the same size should not be used. If it is a large file modification, then copy large blocks and then overwrite large blocks. Of course it is reasonable, but if it is a minor modification, obviously, this minor modification should be temporarily stored, and multiple minor modifications should be aggregated. After reaching a certain number, batch processing will obviously be more reasonable.

Based on the existing BetrFS, this paper implements an efficient cloning mechanism to meet the aforementioned characteristics of Nimble clones. The author introduced a technology called CAW (Copy-on-Abundant-Write) to temporarily store minor data modifications and wait for the right time to perform COW. In addition, this paper improves and enhances the original Bε-tree data structure in 3 aspects. 1. Transform the original Bε-tree tree structure into a Bε-DAG structure (directed acyclic graph), To satisfy the need for better traversal of the entire tree. 2. Introduced GOTO message, which can quickly persist the clone operation. 3. Introduce "translation prefix (prefix conversion)" to satisfy the query of "delayed data copy" and part of shared data. After introducing these optimizations and improvements, the experimental results show that for different models, there is at least a 33% to 6.8 times performance improvement.

The contribution points of the paper boil down to:

  • Designed and implemented the Bε-DAG data structure as the basis of Nimble clones. This method expands the Bε-tree, which is used to gather small changes, and then choose the opportunity to write in batches.
  • Write optimized clone implementation. During the cloning operation, simply write a GOTO message to the DAG root node.
  • Quantitative progressive analysis shows that adding clones will not affect other operations. The cloning algorithm overhead is logarithmic.
  • A comprehensive test shows that the optimized BetrFS does not have an adverse effect on the original BetrFS baseline. In terms of the clone feature, compared with the traditional file system that supports the clone feature, the performance is improved by 3-4 times. For file systems that do not support cloning, there are two orders of magnitude improvement.

 

03BetrFS background knowledge

In order to understand this paper in depth, you need to understand the background knowledge of BetrFS. BetrFS is a research-oriented file system developed by the team. From 2015 to now, the author team has published many papers based on this file system. For details, please refer to the official website of BetrFS.

BetrFS is a kernel-mode local file system based on KV storage. Different from traditional file systems, such as XFS and Ext4, these file systems are based on inode and B-tree (or B-tree variants) to manage and organize metadata and file system data. And BetrFS is based on Key-Value, it has 2 KV stores. Among them, the metadata KV stores the mapping from the full path of the file (fullpath) to the file system metadata (struct stat). The data KV stores the mapping from {fullpath+block number} to 4K block.

 

 

Picture Source: Attached 1

The KV storage used by the BetrFS backend is a modified version of the kernel based on TokuDB. Although it is a KV storage, the implementation of Bε-tree is different from our common LSM-based levelDB or rocksdb. It is actually similar to B-tree internally. The data structure can be called a better tree.

In Appendix 1, the author team detailed the advantages of this tree and the write optimization for BetrFS. One of the most important points is that when this tree was designed, buffer space was reserved on the node node of the tree. As shown in the figure below, pivots are the same as the keys on the traditional B-tree, which are used to store the keyword keys. The buffer can be used to temporarily store some small IOs written to files, and then gradually "flow" from the root node to the leaf nodes. Generally, the node size is designed to be 2M to 4M, so that when the buffer space is about to be full, flush writes are performed at the next level. This optimization can transform small io writes into large block io writes, so Greatly improve the performance of small IO writes.

 

 

Picture Source: Attached 2

range query

Because fullpath-based KV storage is used, for example, based on the following file names, they have the same prefix when stored, so in the actual algorithm operation, they will basically fall on the same node, so when doing range queries, such as traversal For all subdirectories or files under the directory, the IO falling on the disk will also be continuous, so the traversal speed will be very fast. At the same time, in the delete operation, it is also easier to optimize the delete process, such as designing a delete message to implement mark deletion, and then performing the delete operation later.

 

 

Rename operation based on fullpath

For file rename operations based on fullpath, it is more challenging. Recall that in Bε-tree, all keys are full paths, so when rename a folder, for example, folder A renames to B, then the keys of all subfolders and subfiles under A must be modified accordingly . The solution given by the author team in the paper attached to 3 is to use range rename operation. The key to this solution is that there is a pointer inside to point to the subtree of the source and destination of the folder, and the range rename operation It can be converted into turning this pointer, the so-called "pointer swing", and subsequent subtrees will self-heal to achieve the purpose of rename. For more details about this method, please refer to the attached 3 paper, which will not be expanded here.

Power failure recovery consistency problem

File system power failure recovery consistency has always been a key issue of file system worthy of study. Typical examples such as ext4 and xfs use journal mechanism and transactional nature to ensure that the file system can still be restored correctly in the event of an abnormal power failure, so as to achieve consistency. status. BetrFS's Bε-tree also uses a similar mechanism. Any modification to Bε-tree needs to be logged in advance. Bε-tree will checkpoint periodically, such as a checkpoint every 60s. After completion, redo will be trimmed. log. In the event of abnormal power failure, rely on replaying the redo log to achieve a consistent state.

 

04Cloning in BetrFS 0.5

The author of this chapter introduces how to implement the cloning feature in BetrFS. First, the author starts with the most basic semantics of cloning.

Cloning operation semantics

 

 

The cloning action is atomic, that is, it either succeeds or fails, and no intermediate state is allowed.

From the perspective of the key space of KV-store, clone(s,d) means to copy all the keys prefixed with s to the new key prefixed with d. In addition, it will also delete the original keys starting with d at the same time. Key.

Lifted Bε-DAGs

The essence of cloning is to add a new edge to the Bε-tree tree, so that when accessing the cloned folder, the data before the clone can be accessed in some way (the data after the clone is all before the content is modified) Fully shared). It is impossible to access data structures like Bε-tree. Therefore, Bε-tree needs to be improved to support shared access. DAG (Directed Acyclic Graph) can accomplish this function. As shown in the figure below, the bottom node node in the DAG data structure can be accessed by the above two paths.

 

 

Picture source: Appendix 4

When improving the Bε-tree, the author emphasized that three problems need to be solved:

1. Because there may be access to node from multiple paths, it is necessary to maintain a reference count for node. The reference count is not stored in the data structure of node itself, but in a special node translation table. When modifying the reference count, there is no need to change the node data structure itself, which avoids the need to speed up unlocking and other operations in order to improve performance.

2. Question 2 is that in the Bε-tree, the node is usually set to be relatively large, such as 2M to 4M in size, which means that there will be many key values ​​on one node. Therefore, sharing the target node means that some redundant keys will also be shared, as shown in Figure 2 below:

 

 

The node below contains all the key spaces prefixed with s, but it also contains the keys starting with q and v. These keys are redundant and should not be accessed. Therefore, the author here uses filtering technology for the keys in the pivots region to filter those redundant keys.

3. Using the translation prefix (prefix conversion) technology, as shown in Figure 2 above, after converting the directory starting with s to the directory starting with p, a pointer will be inserted on the upper node to point to the node below, then follow up When querying pw, the node that points to the following is found in pivots, but because it is cloned, translation prefix is ​​needed here, that is, p is removed first, and when querying the following nodes, the query is performed according to sw. This conversion is only temporary, because when the clone is finally completed, the node needs to be turned into a normal node, which will be described later.

Creating clones with GOTO messages

The paper next describes the process of creating a clone. First, as shown in Figure 3 below, to clone all keys starting with s to keys starting with p, you need to find the parent node of all key nodes starting with s, called LCA (lowest-common ancestor);

 

 

The next thing to do is a flush operation, that is, all the keys starting with s that are temporarily stored between the root node and the LCA node are persisted to ensure that the LCA is complete and contains the complete key space starting with s . After the flush is completed, insert a GOTO message to the root node to complete the clone process.

So what is the purpose of this GOTO message? This is the key to the cloning action itself. Its composition is roughly like this:

(a,b) - hight - dst_node

(a,b) represents the key space covered, hight represents the height to the target node, and dst_node represents the target target.

For example, if the key to be queried is x, first determine whether x falls within the key interval indicated by (a, b). If it falls within this interval, then it will go directly to dst_node to continue the search. In other words, goto message can change the search direction.

Flushing GOTO messages

GOTO message can slide from the root node to the node below by flushing. Through this message mechanism, the traversal addressing information in the DAG can be encoded, so that the cloning operation can be completed very efficiently. The power of GOTO message is that all subsequent queries have changed because of the new path determined by GOTO. In other words, GOTO message can implicitly delete all the original key spaces that start with a specific prefix.

Converting a GOTO

The conversion process of Goto message is similar to the flush process. When sliding down, it will involve the merging and processing of the key space of the overlapping area, as shown in the following figure:

 

 

On the left, the interval (pab,t) contains (pz,r) in the pivots area, so when converting the goto message to standard pivots, the original (pz,r) key space will be deleted, and (pab,t ) Space to join in.

In the case of partial overlap, it is more complicated. For example, the parts of (pa, pz) and (r, w) on the left side of the above figure overlap with (pab, t).

 

 

After processing, it should be noted that the key of the space (pa, pab) should be prefixed with as2, because in the old space (pa, pz), when querying downward, the prefix is ​​s2, but after conversion, The pa prefix is ​​removed, and a must be added to be consistent with the original key. This is the so-called "lifted Bε-tree" technology, which is detailed in the attached 3 paper of the author's team.

Flushes, splits, and merges

The paper next talks about flush, split and merge of nodes. From a higher level, two processes are required:

1. Need to convert children node to simple children

2. Then perform the standard Bε-tree flush, split and merge process

The definition of Simple child is that the node reference count is 1, and the child pointed to by its side has no translation prefix.

After the children are simplified, Bε-DAG is formalized into Bε-tree, so the flush, split, and merge of nodes can be performed in the manner of Bε-tree.

There are at least two scenarios where a node needs to be simplified. The first case is when the parent node has accumulated enough changes in the buffer space to perform CAW operations. The second situation is that when the background heal thread adjusts the target's fanout, it needs to perform split and merge operations, which leads to simple nodes. Note that Simple a node is carried out in the process of flush, split or merge.

When simplifying a node, the first thing to do is to perform a private copy of the node first, as shown in Figure 5:

 

 

After getting this private node, you can "organize" this node, as shown in the figure below. The leftmost is the original private node. Some of the key spaces inside are redundant and will be deleted. The middle figure is to delete redundant For the rest of the key image, finally, the translation prefix will also be deleted.

 

 

It is worth noting that these complicated steps are only triggered after the clone, when data is written, and after certain conditions are met, or in the background to balance the Bε-DAG state. For those unmodified shared nodes, these complicated processes are not required.

 

05 Progressive Analysis of Algorithms

The progressive analysis of Bε-DAG shows that the complexity of insertion, query, and cloning is only related to the height of Bε-DAG, and its progressive complexity is consistent with lifted Bε-tree. It can be considered like this. If we flush all GOTO messages and convert them into general pivots, after eliminating GOTO messages and disconnecting the shared node link, a Bε-tree will be formed, with the highest height. It is the same height as Bε-DAG.

The heights of Bε-tree and Bε-DAG are both O(log B N ), the former N refers to the key number of the tree, and the latter N refers to the key before the clone plus the number of the key of the clone.

Query: Because the height of Bε-DAG is O(log B N ), the IO query cost is also O(log B N )

Insertion: The complexity of insertion is the same as that of Bε-tree, which is:

 

 

Cloning: The cloning process can be divided into 2 stages, the first stage is online, and the second stage is backstage. The online phase only includes the cost of flushing the message containing s above the LCA to the LCA and inserting a GOTO message into the root node. This cost is O (log B N ) level. The background processing stage is mainly for the background process to flush the GOTO message down and finally convert it to general pivots. The algorithm complexity is also O(log B N ) level.

 

06 Evaluation and Analysis

The paper gives the performance test results in the evaluation and analysis chapter, and mainly answers the following questions:

  • Does the cloning feature implemented in BetrFS 0.5 meet the following four design goals: Can a clone be created at a high speed, does the read satisfy the principle of locality, does it satisfy the fast write, and what is the waste of space?
  • Will the introduction of the cloning feature have an impact on the performance of the previous version;
  • Whether the cloning feature can improve the performance of real applications.

The author compared the following file systems: baseline BetrFS (refers to BetrFS 0.4 version without cloning feature), ext4, Btrfs, XFS, ZFS, and NILFS2.

Clone performance

In the cloning performance test, the author chose to create 8 subdirectories in a single directory, each subdirectory with a 4M size file, and then clone this directory, after each clone, write to each file after the clone Enter 16 bytes (4KB aligned) to simulate small block IO modification.

Figure 7-a is the clone delay result, Figure 7-b is the write delay of 16-byte modification, and Figure 7-c is the read delay of the grep clone file:

 

 

In the comparative file system, Btrfs uses two modes based on reflink and volume snapshot, XFS uses reflink mode, and ZFS uses volume snapshot mode. For BetrFS, the no cleaning mode that prohibits background processing and the normal mode that allows background processing are used. It can be seen from the results that the background processing prohibited mode and the normal mode are basically the same in performance, but there will be some fluctuations in the space utilization.

Analyzing the latency of creating a clone, BetrFS only needs a latency of 60ms. This result is 33% faster than XFS, 58% faster than svol-based Btrfs, and is an order of magnitude improvement compared to ZFS. In addition, after multiple clones, the performance of BetrFS will not change much, while the delay of Btrfs and XFS will gradually increase as the number of clones increases. After about 8 rounds, the delay shows double. Up.

In terms of write latency, the performance of BetrFS is 8-10 times that of other file systems. This is mainly due to the CAW feature of BetrFS. Other file systems are basically COW types. In addition, as the number of clones increases, all these file systems basically have no obvious write performance loss.

In terms of read latency, because grep operations are basically scan-type operations, BetrFS is very stable and will not decay as the number of clones increases, while XFS and ZFS have a relatively obvious decay trend, Btrfs will also decay, but not Too serious. After 8 clones, the attenuation of the Btrfs-svol type is about 10%, and the attenuation of the Btrfs file type is about 20%. In addition, the author also pointed out that after 17 clones, Btrfs attenuation was about 50% (not shown in the figure).

Table 1 below shows the space overhead. It can be seen that every time a clone is modified, the overhead introduced by BetrFS is only about 16K, which is much better than other file systems. In addition, because of the clean action in the BetrFS background, the space utilization will fluctuate. , In no clean mode, the space overhead will stabilize at 32K. In short, the results show that, in terms of space overhead, BetrFS has achieved the expected goal and will not waste space.

 

 

So, how does BetrFS perform under other file operation models? The author also selected some use cases for evaluation.

Sequential IO

In terms of sequential read and write IO, an acceptable performance can be achieved. In terms of read performance, it is about 19% slower than the fastest ext4 file system, and in terms of sequential writes, it is only about 6% slower than the fastest Btrfs.

 

 

Random IO

In the random read and write performance test, the author chose to randomly read and write 256K 4-bytes to a 10G large file, with the fsync flag when writing to ensure persistence to the disk.

The results show that for random writes, BetrFS 0.5 is about 39 to 67 times faster than traditional file systems, and is only about 8.5% slower than BetrFS 0.4 without clone. For random reads, BetrFS 0.5 is only 12% slower than the fastest Btrfs.

 

 

TokuBench measurement

The author used TokuBench software to conduct a massive file creation test. About 3 million 200-byte small files were created, distributed in multiple subdirectories, and each subdirectory maintained a certain degree of balance, that is, no more than 128 subdirectories or 128 subfiles. The results show that the performance of BetrFS0.5 and BetrFS0.4 is basically the same in the creation of a large number of small files, which is about 95 times higher than that of the Ext4 file system.

 

 

Directory operation performance

In the directory operation performance test, recursive grep, find and delete operations were mainly carried out. The results are shown in the following table. The performance of BetrFS 0.5 and BetrFS 0.4 are basically the same, and there will be no performance loss due to the introduction of clone. Traditional file system performance is improved by orders of magnitude.

 

 

Performance evaluation of specific applications

In terms of evaluation of specific applications, the author selected git clone, tar, rsync, and IMAP server for testing. The results are as follows:

 

 

In the figure, BetrFS 0.5 is executed in the directory that has not been cloned, and BetrFS 0.5-clone is executed in the cloned directory.

It can be seen from the results that BetrFS has achieved the best performance in most scenarios. In very few scenarios, BetrFS will have a slight performance loss, but it is also within the acceptable range.

Container scene

The author tested the real performance of clone in the Linux Containers (LXC) scene. The container backend uses the Dir backend by default. This backend is internally implemented to use rsync to copy directories. In addition, you can customize your own backend. In the customized backend, you can use your own clone mechanism efficiently.

 

 

As shown in the table, the container back-end cloning feature implemented by BetrFS is several times faster than that of ZFS and BetrFS, and is an improvement of several orders of magnitude compared to the traditional Dir implementation.

 

07 Conclusion

By decoupling copy and write in the CoW feature, the author team developed a new file system with high efficiency clone/read/write and efficient space usage based on the existing BetrFS. Judging from the performance test results and real application performance, these seemingly contradictory goals have been achieved. At the same time, some of the data structures and technologies used, such as small IO aggregation, batch writing, Bε-DAG, etc., are relatively common technologies, which can be used in any key-value store-based application software, not just File system.

 

附1:FAST15 BetrFS: A Right-Optimized Write-Optimized File System(http://supertech.csail.mit.edu/papers/JannenYuZh15a.pdf

附2:An Introduction to Bε-trees and WriteOptimization(https://www.usenix.org/system/files/login/articles/login_oct15_05_bender.pdf

附3:The Full Path to Full-Path Indexing(https://www.usenix.org/system/files/conference/fast18/fast18-zhan.pdf

附4:How to Copy Files slide(https://www.usenix.org/sites/default/files/conference/protected-files/fast20_slides_conway.pdf

Attachment 5: The original paper link how to copy file ( https://www.usenix.org/system/files/fast20-zhan.pdf )

Published on 2020-06-03

https://zhuanlan.zhihu.com/p/145642958

Guess you like

Origin blog.csdn.net/tjcwt2011/article/details/113102484