Common open source distributed file system architecture comparison

What is a file system?

The file system is a very important component in a computer, providing a consistent way to access and manage storage devices. In different operating systems, the file system will have some differences, but there are some common features that have not changed for decades:

  1. Data exists in the form of files, and APIs such as Open, Read, Write, Seek, Close are provided for access;

  2. Files are organized in tree-like directories, providing atomic Rename operations to change the location of a file or directory.

The access and management methods provided by the file system support most computer applications, and the concept of "everything is a file" of Unix highlights its important position. The complexity of the file system makes its scalability unable to keep up with the rapid development of the Internet, and the greatly simplified object storage fills the gap in time and develops rapidly. Because object storage lacks a tree-like structure and does not support atomic renaming operations, it is very different from file systems and will not be discussed in this article.

Challenges of stand-alone file systems

Most file systems are stand-alone, providing access and management for one or more storage devices within a stand-alone operating system. With the rapid development of the Internet, stand-alone file systems face many challenges:

  • Sharing: It is impossible to provide access to applications distributed in multiple machines at the same time, so with the NFS protocol, a single-machine file system can be provided to multiple machines at the same time through the network.

  • Capacity: Unable to provide enough space to store data, data has to be scattered in multiple isolated stand-alone file systems.

  • Performance: Unable to meet the very high read and write performance requirements of some applications, the application has to do logical splitting to read and write multiple file systems at the same time.

  • Reliability: Limited by the reliability of a single machine, machine failure may result in data loss.

  • Availability: Limited by the availability of a single operating system, operation and maintenance operations such as failure or restart will cause unavailability.

With the rapid development of the Internet, these problems have become increasingly prominent, and some distributed file systems have emerged to meet these challenges.

The following introduces a few basic architectures of distributed file systems that I have learned about, and compares the advantages and limitations of different architectures.

GlusterFS

GlusterFS is a POSIX distributed file system (open source under GPL) developed by Gluster in the United States. The first public version was released in 2007 and acquired by Redhat in 2011.

Its basic idea is to integrate multiple stand-alone file systems into a unified namespace and provide it to users through a stateless middleware. This middleware is implemented by a series of superimposed translators (Translators), each translator solves a problem, such as data distribution, replication, splitting, caching, locking, etc., users can flexibly configure according to specific application scenarios. . For example, a typical distributed volume is shown in the following figure:

Server1 and Server2 form Volume0 with 2 copies, Server3 and Server4 form Volume1, and they are merged into a distributed volume with larger space.

Advantages : The data files are finally saved on the single-machine file system in the same directory structure, and there is no need to worry about data loss due to the unavailability of GlusterFS.

No obvious single point problem, linearly scalable.

Support for a large number of small files is estimated to be good.

Challenge : This structure is relatively static and not easy to adjust. It also requires each storage node to have the same configuration. When data or access is not balanced, space or load adjustment cannot be performed. The ability to recover from failures is also relatively weak. For example, when Server1 fails, there is no way to add copies of files on Server2 to healthy 3 or 4 to ensure data reliability.

Due to the lack of independent metadata services, all storage nodes are required to have a complete data directory structure. When traversing the directory or adjusting the directory structure, it is necessary to visit all nodes to obtain correct results, resulting in limited scalability of the entire system, extending to dozens of It's fine for a few nodes, but it's hard to manage hundreds of nodes efficiently.

CephFS

CephFS started with Sage Weil's doctoral dissertation research, and the goal is to achieve distributed metadata management to support exabyte-scale data scale. In 2012, Sage Weil established InkTank to continue supporting the development of CephFS, which was acquired by Redhat in 2014. It wasn't until 2016 that CephFS released a production-ready stable release (the metadata part of CephFS is still stand-alone). Right now, CephFS's distributed metadata is still immature.

Ceph is a layered architecture, the bottom layer is a distributed object storage based on CRUSH (hash), and the upper layer provides three APIs: object storage (RADOSGW), block storage (RDB) and file system (CephFS), as shown in the following figure :

It is very attractive to use a storage system to meet the storage requirements of multiple different scenarios (virtual machine images, massive small files and general file storage), but because of the complexity of the system, it requires strong operation and maintenance capabilities to support it. At present, only block storage is still relatively mature and widely used. Object storage and file systems are not ideal. After hearing some use cases, they gave up after a period of time.

The architecture of CephFS is shown in the following figure:

CephFS is implemented by MDS (Metadata Daemon), which is one or more stateless metadata services. It loads the metadata of the file system from the underlying OSD and caches it in memory to improve access speed. Because MDS is stateless, it is relatively easy to configure multiple standby nodes to implement HA. However, the backup node does not have a cache and needs to be preheated again, which may take a long time to recover from a failure.

Because loading or writing data from the storage layer will be slow, MDS must use multi-threading to improve throughput. Various concurrent file system operations lead to a great increase in complexity, which is prone to deadlocks, or performance due to slow IO. decline. In order to obtain better performance, MDS often needs to have enough memory to cache most of the metadata, which also limits its actual support ability.

When there are multiple active MDSs, a part of the directory structure (subtree) can be dynamically allocated to an MDS and completely handled by it to achieve the purpose of horizontal expansion. Before multiple actives, it is inevitable to require respective locking mechanisms to negotiate ownership of subtrees, and to achieve atomic renaming across subtrees through distributed transactions, which are very complex to implement. Currently the latest official documentation still deprecates the use of multiple MDSs (it is ok as a backup).

GFS

Google's GFS is a pioneer and a typical representative of distributed file systems, developed from the early BigFiles. In the paper published in 2003, it elaborated its design concept and details, which has a great influence on the industry. Later, many distributed file systems refer to its design.

As the name suggests, BigFiles/GFS is optimized for large files and is not suitable for scenarios with an average file size of less than 1MB . The architecture of GFS is shown in the following figure:

GFS has a Master node to manage metadata (all loaded into memory, snapshots and update logs are written to disk), and files are divided into 64MB Chunks and stored on several ChunkServers (directly using a stand-alone file system). The file can only be written additionally, without worrying about the version and consistency of Chunk (the length can be used as the version). This design uses a completely different technology to solve metadata and data, which greatly simplifies the complexity of the system and has sufficient scalability (if the average file size is greater than 256MB, the Master node can support about 1PB of data per GB of memory). Forgoing support for some features of the POSIX file system (such as random writes, extended attributes, hard links, etc.) also further simplifies system complexity in exchange for better system performance, robustness, and scalability.

Because of the maturity and stability of GFS, Google can more easily build upper-layer applications (MapReduce, BigTable, etc.). Later, Google developed Colossus, a next-generation storage system with stronger scalability, which completely separated metadata and data storage, realized the distribution of metadata (automatic sharding), and used Reed Solomon coding to reduce storage space usage. reduce costs.

HDFS

Hadoop from Yahoo is an open-source Java implementation of Google's GFS, MapReduce, etc. HDFS basically copies the design of GFS, so I won't repeat it here. The following figure is the architecture diagram of HDFS:

The reliability and scalability of HDFS are still very good. There are many thousands of nodes and 100PB-level deployments. The performance of supporting big data applications is still very good. There are few cases of data loss (because there is no recycle bin configured) Unless the data is deleted by mistake).

The HA solution of HDFS was added later, and it was more complicated, so that Facebook, who first made this HA solution, did manual failover for a long time (at least 3 years) (does not trust automatic failover) .

Because NameNode is implemented in Java and depends on the pre-allocated heap memory size, insufficient allocation can easily trigger Full GC and affect the performance of the entire system. Some teams have tried to rewrite it in C++, but haven't seen a mature open source solution yet.

HDFS also lacks mature non-Java clients, making it inconvenient to use in scenarios other than big data (tools such as Hadoop) (such as deep learning, etc.).

MooseFS

MooseFS is an open source distributed POSIX file system from Poland. It also refers to the architecture of GFS and implements most of the POSIX semantics and APIs. After being mounted by a very mature FUSE client, it can be accessed like a local file system. The architecture of MooseFS is shown in the following figure:

MooseFS supports snapshots, and it is convenient to use it for data backup or backup recovery.

MooseFS is implemented in C, and the Master is an asynchronous event-driven single-threaded, similar to Redis. However, the network part uses poll instead of the more efficient epoll, which results in a very heavy CPU consumption when the concurrency reaches about 1000.

The open source community version does not have HA, and uses metalogger to achieve asynchronous cold backup. The closed source paid version has HA.

In order to support random write operations, the chunks in MooseFS can be modified, and a version management mechanism is used to ensure data consistency. This mechanism is complicated and prone to strange problems (for example, after the cluster restarts, there may be a few chunks whose actual number of copies is lower than expected).

JuiceFS

The GFS, HDFS and MooseFS mentioned above are all designed for the hardware and software environment of the self-built computer room. The reliability of the data and the availability of the nodes are combined to solve the problem in the way of multiple machines and multiple copies. However, in the virtual machine of the public cloud or private cloud, the block device is already a virtual block device with a three-copy reliability design. If it is done through multiple machines and multiple copies, the cost of data will remain high (actually is 9 copies).

So we designed JuiceFS to improve the architecture of HDFS and MooseFS for the public cloud. The architecture is shown in the following figure:

JuiceFS uses the existing object storage in the public cloud to replace DataNode and ChunkServer to realize a fully elastic serverless storage system. The object storage of the public cloud has solved the safe and efficient storage of large-scale data. JuiceFS only needs to focus on the management of metadata, which also greatly reduces the complexity of metadata services (the master of GFS and MooseFS needs to solve the problem of metadata at the same time. storage and block health management). We've also made a lot of improvements to the metadata section, enabling Raft-based high availability from the start. To truly provide a high-availability and high-performance service, the management and operation and maintenance of metadata are still very challenging. Metadata is provided to users in the form of services. Because the POSIX file system API is the most widely used API, we have implemented a highly POSIX compliant client based on FUSE, and users can mount JuiceFS to Linux or macOS through a command-line tool for quick access like a local file system.

The dotted line part on the right in the above figure is the part responsible for data storage and access, which involves the user's data privacy. They are completely in the customer's own account and network environment, and will not be in contact with the metadata service. We (Juicedata) don't have any way to access the client's content (except metadata, please don't put sensitive content in the filename).

summary

Briefly introduce the architecture of several distributed file systems that I know, and put them in the following figure in the order of appearance (the arrow indicates the former or the new generation version):

The blue files in the upper part of the figure above are mainly used for big data scenarios, and implement a subset of POSIX, while the green files below are POSIX-compatible file systems.

Among them, the system design of metadata and data separation represented by GFS can effectively balance the complexity of the system, effectively solve the storage problem of large-scale data (usually large files), and have better scalability. Under this architecture, Colossus and WarmStorage, which support distributed storage of metadata, have infinite scalability.

As a latecomer, JuiceFS has learned how MooseFS implements a distributed POSIX file system, as well as the idea of ​​completely separating metadata and data such as Facebook's WarmStorage, hoping to provide the best distributed storage for public or private cloud scenarios experience. By storing data in object storage, JuiceFS effectively avoids the problem of high cost caused by the double-layer redundancy (redundancy of block storage and multi-machine redundancy of distributed system) when using the above distributed file system. JuiceFS also supports all public clouds, without worrying about a cloud service being locked, and can smoothly migrate data between public clouds or zones.

Finally, if you have a public cloud account, come to JuiceFS to register, and you can mount a PB-level file system to your virtual machine or your own Mac in 5 minutes.

Recommended reading:

How to speed up AI model training 7x with JuiceFS

If it is helpful, please follow us on Juicedata/JuiceFS ! (0ᴗ0✿)

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324132605&siteId=291194637