Revealing the large-scale metadata management mechanism of distributed file systems - taking the Alluxio file system as an example

Today, our world has entered a data age. With the rapid development of information technologies such as the Internet, IoT, 5G, big data, artificial intelligence, autonomous driving, and the metaverse, the total amount of data that people generate, collect, store, manage, and analyze is growing rapidly. Large-scale data in industries with diverse forms, complex formats, huge scale, and rapid generation have driven rapid changes in underlying new basic support computing support technologies. Through the guidance and practice of pioneers in industry and academia over the past 10 years, the technology ecosystem of distributed parallel computing and distributed data storage has continued to evolve, become rich and prosperous. Among them, distributed data storage management plays a fundamental role in this massive data processing technology stack and is the cornerstone of big data application analysis in many industries.

Distributed file system is a mainstream distributed data storage management system that has been widely used in the era of high-performance computing and big data computing. In recent years, with the continuous development of cloud computing technology, the application of distributed object storage, key-value storage and other technologies has also become popular. In this context, many distributed file systems have begun to take the technical route of unified and efficient management of data storage. Among them, a system that is well-known and commonly used by users is Alluxio, which was born in AMPLab at the University of California, Berkeley. It can be regarded as a unified big data virtual file system, different types of distributed storage systems (file systems, object storage system) can be mounted to the Alluxio directory, providing an efficient and unified access mode and interface. Metadata is the most important type of key information about data information in a storage system and the most frequent normal access. In order to effectively manage large-scale data files and objects from different underlying distributed storage systems, Alluxio needs to provide an efficient and scalable large-scale metadata management mechanism.

This article takes the open source version of Alluxio 2.8 as an example to reveal the large-scale metadata management mechanism common in distributed file systems. For Alluxio users, users interact with the Alluxio file system interface through file metainformation, and read, write, and cache data through data block metainformation. File and data block metainformation is stored and managed uniformly by Alluxio Master.     

0 1. Common types of distributed file system metadata

Among the metadata managed by Alluxio Master, the most important categories are file metadata, data block metadata, mount point metadata and Alluxio Worker metadata.

File (inode) metadata

Each file or folder in the Alluxio file system is represented by an inode. This inode stores all attributes and meta-information of the file, including basic file attributes, permission information, management attributes, timestamps, contained data blocks and each Metadata for a data block, etc. The concept of "inode" comes from Unix-type file systems and is widely used in file systems such as Linux and HDFS. An inode represents a node on the file system directory tree. Because Alluxio manages multiple understores, the number of potential files in the Alluxio namespace is actually the sum of files in all understores. As the most important service in the Alluxio cluster, the metadata service directly determines the scale, performance and stability of the system. It is worth mentioning that the inodes in the Alluxio file system do not necessarily exist in the underlying storage. For example, if this path is written to Alluxio in MUST_CACHE mode, then Alluxio will not create this file in the underlying storage. In addition, if the underlying storage is an object storage, because object storage does not have the concept of folders, the folders in Alluxio will not correspond to actual objects in the underlying storage.

Generally speaking, Alluxio Master's management of inodes can be abstractly divided into the following categories:

  • Using an InodeTree to store all inode information and the tree structure between inodes (the parent-child relationship between folders and files), Alluxio Master maintains the tree structure of the file system.

  • Implements the interface for file system operations and supports all operations on files. Alluxio Master opens a series of file system operation interfaces and provides concurrency safety and persistence guarantees for each operation. In this way, it provides a distributed file system to upper-layer applications.

  • Maintain a persistent state through the Journal log to ensure the persistence and atomicity of each inode operation. Alluxio Master ensures that inode information and every operation are recorded in the Journal log, thereby ensuring that inode information and changes will not be lost under any circumstances.

  • Alluxio's InodeTree supports inode-level read and write concurrent access by fine-graining the lock granularity to each inode. Concurrency control is performed on each inode through locks to ensure the thread safety of the inode during concurrent reading and writing.

Data block metadata

If the inode corresponds to a file, it has 0 (empty files) or more data blocks. For a new file, all data block sizes are set by alluxio.user.block.size.bytes.default, except for the last data block. Files with only 1 data block are also counted as the last data block. The meta-information management of data blocks is relatively simple compared to inodes, because there is no tree structure or parent-child relationship between data blocks.

Alluxio Master saves the metadata of the data block and the current location of the data block cache, and provides an interface for reading and writing these information. The block metadata managed by the Alluxio Master can be briefly viewed as two key-value stores:

(1)<BlockID, BlockMetadata>

(2)<BlockID, List<BlockLocation>>

Among them, BlockMetadata records the length of the data block. BlockLocation records the Alluxio Worker node address where this data block (cache) exists, and the specific storage location of this data block on the Alluxio Worker node.

These two different pieces of information are stored separately mainly because of their different life cycles. Block Metadata is immutable. Alluxio does not support random changes or appends to already written data blocks. If this file is rewritten, it will get a new FileID (ie InodeID) and a new BlockID, and the old data blocks will be discarded. On the contrary, the BlockLocation list will continue to change. For example, when the data block is loaded into a new Alluxio Worker or is evicted from an Alluxio Worker, the list information will change accordingly.

MountTable

MountTable manages the mount points in all Alluxio file systems and provides operations such as creating and changing mount points. At the same time, the Alluxio file path and the file path of the underlying storage are also resolved and corresponding to each other through MountTable.

Worker metadata

Alluxio Master's management of Alluxio Worker metadata includes tracking which Alluxio Workers are currently working and constantly updating the cache list on Alluxio Workers. The information recorded by Alluxio Master mainly includes:

(1) Alluxio Worker’s address, startup time and other unchanged information.

(2) The space usage of Alluxio Worker, including the usage of each layer in the multi-layer cache, is updated with each heartbeat.

(3) All BlockIDs cached in Alluxio Worker and all BlockIDs to be removed from Alluxio Worker. This information changes with every heartbeat and block operation (load, eviction, etc.).   

0 2. Storage model of distributed file system metadata

The metadata storage of a distributed file system usually includes on-heap storage and off-heap storage. Among them, on-heap storage access is efficient, but the space is limited, while off-heap storage space is large, but if not designed properly, it will cause performance losses.

2.1 Metadata is stored on the heap (HEAP mode)

Taking Alluxio as an example, in HEAP mode, all metainformation is stored in the JVM's heap in the form of Java objects. Each file occupies approximately 2KB~4KB of memory on the heap. Therefore, when there are a large number of files in the Alluxio file system, the meta information on the heap will bring a lot of memory pressure to the JVM. It is not difficult to calculate that when there are 100 million files in the system, just storing the metainformation of these files on the JVM will occupy 200GB~400GB. Coupled with the large amount of RPC operation memory overhead that the Master JVM must bear, the memory requirements of this JVM are difficult for ordinary servers to bear.

In addition, GC at such a data size becomes very difficult to manage for most JVM versions. These meta-information in the Alluxio Master JVM are long-lived objects, which will especially have a great impact on the GC efficiency of the old generation. Although there are some commercial versions of JVM that can avoid some or most of the performance and management problems caused by JVM, for most users, excessive JVM usage is still a very difficult pain point, especially the JVM of Alluxio Master may change in the future. As a result, business expansion may exceed the upper limit of physical machine memory.

2.2 Metadata is stored outside the heap (ROCKS mode)

In response to the problem that HEAP mode is difficult to expand, Alluxio optimized the design direction. Alluxio introduced the ROCKS mode in version 2.0, moving meta-information storage outside the JVM. In ROCKS mode, Alluxio Master embeds a RocksDB, which moves the metainformation of files (and data blocks) from the previous JVM heap to RocksDB. The storage medium of RocksDB is actually the hard disk instead of the memory. To use RocksDB to store metadata, you only need to configure the metadata storage mode and specify the path to RocksDB storage:

alluxio.master.metastore=ROCKS

alluxio.master.metastore.dir=${alluxio.work.dir}/metastore

RocksDB embedded in Alluxio will use the path configured by alluxio.master.metastore.dir as its own metadata storage. In the following example, we view the RocksDB storage of a running Alluxio cluster. We can see that Alluxio has a storage directory for Inode and Block metadata saved in RocksDB, and maintains data files managed by RocksDB. The storage directory structure of RocksDB will not be described in detail in this book. Readers can check the official documentation of RocksDB.

$ ls -al -R metastore/

metastore/:

total 8

drwxrwxr-x. 2 alluxio-user alluxio-group 4096 May 21 03:20 blocks

drwxrwxr-x. 2 alluxio-user alluxio-group 4096 May 21 03:33 inodes

 

metastore/blocks:

total 4264

-rw-r--r--. 1 alluxio-user alluxio-group     0 May 21 03:20 000005.log

-rw-r--r--. 1 alluxio-user alluxio-group    16 May 21 03:20 CURRENT

-rw-r--r--. 1 alluxio-user alluxio-group    36 May 21 03:20 IDENTITY

-rw-r--r--. 1 alluxio-user alluxio-group     0 May 21 03:20 LOCK

-rw-r--r--. 1 alluxio-user alluxio-group 52837 May 21 03:30 LOG

-rw-r--r--. 1 alluxio-user alluxio-group   176 May 21 03:20 MANIFEST-000004

-rw-r--r--. 1 alluxio-user alluxio-group 13467 May 21 03:20 OPTIONS-000009

-rw-r--r--. 1 alluxio-user alluxio-group 13467 May 21 03:20 OPTIONS-000011

 

metastore/inodes:

total 4268

-rw-r--r--. 1 alluxio-user alluxio-group     0 May 21 03:20 000005.log

-rw-r--r--. 1 alluxio-user alluxio-group  1211 May 21 03:33 000012.sst

-rw-r--r--. 1 alluxio-user alluxio-group    16 May 21 03:20 CURRENT

-rw-r--r--. 1 alluxio-user alluxio-group    36 May 21 03:20 IDENTITY

-rw-r--r--. 1 alluxio-user alluxio-group     0 May 21 03:20 LOCK

-rw-r--r--. 1 alluxio-user alluxio-group 58083 May 21 03:33 LOG

-rw-r--r--. 1 alluxio-user alluxio-group   247 May 21 03:33 MANIFEST-000004

-rw-r--r--. 1 alluxio-user alluxio-group 13679 May 21 03:20 OPTIONS-000009

-rw-r--r--. 1 alluxio-user alluxio-group 13679 May 21 03:20 OPTIONS-000011

2.3 Memory and disk usage of off-heap storage

In ROCKS mode, meta-information is stored in RocksDB outside the heap, which will greatly reduce the memory pressure of meta-information storage on the Alluxio Master process. Compared with HEAP mode, all meta-information reading and writing are reduced from memory speed to hard disk speed, which will greatly affect the performance and throughput of Alluxio Master. Therefore, Alluxio Master adds a cache in memory to speed up access to RocksDB. In other words, in ROCKS mode, the memory usage of meta-information storage becomes the memory usage of this part of the cache. Similar to the memory usage estimate in HEAP mode, the meta information storage of each file in the cache occupies the same 2KB~4KB.

The size of the cache is controlled by alluxio.master.metastore.inode.cache.max.size. The value of this configuration item may vary depending on the Alluxio version. Alluxio Master will first write to the cache, and then start writing to RocksDB (disk) when the cache reaches a certain amount of usage. The disk usage of RocksDB is as follows: the metainformation of about 1 million files takes up about 4GB of hard disk space. It is worth noting that when the number of files in the Alluxio namespace does not trigger eviction based on alluxio.master.metastore.inode.cache.max.size, all file metainformation is in the memory-based cache and is not written to RocksDB. This At this time, the metainformation disk usage of these files is close to 0.

2.4 Cache acceleration and tuning of off-heap storage

When there is sufficient memory space, appropriately increasing alluxio.master.metastore.inode.cache.max.size can cache more file metainformation in memory to improve performance. At the same time, it should be noted that RPC operations on Alluxio Master will also consume memory. Even if there are no RPC operations in progress, there will still be some internal management logic such as regular file scanning on the Alluxio Master that consumes memory. When estimating the memory in the Alluxio Master process, you need to reserve enough memory for these operations and do not let meta-information storage occupy all the memory. This is the same as the reason why 100% of the memory on the server cannot be allocated to the application without reserving memory space for the operating system. The management of metainformation cache is based on the water level mechanism. The user configures a high water level parameter and a low water level parameter. For example, the following is the default configuration:

alluxio.master.metastore.inode.cache.high.water.mark.ratio=0.85

alluxio.master.metastore.inode.cache.low.water.mark.ratio=0.8

When cache usage reaches 0.85 * alluxio.master.metastore.inode.cache.max.size, cached data will begin to be evicted and the data content in the cache will be written to RocksDB storage. Stop eviction when cache occupancy drops to 0.8.

2.5 Switch between HEAP and ROCKS modes

The format of the Journal log is different when using HEAP mode and ROCKS mode, so switching from one mode to the other cannot be done by simply changing the configuration and restarting the Alluxio Master process. The switching of the metadata storage mode can be done by starting the cluster from the backup, see section 4.5.

This article takes Alluxio as an example to briefly introduce the basic types of metadata of distributed file systems and their management and optimization methods. For more details on data access optimization, you can further refer to the Alluxio open source community code. You are also welcome to read the recent publications of Machinery Industry Press The technical book "Distributed Unified Big Data Virtual File System - Alluxio Principles, Technology and Practice" :

This book is written based on the widely used Alluxio 2.8.0 open source version. It provides an in-depth introduction to the technical principles and practical cases of Alluxio-related distributed unified big data file systems. The main content includes system entry and use, kernel component design and implementation principles, and detailed It introduces large-scale enterprise application cases and practices, and is accompanied by Alluxio's open source community developer guide. This book provides a relatively complete technical guide and practical tutorials for Alluxio open source community users, teachers and students of big data system courses in universities, and potential enterprise users. It can be used as a professional textbook in the field of big data, as well as for big data practitioners and researchers. important professional information of the author.

Guess you like

Origin blog.csdn.net/qq_41640218/article/details/132764281
Recommended