Distributed Cluster Framework - Google File System GFS

Google File System GFS

        Google File System ( Google File System , GFS ) is a large-scale distributed file system. It provides massive storage for Google cloud computing, and is closely integrated with technologies such as Chubby , MapReduce , and Bigtable , and is at the bottom of all core technologies. Since GFS is not an open source system, we can only get a little understanding from the technical documents published by Google , but cannot conduct in-depth research. Document [1] is the most detailed technical document about GFS released by Google . It elaborates in detail on the background, characteristics, system framework and performance testing of GFS .

The current mainstream distributed file systems include RedHat 's GFS[3] ( Global File System ), IBM 's GPFS[4] , Sun 's Lustre[5] , etc. These systems are usually used in high-performance computing or large data centers, and have high requirements on hardware facilities. Taking the Luster file system as an example, it only provides a fault-tolerant solution for the metadata manager MDS , but for the specific data storage node OST , it relies on itself to solve the fault-tolerant problem. For example, Luster recommends that OST nodes use RAID technology or SAN storage area network for fault tolerance. However, since Luster itself cannot provide fault tolerance for data storage, once the OST fails, it cannot be recovered. Therefore, the stability of the OST is quite high. As a result, the cost of storage is greatly increased, and the cost will increase linearly with the expansion of the scale.

        As Kai-Fu Lee said, innovation is important, but useful innovation is more important. The value of innovation depends on the comprehensive performance of an innovation in three aspects: novelty, usefulness and feasibility. The novelty of Google GFS does not lie in the amazing technology it uses, but in that it uses cheap commercial machines to build a distributed file system, and at the same time closely combines the design of GFS with the characteristics of Google applications and simplifies its implementation , make it feasible, and finally achieve the perfect combination of novelty, usefulness and feasibility. GFS uses cheap commercial machines to build a distributed file system, assigns fault tolerance tasks to the file system, and uses software to solve system reliability problems, which can reduce storage costs exponentially. Due to the large number of servers in GFS , server crashes often occur in GFS, and should not even be regarded as abnormal phenomena. How to ensure the security of data storage and provide uninterrupted data storage services in frequent failures It is the core problem of GFS . The beauty of GFS is that it uses a variety of methods, from multiple perspectives, using different fault-tolerant measures to ensure the reliability of the entire system.

 2.1.1  System Architecture

The system architecture of         GFS is shown in Figure 2-1[1] . GFS divides the nodes of the entire system into three types of roles: Client (client), Master (main server) and Chunk Server (data block server). Client is the access interface provided by GFS to applications. It is a set of dedicated interfaces that do not comply with the POSIX specification and are provided in the form of library files. Applications call these library functions directly and link against the library. Master is the management node of GFS , logically there is only one, it saves the metadata of the system, is responsible for the management of the entire file system, and is the " brain " of the GFS file system . Chunk Server is responsible for specific storage work. Data is stored on the Chunk Server in the form of files . There can be more than one Chunk Server , and its number directly determines the scale of GFS . GFS divides the file into blocks according to a fixed size, the default is 64MB, each block is called a Chunk (data block), and each Chunk has a corresponding index number ( Index ).

  

Figure 2-1 GFS architecture

        When the client accesses GFS , it first accesses the Master node to obtain information about the Chunk Servers it will interact with , and then directly accesses these Chunk Servers to complete data access. This design method of GFS realizes the separation of control flow and data flow. There is only control flow between Client and Master , but no data flow, which greatly reduces the load on Master and prevents it from becoming a bottleneck of system performance. The data flow is directly transmitted between the Client and the Chunk Server . At the same time, since the file is divided into multiple Chunks for distributed storage, the Client can access multiple Chunk Servers at the same time , so that the I/O of the entire system is highly parallel, and the overall system performance is improved.

Compared with the traditional distributed file system, GFS simplifies the characteristics of Google applications in many aspects, so as to achieve the best balance of cost, reliability and performance under a certain scale. Specifically, it has the following characteristics.

1 . Using the central server mode

        GFS adopts the central server mode to manage the entire file system, which can greatly simplify the design and reduce the difficulty of implementation. Master manages all metadata in the distributed file system. Files are divided into Chunks for storage. For the Master , each Chunk Server is just a storage space. All operations initiated by the Client need to be executed through the Master first . There are many advantages to doing this. It is very easy to add a new Chunk Server . The Chunk Server only needs to be registered with the Master , and there is no relationship between the Chunk Servers . If a completely peer-to-peer, decentralized model is adopted, then how to notify each Chunk Server of the update information of the Chunk Server will be a difficult point in the design, and this will also affect the scalability of the system to a certain extent. The Master maintains a unified namespace and grasps the status of the Chunk Servers in the entire system at the same time , so that the load balancing of data storage in the entire system can be achieved. Since there is only one central server, the metadata consistency problem is naturally solved. Of course, the central server model also brings some inherent disadvantages, such as easily becoming the bottleneck of the entire system. GFSA variety of mechanisms are used to prevent the Master from becoming a bottleneck in system performance and reliability, such as controlling the size of metadata as much as possible, remote backup of the Master , control information and data distribution, etc.

2 . don't cache data

        The cache ( Cache ) mechanism is an important means to improve the performance of the file system. In order to improve the performance of the general file system, it is generally necessary to implement a complex cache mechanism. According to the characteristics of the application, the GFS file system does not implement caching, which is considered from two aspects of necessity and feasibility. In terms of necessity, most clients read and write sequentially in streaming mode, and there is no large amount of repeated reading and writing. Caching this part of data has little effect on improving the overall performance of the system; and for Chunk Server, because GFS data is in The Chunk Server stores data in the form of files. If a piece of data is read frequently, the local file system will naturally cache it. From the perspective of feasibility, how to maintain the consistency between the cache and the actual data is an extremely complicated problem. The stability of each Chunk Server in GFS cannot be guaranteed. Coupled with various uncertain factors such as the network, the consistency problem is particularly complex. In addition, due to the huge amount of data read, it cannot be fully cached with the current memory capacity. For the metadata stored in the Master , GFS adopts a caching strategy, and all operations initiated by the Client in GFS need to go through the Master first . Master needs to perform frequent operations on its metadata. In order to improve the efficiency of operations, MasterAll metadata are stored directly in memory for manipulation. At the same time, the corresponding compression mechanism is adopted to reduce the space occupied by metadata and improve the utilization rate of memory.

3 . Implemented in user mode

        As an important part of the operating system, the file system is usually implemented at the bottom layer of the operating system. Taking Linux as an example, whether it is a local file system such as the Ext3 file system or a distributed file system such as Luster , they are all implemented in the kernel mode. Implementing the file system in the kernel mode can better integrate with the operating system itself and provide upward compatible POSIX interfaces. However, GFS chooses to implement it in user mode, mainly based on the following considerations.

        1 ) Implemented in the user mode, the data can be accessed directly by using the POSIX programming interface provided by the operating system, without knowing the internal implementation mechanism and interface of the operating system, thereby reducing the difficulty of implementation and improving the versatility.

        2 ) The functions provided by the POSIX interface are more abundant, and more features can be used in the implementation process, not as limited as kernel programming.

        3 ) There are many debugging tools in user mode, but it is relatively difficult to debug in kernel mode.

        4 ) In user mode, both Master and Chunk Server run as processes, and a single process will not affect the entire operating system, so it can be fully optimized. In the kernel mode, if you can't grasp its characteristics well, not only will the efficiency not be high, but it will even affect the stability of the entire system.

        5 ) In the user mode, GFS and the operating system run in different spaces, and the coupling between the two is reduced, which facilitates the separate upgrade of GFS itself and the kernel.

4 . Only provide dedicated interface

        A common distributed file system generally provides a set of interfaces compatible with the POSIX specification. The advantage is that the application program can transparently access the file system through the unified interface of the operating system without recompiling the program. At the beginning of the design, GFS was completely oriented to Google 's applications, using a dedicated file system access interface. The interface is provided in the form of a library file, and the application program is compiled together with the library file, and the Google application program completes the access to the GFS file system by calling the API of these library files in the code . Using a dedicated interface has the following benefits.

        1 ) Reduce the difficulty of implementation. Generally, POSIX- compatible interfaces need to be implemented at the operating system kernel level, while GFS is implemented at the application layer.

        2 ) Adopting a dedicated interface can provide some special support for the application according to the characteristics of the application, such as an interface that supports concurrent appending of multiple files, etc.

        3 ) The dedicated interface directly interacts with Client , Master , and Chunk Server , which reduces context switching between operating systems, reduces complexity, and improves efficiency.

2.1.2  Fault tolerance mechanism

 1 . Master Fault Tolerance

        Specifically, three kinds of metadata of the GFS file system are saved on the Master .

        1 ) Namespace ( Name Space ), which is the directory structure of the entire file system.

        2 ) The mapping table of Chunk and file name.

        3 ) The location information of the Chunk copy, each Chunk has three copies by default.

        First of all, for a single Master , for the first two types of metadata, GFS provides fault tolerance through operation logs. The third type of metadata information is directly stored on each Chunk Server , and is automatically generated when the Master starts or when the Chunk Server registers with the Master . Therefore, when the Master fails, the above metadata can be quickly restored if the disk data is preserved intact. In order to prevent the Master from completely crashing, GFS also provides a remote real-time backup of the Master , so that when the current GFS Master fails and fails to work, another GFS Master can quickly take over its work.

2 . Chunk Server Fault Tolerance

        GFS uses replicas to implement Chunk Server fault tolerance. Each Chunk has multiple storage copies (three by default), distributed and stored on different Chunk Servers . The copy distribution strategy needs to consider various factors, such as network topology, rack distribution, and disk utilization. For each Chunk , all replicas must be written successfully to be considered as successfully written. In the subsequent process, if the relevant copy is lost or unrecoverable, the Master will automatically copy the copy to other Chunk Servers , so as to ensure that the number of copies remains constant. Although one piece of data needs to be stored in three copies, it seems that the utilization rate of disk space is not high, but after comparing various factors and the cost of disks continues to decline, it is undoubtedly the simplest, most reliable, and most effective to use copies, and the difficulty of implementation is also high. Minimal one way.

Each file in         GFS is divided into multiple Chunks . The default size of Chunk is 64MB . This is because the files processed in Google applications are relatively large, and it is a reasonable choice to divide them in units of 64MB . The Chunk Server stores a copy of the Chunk , and the copy is stored in the form of a file. Each Chunk is divided in units of Block , the size is 64KB , and each Block corresponds to a 32-bit checksum. When reading a Chunk copy, the Chunk Server will compare the read data with the checksum, and if they do not match, an error will be returned, so that the Client can choose the copy on other Chunk Servers .

 2.1.3  System Management Technology

        Strictly speaking, GFS is a distributed file system, including a complete set of solutions from hardware to software. In addition to some key technologies of GFS mentioned above , there are corresponding system management technologies to support the application of the entire GFS , and these technologies may not necessarily be unique to GFS .

1 . Large-Scale Cluster Installation Technology

There are usually a lot of nodes in the cluster where GFS is installed . The largest cluster in [1] has more than 1,000 nodes, and now there are more than 10,000 machines running in the Google data center. Therefore, the rapid installation and deployment of a GFS system, as well as the rapid system upgrade of nodes, require corresponding technical support.

2 . Fault Detection Technology

GFS is a file system built on unreliable and cheap computers. Due to the large number of nodes, failures occur very frequently. How to find and determine the failed Chunk Server in the shortest time requires related cluster monitoring technology.

3 . Node dynamic joining technology

When a new Chunk Server is added, if the system needs to be installed in advance, system expansion will be a very cumbersome task. If it is possible to automatically obtain the system and install and run it only by adding the bare metal, then the workload of GFS maintenance will be greatly reduced.

4 . energy saving technology

Relevant data show that the power consumption cost of the server is greater than the original purchase cost, so Google has adopted a variety of mechanisms to reduce the energy consumption of the server, such as modifying the server motherboard, using batteries instead of expensive UPS (uninterruptible power system), increasing energy utilization. Rich Miller said in a blog post about the data center that this design allows Google 's UPS utilization rate to reach 99.9% , while the general data center can only reach 92% to 95% .

Guess you like

Origin blog.csdn.net/qq_53142796/article/details/132575334