The Google File System (GFS) Study Notes

Primer

This article is my study notes for the course "Interpretation of Classic Papers on Big Data" taught by teacher Xu Wenhao of Geek Time. A lot of text and pictures come from the content of the column. If there is any infringement, it will be deleted.

Although this paper published in 2003 is relatively long, it is still a classic paper in the field of big data. Paper Address: The Google File System

The core of this paper is to solve the problem of how to efficiently store massive data in a distributed environment.

Architecture of GFS

GFS is a single-master architecture, which makes the architecture of GFS very simple and avoids the need to manage complex consistency issues. However, it also brings many limitations. For example, once the Master fails, the entire cluster cannot write data.

In the entire GFS, there are two kinds of servers, one is the Master, which is the only one master control node in the entire GFS; the second is the chunkserver, which is the node that actually stores data.

Since GFS is called a distributed file system, this file may not actually be stored on the same server.

Therefore, in GFS, each file will be divided into chunks according to the size of 64MB. Each chunk will have a unique handle on GFS. This handle is actually a number that can uniquely identify a specific chunk. Then each chunk will be placed on the chunkserver in the form of a file.

The chunkserver is actually an ordinary Linux server on which a user-mode GFS chunkserver program runs. This program will be responsible for RPC communication with the Master and GFS clients to complete the actual data read and write operations. Of course, in order to ensure that the data will not be lost because a certain chunkserver is broken, each chunk will store a full three copies (replica).

One of them is primary data, and two are secondary data. When there is any inconsistency between the three data, the primary data shall prevail. With three copies, it can not only prevent data loss due to various reasons, but also distribute the reading pressure of the system when there are many concurrent reads.

insert image description here

First of all, there are three main metadata (metadata) stored in the master:

1. The namespace information of the file and chunk, that is, the path and file name similar to the previous /data/geektime/bigdata/gfs01;

2. Which chunks are these files divided into, that is, the mapping relationship from the full path file name to multiple chunk handles;

3. On which chunkservers are these chunks actually stored, that is, the mapping relationship from chunk handle to chunkserver.

Regarding Metadata, the original text of the paper is as follows:

The master stores three major types of metadata: the file and chunknamespaces, the mapping from files to chunks, and the locations of each chunk’s replicas. All metadata is kept in the master’s memory. The first two types (namespaces and file-to-chunkmapping) are also kept persistent by logging mutations to an operation log stored on the master’s local diskand replicated on remote machines. Using a log allows us to update the master state simply, reliably, and without risking inconsistencies in the event of a master crash. The master does not store chunklocation information persistently. Instead, it asks each chunkserver about its chunks at master startup and whenever a chunkserver joins the cluster.

insert image description here

read data

The steps when the client needs to read data are as follows:

1. The client first finds the data to be read in which chunk, because the length of each chunk is fixed at 64M, the client can calculate the required data according to the offset and length of the file to be read. on a chunk. The client sends the file name and chunk information to the Master.

2. After the master gets the information, it will tell the client the chunkserver where all the replicas corresponding to the chunk are located.

3. After the client obtains the chunkserver information where the chunk is located, the client can go to any chunkserver to read the data it needs.
insert image description here

The original picture of the paper is as follows:
insert image description here

A single Master node can easily become a performance bottleneck. Once the Master hangs up, the service will stop. In order to improve performance, all the data of the Master node is stored in the memory, which will also cause problems, that is, once the Master hangs up, all the data will be lost. Therefore, the Master will persist by recording operation logs and periodically generating corresponding Checkpoints, that is, writing to the hard disk.

This is to ensure that the data in the Master will not be lost due to a machine failure. When the Master node restarts, it will read the latest Checkpoints first, and then replay (replay) the operation log after the Checkpoints to restore the state of the Master node to the latest state before. This is the most common recovery mechanism used by storage systems.

But here comes the question again, what if the Master node is completely abolished? Hardware failure, what should I do if the computer cannot be turned on?

Don’t be afraid, we still have spare tires, which are called Backup Masters. When we write operations on the Master, the corresponding operation records will be saved to the disk, and the entire operation will be considered successful only after the data on the Back Master is also successfully saved. This whole process is synchronous, called "synchronous replication". When the Master dies, a Backup Master will be selected to become a new Master.

However, the process of switching to a new Master may be relatively long, and it is necessary to read Checkpoints and operation logs, which may affect the use of the client. In order to prevent the service from being suspended for a short period of time, the designer added a "Shadow Master" "Things, the shadow master will continuously synchronize the data in the master, but when there is a problem with the master, the clients can find the information they want from these shadow masters. The shadow master is read-only and replicated asynchronously, so there won't be much performance impact.

write data

The main steps for client to write data are as follows:

In the first step, the client will ask the master which chunkservers the data to be written should be on.

The second step is the same as reading data, the master will tell the client all the chunkservers where the secondary replicas are located. This is not enough, the master will also tell the client which replica is the "boss", that is, the primary replica, and the data is based on it at this time.

In the third step, after obtaining which chunkservers the data should be written to, the client will send the data to be written to all replicas. But at this time, the chunkserver will not actually write down the sent data after receiving it, but will only put the data in an LRU buffer.

In the fourth step, after all replicas have received the data, the client will send a write request to the primary replica. GFS faces hundreds of concurrent clients, so the primary copy may receive write requests from many clients. The master copy will arrange these requests in an order to ensure that all data writes have a fixed order. Next, the master copy begins to write the data in the LRU buffer to the actual chunk in this order.

In the fifth step, the primary copy will forward the corresponding write request to all secondary copies, and all secondary copies will write data to the hard disk in the same data writing order as the primary copy.

Step 6: After the data of the secondary copy is written, it will reply to the primary copy, and I have finished writing the data just like you.

In the seventh step, the master copy tells the client that the data has been successfully written. And if there is an error in the process of writing data to any copy, the error will be notified to the client, which means that the write actually failed.

insert image description here

Just like reading data from GFS before, the GFS client only gets the metadata of the chunkserver where the chunk data is from the master, and the actual data reading and writing does not need to go through the master. In addition, not only the specific data transmission does not go through the master, but also the coordination of subsequent data writing on multiple chunkservers at the same time does not need to go through the master.

Pipelined network data transmission

The client does not transfer data to multiple chunkservers in turn, but first transfers all data to the chunkserver closest to itself, and then lets the chunkserver synchronize the data to other replicas.

insert image description here

The advantage of this is that the network bottleneck of the client is bypassed, and the data transmission efficiency between servers is much faster than that of the client.
insert image description here

First, the client transmits the data to the “nearest” secondary replica A server on the same rack;

Then, the secondary replica A server transmits the data to the "nearest" master replica server in a different rack but under the same aggregation layer switch;

Finally, the primary replica server transmits the data to the secondary replica B server under another aggregation layer switch.

When writing data, the client only obtains the metadata of the chunk location from the master, but in the actual data transmission process, the master does not need to participate, thus preventing the master from becoming a bottleneck.

When the client writes data to multiple chunkservers, the "nearest" pipeline transmission scheme is adopted to prevent the client's network from becoming a bottleneck.

record append

The above writing method is prone to problems, that is, when multiple clients write a chunk at the same time, the data of the other party will be overwritten. In order to solve this problem, GFS recommends a main data writing method called "record appending", which can guarantee atomicity (Atomic), and can be basically deterministic during concurrent writing.

The specific operation process is as follows:

1. Check whether the current chunk can write the record to be added now. If it can be written down, then write the current additional record into it. At the same time, this data writing will also be sent to other secondary copies, and it will also be written on the secondary copy.

2. If the current chunk cannot fit anymore, it will first fill the current chunk with empty data, and let the secondary copy also fill up the empty data. The primary replica then tells the client to try again on the next chunk. At this time, the client will go to the chunkserver where the new chunk is located to append records.

3. Because the chunkserver where the master copy is located controls the order of data writing operations, and the data will only be appended later, so even in the case of concurrent writes, requests will be queued to the same chunkserver where the master copy is located. There will be no data written to the same area, overwriting the data that has been additionally written.

4. In order to ensure that the data that needs to be added can be stored in the chunk, GFS limits the amount of data added to a record to 16MB, and the size of a chunk in the chunkserver is 64MB. Therefore, when appending records needs to fill in blank data in the chunk, at most 16MB will be filled in, that is, the storage space of the chunkserver will be wasted at most 1/4.
insert image description here

Each client will be allocated an independent space on a chunk, and then write concurrently. Since each client writes its own independent space, there will be no conflicts. If the writing fails and then find a new space to rewrite, the most common thing is that the order is different, which can ensure that the data is complete.

For example, if the data A to be written by client X fails on chunkserver R, it will be appended to the end of C. Although there are some duplicate data, the data can be guaranteed to be complete in the end.

insert image description here

Summarize

Finally, let's take a look at the overall design of GFS as follows:
insert image description here

To sum up, the design principle of GFS is simplicity, design around hardware performance, and loose requirements for consistency under these two premises.

There is no theoretical innovation in the design of GFS, but an innovation in design according to the constraints of the hardware. The one that suits us is the best. We don't have to pursue any advanced technology. A good solution is one that can meet the needs at a low cost. Innovation does not necessarily mean creating something new, but combining some things in a proper way is also innovation.

Guess you like

Origin blog.csdn.net/zhanyd/article/details/120705352