Distributed File System (GFS and HDFS) Overview

Foreword

Background significance

Distributed storage concepts

        Category distributed storage system

        CAP theory

        Replica

        consistency

        GFS architecture

        Lease (lease), and change the order

        Fault Tolerance


Foreword

Because I have a distributed class, the teacher asked us to choose a topic to do the report, there GFS, Hadoop, Bigtable, MapReduce, Chubby, OpenStack and so on. Any content, you can talk about one important point, you can tell reviewed. Because I had just read some books distributed file, so we chose GFS (because familiar looking) review. Although it mentioned in the book before looking to the GFS, but it did not introduce the principle is clear, so the Internet to download the GFS papers. This paper describes the development and design ideas at GFS, and ideas to solve the problems encountered. From the characteristics of stored data to design systems, as well as papers in some of the design ideas, make me understand how the underlying distributed file system. This blog mainly by paper describes the GFS and HDFS comics, so that we understand the distributed file system.

Hadoop Distributed File System (HDFS) is designed to be suitable for running on a general purpose distributed file system hardware. And its existing distributed file systems have a lot in common. But at the same time, it and other distributed file system differences are evident. HDFS is highly fault-tolerant systems, suitable for deployment on low-cost machines. HDFS provides high throughput data access, ideal for use on large data sets. HDFS can be seen as an open source implementation of the GFS, which draws on the thinking of many GFS design and implementation. In order to better understand the principles of HDFS, reading papers GFS is essential to the learning process. If you've already seen the distributed-related books, or have spent a lot of time learning HDFS, then look at the paper and GFS will not feel any difficulty. This blog HDFS collected comics, watch cartoons in learning knowledge is very interesting. To understand the role of information in the comic book (on behalf of the NameNode is GFS master, that is, the metadata server .DataNode represents the GFS chunkserver, namely block server):

 

Background significance

In order to meet the rapid growth of Google's data processing requirements, Google developers to design and implement the Google File System (Google File System). GFS is a large data-intensive applications, scalable distributed file system. GFS distributed file system with the traditional design has many of the same goals, such as performance, scalability, reliability, and availability. GFS's design also takes into account the environmental impact of load and Google applications. GFS is a distributed storage system, that is a large number of ordinary PC server via Internet connectivity, external storage services offer a whole.

 

Developers choose to re-examine the traditional file system in the design, derived from a completely different design ideas:

  1. Component failure event is considered to be the norm, rather than an accident (the system is running on an ordinary PC). Causes include a program bug, the operating system bug, human error, and even hard disk, memory, a connector, and a network power failure and the like.
  2. Very large files, I / O operations and data block size needs to be reconsidered (block size affects metadata table).
  3. Modify most of the documents adopted at the end of the file is added to the data , rather than overwrite the original data (see specific reasons type of data stored).
  4. Introducing the atomic additional recording operation, thus ensuring the plurality of clients can be added simultaneously operate, no additional synchronization to ensure data consistency.

 

Distributed storage concepts

Category distributed storage system

(1) Structure of data : typically stored in a relational database, may be represented by a two-dimensional table structure. Mode and the data content structure is separate, pre-defined data patterns need. Domestic open source distributed database - Alibaba OceanBase .

(2) unstructured data : includes all formats of office documents, text, images, audio and video information. Such data objects in the form of tissue, there is no correlation between objects, such data generally referred Blob (Binary Large Object) data. Blob distributed file system for storing data , a typical system Fackbook Haystack, GFS and HDFS.

(3) semi-structured data : between unstructured and structured data, HTML documents belong to semi-structured data. Its structure and content patterns mixed together, there is no clear distinction, the structure does not need to pre-defined data pattern. A typical system is Google Bigtable .

Because of unstructured data, the data is stored documents, text, images, audio and video and other information. Features such data is read write less. So I have mentioned before - to modify most of the documents adopted at the end of the file is added to the data, rather than overwrite the original data the way . In fact, think about it, when you use Baidu cloud disk, the data we can upload, download and delete. But you can not modify data upload. One reason is that a similar system is reading and writing applications and less on the other hand is to modify the operation of the consistency model is more complex.

Bigtable paper Title : A Distributed Storage System for Structured the Data . However, in its introduction, said the Bigtable does not support full relational data model; in contrast, Bigtable provides customers with simple data model . So some book stores its data classified as semi-structured data.

 

CAP theory

CAP theory of distributed systems: the first three theoretical characteristic of a distributed system is summarized as follows:

availability (Availability) : read and write operations in a case where a single machine failure still perform normal (Example: program bugs) , without having to wait for the machine to restart failure, i.e. to ensure that the service has been available.

partitions fault tolerance (Partition Tolerance) : a distributed system in the face of a node failure or network partition time, still be able to provide a complete service outside.

Consistency (Consistency) : that is updated after a successful operation and returned to the client, exactly the same data from all nodes at the same time, which is distributed consistency.

Partitions fault tolerance and availability feels very similar, they differ in that the node is not dawdle out. As a distributed storage system, even if one node goes down, it is still required to provide a complete storage service for users. So Partition tolerance can always be met, therefore, consistency and availability can not have both .

 

Replica

Distributed storage system in order to ensure high reliability and availability , multiple copies of the data is typically stored in the system. When a copy of the storage node resides fails, distributed system can automatically switch to the other service copy, in order to achieve automatic fault tolerance. Since there are multiple copies, bringing data consistency problems.

 

FIG above read operation, the read data when the user can obtain information from the master copy, can also obtain information from the backup copy in two. If there are 3000 individual reading the same data at the same time, under ideal conditions, the system can read so many requests to share three copies of the server is located. This is significantly larger than the processing speed of a process server 3000 requests faster. Another point, for some reason, the first backup copy is damaged, then the client still and another backup copy of the data acquired from the master copy.

By replica, making the system more reliable. However, due to the presence of multiple copies of the client at the time of a write or append data, the need to ensure the consistency of three copies of the data. FIG write operation as described above, since the two alternate copies exist, the client must synchronize to the data written to the backup copy of the data is written. This process, not only to add synchronization mechanisms ensure that the data are written in three copies, and this time read the copy of the data (service) is not available. This is the consistency and availability can not have both reason. Consistency and availability can not have both problems is not all the time, only the additional data in the system, in order to ensure data consistency, read the corresponding service was not available.

 

consistency

From the client's point of view, the consistency of the following three conditions comprising:

(1) Strong Consistency: If the A value is written to the first storage system, the storage system ensures that a subsequent read operations A, B, C will return the latest value.

(2) weak consistency: if the A value is written to the first storage system, the storage system can not guarantee a subsequent reading operation A, B, C can read the latest value.

(3) Final consistency: if the A value is written to the first storage system, the storage system ensures that no write operation if the subsequent read operation to update the same value, A, B, C "final" are read to write A the latest value-in. "Final" identity has a "window inconsistency" concept, it is especially the A value is written to the subsequent A, B, C reads the latest value of the period. The size of the "window of inconsistency" is dependent on several factors: interaction latency, system load, the number of copies and replication protocol requires synchronization .

 

GFS architecture

 

Role Task: Master (single) used to store metadata, i.e., the data will be described. Chunkserver (s) used to store data stored in the user. Client is requesting party services.

Master node performs all of the namespace operations. In addition, it manages the entire system a copy of all of Chunk: it determines the storage location of Chunk, Chunk and create a new copy of it to coordinate a variety of activities to ensure the system is fully replicated Chunk, all the blocks of the server between load balancing, reclaim storage space is no longer in use.

Master saved on three metadata information:

(1) namespace, which is the entire directory structure and file system chunk basic information;

(2) between the chunk mapping file;

Position information (. 3) copies of the chunk, usually three copies of each chunk.

Master node single strategy greatly simplifies our design. Master a single node can pinpoint the location Chunk by global information and decision-making replication. In addition, we must reduce the reading and writing of the Master node, to avoid the Master node becomes the bottleneck of the system. Client does not read and write data files via Master node. On the contrary, the client should ask the Master node and link it blocks the server. The client caches this metadata information for some time, and the subsequent operation directly block the server performs data read and write operations.

GFS process of learning to read by the architecture diagram. First, the client program specifies the file name and the byte offset, according to a fixed size Chunk, Chunk converted into the index file. It then sends the file name and the Chunk Index to Master node. Master node position information corresponding Chunk ID and send copies to the client. Client with file names and Chunk as Key index cache information.

After the client sends a request to a copy of them, generally choose the nearest. And request information containing identification byte range of Chunk. In the subsequent read operation Chunk of this, the client no longer and Master node communication, unless the cached metadata information of date or file is reopened. In fact, the client usually more Chunk query information in a request, respond Master node may also contain followed behind these Chunk Chunk requested information. In practical applications, this additional information without any cost, avoid the client and the Master node communication occurs many times in the future.

Master server does not save persistence which storage blocks servers hold a copy of the information specified in the Chunk. Master server polling only when starting block server to obtain this information. Master server to ensure that the information in its possession always up to date, because it controls the distribution of all of Chunk position, but also through periodic monitoring heartbeat information block state of the server.

Initial design, we try to position information Chunk lasting saved on the Master server, but then we found the polling at boot block server, and then update periodically polls the way easier. This design simplifies the block to join a server cluster, leave a cluster, renamed, failure, and restart when, Master server and synchronization issues block server data. Have hundreds of servers in a cluster, such incidents occur frequently.

Read data process is as follows :

Write data flow is as follows:

 

 

Lease (lease), and change the order

Designers in the design of this system, an important principle is to minimize all operating and Master node interaction. With this design, at the client will now be described how block Master server and server interaction, to achieve a data modification operation, the recording operation is added and the snapshot function atoms.

The change is a change operation Chunk content or metadata, such as additional recording operation or a write operation. Change operation will be performed on all copies of Chunk. Designers use lease (lease) mechanism to maintain consistency between multiple copies of a change order. Master node to create a copy of a lease for the Chunk, and this is called the master copy of Chunk. Main Chunk Chunk operations all changes will be serialized. All copies have followed this sequence modification operations. Hence, the modified operation sequence lease global order is first selected by the Master node is determined, then the sequence number is determined by the main lease Chunk allocation.

Lease mechanism is designed to minimize the administrative burden for Master node. The initial lease timeout is set to 60 seconds. However, as long Chunk is modified, the main Chunk can apply for a longer lease will normally be confirmed Master node and received the lease extended period of time. These information requests and extended lease approval are usually attached Master heartbeat message between the nodes and the server to transmit blocks. Sometimes Master node will try to cancel the lease ahead of time (for example, Master node want to cancel the modification operation on a file that has been renamed). Even if Master node and a primary Chunk lost contact, it can still be safely after the old lease expires and another copy Chunk sign a new lease.

GFS is provided a data append operation one atom - append records ( the Record the append ) . Writing operation in a conventional manner, the client program specifies the offset data writing. A write operation to the same region are not parallel serial: region data segment tail may contain a plurality of different clients written. Use append records, the client need only specify the data to be written. GFS write operation at least once to ensure the successful implementation of atoms (i.e., a sequential write byte stream), additional data is written to the specified offset position GFS, after this offset GFS return to the client. This is similar to the Unix operating system, programming environments, opened in O_APPEND schema file, the behavior of multiple concurrent write operations when there is no race conditions.

Append records in our distribution formula very frequently used applications, in which distributed applications usually have a lot of clients in parallel, additional data is written to the same file. If we use the traditional way of file write operations, the client requires additional complicated, time-consuming synchronization mechanism, such as using a distributed lock manager. In our work, such files are usually used for multiple producer / single consumer queue system, or merge data from multiple clients results of the file.

Additionally recording operation is a modification, in addition to the main Chunk some additional control logic. All copies of the client to push data to a file last Chunk, followed by sending a request to the master Chunk. Main Chunk checks whether the record append operation will Chunk exceed the maximum capacity (64MB). If more than the maximum size , the first main current Chunk Chunk filled to the maximum capacity , after notifying all two copies of the same operations, and then respond to the client request re-recording operation on the next Chunk added. (Appended data record size strictly controlled Chunk  size 1/4, so that even in the worst case, the number of pieces of data are still in a controlled range.) Added usually does not exceed the maximum recording the Chunk size , the main Chunk to append data to their own copy, and then notify the two copies of the data written in the same position with the main Chunk on the success of last replied client operations.

If the record append operation fails on any of the copies, the client need to re-operate. Re-recording the additional results that different copies of the same Chunk may contain different data - repeat comprising all or a portion of the recording data. GFS does not guarantee that all copies Chunk at the byte level is exactly the same. Ensure that the data is only as a whole , it is written at least once atoms. This feature

Properties can be derived by simple observation: if the operation performed successfully, the data must have been written to all copies of the same offset location in the Chunk. After this, all the copies are to record at least the tail, any subsequent record is appended to a larger offset, Chunk or different, even if the other is selected as a copy of the Master Chunk master node Chunk. In terms of our guarantee consistency model, recording data append operation successfully written region is defined (and therefore also consistent), otherwise inconsistent (and therefore is undefined).

A flowchart of additional data:

  1. Which one block server client ask Master node holds the current lease, and the location of other replicas. If you do not hold a lease Chunk, Master node to select one to create a copy of the lease (this step is not shown in the figure).
  2. Master node identifier and the other copy of the master Chunk (also referred to as secondary copies, two copies) position back to the client. Client cache such data for subsequent operations. Only the main Chunk is not available, or the main Chunk reply that it no longer holds the lease when the client only need to reconnect with the Master node.
  3. Push data to the client on all copies. The client may push data in any order. Server receives the data blocks and stored in its internal LRU cache has expired or the data is used to swap out. Since the network transmission load of the data stream is very high, we can separate the data stream by the data flow and control flow of the program based on the network topology, improve system performance, the system irrespective of which block of main storage server Chunk. Section 3.2 will further discuss this point.
  4. When all received copies of the data are confirmed, the client sends a write request to the main server block. This request identifies the earlier pushed all copies of the data. Chunk allocation main consecutive sequence numbers received for all operations, these operations may come from different clients, to ensure that the serial number sequence of operations performed. It is the sequence number order to the operation of the application in its own local state (It applies the mutation to its own local state in serial number order. Mutation is to change the meaning of the operation state change where the context translated).
  5. Chunk master request to write all the two copies. Each of the two copies of performing these operations in the same order in accordance with sequence numbers assigned to the main Chunk.
  6. All copies of the two main reply Chunk, they have completed the operation.
  7. Server master block (Chunk where the main server block) replies to the client. Any errors generated any copies returned to the client. In the event of an error, write operations may be executed successfully in a number of main Chunk and two copies. (If the operation fails in the Chunk master, the operation will not be assigned a serial number, is not transmitted) the client's request is identified as a failure, the modified region in an inconsistent state. Our client code to handle such errors by performing the operation failed repetition. Before Repeat from the beginning, starting with the customer the opportunity to step (3) to step (7) made several attempts.
  8. If the data amount is large write an application once or a plurality of data across Chunk, GFS client code will write them into a plurality of operation. These operations follow the control flow described above, but may be interrupted operations are performed simultaneously on a client or other covering. Thus, the shared file may contain data tail region fragments from different clients, however, since the writing operation after the decomposition of these are performed in the same order on all copies is completed, all copies are consistent Chunk.

In order to improve network efficiency, we have taken measures to separate the flow of data and control flow. In the control flow from the client to the Chunk master, and then all the two copies of the same time data as a pipe, along a well-chosen sequence of blocks server push chain. Our goal is to fully utilize the bandwidth of each machine, to avoid network bottlenecks and high latency connections, minimizing delay to push all the data.

In order to fully utilize the bandwidth of each machine, along one data block a push chain sequence server, rather than dispersed push other topologies (e.g., tree topology). The linear push mode, each machine outlet all the bandwidth is used to transmit data at the fastest speed, rather than a plurality of contact receiving between those allocated bandwidth.

 

Fault Tolerance

 

Published 77 original articles · won praise 178 · views 220 000 +

Guess you like

Origin blog.csdn.net/qq_38289815/article/details/103228536