Hadoop trilogy started~

This article has been updated to Yuque. If you reply "Yuque" in the background, you can get all the information that is continuously updated throughout the entire career of Attack Bar Big Data (the password is updated once a week)

Getting started with big data usually starts with Hadoop. Through this article, you can learn the following points:

  1. Basic features of Hadoop

  2. HDFS read process

  3. HDFS write process

  4. HDFS append process

  5. Consistency Guarantee of HDFS Data Blocks

1. Basic features of Hadoop

   Hadoop是一种分布式系统基础架构,由Apache基金会维护,Hadoop框架最核心的设计就是MapReduce和HDFS。其中一个组件是HDFS(Hadoop Distributed File System)为海量数据提供了存储,具有高容错的特点,而且可以在通过廉价的硬件存储海量的数据。而且也具有高吞吐量的特点。HDFS放宽了POSIX的要求,可以以流的方式来访问数据。而MapReduce则为海量的数据提供了计算。Hadoop具有以下几个优点:

1. High reliability: Hadoop's ability to store and process data bit by bit is trustworthy.

2. High scalability: Hadoop distributes massive data among available computing machines, and can easily expand thousands of nodes to increase data storage capacity and computing power.

3. Efficiency: Hadoop can dynamically move data between nodes and ensure the dynamic balance of each node, so the processing speed is very fast.

4. High fault tolerance: Hadoop can automatically save multiple copies of data, and can automatically reassign failed tasks.

5. Low cost: Compared with all-in-one machines, commercial data warehouses, and data marts such as QlikView and Yonghong Z-Suite, Hadoop is open source, so the software cost of the project will be greatly reduced.

This article will introduce the three core processes around HDFS components.

picture

2. HDFS reading process

picture

The overall reading process is shown in the figure above, and the following details are introduced in detail:

  1. The client executes the read command or calls the read interface through the API. At this time, the bottom layer communicates with the NameNode through the ClientProtocol, and the NameNode obtains the data block information corresponding to the starting position of the file according to the read conditions.

  2. After the client gets the obtained data block information (which also includes the real storage location corresponding to the data block, that is, on which machine in which rack), it then selects a DataNode node closest to the client to establish a connection. Why? How about choosing the nearest one? Because it involves a series of remote operations such as network IO, serialization, and deserialization, which are time-consuming. After the connection is established, the data block read operation is performed.

  3. The bottom layer of the client will obtain data from the DataNode in the form of packaet by calling the DFSInputStream.read() method.

  4. When the client gets the returned data block, it must first verify it, otherwise it will be a bad thing if it gets a bad block. When the check value is different or the DataNode is abnormal during the reading process, the bad block information will be reported to the NameNode through the ClientProtocol.reportBadBlocks method, and then the NameNode will issue a block balance command after receiving this information, because The default block copy is 3 copies. Now that one copy is missing, don’t you have to copy it again?

  5. At this time, the client will communicate and pull data from the DataNode node where the copy data block is located, and also perform verification operations. There must be no missing part of the process.

  6. After the data block is read, it will request the NameNode again to obtain the DataNode node information corresponding to the next data block.

  7. Repeat steps 1 to 6 in this way until all data blocks are read, and finally close the input stream.

To summarize here, there are probably a few steps:

  1. The client initiates a file open request

  2. Get DataNode location information

  3. DataNode communication continuously obtains data blocks

  4. Check the data block and do exception handling

  5. Close the input stream, Over!

3. HDFS write process

picture

The specific process of writing to HDFS is as follows:

  1. When the client initiates a write request, it will first let the NameNode create an empty file, and then record this operation in the editLog. At this time, it will return a DFSOutputStream output stream object, so that it can enter the subsequent write.

  2. When the client gets the output stream object, it will apply for a data block from the NameNode, otherwise how would the client know where to write it? When the client gets the corresponding DataNode location information, it starts to communicate and establish a pipeline, so that it can be used to transmit data.

  3. The client starts to send data at this time, but it is impossible to send all at once, and will divide the data into Packets and send them to the DataNode, and then the DataNode will start to confirm, and return an ack message to the output stream if there is no problem.

  4. It is also mentioned above that HDFS has the characteristics of high fault tolerance, so if the data is written to a node, if the node hangs up, the data will be lost? So when the data is written to the node, it will communicate with other nodes to transmit the data in the form of Pipeline, so that the data will be kept in multiple copies. Specifically, how many copies should be kept, combined with your actual storage resources and data importance.

  5. When all DataNodes have confirmed that the data has been successfully written, the output stream will delete the data packet from the cache queue (this cache queue is used to store unconfirmed packets), when the DataNode successfully receives a data block Well, it will report to the NameNode steward, and the NameNode will update the metadata information in the memory (that is, on which nodes the block is stored, and what is the data range involved in the block).

  6. When the client fills up a block, it will apply for a new data block from the NameNode again, and then execute steps 1~5 in a loop. (Previously the size of a block was 64MB, now changed to 128MB, it may be set to 256MB in actual production).

  7. When the client has written all the data in the file, it will close the output stream, and then notify the NameNode to submit all the data blocks.

in conclusion:

  1. create empty file

  2. build stream pipeline

  3. Start writing data, replica copying

  4. Close the pipeline, commit the file

4. HDFS additional write process

picture

In addition to the writing process, there are usually some additional writing scenarios. The internal specific process is as follows:

  1. The client initiates a request to write to an existing file. At this time, the NameNode will return the location information of the last block of the file, and instantiate a DFSOutputStream output stream. At the same time, it will get a lease (the lease issue will not be discussed further. Yes, it can be understood that a ticket has a validity period).

  2. If the last block is full, Null will be returned, and a new data block will be requested from the NameNode, otherwise, the pipeline will be established and data will be written.

  3. The client starts to send data, and the data will be divided into Packets and sent to DataNodes, and then replicated between DataNodes, and then DataNodes will confirm and return an ack message to the output stream.

  4. When all DataNodes confirm that the data is written successfully, the output stream will delete the data packet from the cache queue, which is the same as the normal writing process.

  5. When the client fills up a data block, it will apply for a new data block and execute steps 1~5 in a loop.

  6. When the client has written all the data in the file, it will close the output stream and notify the NameNode to submit all the data blocks.

The above is the core trilogy of HDFS. Let’s think about the writing process again. If an exception occurs during the writing process, will it cause data loss? How does Hadoop ensure that the data blocks of all nodes in the cluster are consistent and accurate?

5. Data fault tolerance and consistency guarantee

Let's first understand how to ensure that it is not lost during the writing process?

First of all, in the design of the output stream, a buffer queue is also mentioned above, which is used to store unacknowledged data packets. If there is an abnormality in the node or a failure in network communication during the writing process, Then the unconfirmed data packets will be re-added to the sending queue, so there will be no loss of a certain block, because all blocks must be confirmed.

Another question is how does the cluster ensure that data blocks are consistent and accurate on all nodes?

There is actually a time stamp mechanism here, that is to say, a time stamp will be applied for when writing a data block, and then a pipeline will be established with this time stamp. Here are two situations:

One is the case of DataNode suspended animation: When writing data to a DataNode node, for example, suddenly the DataNode load is very high and does not send a heartbeat to the NameNode, then it will be mistaken for this to hang up, and the client will report to the NameNode at this time , and re-apply for the timestamp to establish a new pipeline. When the node recovers, the timestamp before the failure will be different from the timestamp saved by the NameNode, then the DataNode will delete the invalid block, thus ensuring all data. The blocks are all consistent.

One is when the DataNode is really down: In this case, the client will communicate with the new DataNode, apply for a new time stamp to establish a channel, and the new data block has not been stored on this node, so it will pass other The DataNode node copies the data block to the node, thus ensuring that the data block will not be lost.

Guess you like

Origin blog.csdn.net/qq_28680977/article/details/122149413