Review HDFS data read and write process

HDFS data reading process

1. The client reads the desired file by calling open () of the FileSystem object.
2. The Client initiates an RPC request to the NameNode to determine the location of the requested file block;
3. The NameNode will return a partial or full block list of the file, and for each block, the NameNode will return
the DataNode address containing the block copy; these DN return address, will come from the cluster topology DataNode accordance with the client, then the discharge
order, sort two rules: the network topology from the front row near client; Super heartbeat reporting mechanism DN state to STALE , Such a
rear-end;
4. The Client selects the top-ranked DataNode to read the block. If the client itself is a DataNode, then the data will be obtained directly from the local (short-
circuit read feature);
5. The underlying is essentially established Socket Stream (FSDataInputStream), repeatedly calling the read method of the parent class DataInputStream
until the data reading on this block is completed;
6. Parallel reading, re-reading if failed
7. After reading the block of the list, if the file is read take not over, the client will continue to get a number to the block list under the NameNode;
8, Back to the list of follow-up block
9, flow eventually close reading and to read all of the block will eventually merge into a complete file.
 

 

HDFS data writing process

 

 

1. The client initiates a file upload request and establishes communication with the NameNode through RPC. The NameNode checks whether the target file already exists, whether the parent directory
exists, and returns whether it can be uploaded;
2. The DataNode server to which the client requests the first block to be transferred;
3 The NameNode allocates files according to the number of backups specified in the configuration file and the rack-aware principle, and returns the available DataNode addresses such as:
A, B, C;
4. The client requests one of the three DataNodes to upload data (essentially It is an RPC call to establish a pipeline), A will continue to call
B after receiving the request , then B calls C, the entire pipeline is established, and then returns to the client step by step;
5, the client begins to upload the first block to A (first from the disk Read the data and put it in a local memory cache), with the unit of packet (64K by default), A
receives a packet and sends it to B, and B sends it to C; each packet sent by A will be put in a response queue and waiting for a response.
6. The data is divided into packets. The data packets are transmitted in sequence on the pipeline. In the reverse direction of the pipeline, the acks are sent one by one (the command is correctly
answered). Finally, the first DataNode node A in the pipeline sends the pipelineack to the client;
7. Close the write stream.
8. When the transmission of a block is completed, the client requests NameNode to upload the second block to the server again.
 

Published 231 original articles · 300 praises · 300,000 views

Guess you like

Origin blog.csdn.net/bbvjx1314/article/details/105444124