Questions that must be asked in the direction of interview big data: HDFS read and write process

HDFS read and write process

This question is an indispensable question for interviewing big data analysts. Many interviewers cannot fully speak out
, so please remember it. And many problems are derived from the HDFS read and write process.

1. HDFS reading process

  1. Client sends RPC request to NameNode. The location of the requested file block;
  2. After the NameNode receives the request, it will check the user permissions and whether there is this file. If they are all in line
    , it will return part or all of the block list as appropriate. For each block, the NameNode
    will return the address of the DataNode that contains the copy of the block; The address of the DataNode,
    the distance between the DataNode and the client will be obtained according to the cluster topology, and then sorted according to
    two rules: in the network topology, the closest to the Client is ranked first; in the heartbeat mechanism,
    the state of the DataNode reported overtime is STALE, Such row is behind;
  3. Client selects the top-ranked DataNode to read the block. If the client itself is
    a DataNode, it will directly obtain data from the local (short-circuit read feature);
  4. The essence of the bottom layer is to establish a Socket Stream (FSDataInputStream), and repeatedly call
    the read method of the parent class DataInputStream until the data on this block is read;
  5. After reading the blocks in the list, if the file reading is not over yet, the client will continue to
    obtain the next batch of block lists from the NameNode;
  6. After reading a block, checksum verification will be performed. If an error occurs when reading the DataNode,
    the client will notify the NameNode, and then
    continue reading from the next DataNode that has a copy of the block;
  7. The read method reads the block information in parallel, not one by one; the NameNode only
    returns the DataNode address of the block requested by the Client, not the data of the requested block;
  8. Finally, all blocks read will be merged into a complete final file;
    insert image description here

2. HDFS write process

  1. Client The client sends an upload request and establishes communication with the NameNode through RPC. The NameNode
    checks whether the user has the upload permission and whether the uploaded file has the same name in the corresponding HDFS directory
    . If any of the two is not satisfied, an error will be reported directly. , if both are satisfied, then
    return to the client a message that can be uploaded;
  2. The client splits the file according to the size of the file. The default is 128M. After the split is completed, it
    sends a request to the NameNode to which server the first block is uploaded to;
  3. After the NameNode receives the request, it allocates files according to the network topology, rack perception and copy mechanism
    , and returns the address of the available DataNode;
  4. After receiving the address, the client communicates with a node in the server address list such as A, which is essentially an
    RPC call to establish a pipeline. A will continue to call B after receiving the request, and B will
    call C to complete the establishment of the entire pipeline. , return to Client step by step;
  5. Client starts to send the first block to A (read data from the disk first and then put it in the local memory
    cache), in the unit of packet (data packet, 64kb), A receives a packet and sends it to
    B, and then B sends For C, A will put it into a response queue to wait for the response after each packet is transmitted;
  6. The data is divided into individual packets and transmitted sequentially on the pipeline. In
    the reverse transmission of the pipeline, acks are sent one by one (command correct response), and finally the first
    DataNode node A in the pipeline sends the pipeline ack to the Client;
  7. After a block transmission is completed, the Client requests the NameNode to upload the second block again, and
    the NameNode selects three DataNodes for the Client again.

Guess you like

Origin blog.csdn.net/m0_58353740/article/details/131353868