Hadoop study notes (c): distributed file system write and read processes

Writing process: how the file into pieces, uploaded to the server

Read process: how to read blocks of data from various servers

 

Writing process

Figure I

 

 

 

Figure II

 

The process of writing : NameNode will block allocated memory block location, every time you want to store the file will create a path in NameNode, after HDFSClient read and write access to this data are the first in the path went from NameNode where where to download the file or files to upload, storage location specific NameNode how to allocate a block according to the principle of FIG.

Examples of file write process aid understanding:

   If we have a file test.txt, I want to put it on Hadoop, execute the following command:

Quote

        # Hadoop fs -put /usr/bigdata/dataset/input/20130706/test.txt / opt / bigdata / hadoop / dataset / input / 20130706 // or execute the following command 
        # hadoop fs -copyFromLocal / usr / bigdata / dataset / input / 20130706 / test.txt / opt /         bigdata / hadoop / dataset / input / 20130706
       
        entire writing process is as follows: 
        the first step, the client calls create DistributedFileSystem () method, start creating a new file: DistributedFileSystem create DFSOutputStream, an RPC call, let NameNode create a new file in the file system's namespace; 
        the second step, after receiving the user NameNode write file RPC requests, even who must first perform various checks, such as whether the customer has tasked relevant whether permissions and the file already exists and so, after checks by a new file will be created, edited and operation records to a log, then DistributedFileSystem will DFSOutputStream objects wrapped in FSDataOutStream example, returned to the client; otherwise the file is created and failed to client throw IOException. 
        A third step, the client begins to write file: DFSOutputStream moves the file into packets of data packets, these packets then written therein is called a data queue (data queue). data queue requests to the node list NameNode DataNode node adapted to store a copy of data, and then generates a data stream Pipeline pipe before DataNode these, we assume that a copy of the current parameter is set to 3, then the data-flow pipeline DataNode there are three nodes. 
        A fourth step, first DFSOutputStream packets will write data to a first data stream DataNode node Pipeline pipe, a first DataNode packets received packets and then writing to the second node Pipeline, Similarly, the second node and saving the received data to write the third data in the DataNode Pipeline nodes. 
        The fifth step, maintaining the same internal DFSOutputStream another internal write data confirm the queue --ack queue. When in the third Pipeline DataNode packets successfully saved node, the node returns back a message acknowledgment data written to the second DataNode successful, receiving the second acknowledgment information DataNode after the current node data write success will send a confirmation message to the data write succeeds Pipeline DataNode the first node and the first node if the node's data also wrote successful, it sends ack packets will be deleted from the queue data after receiving the information. 
        In the process of writing data, if a node DataNode Pipeline data flow pipeline write failed even issue occurs, what needs to be done inside handle it? If this happens, it will perform some action: 
        First, the Pipeline data-flow pipeline will be closed, ack queue of packets is added to the front of the data queue of data packets to ensure that loss of packets does not occur, a new identity as specified in the current data stored in the other normal dataname, and transmits the identification to NameNode, so that the fault may be deleted datanode stored in the recovery part of the data block; 
        Next, the DataNode normal to the saved node ID of block version upgrades - DataNode such failure on node the block data will be deleted node back to normal after, the failed node will be deleted from the pipeline; 
        and finally, the remaining data will be written to the pipeline data flow pipeline of two other nodes. 
        If multiple nodes to write data in the Pipeline failure occurs, then the number just write a successful block reaches dfs.replication.min (default is 1), then the mission is successful write, and then after NameNode by-step way block replicated to other nodes, the last thing the data reaches the number of copies of dfs.replication configuration parameters. 
        Step Six ,, after the completion of the write operation, the client calls close () Close the write operation, refresh the data; 
        seventh step ,, refresh the data stream after the write operation is closed after NameNode. This entire write operation is completed.

 

Reading Process

 

 

 

Examples of file read to understand the flow of aid:

Client calls FileSystem.open () method:

  1 FileSystem through RPC communication with NN, NN return portion or all of the file block list (block containing the copied address DN).

  For Li 2 Select latest DN client connection is established, Block is read, return FSDataInputStream .

Client call to read input stream () method:

  1 When the read end of the block, FSDataInputStream close connections with the current DN, and did not read the next block to find the nearest DN .

  2 will be scanned and a block checksum verification , an error occurs when reading the DN, the client will notify NN, and then copy the DN owns the block from a continued reading.

  3 If the block list to read the file is not over, FileSystem will continue to get the next batch of block list from the NN.

Close FSDataInputStream


 

 


Reference links: HTTPS: //blog.csdn.net/zhang123456456/article/details/77882866, https://www.cnblogs.com/laowangc/p/8949850.html ;

 

Guess you like

Origin www.cnblogs.com/isme-zjh/p/11534186.html