Big Data sharing practice to read and write two-step tutorial book -HDFS

Programmers big data sharing practice to read and write two-step tutorial book -HDFS

Big Data sharing practice to read and write two-step tutorial book -HDFS

 

 

First, read and write the premise of HDFS

NameNode (metadata node): store metadata (name space, number of copies, authority, block list, the cluster configuration information), does not contain data node. The metadata node file system metadata stored in memory.

1.DataNode (node ​​data): where the real storage of data, data in units of blocks. The default block size is 128M. Data node periodically sends information to all the memory blocks metadata node. NameNode by the client node and the communication node again the data read or write data.

2.SecondaryNameNode (from the metadata node): metadata node is not a spare node, but work with the metadata node, working with different metadata node. SecondaryNameNode periodically namespace metadata node change log and image file merge, helping metadata node memory to store metadata information on the disk.

3.Client (client): The client is required to obtain applications and interfaces HDFS file system, HDFS triggered read / write operations.

It is worth noting:

1.namenode actual client only upload a datanode, the other two are namenode done. Let datenote own replication. After the copy is complete then progressively return the results to namenode. If 2,3datanode copy fails, then there namenode datanode assign a new address. For the client default upload a datanode on it, and the rest by datanode own copy.

2.datanode slice is done by the client. Upload and upload a copy of the second three datanode is asynchronous.

Two, HDFS in the write process:

1. root namenode communication request to upload files, whether namenode check the target file already exists, the parent directory exists.

2.namenode return if it can be uploaded.

3.client first block of the transmission request to the server which datanode.

4.namenode return three datanode server ABC.

A 5.client a request to upload data (essentially an RPC call, establishing pipeline), A receives a request continues to call B, and B calls C, the setup complete really a pipeline of three dn, back step clients.

6.client first began to upload Block A (start disk read data into a local cache memory), in packet units, A will receive a packet transmitted to B, B to pass C; A per pass a reply packet will be placed in a queue waiting for a reply.

7. When a block transfer is complete, client requests again namenode second block upload server.

Three, hdfs the read process:

1. namenode communication with query metadata, to find datanode server file blocks are located.

2. Choose a Datanode (principle of proximity, then randomly) server, the request for establishing the socket stream.

3.datanode start sending data. (Read data stream from the disk into the inside, do the check in packet units)

4. The client received in packet units, the first in the local cache, and then written to the destination file.

Recommended Reading articles

Big Data technologies inventory

Training programmers to share large data arrays explain in Shell

Big Data Tutorial: SparkShell in writing Spark and IDEA program

Guess you like

Origin blog.csdn.net/chengxvsyu/article/details/91492206