Review HDFS copy storage mechanism

HDFS copy storage mechanism


The first data comes from the client


The second storage location is on the same rack as the first copy, and not on the same node. According to certain rules (cpu memory IO is the utilization rate, and the
remaining capacity of the hard disk), find a node to store


The storage location of the third copy is not on the same rack as the first and second data copies, and the logic is closest to the rack where the copies 1 and 2 are stored
according to certain rules (cpu memory IO is usage, And the remaining capacity of the hard disk) Find a node to store
 

Illustration:

 


DataNode role


1. Perform reading and writing of data (the client responds)


2. Periodically report to NameNode (data block information, checksum)


If the datanode does not report to the NameNode for 10 minutes, it means that the
heartbeat cycle has been lost (downtime) for 3 seconds


3. Execute pipeline replication (point by point copy)
 

Illustration:

 


Rack awareness


In fact, engineers need to receive the creation of a script (python sh), which records the correspondence between the host IP and the switch.
The configuration location is core-site.xml and finally add the following configuration
topology.script.file.name
/home/bigdata/apps/hadoop/etc/hadoop/RackAware.py
 

RPC refers to remote procedure calls. It is a method for data communication between multiple components and multiple modules in the cluster.

Published 231 original articles · 300 praises · 300,000 views

Guess you like

Origin blog.csdn.net/bbvjx1314/article/details/105444079