HDFS copy storage mechanism
The first data comes from the client
The second storage location is on the same rack as the first copy, and not on the same node. According to certain rules (cpu memory IO is the utilization rate, and the
remaining capacity of the hard disk), find a node to store
The storage location of the third copy is not on the same rack as the first and second data copies, and the logic is closest to the rack where the copies 1 and 2 are stored
according to certain rules (cpu memory IO is usage, And the remaining capacity of the hard disk) Find a node to store
Illustration:
DataNode role
1. Perform reading and writing of data (the client responds)
2. Periodically report to NameNode (data block information, checksum)
If the datanode does not report to the NameNode for 10 minutes, it means that the
heartbeat cycle has been lost (downtime) for 3 seconds
3. Execute pipeline replication (point by point copy)
Illustration:
Rack awareness
In fact, engineers need to receive the creation of a script (python sh), which records the correspondence between the host IP and the switch.
The configuration location is core-site.xml and finally add the following configuration
topology.script.file.name
/home/bigdata/apps/hadoop/etc/hadoop/RackAware.py
RPC refers to remote procedure calls. It is a method for data communication between multiple components and multiple modules in the cluster.