[Big Data] HDFS DataNode working mechanism

DataNode working mechanism

  1. A data block is stored on the disk in the form of a file on the DataNode, including two files, one is the data itself, and the other is the metadata including the length of the data block, the checksum of the block data, and the timestamp.
  2. After the DataNode is started, it registers with the NameNode. After passing, it periodically (1 hour) reports all block information to the NameNode.
  3. There is a heartbeat event between the DataNode and the NameNode . The heartbeat is once every 3 seconds. The result of the heartbeat is returned with a command from the NameNode to the DataNode. If no heartbeat from a DataNode is received for more than 10 minutes, the node is considered unavailable.
  4. Some machines can be safely added and exited during cluster operation

Data integrity

Thinking: If the data stored in the computer disk is the red light signal (1) and the green light signal (0) that control the high-speed rail signal light, but the disk storing the data is broken and the green light is always displayed, is it dangerous?

In the same way, the data on the DataNode node is damaged, but it is not found. Is it dangerous? Then how to solve it?

  • Methods to ensure data integrity
  1. When the DataNode reads the Block, it will calculate CheckSum (checksum)
  2. If the calculated CheckSum is different from the value when the block was created, the block has been damaged
  3. Client reads Blocks on other DataNodes
  4. DataNode periodically verifies CheckSum after its file is created, as shown below:

Parameter setting when offline

The TimeOut parameter setting when the DataNode process died or the network failure caused the DataNode to be unable to communicate with the NameNode

  1. The NameNode will not immediately judge the node as dead, it will take a period of time, which is called the timeout period
  2. The default timeout period of HDFS is 10 minutes + 30 seconds
  3. The calculation formula for the timeout period is:
# dfs.namenode.heartbeat.recheck-interval默认为300000ms,dfs.heartbeat.interval默认为5s
TimeOut = 2 * dfs.namenode.heartbeat.recheck-interval + 10 * dfs.heartbeat.interval
  • In actual development, you can adjust it according to your own server's situation. For example, if the server performance is relatively low, you can appropriately increase the time; if the server performance is good, you can shorten it appropriately.

Serve new data nodes

Requirements: With the growth of the company's business or major events (such as Double 11), the amount of data is getting larger and larger, and the capacity of the original data nodes can no longer meet the needs of storing data, and new data needs to be dynamically added on the basis of the original cluster node.

  • step:
  1. Clone a virtual machine
  2. Modify IP address and host name
  3. Delete the data and logs files stored in the original HDFS file system
  4. Just start the node directly

Decommission old data nodes

There are two ways to decommission old data nodes: add whitelist and blacklist decommission

Add whitelist

  • step:
  1. hadoop安装目录/etc/hadoopCreate a dfs.hosts file in the NameNode directory
  2. Add a whitelisted host name
  3. Add the dfs.hosts property to the hdfs-site.xml configuration file of NameNode
<property>
    <name>dfs.hosts</name>
    # dfs.hosts文件所在路径
    <value>/opt/module/hadoop-2.7.2/etc/hadoop/dfs.hosts</value>
</property>
  1. Synchronize configuration files to other nodes in the cluster
  2. Refresh NameNode
[kocdaniel@hadoop102 hadoop-2.7.2]$ hdfs dfsadmin -refreshNodes
Refresh nodes successful
  1. Update ResourceManager node
[kocdaniel@hadoop102 hadoop-2.7.2]$ yarn rmadmin -refreshNodes
  1. If the data is not balanced, you can use commands to rebalance the cluster
[kocdaniel@hadoop102 sbin]$ ./start-balancer.sh

Blacklist retirement

  • step:
  1. hadoop安装目录/etc/hadoopCreate a dfs.hosts.exclude file in the NameNode directory
  2. Add the name of the host to be decommissioned
  3. Add the dfs.hosts.exclude property to the hdfs-site.xml configuration file of NameNode
<property>
    <name>dfs.hosts.exclude</name>
     <value>/opt/module/hadoop-2.7.2/etc/hadoop/dfs.hosts.exclude</value>
</property>
  1. Synchronize configuration files to other nodes in the cluster
  2. Refresh NameNode, Refresh ResourceManager
[kocdaniel@hadoop102 hadoop-2.7.2]$ hdfs dfsadmin -refreshNodes
Refresh nodes successful
[kocdaniel@hadoop102 hadoop-2.7.2]$ yarn rmadmin -refreshNodes
  1. Check the web browser, the status of the decommissioned node is decommission in progress, indicating that the data node is copying blocks to other nodes
  2. Wait for the status of the decommissioned node to be decommissioned (all blocks have been copied), stop the node and the node resource manager.
  • Note: If the number of replicas is 3 and the number of serving nodes is less than or equal to 3, it cannot be successfully decommissioned. You need to modify the number of replicas to decommission.
  • Note: The same host name is not allowed to appear in the whitelist and blacklist at the same time.

The difference between the two

  • Adding a whitelist is more irritable and will directly turn off the service of the node to be decommissioned without copying data
  • Blacklist decommissioning will copy the data of the node server to be decommissioned to other nodes without directly shutting down the node service, which is relatively slow

DataNode multi-directory configuration

  • DataNode can also be configured into multiple directories, and the data stored in each directory is different . That is: data is not a copy, which is different from NameNode multi-directories
  • Role: to ensure that all disks are utilized and balanced, similar to disk partitions in windows

Reprinted from: https://www.cnblogs.com/kocdaniel/p/11604887.html

Guess you like

Origin blog.csdn.net/debimeng/article/details/104148052