HDFS additional data error solution

The two main errors have been reported in turn tonight:

First

2022-10-25 21:37:11,901 WARN hdfs.DataStreamer: DataStreamer Exception
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[192.168.88.151:9866,DS-4b3969ed-e679-4990-8d2e-374f24c1955d,DISK]], original=[DatanodeInfoWithStorage[192.168.88.151:9866,DS-4b3969ed-e679-4990-8d2e-374f24c1955d,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
        at org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1304)
        at org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1372)
        at org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1598)
        at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1499)
        at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:719)
appendToFile: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[192.168.88.151:9866,DS-4b3969ed-e679-4990-8d2e-374f24c1955d,DISK]], original=[DatanodeInfoWithStorage[192.168.88.151:9866,DS-4b3969ed-e679-4990-8d2e-374f24c1955d,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.

I can understand this after reading the article. It means that it cannot be written; that is, if there are 3 datanodes in your environment, the number of backups is set to 3. When writing, it writes to 3 machines in the pipeline. The default replace-datanode-on-failure.policy is DEFAULT, if the datanode in the system is greater than or equal to 3, it will find another datanode to copy. At present, there are only 3 machines, so as long as one datanode has a problem, it will not be able to write successfully.

At that time, I only turned on one machine in the cluster during the operation, and then the datanode should be the number in the cluster, but I only have one, so I reported an error. Then I got Hadoop started on all three machines and it worked.

The common method on the Internet is to modify the hdfs-core.xml file, as follows:

<property>
        <name>dfs.support.append</name>
        <value>true</value>
</property>

<property>
        <name>dfs.client.block.write.replace-datanode-on-failure.policy</name>
        <value>NEVER</value>
</property>
<property>
        <name>dfs.client.block.write.replace-datanode-on-failure.enable</name>
        <value>true</value>
</property>

The response to the node on the Internet is explained in this way. I think you can pay attention to this for the second and third nodes: dfs.client.block.write.replace-datanode-on-failure.policy, when the default is 3 or more backups, it is Will try to replace the node and try to write to the datanode. In the case of two backups, the datanode is not replaced, and the writing is started directly. For a cluster of 3 datanodes, as long as one node does not respond to writes, there will be problems, so it can be turned off.

the second

appendToFile: Failed to APPEND_FILE /2.txt for DFSClient_NONMAPREDUCE_505101511_1 on 192.168.88.151 because this file lease is currently owned by DFSClient_NONMAPREDUCE_-474039103_1 on 192.168.88.151

appendToFile: Failed to APPEND_FILE /2.txt for DFSClient_NONMAPREDUCE_814684116_1 on 192.168.88.151 because lease recovery is in progress. Try again later.

The second is that these two sentences are back and forth, but I see that they are all leases after seeing because, I think it should be a problem, this should be the node response mentioned on the Internet, I see that owned by DFSClient is mentioned after the first sentence, It is estimated that I only opened one machine, because the latter IP is the first machine. If you all open it, it should be a node response problem. This is back to the above, so let’s modify the file honestly.