HDFS Block damage solution

Background description:

The HDFS service is found to be abnormal after the computer room is powered off and restarted

Discovery steps:

Check the health of the HDFS file system by viewing commands or web ui information
```
hdfs fsck /
```
Check which corresponding blocks are damaged (display specific block information and file path information)
```
hdfs fsck -list-corruptfileblocks
```

Data processing flow: MySQL-----> Hadoop, the solution only needs to resynchronize a copy of the data in the table

Deep thinking:

How to obtain the specific information of the file block?
One file corresponds to multiple blocks, and each block is distributed on different machines?

Use the command to view block information:

hdfs fsck xxx -files -locations -blocks -racks

The meaning of the corresponding parameters:

files file block information,
blocks display block information only after the -files parameter
locations After the -blocks parameter is taken, the specific IP location of the datanode where the block block is located is displayed.
racks displays the rack position after the -files parameter

Files with damaged blocks cannot be displayed. For good files that are not damaged, the detailed distribution information can be displayed. The distribution of corresponding block information can be obtained, and the damaged blocks can be deleted on the designated linux machine, or use The command provided by Hadoop to delete the damaged block:, hdfs fsck / -deletethis command is to delete the file corresponding to the damaged block, then the other blocks corresponding to the file will also be deleted.

Summary of solutions:

Method 1: delete brute force deletion

Finally delete the file of the damaged block, and then refresh the corresponding business system data
Command: hdfs fsck / -delete
Note: This command only deletes the file corresponding to the damaged block, not the damaged block

Thinking:

If only the file corresponding to the damaged block is deleted, we don't know how much the corresponding data is lost. How to ensure that the data is restored? This is something to think about.
If it’s log data, it’s okay to lose a little bit, then it’s okay.
If the corresponding data is business data, such as order data, it cannot be lost. Data refreshing and maintenance are required to be reported

Method 2: Debug elegant processing

Manually repair the block command (retry at most 10 times):

hdfs debug recoverLease -path /blockrecover/bigdata.md -retries 10

This is based on the thinking: a block has three copies corresponding to it, one of the copies is damaged, but there are two other copies, it can be repaired with the other two copies, so we can use the debug command to do repair

Method 3: Automatically repair configuration parameters

In fact, in HDFS, automatic block repair can also be configured. When a data block is damaged, the DataNode node will not find the damage before performing the directoryscan operation. The directoryscan operation interval is 6h.

dfs.datanode.directoryscan.interval : 21600

The data block will not be restored before the DataNode performs blockreport to the NameNode; the blockreport operation is at an interval of 6h

dfs.blockreport.intervalMsec : 21600000