如何确定block损坏的位置和修复

首先通过 hadf fsck 命令帮助

[hadoop@hadoop-01 ~]$ hdfs fsck

Usage: hdfs fsck <path> [-list-corruptfileblocks | [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks | -replicaDetails | -upgradedomains]]]] [-includeSnapshots] [-storagepolicies] [-blockId <blk_Id>]
	<path>	start checking from this path
	-move	move corrupted files to /lost+found
	-delete	delete corrupted files
	-files	print out files being checked
	-openforwrite	print out files opened for write
	-includeSnapshots	include snapshot data if the given path indicates a snapshottable directory or there are snapshottable directories under it
	-list-corruptfileblocks	print out list of missing blocks and files they belong to
	-files -blocks	print out block report
	-files -blocks -locations	print out locations for every block
	-files -blocks -racks	print out network topology for data-node locations
	-files -blocks -replicaDetails	print out each replica details 
	-files -blocks -upgradedomains	print out upgrade domains for every block
	-storagepolicies	print out storage policy summary for the blocks
	-blockId	print out which file this blockId belongs to, locations (nodes, racks) of this block, and other diagnostics info (under replicated, corrupted or not, etc)

Please Note:
	1. By default fsck ignores files opened for write, use -openforwrite to report such files. They are usually  tagged CORRUPT or HEALTHY depending on their block allocation status
	2. Option -includeSnapshots should not be used for comparing stats, should be used only for HEALTH check, as this may contain duplicates if the same file present in both original fs tree and inside snapshots.

Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port>    specify a ResourceManager
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
command [genericOptions] [commandOptions]

具体操作:

#查看文件中损坏的块(-list-corruptfileblocks)
[hadoop@hadoop-01 ~]$ hdfs fsck /home/hadoop/clear/day=20180717/ -list-corruptfileblocks

#将损坏的文件移动至/lost+found目录(-move)
[hadoop@hadoop-01 ~]$ hdfs fsck /home/hadoop/clear/day=20180717/part-r-00000 -move

#删除损坏的文件(-delete)
[hadoop@hadoop-01 ~]$ hdfs fsck /home/hadoop/clear/day=20180717/part-r-00000 -delete

#检查并列出所有文件状态(-files)
[hadoop@hadoop-01 ~]$ hdfs fsck /home/hadoop/clear/day=20180717/ -files

#检查并打印正在被打开执行写操作的文件(-openforwrite)
[hadoop@hadoop-01 ~]$ hdfs fsck /home/hadoop/clear/day=20180717/ -openforwrite

#打印文件的Block报告(-blocks) 需要和-files一起使用。
[hadoop@hadoop-01 ~]$ hdfs fsck /home/hadoop/clear/day=20180717/part-r-00000 -files -blocks
Connecting to namenode via http://hadoop:50070/fsck?ugi=hadoop&files=1&blocks=1&path=%2Fhome%2Fhadoop%2Fclear%2Fday%3D20180717%2Fpart-r-00000
FSCK started by hadoop (auth:SIMPLE) from /192.168.232.8 for path /home/hadoop/clear/day=20180717/part-r-00000 at Mon Apr 01 14:48:16 CST 2019
/home/hadoop/clear/day=20180717/part-r-00000 72432 bytes, 1 block(s):  OK
0. BP-2127332931-192.168.232.8-1545632462593:blk_1073741866_1042 len=72432 Live_repl=1

#其中,/logs/site/2015-08-08/lxw1234.log 7408754725 bytes, 56 block(s): 表示文件的总大小和block数;
0. BP-2127332931-192.168.232.8-1545632462593:blk_1073741866_1042 len=72432 Live_repl=1

#前面的0代表该文件的block索引,56的文件块,就从0-55;

BP-1034052771-172.16.212.130-1405595752491:blk_1075892982_2152381表示block id;

len=72432 表示该文件块大小;

Live_repl=1 表示该文件块副本数;

#打印文件块的位置信息(-locations)  需要和-files -blocks一起使用。
[hadoop@hadoop-01~]$ hdfs fsck /home/hadoop/clear/test.log -files -blocks -locations

#打印文件块位置所在的机架信息(-racks)
[hadoop@hadoop-01~]$ hdfs fsck /home/hadoop/clear/test.log -files -blocks -locations -racks

1.现象:
断电 导致HDFS服务不正常或者显示块损坏

2.检查HDFS系统文件健康
hdfs fsck /

3.检查hdfs fsck -list-corruptfileblocks

Connecting to namenode via http://hadoop36:50070/fsck?ugi=hdfs&listcorruptfileblocks=1&path=%2F
The list of corrupt files under path '/' are:
blk_1075229920  /hbase/data/JYDW/WMS_PO_ITEMS/c71f5f49535e0728ca72fd1ad0166597/0/f4d3d97bb3f64820b24cd9b4a1af5cdd
blk_1075229921  /hbase/data/JYDW/WMS_PO_ITEMS/c96cb6bfef12795181c966a8fc4ef91d/0/cf44ae0411824708bf6a894554e19780
The filesystem under path '/' has 2 CORRUPT files

4.分析
MySQL–》大数据平台
​ 只需要从MySQL这个表的数据重新刷新一份到HDFS平台

5.想要知道文件的哪些块分布在哪些机器上面?手工删除linux文件/dfs/dn/…
hadoop36:hdfs:/var/lib/hadoop-hdfs:>

-files 文件分块信息,
-blocks 在带-files参数后才显示block信息
-locations 在带-blocks参数后才显示block块所在datanode的具体IP位置,
-racks 在带-files参数后显示机架位置

无法显示,无法手工删除块文件:

hdfs fsck /hbase/data/JYDW/WMS_PO_ITEMS/c71f5f49535e0728ca72fd1ad0166597/0/f4d3d97bb3f64820b24cd9b4a1af5cdd -files  -locations -blocks  -racks
Connecting to namenode via http://hadoop36:50070/fsck?ugi=hdfs&locations=1&blocks=1&files=1&path=%2Fhbase%2Fdata%2FJYDW%2FWMS_PO_ITEMS%2Fc71f5f49535e0728ca72fd1ad0166597%2F0%2Ff4d3d97bb3f64820b24cd9b4a1af5cdd
FSCK started by hdfs (auth:SIMPLE) from /192.168.1.100 for path /hbase/data/JYDW/WMS_PO_ITEMS/c71f5f49535e0728ca72fd1ad0166597/0/f4d3d97bb3f64820b24cd9b4a1af5cdd at Sat Jan 20 15:46:55 CST 2018
/hbase/data/JYDW/WMS_PO_ITEMS/c71f5f49535e0728ca72fd1ad0166597/0/f4d3d97bb3f64820b24cd9b4a1af5cdd 2934 bytes, 1 block(s): 
/hbase/data/JYDW/WMS_PO_ITEMS/c71f5f49535e0728ca72fd1ad0166597/0/f4d3d97bb3f64820b24cd9b4a1af5cdd: CORRUPT blockpool BP-1437036909-192.168.1.100-1509097205664 block blk_1075229920
 MISSING 1 blocks of total size 2934 B

1. BP-1437036909-192.168.1.100-1509097205664:blk_1075229920_1492007 len=2934 MISSING!

Status: CORRUPT
 Total size:    2934 B
 Total dirs:    0
 Total files:   1
 Total symlinks:                0
 Total blocks (validated):      1 (avg. block size 2934 B)

------

  UNDER MIN REPL'D BLOCKS:      1 (100.0 %)
  dfs.namenode.replication.min: 1
  CORRUPT FILES:        1
  MISSING BLOCKS:       1
  MISSING SIZE:         2934 B
  CORRUPT BLOCKS:       1

------

 Minimally replicated blocks:   0 (0.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     0.0
 Corrupt blocks:                1
 Missing replicas:              0
 Number of data-nodes:          12
 Number of racks:               1
FSCK ended at Sat Jan 20 15:46:55 CST 2018 in 0 milliseconds

The filesystem under path '/hbase/data/JYDW/WMS_PO_ITEMS/c71f5f49535e0728ca72fd1ad0166597/0/f4d3d97bb3f64820b24cd9b4a1af5cdd' is CORRUPT
hadoop36:hdfs:/var/lib/hadoop-hdfs:>

好的文件是显示块分布情况的:

hadoop36:hdfs:/var/lib/hadoop-hdfs:>hdfs fsck /hbase/data/JYDW/WMS_TO/011dea9ae46dae6c1f1f3a24a75af100/0/1d60f56773984e4cac614a8b5f7e93a6 -files  -locations -blocks  -racks
Connecting to namenode via http://hadoop36:50070/fsck?ugi=hdfs&files=1&locations=1&blocks=1&racks=1&path=%2Fhbase%2Fdata%2FJYDW%2FWMS_TO%2F011dea9ae46dae6c1f1f3a24a75af100%2F0%2F1d60f56773984e4cac614a8b5f7e93a6
FSCK started by hdfs (auth:SIMPLE) from /192.168.1.100 for path /hbase/data/JYDW/WMS_TO/011dea9ae46dae6c1f1f3a24a75af100/0/1d60f56773984e4cac614a8b5f7e93a6 at Sat Jan 20 15:58:25 CST 2018
/hbase/data/JYDW/WMS_TO/011dea9ae46dae6c1f1f3a24a75af100/0/1d60f56773984e4cac614a8b5f7e93a6 1697 bytes, 1 block(s):  OK

1. BP-1437036909-192.168.1.100-1509097205664:blk_1075227504_1489591 len=1697 Live_repl=3 [/default/192.168.1.150:50010, /default/192.168.1.153:50010, /default/192.168.1.145:50010]

blk_1075227504_1489591 len=1697 Live_repl=3 
[/default/192.168.1.150:50010, /default/192.168.1.153:50010, /default/192.168.1.145:50010]

6.最终选择一了百了,删除损坏的块文件,然后业务系统数据重刷
hadoop36:hdfs:/var/lib/hadoop-hdfs:>hdfs fsck / -delete

扫描二维码关注公众号,回复: 5988020 查看本文章

7.假设数据仅有HDFS上 【文件只有hdfs上有;其他来源没有;这个时候如果有副本是完好的;有的副本是损坏的】
7.1 hdfs dfs -ls /xxxx
​ hdfs dfs -get /xxxx ./ 下载好完好的副本到Linux环境
​ hdfs dfs -rm /xxx 删除已有的文件包括损坏的副本文件
​ hdfs dfs -put xxx / 上传完好的副本文件;此时hdfs就会自动完善3个副本。

注意:

log文件丢一丢丢 没有关系
文件是业务数据 订单数据 丢了,需要报告

手动修复损坏的块【hdfs debug】
hdfs命令帮助是没有debug的,但是确实有hdfs debug这个组合命令,切记。

hdfs debug recoverLease -path hdfs文件位置 -retries 10
自动修复

当数据块损坏后,DN节点执⾏行行directoryscan操作之前,都不会发现损坏;
 也就是directoryscan操作是间隔6h
 dfs.datanode.directoryscan.interval : 21600
 在DN向NN进行blockreport前,都不会恢复数据块;
 也就是blockreport操作是间隔6h
 dfs.blockreport.intervalMsec : 21600000
 当NN收到blockreport才会进行恢复操作。

注意:手动修复方式,但是前提要手动删除损坏的block块。
 切记,是删除损坏block文件和meta⽂文件,而不是删除hdfs文件。
 当然还可以先把文件get下载,然后hdfs删除,再对应上传。
 切记删除不要执行: hdfs fsck / -delete 这是删除损坏的文件, 那么数据不就丢了了嘛;除非无所谓丢数据,或者有信心从其他地方可以补数据到hdfs!

猜你喜欢

转载自blog.csdn.net/weixin_43212365/article/details/89413990