一、原因
journal node文件损坏大部分由两种原因引起:
1.磁盘写满后edit_log文件无法写入,磁盘空间恢复后,edit_log文件数据已经丢失。
2.服务器物理断电导致edit_log文件损坏。
后果:超过半数 jn 节点损坏后,hdfs将处于不可用状态。
二、现象
1.检查namenode状态,发现两个namenode都是standby
[root@data01 conf]# sudo cat /etc/hadoop/conf/hdfs-site.xml |grep dfs.ha.namenodes.nameservice01 -A2 -B1
<property>
<name>dfs.ha.namenodes.nameservice01</name>
<value>namenode44,namenode8</value>
</property>
sudo -u hdfs hdfs haadmin -getServiceState namenode44
standby
sudo -u hdfs hdfs haadmin -getServiceState namenode8
standby
2.强制转换nn为active,又会自动变成standby
3.journal node报错信息:
2018-03-19 20:48:04,817 WARN namenode.FSImage (EditLogFileInputStream.java:scanEditLog(359)) - Caught exception after scanning through 0 ops from /data/sa_cluster/hadoop_ecosystem/dfs/jn/nameservice01/current/edits_inprogress_0000000000001990667 while determining its valid length. Position was 1011712
java.io.IOException: Can't scan a pre-transactional edit log.
at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4974)
at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)
at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:355)
at org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:551)
at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:192)
at org.apache.hadoop.hdfs.qjournal.server.Journal.<init>(Journal.java:152)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:90)
at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:99)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:189)
at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:224)
at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25431)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject
三、恢复步骤
1.停止一切导入操作
2.以三节点中2个jn损坏为例。
3.录cdh,停止hdfs,yarn,hive,impala
4.将JournalNode异常的2个节点上的数据文件备份移到别的目录,然后删除JournalNode数据文件,copy正常JournalNode的数据文件到这2台节点上来。目录根据实际情况修改。
mv /data/sa_cluster/hadoop_ecosystem/dfs/jn/nameservice01/current /data/sa_cluster/hadoop_ecosystem/dfs/jn/nameservice01/current.bakyyyymmdd
scp data01:/data/sa_cluster/hadoop_ecosystem/dfs/jn/nameservice01/current /data/sa_cluster/hadoop_ecosystem/dfs/jn/nameservice01/current
sudo chown -R hdfs:hdfs /data/sa_cluster/hadoop_ecosystem/dfs/jn/nameservice01/current
5.启动jn,检查运行情况。
6.在cdh上启动hdfs,yarn,hive,impala.