Journal node文件损坏修复方法

一、原因
journal node文件损坏大部分由两种原因引起:
1.磁盘写满后edit_log文件无法写入,磁盘空间恢复后,edit_log文件数据已经丢失。
2.服务器物理断电导致edit_log文件损坏。
后果:超过半数 jn 节点损坏后,hdfs将处于不可用状态。
二、现象
1.检查namenode状态,发现两个namenode都是standby

[root@data01 conf]# sudo cat /etc/hadoop/conf/hdfs-site.xml |grep dfs.ha.namenodes.nameservice01 -A2 -B1
  <property>
    <name>dfs.ha.namenodes.nameservice01</name>
    <value>namenode44,namenode8</value>
  </property>
sudo -u hdfs hdfs haadmin -getServiceState namenode44                                                                                                                                                               
standby                                                              
sudo -u hdfs hdfs haadmin -getServiceState namenode8
standby

2.强制转换nn为active,又会自动变成standby
3.journal node报错信息:

2018-03-19 20:48:04,817 WARN  namenode.FSImage (EditLogFileInputStream.java:scanEditLog(359)) - Caught exception after scanning through 0 ops from /data/sa_cluster/hadoop_ecosystem/dfs/jn/nameservice01/current/edits_inprogress_0000000000001990667 while determining its valid length. Position was 1011712
 
java.io.IOException: Can't scan a pre-transactional edit log.
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4974)
        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)
        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:355)
        at org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:551)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:192)
        at org.apache.hadoop.hdfs.qjournal.server.Journal.<init>(Journal.java:152)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:90)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:99)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:189)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:224)
        at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25431)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject

三、恢复步骤
1.停止一切导入操作
2.以三节点中2个jn损坏为例。
3.录cdh,停止hdfs,yarn,hive,impala
4.将JournalNode异常的2个节点上的数据文件备份移到别的目录,然后删除JournalNode数据文件,copy正常JournalNode的数据文件到这2台节点上来。目录根据实际情况修改。

mv /data/sa_cluster/hadoop_ecosystem/dfs/jn/nameservice01/current /data/sa_cluster/hadoop_ecosystem/dfs/jn/nameservice01/current.bakyyyymmdd
scp data01:/data/sa_cluster/hadoop_ecosystem/dfs/jn/nameservice01/current /data/sa_cluster/hadoop_ecosystem/dfs/jn/nameservice01/current
sudo chown -R hdfs:hdfs /data/sa_cluster/hadoop_ecosystem/dfs/jn/nameservice01/current

5.启动jn,检查运行情况。
6.在cdh上启动hdfs,yarn,hive,impala.

发布了23 篇原创文章 · 获赞 0 · 访问量 119

猜你喜欢

转载自blog.csdn.net/Abson_Lu/article/details/104521611