etcd:tocommit is out of range [lastIndex]. Was the raft log corrupted, truncated, or lost?

etcd:tocommit is out of range [lastIndex]. Was the raft log corrupted, truncated, or lost?


快速通道:

不用看我废话篇——

https://github.com/etcd-io/etcd/issues/13509#issuecomment-980506247

废话篇如下:


1.环境版本

操作系统:

[test1280@node1 ~]$ cat /etc/redhat-release 
CentOS release 6.8 (Final)
[test1280@node1 ~]$ uname -a
Linux node1 2.6.32-642.el6.x86_64 #1 SMP Tue May 10 17:27:01 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

软件版本:

[test1280@node1 ~]$ etcd --version
etcd Version: 3.4.18
Git SHA: 72d3e382e
Go Version: go1.12.17
Go OS/Arch: linux/amd64

主机信息:

主机 IP
node1 192.168.75.128
node2 192.168.75.129
node3 192.168.75.130

2.错误复现

2.1.安装部署
2.1.1.注入域名

编辑 node1-node3 中 /etc/hosts 配置,添加如下:

192.168.75.128 etcd1.test1280
192.168.75.129 etcd2.test1280
192.168.75.130 etcd3.test1280

检查 node1-node3 中 /etc/hosts 配置,最终如下:

node1

[root@node1 ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.75.128 etcd1.test1280
192.168.75.129 etcd2.test1280
192.168.75.130 etcd3.test1280

node2

[root@node2 ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.75.128 etcd1.test1280
192.168.75.129 etcd2.test1280
192.168.75.130 etcd3.test1280

node3

[root@node3 ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.75.128 etcd1.test1280
192.168.75.129 etcd2.test1280
192.168.75.130 etcd3.test1280
2.1.2.下载安装

在 node1-node3 中上传 etcd 安装包并解压缩。

etcd release 下载地址:https://github.com/etcd-io/etcd/releases

我使用的是:etcd-v3.4.18-linux-amd64.tar.gz

解压安装包:

[test1280@node1 ~]$ tar zxf etcd-v3.4.18-linux-amd64.tar.gz
[test1280@node2 ~]$ tar zxf etcd-v3.4.18-linux-amd64.tar.gz
[test1280@node3 ~]$ tar zxf etcd-v3.4.18-linux-amd64.tar.gz
2.1.3.编辑脚本

node1 脚本:start-etcd1.sh

#/bin/bash

nohup etcd \
 --name etcd1 \
 --initial-advertise-peer-urls http://etcd1.test1280:2380 \
 --listen-peer-urls http://0.0.0.0:2380 \
 --listen-client-urls http://0.0.0.0:2379 \
 --advertise-client-urls http://etcd1.test1280:2379 \
 --auto-compaction-retention '1' \
 --initial-cluster-token test1280 \
 --initial-cluster etcd1=http://etcd1.test1280:2380,etcd2=http://etcd2.test1280:2380,etcd3=http://etcd3.test1280:2380 \
 --initial-cluster-state new >etcd.log 2>&1 &

node2 脚本:start-etcd2.sh

#/bin/bash

nohup etcd \
 --name etcd2 \
 --initial-advertise-peer-urls http://etcd2.test1280:2380 \
 --listen-peer-urls http://0.0.0.0:2380 \
 --listen-client-urls http://0.0.0.0:2379 \
 --advertise-client-urls http://etcd2.test1280:2379 \
 --auto-compaction-retention '1' \
 --initial-cluster-token test1280 \
 --initial-cluster etcd1=http://etcd1.test1280:2380,etcd2=http://etcd2.test1280:2380,etcd3=http://etcd3.test1280:2380 \
 --initial-cluster-state new >etcd.log 2>&1 &

node3 脚本:start-etcd3.sh

#/bin/bash

nohup etcd \
 --name etcd2 \
 --initial-advertise-peer-urls http://etcd2.test1280:2380 \
 --listen-peer-urls http://0.0.0.0:2380 \
 --listen-client-urls http://0.0.0.0:2379 \
 --advertise-client-urls http://etcd2.test1280:2379 \
 --auto-compaction-retention '1' \
 --initial-cluster-token test1280 \
 --initial-cluster etcd1=http://etcd1.test1280:2380,etcd2=http://etcd2.test1280:2380,etcd3=http://etcd3.test1280:2380 \
 --initial-cluster-state new >etcd.log 2>&1 &
2.1.4.配置 PATH

编辑 node1-node3 中 $HOME/.bash_profile 中 PATH 变量:

export PATH=$HOME/etcd-v3.4.18-linux-amd64:$PATH
2.1.5.启动运行

node1:

[test1280@node1 ~]$ ./start-etcd1.sh

node2:

[test1280@node2 ~]$ ./start-etcd2.sh

node3:

[test1280@node3 ~]$ ./start-etcd3.sh
2.1.6.检查状态

在 node1-node3 任意主机执行:

etcdctl --endpoints=http://etcd1.test1280:2379,http://etcd2.test1280:2379,http://etcd3.test1280:2379 endpoint status -w table

可以观察到如下结果:

[test1280@node3 ~]$ etcdctl --endpoints=http://etcd1.test1280:2379,http://etcd2.test1280:2379,http://etcd3.test1280:2379 endpoint status -w table
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://etcd1.test1280:2379 | 8d8f805c54155c1f |  3.4.18 |   20 kB |      true |      false |         4 |          9 |                  9 |        |
| http://etcd2.test1280:2379 | b2a96233e99da684 |  3.4.18 |   20 kB |     false |      false |         4 |          9 |                  9 |        |
| http://etcd3.test1280:2379 | 427c1e146435064e |  3.4.18 |   20 kB |     false |      false |         4 |          9 |                  9 |        |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

目前 etcd 集群正常。

更多安装,请参见:Linux:安装etcd集群

2.2.宕停 etcd

使用 kill -9 指令宕停任意节点的 etcd 进程,我们以 node1 节点为例:

[test1280@node1 ~]$ ps -ef | grep etcd
test1280   6017      1  1 10:36 pts/0    00:00:03 etcd --name etcd1 --initial-advertise-peer-urls http://etcd1.test1280:2380 --listen-peer-urls http://0.0.0.0:2380 --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://etcd1.test1280:2379 --auto-compaction-retention 1 --initial-cluster-token test1280 --initial-cluster etcd1=http://etcd1.test1280:2380,etcd2=http://etcd2.test1280:2380,etcd3=http://etcd3.test1280:2380 --initial-cluster-state new
test1280   6031   5952  0 10:41 pts/0    00:00:00 grep --color=always etcd

node1 节点的 etcd 进程 PID = 6017,使用 kill -9 指令宕停:

[test1280@node1 ~]$ kill -9 6017
2.3.删除 etcd 数据文件

将 node1 中 etcd 的数据文件删除,在我们当前的例子中,是 etcd1.etcd:

[test1280@node1 ~]$ ll
total 17044
drwx------. 3 test1280 test1280     4096 Mar 18 10:36 etcd1.etcd【***这个***】
-rw-rw-r--. 1 test1280 test1280     9540 Mar 18 10:36 etcd.log
drwxr-xr-x. 3 test1280 test1280     4096 Oct 15 06:53 etcd-v3.4.18-linux-amd64
-rw-r--r--. 1 test1280 test1280 17414708 Mar 18 01:38 etcd-v3.4.18-linux-amd64.tar.gz
-rwxr-xr-x. 1 test1280 test1280      478 Mar 18 10:24 start-etcd1.sh
-rwxr-xr-x. 1 test1280 test1280      478 Mar 18 10:26 start-etcd2.sh
-rwxr-xr-x. 1 test1280 test1280      478 Mar 18 10:26 start-etcd3.sh
[test1280@node1 ~]$ rm -rf etcd1.etcd/
2.4.修改 etcd 启动脚本

将 node1 中 etcd 的启动脚本修改:

修改前:

 --initial-cluster-state new >etcd.log 2>&1 &

修改后:

 --initial-cluster-state existing >etcd.log 2>&1 &

即,仅将 --initial-cluster-state 改为 existing。

2.5.重启 etcd

将 node1 的 etcd 重启,执行其脚本 start-etcd1.sh:

[test1280@node1 ~]$ ./start-etcd1.sh

此时,node1 的 etcd 启动失败,可查看 etcd.log:

raft2022/03/18 11:02:20 INFO: 8d8f805c54155c1f [term: 0] received a MsgHeartbeat message with higher term from 427c1e146435064e [term: 5]
raft2022/03/18 11:02:20 INFO: 8d8f805c54155c1f became follower at term 5
raft2022/03/18 11:02:20 tocommit(11) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(11) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?

goroutine 155 [running]:
log.(*Logger).Panicf(0xc0001426e0, 0x10e77d8, 0x5d, 0xc0003960a0, 0x2, 0x2)
    /home/remote/sbatsche/.gvm/gos/go1.12.17/src/log/log.go:219 +0xc1
...

错误复现。


3.解决方法

有大佬在 github issue 已经给出解决方法:

此问题是由于,etcd 进程 crash,同时,其 data 文件丢失,此时重启就会出错。

这种极端情况,需要人工介入恢复,也是合理的。

step 1:将 node1 的 etcd 节点,从集群中删除

node1 的 etcd 的 ID 是:8d8f805c54155c1f

在这里插入图片描述
在 node2、node3 的 etcd 中将 node1 的 etcd 删除:

[test1280@node2 ~]$ etcdctl --endpoints=http://etcd2.test1280:2379,http://etcd3.test1280:2379 member remove 8d8f805c54155c1f
Member 8d8f805c54155c1f removed from cluster 6ee07b66b4556e33
step 2:将 node1 的 etcd 节点,再次添加到集群中

在 node2、node3 的 etcd 中将 node1 的 etcd 重新添加:

[test1280@node2 ~]$ etcdctl --endpoints=http://etcd2.test1280:2379,http://etcd3.test1280:2379 member add etcd1 --peer-urls=http://etcd1.test1280:2380
Member 3662f03eb0a523d9 added to cluster 6ee07b66b4556e33

ETCD_NAME="etcd1"
ETCD_INITIAL_CLUSTER="etcd1=http://etcd1.test1280:2380,etcd3=http://etcd3.test1280:2380,etcd2=http://etcd2.test1280:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://etcd1.test1280:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

注意,需要指明 node1 的 etcd 的 name 和 peer-urls。

step 3:修改 node1 的 etcd 启动脚本

At last, start the member again, note you need to set the --initial-cluster-state as “existing” in this case.

注意,最后的 --initial-cluster-state 是 existing,不是 new:

[test1280@node1 ~]$ cat start-etcd1.sh 
#/bin/bash

nohup etcd \
 --name etcd1 \
 --initial-advertise-peer-urls http://etcd1.test1280:2380 \
 --listen-peer-urls http://0.0.0.0:2380 \
 --listen-client-urls http://0.0.0.0:2379 \
 --advertise-client-urls http://etcd1.test1280:2379 \
 --auto-compaction-retention '1' \
 --initial-cluster-token test1280 \
 --initial-cluster etcd1=http://etcd1.test1280:2380,etcd2=http://etcd2.test1280:2380,etcd3=http://etcd3.test1280:2380 \
 --initial-cluster-state existing>etcd.log 2>&1 &
step 4:清理 node1 的 etcd 的原 data(如果有)

删除:etcd1.etcd(如果有)

[test1280@node1 ~]$ ll
total 17040
drwx------. 3 test1280 test1280     4096 Mar 18 11:11 etcd1.etcd
-rw-rw-r--. 1 test1280 test1280     4565 Mar 18 11:11 etcd.log
drwxr-xr-x. 3 test1280 test1280     4096 Oct 15 06:53 etcd-v3.4.18-linux-amd64
-rw-r--r--. 1 test1280 test1280 17414708 Mar 18 01:38 etcd-v3.4.18-linux-amd64.tar.gz
-rwxr-xr-x. 1 test1280 test1280      482 Mar 18 11:00 start-etcd1.sh
-rwxr-xr-x. 1 test1280 test1280      478 Mar 18 10:26 start-etcd2.sh
-rwxr-xr-x. 1 test1280 test1280      478 Mar 18 10:26 start-etcd3.sh
[test1280@node1 ~]$ rm -rf etcd1.etcd/
step 5:重启检查

在这里插入图片描述
可观察到,此时 node1 的 etcd 已正常加入到集群中。


4.参考引用

https://github.com/etcd-io/etcd/issues/13509

猜你喜欢

转载自blog.csdn.net/test1280/article/details/123579775