etcd：tocommit is out of range [lastIndex]. Was the raft log corrupted, truncated, or lost?

快速通道：

不用看我废话篇——

https://github.com/etcd-io/etcd/issues/13509#issuecomment-980506247

废话篇如下：

1.环境版本

操作系统：

[test1280@node1 ~]$ cat /etc/redhat-release 
CentOS release 6.8 (Final)
[test1280@node1 ~]$ uname -a
Linux node1 2.6.32-642.el6.x86_64 #1 SMP Tue May 10 17:27:01 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

软件版本：

[test1280@node1 ~]$ etcd --version
etcd Version: 3.4.18
Git SHA: 72d3e382e
Go Version: go1.12.17
Go OS/Arch: linux/amd64

主机信息：

主机	IP
node1	192.168.75.128
node2	192.168.75.129
node3	192.168.75.130

2.错误复现

2.1.安装部署

2.1.1.注入域名

编辑 node1-node3 中 /etc/hosts 配置，添加如下：

192.168.75.128 etcd1.test1280
192.168.75.129 etcd2.test1280
192.168.75.130 etcd3.test1280

检查 node1-node3 中 /etc/hosts 配置，最终如下：

node1

[root@node1 ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.75.128 etcd1.test1280
192.168.75.129 etcd2.test1280
192.168.75.130 etcd3.test1280

node2

[root@node2 ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.75.128 etcd1.test1280
192.168.75.129 etcd2.test1280
192.168.75.130 etcd3.test1280

node3

[root@node3 ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.75.128 etcd1.test1280
192.168.75.129 etcd2.test1280
192.168.75.130 etcd3.test1280

2.1.2.下载安装

在 node1-node3 中上传 etcd 安装包并解压缩。

etcd release 下载地址：https://github.com/etcd-io/etcd/releases

我使用的是：etcd-v3.4.18-linux-amd64.tar.gz

解压安装包：

[test1280@node1 ~]$ tar zxf etcd-v3.4.18-linux-amd64.tar.gz

[test1280@node2 ~]$ tar zxf etcd-v3.4.18-linux-amd64.tar.gz

[test1280@node3 ~]$ tar zxf etcd-v3.4.18-linux-amd64.tar.gz

2.1.3.编辑脚本

node1 脚本：start-etcd1.sh

#/bin/bash

nohup etcd \
 --name etcd1 \
 --initial-advertise-peer-urls http://etcd1.test1280:2380 \
 --listen-peer-urls http://0.0.0.0:2380 \
 --listen-client-urls http://0.0.0.0:2379 \
 --advertise-client-urls http://etcd1.test1280:2379 \
 --auto-compaction-retention '1' \
 --initial-cluster-token test1280 \
 --initial-cluster etcd1=http://etcd1.test1280:2380,etcd2=http://etcd2.test1280:2380,etcd3=http://etcd3.test1280:2380 \
 --initial-cluster-state new >etcd.log 2>&1 &

node2 脚本：start-etcd2.sh

#/bin/bash

nohup etcd \
 --name etcd2 \
 --initial-advertise-peer-urls http://etcd2.test1280:2380 \
 --listen-peer-urls http://0.0.0.0:2380 \
 --listen-client-urls http://0.0.0.0:2379 \
 --advertise-client-urls http://etcd2.test1280:2379 \
 --auto-compaction-retention '1' \
 --initial-cluster-token test1280 \
 --initial-cluster etcd1=http://etcd1.test1280:2380,etcd2=http://etcd2.test1280:2380,etcd3=http://etcd3.test1280:2380 \
 --initial-cluster-state new >etcd.log 2>&1 &

node3 脚本：start-etcd3.sh

#/bin/bash

nohup etcd \
 --name etcd2 \
 --initial-advertise-peer-urls http://etcd2.test1280:2380 \
 --listen-peer-urls http://0.0.0.0:2380 \
 --listen-client-urls http://0.0.0.0:2379 \
 --advertise-client-urls http://etcd2.test1280:2379 \
 --auto-compaction-retention '1' \
 --initial-cluster-token test1280 \
 --initial-cluster etcd1=http://etcd1.test1280:2380,etcd2=http://etcd2.test1280:2380,etcd3=http://etcd3.test1280:2380 \
 --initial-cluster-state new >etcd.log 2>&1 &

2.1.4.配置 PATH

编辑 node1-node3 中 $HOME/.bash_profile 中 PATH 变量：

export PATH=$HOME/etcd-v3.4.18-linux-amd64:$PATH

2.1.5.启动运行

node1：

[test1280@node1 ~]$ ./start-etcd1.sh

node2：

[test1280@node2 ~]$ ./start-etcd2.sh

node3：

[test1280@node3 ~]$ ./start-etcd3.sh

2.1.6.检查状态

在 node1-node3 任意主机执行：

etcdctl --endpoints=http://etcd1.test1280:2379,http://etcd2.test1280:2379,http://etcd3.test1280:2379 endpoint status -w table

可以观察到如下结果：

[test1280@node3 ~]$ etcdctl --endpoints=http://etcd1.test1280:2379,http://etcd2.test1280:2379,http://etcd3.test1280:2379 endpoint status -w table
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://etcd1.test1280:2379 | 8d8f805c54155c1f |  3.4.18 |   20 kB |      true |      false |         4 |          9 |                  9 |        |
| http://etcd2.test1280:2379 | b2a96233e99da684 |  3.4.18 |   20 kB |     false |      false |         4 |          9 |                  9 |        |
| http://etcd3.test1280:2379 | 427c1e146435064e |  3.4.18 |   20 kB |     false |      false |         4 |          9 |                  9 |        |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

目前 etcd 集群正常。

更多安装，请参见：Linux：安装etcd集群。

2.2.宕停 etcd

使用 kill -9 指令宕停任意节点的 etcd 进程，我们以 node1 节点为例：

[test1280@node1 ~]$ ps -ef | grep etcd
test1280   6017      1  1 10:36 pts/0    00:00:03 etcd --name etcd1 --initial-advertise-peer-urls http://etcd1.test1280:2380 --listen-peer-urls http://0.0.0.0:2380 --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://etcd1.test1280:2379 --auto-compaction-retention 1 --initial-cluster-token test1280 --initial-cluster etcd1=http://etcd1.test1280:2380,etcd2=http://etcd2.test1280:2380,etcd3=http://etcd3.test1280:2380 --initial-cluster-state new
test1280   6031   5952  0 10:41 pts/0    00:00:00 grep --color=always etcd

node1 节点的 etcd 进程 PID = 6017，使用 kill -9 指令宕停：

[test1280@node1 ~]$ kill -9 6017

2.3.删除 etcd 数据文件

将 node1 中 etcd 的数据文件删除，在我们当前的例子中，是 etcd1.etcd：

[test1280@node1 ~]$ ll
total 17044
drwx------. 3 test1280 test1280     4096 Mar 18 10:36 etcd1.etcd【***这个***】
-rw-rw-r--. 1 test1280 test1280     9540 Mar 18 10:36 etcd.log
drwxr-xr-x. 3 test1280 test1280     4096 Oct 15 06:53 etcd-v3.4.18-linux-amd64
-rw-r--r--. 1 test1280 test1280 17414708 Mar 18 01:38 etcd-v3.4.18-linux-amd64.tar.gz
-rwxr-xr-x. 1 test1280 test1280      478 Mar 18 10:24 start-etcd1.sh
-rwxr-xr-x. 1 test1280 test1280      478 Mar 18 10:26 start-etcd2.sh
-rwxr-xr-x. 1 test1280 test1280      478 Mar 18 10:26 start-etcd3.sh
[test1280@node1 ~]$ rm -rf etcd1.etcd/

2.4.修改 etcd 启动脚本

将 node1 中 etcd 的启动脚本修改：

修改前：

 --initial-cluster-state new >etcd.log 2>&1 &

修改后：

 --initial-cluster-state existing >etcd.log 2>&1 &

即，仅将 --initial-cluster-state 改为 existing。

2.5.重启 etcd

将 node1 的 etcd 重启，执行其脚本 start-etcd1.sh：

[test1280@node1 ~]$ ./start-etcd1.sh

此时，node1 的 etcd 启动失败，可查看 etcd.log：

raft2022/03/18 11:02:20 INFO: 8d8f805c54155c1f [term: 0] received a MsgHeartbeat message with higher term from 427c1e146435064e [term: 5]
raft2022/03/18 11:02:20 INFO: 8d8f805c54155c1f became follower at term 5
raft2022/03/18 11:02:20 tocommit(11) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(11) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?

goroutine 155 [running]:
log.(*Logger).Panicf(0xc0001426e0, 0x10e77d8, 0x5d, 0xc0003960a0, 0x2, 0x2)
    /home/remote/sbatsche/.gvm/gos/go1.12.17/src/log/log.go:219 +0xc1
...

错误复现。

3.解决方法

有大佬在 github issue 已经给出解决方法：

此问题是由于，etcd 进程 crash，同时，其 data 文件丢失，此时重启就会出错。

这种极端情况，需要人工介入恢复，也是合理的。

step 1：将 node1 的 etcd 节点，从集群中删除

node1 的 etcd 的 ID 是：8d8f805c54155c1f

在这里插入图片描述
在 node2、node3 的 etcd 中将 node1 的 etcd 删除：

[test1280@node2 ~]$ etcdctl --endpoints=http://etcd2.test1280:2379,http://etcd3.test1280:2379 member remove 8d8f805c54155c1f
Member 8d8f805c54155c1f removed from cluster 6ee07b66b4556e33

step 2：将 node1 的 etcd 节点，再次添加到集群中

在 node2、node3 的 etcd 中将 node1 的 etcd 重新添加：

[test1280@node2 ~]$ etcdctl --endpoints=http://etcd2.test1280:2379,http://etcd3.test1280:2379 member add etcd1 --peer-urls=http://etcd1.test1280:2380
Member 3662f03eb0a523d9 added to cluster 6ee07b66b4556e33

ETCD_NAME="etcd1"
ETCD_INITIAL_CLUSTER="etcd1=http://etcd1.test1280:2380,etcd3=http://etcd3.test1280:2380,etcd2=http://etcd2.test1280:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://etcd1.test1280:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

注意，需要指明 node1 的 etcd 的 name 和 peer-urls。

step 3：修改 node1 的 etcd 启动脚本

At last, start the member again, note you need to set the --initial-cluster-state as “existing” in this case.

注意，最后的 --initial-cluster-state 是 existing，不是 new：

[test1280@node1 ~]$ cat start-etcd1.sh 
#/bin/bash

nohup etcd \
 --name etcd1 \
 --initial-advertise-peer-urls http://etcd1.test1280:2380 \
 --listen-peer-urls http://0.0.0.0:2380 \
 --listen-client-urls http://0.0.0.0:2379 \
 --advertise-client-urls http://etcd1.test1280:2379 \
 --auto-compaction-retention '1' \
 --initial-cluster-token test1280 \
 --initial-cluster etcd1=http://etcd1.test1280:2380,etcd2=http://etcd2.test1280:2380,etcd3=http://etcd3.test1280:2380 \
 --initial-cluster-state existing>etcd.log 2>&1 &

step 4：清理 node1 的 etcd 的原 data（如果有）

删除：etcd1.etcd（如果有）

[test1280@node1 ~]$ ll
total 17040
drwx------. 3 test1280 test1280     4096 Mar 18 11:11 etcd1.etcd
-rw-rw-r--. 1 test1280 test1280     4565 Mar 18 11:11 etcd.log
drwxr-xr-x. 3 test1280 test1280     4096 Oct 15 06:53 etcd-v3.4.18-linux-amd64
-rw-r--r--. 1 test1280 test1280 17414708 Mar 18 01:38 etcd-v3.4.18-linux-amd64.tar.gz
-rwxr-xr-x. 1 test1280 test1280      482 Mar 18 11:00 start-etcd1.sh
-rwxr-xr-x. 1 test1280 test1280      478 Mar 18 10:26 start-etcd2.sh
-rwxr-xr-x. 1 test1280 test1280      478 Mar 18 10:26 start-etcd3.sh
[test1280@node1 ~]$ rm -rf etcd1.etcd/

step 5：重启检查

在这里插入图片描述
可观察到，此时 node1 的 etcd 已正常加入到集群中。

4.参考引用

https://github.com/etcd-io/etcd/issues/13509