etcd:tocommit is out of range [lastIndex]. Was the raft log corrupted, truncated, or lost?
快速通道:
不用看我废话篇——
https://github.com/etcd-io/etcd/issues/13509#issuecomment-980506247
废话篇如下:
1.环境版本
操作系统:
[test1280@node1 ~]$ cat /etc/redhat-release
CentOS release 6.8 (Final)
[test1280@node1 ~]$ uname -a
Linux node1 2.6.32-642.el6.x86_64 #1 SMP Tue May 10 17:27:01 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
软件版本:
[test1280@node1 ~]$ etcd --version
etcd Version: 3.4.18
Git SHA: 72d3e382e
Go Version: go1.12.17
Go OS/Arch: linux/amd64
主机信息:
主机 | IP |
---|---|
node1 | 192.168.75.128 |
node2 | 192.168.75.129 |
node3 | 192.168.75.130 |
2.错误复现
2.1.安装部署
2.1.1.注入域名
编辑 node1-node3 中 /etc/hosts 配置,添加如下:
192.168.75.128 etcd1.test1280
192.168.75.129 etcd2.test1280
192.168.75.130 etcd3.test1280
检查 node1-node3 中 /etc/hosts 配置,最终如下:
node1
[root@node1 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.75.128 etcd1.test1280
192.168.75.129 etcd2.test1280
192.168.75.130 etcd3.test1280
node2
[root@node2 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.75.128 etcd1.test1280
192.168.75.129 etcd2.test1280
192.168.75.130 etcd3.test1280
node3
[root@node3 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.75.128 etcd1.test1280
192.168.75.129 etcd2.test1280
192.168.75.130 etcd3.test1280
2.1.2.下载安装
在 node1-node3 中上传 etcd 安装包并解压缩。
etcd release 下载地址:https://github.com/etcd-io/etcd/releases
我使用的是:etcd-v3.4.18-linux-amd64.tar.gz
解压安装包:
[test1280@node1 ~]$ tar zxf etcd-v3.4.18-linux-amd64.tar.gz
[test1280@node2 ~]$ tar zxf etcd-v3.4.18-linux-amd64.tar.gz
[test1280@node3 ~]$ tar zxf etcd-v3.4.18-linux-amd64.tar.gz
2.1.3.编辑脚本
node1 脚本:start-etcd1.sh
#/bin/bash
nohup etcd \
--name etcd1 \
--initial-advertise-peer-urls http://etcd1.test1280:2380 \
--listen-peer-urls http://0.0.0.0:2380 \
--listen-client-urls http://0.0.0.0:2379 \
--advertise-client-urls http://etcd1.test1280:2379 \
--auto-compaction-retention '1' \
--initial-cluster-token test1280 \
--initial-cluster etcd1=http://etcd1.test1280:2380,etcd2=http://etcd2.test1280:2380,etcd3=http://etcd3.test1280:2380 \
--initial-cluster-state new >etcd.log 2>&1 &
node2 脚本:start-etcd2.sh
#/bin/bash
nohup etcd \
--name etcd2 \
--initial-advertise-peer-urls http://etcd2.test1280:2380 \
--listen-peer-urls http://0.0.0.0:2380 \
--listen-client-urls http://0.0.0.0:2379 \
--advertise-client-urls http://etcd2.test1280:2379 \
--auto-compaction-retention '1' \
--initial-cluster-token test1280 \
--initial-cluster etcd1=http://etcd1.test1280:2380,etcd2=http://etcd2.test1280:2380,etcd3=http://etcd3.test1280:2380 \
--initial-cluster-state new >etcd.log 2>&1 &
node3 脚本:start-etcd3.sh
#/bin/bash
nohup etcd \
--name etcd2 \
--initial-advertise-peer-urls http://etcd2.test1280:2380 \
--listen-peer-urls http://0.0.0.0:2380 \
--listen-client-urls http://0.0.0.0:2379 \
--advertise-client-urls http://etcd2.test1280:2379 \
--auto-compaction-retention '1' \
--initial-cluster-token test1280 \
--initial-cluster etcd1=http://etcd1.test1280:2380,etcd2=http://etcd2.test1280:2380,etcd3=http://etcd3.test1280:2380 \
--initial-cluster-state new >etcd.log 2>&1 &
2.1.4.配置 PATH
编辑 node1-node3 中 $HOME/.bash_profile 中 PATH 变量:
export PATH=$HOME/etcd-v3.4.18-linux-amd64:$PATH
2.1.5.启动运行
node1:
[test1280@node1 ~]$ ./start-etcd1.sh
node2:
[test1280@node2 ~]$ ./start-etcd2.sh
node3:
[test1280@node3 ~]$ ./start-etcd3.sh
2.1.6.检查状态
在 node1-node3 任意主机执行:
etcdctl --endpoints=http://etcd1.test1280:2379,http://etcd2.test1280:2379,http://etcd3.test1280:2379 endpoint status -w table
可以观察到如下结果:
[test1280@node3 ~]$ etcdctl --endpoints=http://etcd1.test1280:2379,http://etcd2.test1280:2379,http://etcd3.test1280:2379 endpoint status -w table
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://etcd1.test1280:2379 | 8d8f805c54155c1f | 3.4.18 | 20 kB | true | false | 4 | 9 | 9 | |
| http://etcd2.test1280:2379 | b2a96233e99da684 | 3.4.18 | 20 kB | false | false | 4 | 9 | 9 | |
| http://etcd3.test1280:2379 | 427c1e146435064e | 3.4.18 | 20 kB | false | false | 4 | 9 | 9 | |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
目前 etcd 集群正常。
更多安装,请参见:Linux:安装etcd集群。
2.2.宕停 etcd
使用 kill -9 指令宕停任意节点的 etcd 进程,我们以 node1 节点为例:
[test1280@node1 ~]$ ps -ef | grep etcd
test1280 6017 1 1 10:36 pts/0 00:00:03 etcd --name etcd1 --initial-advertise-peer-urls http://etcd1.test1280:2380 --listen-peer-urls http://0.0.0.0:2380 --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://etcd1.test1280:2379 --auto-compaction-retention 1 --initial-cluster-token test1280 --initial-cluster etcd1=http://etcd1.test1280:2380,etcd2=http://etcd2.test1280:2380,etcd3=http://etcd3.test1280:2380 --initial-cluster-state new
test1280 6031 5952 0 10:41 pts/0 00:00:00 grep --color=always etcd
node1 节点的 etcd 进程 PID = 6017,使用 kill -9 指令宕停:
[test1280@node1 ~]$ kill -9 6017
2.3.删除 etcd 数据文件
将 node1 中 etcd 的数据文件删除,在我们当前的例子中,是 etcd1.etcd:
[test1280@node1 ~]$ ll
total 17044
drwx------. 3 test1280 test1280 4096 Mar 18 10:36 etcd1.etcd【***这个***】
-rw-rw-r--. 1 test1280 test1280 9540 Mar 18 10:36 etcd.log
drwxr-xr-x. 3 test1280 test1280 4096 Oct 15 06:53 etcd-v3.4.18-linux-amd64
-rw-r--r--. 1 test1280 test1280 17414708 Mar 18 01:38 etcd-v3.4.18-linux-amd64.tar.gz
-rwxr-xr-x. 1 test1280 test1280 478 Mar 18 10:24 start-etcd1.sh
-rwxr-xr-x. 1 test1280 test1280 478 Mar 18 10:26 start-etcd2.sh
-rwxr-xr-x. 1 test1280 test1280 478 Mar 18 10:26 start-etcd3.sh
[test1280@node1 ~]$ rm -rf etcd1.etcd/
2.4.修改 etcd 启动脚本
将 node1 中 etcd 的启动脚本修改:
修改前:
--initial-cluster-state new >etcd.log 2>&1 &
修改后:
--initial-cluster-state existing >etcd.log 2>&1 &
即,仅将 --initial-cluster-state 改为 existing。
2.5.重启 etcd
将 node1 的 etcd 重启,执行其脚本 start-etcd1.sh:
[test1280@node1 ~]$ ./start-etcd1.sh
此时,node1 的 etcd 启动失败,可查看 etcd.log:
raft2022/03/18 11:02:20 INFO: 8d8f805c54155c1f [term: 0] received a MsgHeartbeat message with higher term from 427c1e146435064e [term: 5]
raft2022/03/18 11:02:20 INFO: 8d8f805c54155c1f became follower at term 5
raft2022/03/18 11:02:20 tocommit(11) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(11) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
goroutine 155 [running]:
log.(*Logger).Panicf(0xc0001426e0, 0x10e77d8, 0x5d, 0xc0003960a0, 0x2, 0x2)
/home/remote/sbatsche/.gvm/gos/go1.12.17/src/log/log.go:219 +0xc1
...
错误复现。
3.解决方法
有大佬在 github issue 已经给出解决方法:
此问题是由于,etcd 进程 crash,同时,其 data 文件丢失,此时重启就会出错。
这种极端情况,需要人工介入恢复,也是合理的。
step 1:将 node1 的 etcd 节点,从集群中删除
node1 的 etcd 的 ID 是:8d8f805c54155c1f
在 node2、node3 的 etcd 中将 node1 的 etcd 删除:
[test1280@node2 ~]$ etcdctl --endpoints=http://etcd2.test1280:2379,http://etcd3.test1280:2379 member remove 8d8f805c54155c1f
Member 8d8f805c54155c1f removed from cluster 6ee07b66b4556e33
step 2:将 node1 的 etcd 节点,再次添加到集群中
在 node2、node3 的 etcd 中将 node1 的 etcd 重新添加:
[test1280@node2 ~]$ etcdctl --endpoints=http://etcd2.test1280:2379,http://etcd3.test1280:2379 member add etcd1 --peer-urls=http://etcd1.test1280:2380
Member 3662f03eb0a523d9 added to cluster 6ee07b66b4556e33
ETCD_NAME="etcd1"
ETCD_INITIAL_CLUSTER="etcd1=http://etcd1.test1280:2380,etcd3=http://etcd3.test1280:2380,etcd2=http://etcd2.test1280:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://etcd1.test1280:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
注意,需要指明 node1 的 etcd 的 name 和 peer-urls。
step 3:修改 node1 的 etcd 启动脚本
At last, start the member again, note you need to set the --initial-cluster-state as “existing” in this case.
注意,最后的 --initial-cluster-state 是 existing,不是 new:
[test1280@node1 ~]$ cat start-etcd1.sh
#/bin/bash
nohup etcd \
--name etcd1 \
--initial-advertise-peer-urls http://etcd1.test1280:2380 \
--listen-peer-urls http://0.0.0.0:2380 \
--listen-client-urls http://0.0.0.0:2379 \
--advertise-client-urls http://etcd1.test1280:2379 \
--auto-compaction-retention '1' \
--initial-cluster-token test1280 \
--initial-cluster etcd1=http://etcd1.test1280:2380,etcd2=http://etcd2.test1280:2380,etcd3=http://etcd3.test1280:2380 \
--initial-cluster-state existing>etcd.log 2>&1 &
step 4:清理 node1 的 etcd 的原 data(如果有)
删除:etcd1.etcd(如果有)
[test1280@node1 ~]$ ll
total 17040
drwx------. 3 test1280 test1280 4096 Mar 18 11:11 etcd1.etcd
-rw-rw-r--. 1 test1280 test1280 4565 Mar 18 11:11 etcd.log
drwxr-xr-x. 3 test1280 test1280 4096 Oct 15 06:53 etcd-v3.4.18-linux-amd64
-rw-r--r--. 1 test1280 test1280 17414708 Mar 18 01:38 etcd-v3.4.18-linux-amd64.tar.gz
-rwxr-xr-x. 1 test1280 test1280 482 Mar 18 11:00 start-etcd1.sh
-rwxr-xr-x. 1 test1280 test1280 478 Mar 18 10:26 start-etcd2.sh
-rwxr-xr-x. 1 test1280 test1280 478 Mar 18 10:26 start-etcd3.sh
[test1280@node1 ~]$ rm -rf etcd1.etcd/
step 5:重启检查
可观察到,此时 node1 的 etcd 已正常加入到集群中。