postgresql+corosync+pacemaker集群无法启动vip问题解决
业务环境
操作系统:CentOS Linux release 7.3.1611 (Core)
数据库版本:postgresql 10.6
本环境计划搭建一主二从的流复制集群,使用corosync+pacemaker进行高可用管控。一台从库使用同步复制,分担读压力;另一台从库使用异步复制,作为一个实时备份。
问题还原
今天尝试手动搭建pg集群,搭建完成后发现集群异常,状态如下:
[root@centos02-5 /home/m]# crm status
Stack: corosync
Current DC: centos02-1 (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum
Last updated: Wed Oct 23 10:40:13 2019
Last change: Wed Oct 23 10:39:47 2019 by root via crm_attribute on centos02-5
2 nodes configured
6 resources configured
Online: [ centos02-1 centos02-5 ]
Full list of resources:
Resource Group: master-group
vip-master (ocf::heartbeat:IPaddr2): Stopped
Master/Slave Set: msPostgresql [pgsql]
Slaves: [ centos02-1 centos02-5 ]
Clone Set: clnPingCheck [pingCheck]
Started: [ centos02-1 centos02-5 ]
Resource Group: slave-group
vip-slave (ocf::heartbeat:IPaddr2): Stopped
2个节点都变为了从库,而虚拟ip无法启动
问题分析
检查日志后发现报错:
[root@centos02-5 ~]# tail /var/log/cluster/corosync.log
Oct 23 10:41:54 pgsql(pgsql)[27468]: INFO: Master does not exist.
Oct 23 10:41:54 pgsql(pgsql)[27468]: INFO: My data status=LATEST.
Oct 23 10:41:54 pgsql(pgsql)[27468]: WARNING: Can't get 172.31.106.42 xlog location.
Oct 23 10:41:54 pgsql(pgsql)[27468]: WARNING: Can't get 172.31.106.25 xlog location.
说明pacemaker无法访问到2个节点的xlog,导致集群搭建失败。
检查pgsql脚本中相关代码:
[root@centos02-5 ~]# vim /usr/lib/ocf/resource.d/heartbeat/pgsql
# get xlog locations of all nodes
for node in ${NODE_LIST}; do
output=`$CRM_ATTR_REBOOT -N "$node" -n \
"$PGSQL_XLOG_LOC_NAME" -G -q 2>/dev/null`
if [ $? -ne 0 ]; then
ocf_log warn "Can't get $node xlog location."
continue
else
ocf_log info "$node xlog location : $output"
echo "$node $output" >> ${XLOG_NOTE_FILE}.${new}
if [ "$node" = "$NODENAME" ]; then
mylocation=$output
fi
fi
done
通过脚本发现,如果语句$CRM_ATTR_REBOOT -N "$node" -n "$PGSQL_XLOG_LOC_NAME" -G -q 2>/dev/null
未执行就会报错。。。
后测试发现,我之前在配置pgsql的时候将node_list配置成了ip,这会导致此语句执行失败,改为机器名后问题解决…