GP数据库初始化失败定位

1、背景

在两台主机上安装包含master、standby、segment的一套GP数据库,在初始化阶段出现失败。

查看GP数据库的启停日志文件 /home/gpadmin/gpAdminLogs,错误信息如下:

20180317:08:31:39:087050 gpsegstart.py_host77:gpadmin:host77:gpadmin-[INFO]:-Postmaster /data2/primary/gpseg3 is running (pid 87089)
20180317:08:31:39:087050 gpsegstart.py_host77:gpadmin:host77:gpadmin-[INFO]:-Transitioning segments, mirroringMode is quiescent...
20180317:08:52:04:087050 gpsegstart.py_host77:gpadmin:host77:gpadmin-[INFO]:-Marking failed /data2/primary/gpseg3, Start failed; check segment logfile.  "peer shut down connection before response was fully received  Retrying no 1  peer shut down connection before response was fully received  Retrying no 2  peer shut down connection before response was fully received  Retrying no 3  peer shut down connection before response was fully received  Retrying no 4  peer shut down connection before response was fully received  Retrying no 5  peer shut down connection before response was fully received  Retrying no 6  peer shut down connection before response was fully received  Retrying no 7  peer shut down connection before response was fully received  Retrying no 8  peer shut down connection before response was fully received  Retrying no 9  peer shut down connection before response was fully received  Retrying no 10  peer shut down connection before response was fully received  Retrying no 11  peer shut down connection before response was fully received  Retrying no 12  peer shut down connection before response was fully received  Retrying no 13  peer shut down connection before response was fully received  Retrying no 14  peer shut down connection before response was fully received  Retrying no 15  peer shut down connection before response was fully received  Retrying no 16  peer shut down connection before response was fully received  Retrying no 17  peer shut down connection before response was fully received  Retrying no 18  peer shut down connection before response was fully received  Retrying no 19  peer shut down connection before response was fully received", 1000
20180317:08:52:04:087050 gpsegstart.py_host77:gpadmin:host77:gpadmin-[INFO]:-Marking failed /data1/primary/gpseg2, Start failed; check segment logfile.  "peer shut down connection before response was fully received  Retrying no 1  peer shut down connection before response was fully received  Retrying no 2  peer shut down connection before response was fully received  Retrying no 3  peer shut down connection before response was fully received  Retrying no 4  peer shut down connection before response was fully received  Retrying no 5  peer shut down connection before response was fully received  Retrying no 6  peer shut down connection before response was fully received  Retrying no 7  peer shut down connection before response was fully received  Retrying no 8  peer shut down connection before response was fully received  Retrying no 9  peer shut down connection before response was fully received  Retrying no 10  peer shut down connection before response was fully received  Retrying no 11  peer shut down connection before response was fully received  Retrying no 12  peer shut down connection before response was fully received  Retrying no 13  peer shut down connection before response was fully received  Retrying no 14  peer shut down connection before response was fully received  Retrying no 15  peer shut down connection before response was fully received  Retrying no 16  peer shut down connection before response was fully received  Retrying no 17  peer shut down connection before response was fully received  Retrying no 18  peer shut down connection before response was fully received  Retrying no 19  peer shut down connection before response was fully received", 1000
20180317:08:52:04:087050 gpsegstart.py_host77:gpadmin:host77:gpadmin-[INFO]:-Stopping segment /data2/primary/gpseg3, 40001 because of failure sending transition
20180317:08:52:05:087050 gpsegstart.py_host77:gpadmin:host77:gpadmin-[INFO]:-Stop of segment succeeded
20180317:08:52:05:087050 gpsegstart.py_host77:gpadmin:host77:gpadmin-[INFO]:-Stopping segment /data1/primary/gpseg2, 40000 because of failure sending transition
20180317:08:52:06:087050 gpsegstart.py_host77:gpadmin:host77:gpadmin-[INFO]:-Stop of segment succeeded
20180317:08:52:06:087050 gpsegstart.py_host77:gpadmin:host77:gpadmin-[INFO]:-Checking segment postmasters... (must_be_running True)
20180317:08:52:06:087050 gpsegstart.py_host77:gpadmin:host77:gpadmin-[INFO]:-Postmaster /data1/mirror/gpseg0 is running (pid 87084)
20180317:08:52:06:087050 gpsegstart.py_host77:gpadmin:host77:gpadmin-[INFO]:-Postmaster /data2/mirror/gpseg1 is running (pid 87085)
20180317:08:52:06:087050 gpsegstart.py_host77:gpadmin:host77:gpadmin-[INFO]:-Validating segment locales...
20180317:08:52:06:087050 gpsegstart.py_host77:gpadmin:host77:gpadmin-[INFO]:-Checking segment postmasters... (must_be_running True)
20180317:08:52:06:087050 gpsegstart.py_host77:gpadmin:host77:gpadmin-[INFO]:-Postmaster /data1/mirror/gpseg0 is running (pid 87084)
20180317:08:52:06:087050 gpsegstart.py_host77:gpadmin:host77:gpadmin-[INFO]:-Postmaster /data2/mirror/gpseg1 is running (pid 87085)
20180317:08:52:06:087050 gpsegstart.py_host77:gpadmin:host77:gpadmin-[INFO]:-

COMMAND RESULTS

2、问题定位

重新安装了好几遍都卡在这里,一直初始化失败,后来从同事那儿了解到该机器不是全新安装,才换了种思维跟踪。

1)查看系统磁盘空间(df)和内存情况(free -g),发现有一台机器的内存趋近于0

2)释放空间后,重新安装仍失败,才找到gp数据库的内部打印日志,发现端口报失败。

通过netstat -apn|grep 端口号查看是否被占用,若已经被占用,然后通过ps -aux|grep 进程号查看进程,最后杀掉该进程即可。

[root@host77 gpAdminLogs]# netstat -anp|grep  40000
tcp        0      0 0.0.0.0:40000               0.0.0.0:*                   LISTEN      8330/emsent         
[root@host77 gpAdminLogs]# netstat -anp|grep  40001
tcp        0      0 0.0.0.0:40001               0.0.0.0:*                   LISTEN      8330/emsent         
[root@host77 gpAdminLogs]# ps -aux|grep 8330
Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
root      8330  0.0  0.5 567784 137140 ?       Ssl  Feb05  54:45 /ubas/ZXUN-UBAS/server/emsent/bin/emsent
root     53626  0.0  0.0   6392   724 pts/5    S+   09:16   0:00 grep 8330

[root@host77 gpAdminLogs]# kill -9 8330

3)重装再测试,安装成功。

3、经验教训

当问题出现时,如果不能一眼看出是自己脚本逻辑问题,需优先排除环境因素影响,如磁盘空间,内存使用,端口占用情况等,在环境没有问题的情况下,再分析前后逻辑。同时,对于一个成熟的产品,系统本身的日志是一个很重要的分析工具,定位时需要充分使用。

猜你喜欢

转载自blog.csdn.net/lanyue1/article/details/79591989