Trafodion Troubleshooting-Server process tdm_arkesp could not be created on \NSK cpu 0

版权声明:本文为博主原创文章,如需转载,请注明出处。 https://blog.csdn.net/Post_Yuan/article/details/82149679

现象

SQL在执行并发扫描的时候无法启用ESP进程而报错,

2018-08-11 16:39:20,486, ERROR, SQL.EXE, Node Number: 0, CPU: 0, PIN: 9467, Process Name: $Z0007QH, SQLCODE: 2012, QID: MXID11000009467212400779772986156000000000306U3333308T150000000_11_STMT1, *** ERROR[2012] Server process tdm_arkesp could not be created on \NSK cpu 0 - Operating system error 4022, TPCError = 31, error detail = 0.  (See variants of Seabed procedure msg_mon_start_process for details).
2018-08-11 16:39:20,487, ERROR, SQL.EXE, Node Number: 0, CPU: 0, PIN: 9467, Process Name: $Z0007QH, SQLCODE: 2013, QID: MXID11000009467212400779772986156000000000306U3333308T150000000_11_STMT1, *** ERROR[2013] Server process tdm_arkesp could not be created on \NSK cpu 0 - Operating system error 4022.

分析

从错误信息发现,cpu 0,即第一个节点上启动ESP进程会有问题,通过sqps检查esp进程的状态,发现系统有超过2000个ESP进程,几乎所有的ESP进程的父进程为”NONE”,通过以下结果可以发现。

[trafodion@datanode-1 logs]$ sqps | grep esp | wc -l
2021

...
[$Z000G8K] 000,00014979 001 GEN  ES--U-- $Z000C7Z    NONE        tdm_arkesp     
[$Z000G8K] 000,00015025 001 GEN  ES--U-- $Z000C9A    NONE        tdm_arkesp     
[$Z000G8K] 000,00015467 001 GEN  ES--U-- $Z000CLX    NONE        tdm_arkesp     
[$Z000G8K] 000,00015905 001 GEN  ES--U-- $Z000CZF    NONE        tdm_arkesp     
[$Z000G8K] 000,00016080 001 GEN  ES--U-- $Z000D4F    NONE        tdm_arkesp     
[$Z000G8K] 000,00016241 001 GEN  ES--U-- $Z000D91    NONE        tdm_arkesp     
[$Z000G8K] 000,00017532 001 GEN  ES--U-- $Z000EAX    NONE        tdm_arkesp     
[$Z000G8K] 000,00017601 001 GEN  ES--U-- $Z000ECW    NONE        tdm_arkesp     
[$Z000G8K] 000,00018405 001 GEN  ES--U-- $Z000F0V    NONE        tdm_arkesp
...

找一个具体的ESP进程号,使用ps命令查看进程相关信息,我们发现这些ESP进程都处于defunct状态,他们都有一个共同的父进程号30037。

trafodion@datanode-1 logs]$  ps -ef | grep 18405
trafodi+ 18405 30037  0 03:50 ?        00:00:27 [tdm_arkesp] <defunct>
trafodi+ 22279 19210  0 17:32 pts/0    00:00:00 grep --color=auto 18405

解决

手动kill 30037这个父进程,并再次查看ESP个数,发现esp个数变为0。

kill -9 30037

30037进程是monitor进程,此问题发生的原因有可能是monitor异常重启,导致原有的monitor进程没有正常退出,从而相关的ESP进程都成为了僵尸状态。

猜你喜欢

转载自blog.csdn.net/Post_Yuan/article/details/82149679