Service Unavailable troubleshooting ideas

Foreword

On Thursday, the servers suddenly hung up. SSH are not connected, the daily applet directly down off the background, probably about 3K applet daily visits. Ever since it opened, after the investigation.

Investigation stage

Leaving aside what, say first service recovery. Restart Ali cloud server, SSH connection. Open nginx, redis, mysql, java service. A series of operations, started the first service first.

Server installed CloudMonitor (Cloud Monitoring), is highly recommended to install, to troubleshoot problems, view the CPU, memory, very helpful.

View CPU, memory, as follows:

 

 

 

 

 

 

 

We start with the last one figure looks, you can clearly see around 9.30 when the inflow and outflow rate of the network immediately soared. Therefore it can be concluded initially, CPU, memory soar, may be associated with the network.

About 9.30 minutes, the server is probably running several applications related with the network:

  • mysql
  • repeat
  • ngxin
  • java service (osc, sign, etc.)
  • docker

Obviously the first three everyday applications, basically not a problem, first of all ruled out. The rest is Java-related services and the Docker.

The first idea is not Java sudden increase in traffic, then not enough server resources, and service to the dead. Then went to see the relevant log Java services, found 9.30 points is no exception, with the usual traffic without much change. It is excluded.

So the service is docker.

docker I will have a regular day job, to brush their problem. Basically more than nine a day will start, then stop off 11:00. View docker then log:

 

 

We can clearly see when 9.23 points. docker open, began to brush the question. It can be concluded that the docker pot. At this point the docker kill off, turn off the regular tasks, in so far there have been no problem.

 

In the current issue

To determine whether the problem is a docker, then a few days later, I opened the docker's regular tasks. Check server resources as follows:

 

Reproduced to a re-issue, it is clear that this is the docker pot. As for why this service is turned docker, soared memory, CPU soared, leading directly to the server is down. For this reason we must ask the author of the image.

个人初步猜想,内存泄露了

我们可以仔细观察下内存的图片。约9.30的时候docker服务启动,内存上升至70%左右,这都是非常的合理的。

在大概10点左右的时候,任务跑完了(通过查看日志)。但服务并没有stop。

从10点开始,内存一路飙升,飙升至95%,最终我kill掉了docker,内存回归正常。

这是很明显的内存泄露,因为此镜像为私人镜像,并且不开源,具体代码无从查起,也是没有办法的了。

不过已向改仓库提了issue。https://github.com/fuck-xuexiqiangguo/docker/issues/20

 

总结

从这次服务器挂掉,有以下几点感想。

  • 日志很重要,无论是什么服务,一定要记得把日志排在首位
  • 服务器一定要有监控,并且要有监控预警,超过多少,发短信,电话通知。
  • 问题思路排查要有理有据,一步一步来,不能瞎子抓阄似的。
  • 服务挂掉,首先要恢复服务,比如重启等操作

 这次服务器宕机并没有任何影响,毕竟没啥用户,不过感觉对问题的排查更加深刻了。业务推动技术,这点是毋庸置疑的了。

而且业务上线后,慢慢也会出现很多问题,一个一个解决,也能学习到很多东西。

 

 

Guess you like

Origin www.cnblogs.com/wenbochang/p/11979118.html