The server memory was burst by Java thread that day, a simple JVM problem diagnosis process

I am a kite, the public number "Kite in the Ancient Times", a technical public number that is not only technical, a slash developer who has been involved in the programming circle for many years and is mainly engaged in Java, and also Python and React. The Spring Cloud series of articles has been completed, you can go to my github to view the full content of the series. You can also reply to "pdf" in the public account to get my elaborate pdf version of the complete tutorial.

Talking about eating at noon that day, a colleague said: "The person in the project team is mad at me, there is a problem with the program. In the morning, they were in the group @ they, and they did not reply to the message until noon, and they even said that their program was no problem , It's called too frequently on our side. I just want to laugh. "

Generally speaking, if there is a problem with the docking, if the error is not too obvious, I will first doubt if there is a problem myself, so as not to be embarrassing. So I said, go back after eating and I'll help you check to see where the problem is.

Background note

Our current system is integrated with many third-party systems, and the problem is one of the three-party systems. In fact, it is very simple, their system will generate some personal to-do tasks, and then the number of to-do tasks needs to be pushed to our APP, which is displayed as the corner of the icon.

The user data has been cleared. In fact, the requirement is very simple, and the corner notification is not required to be real-time. It can be swiped every 10 minutes. This scenario is very typical, and the use of message queues is perfect. They pushed the data to the message queue, and we went to the message queue to get it, perfect.

However, this is not the case. They say that the system is productized and does not support message queues. It can only open the to-do task interface. Okay (smiley face), you are the product and you are reasonable. There may be not many users with pending tasks, more than 300, then request more than 300 requests every 10 minutes. There is no need for multi-threading, that is, simply looping more than 300 requests, each time takes about 1 minute.

It's okay, so chant.

By the way, the JDK of this service is version 1.6. It is said that due to historical reasons, it is no longer dare to upgrade. Moreover, the service should be deployed on windows. (You say magic is not magic)

Blooming

Then chant like this, be a scheduled task, and ask for 300 times in 10 minutes, which is quite enjoyable and worry-free.

但是好景不长,天不遂人愿,服务器不遂程序员愿。

以下是同事的经历,我转述以下。

就在定时任务跑起来后的第二个晚上,那本来该是一个平常的晚上,可是告警邮件扰人清梦。一看日志,内存使用空间过高,撑爆了,导致机器自动重启了。windows 就这点好啊,还会自动重启(尴尬脸)。然后手动上去把服务启动起来,解决。

隔了一天,还是晚上,又报警了,服务器又自动重启了,又是内存使用空间过高。又手动上去把服务启动了。

于是他反馈给这个服务的开发人员,结果得到的回复是:“我们的服务没有问题,肯定是你们的调用有问题,你们把定时任务停掉肯定就好了,所以是你们的问题”。

于是,他过来找我,跟我说明情况,问我可能会是什么问题。

我:你确定定时服务是 10 分钟一次,没有出现死循环吗?

同事:确定。

我:那他们的服务有使用 redis 之类的外部缓存吗?

同事:不知道。

我:。。。 既然你确定你调用的没问题,那肯定是他们程序出现问题把内存撑爆了呀,这有什么好怀疑的,让他们改吧。

同事:他们现在说自己没问题啊。

挖出真凶

好吧,既然他们说没问题,那我就来帮他把问题找出来吧。于是,远程进了那台 windows 服务器。

这时候已经把定时任务已经跑了两天了,16G 的内存已经用掉 15G 多了,眼看随时有可能崩溃,然后把定时任务停掉,内存使用量也并不会下来。

我开始怀疑是不是用了 redis 之类的外部缓存,结果进服务器一查 redis 、memcached 之类的压根儿就没装,所以排除外部缓存。(随后使用 JVM 工具查看也证明了这一点

那既然不是外部缓存,那肯定出在 JVM 上了,要不然就是用了 JVM 缓存,要不然就是内存泄漏什么的。于是我想用 jinfo -flags看一下 JVM 初始参数,JDK 6 竟然还不支持 -flags 。

然后我不知道是不是尝试了 jmap -heap 还是就看了一眼 jmap -help以为不支持 jamp -heap,反正最后我是通过 jconsole来观察的 JVM。一看 JVM 参数明显就是默认没特殊设置过,并且奇怪的是对内存一共采用了 700 多M。700M 和 15G 比,差哪儿去了,没道理啊,问题没出在堆上。

然后我尝试执行 GC 操作,然而并没有任何改善。直到这里,我严重怀疑是出现了内存泄漏了。

于是我执行了 jmap -dump,把堆、线程信息 dump 下来,然后拉到本地分析。不看不知道,一看吓一跳,线程多到令人窒息。

不得不说,有一点他们做的非常好,竟然贴心的给线程编了号,没错,就是有这么多线程 10万多个。于是我们算了一下假设 10分钟请求 300 次,那就是 300 个线程,一小时就是 30 x 6=1800,一天24小时就是1800 x 24=43200,两天多的时间 10万多个线程那就正好对上了,好牛x的样子。

一个线程默认占用空间大小 1M,10万多个线程那就是 10个多G,加上堆内存占用和机器上其他服务的内存占用,内存飙到 15G 就对的上了。

谁的问题谁处理

有问题就找问题就这么难吗,不承认自己的程序有问题是怎么想的呢。

好啊,你们自己不查,我帮你找到问题原因了,满意了吧。

于是,同事理直气壮的把上面那张截图发给他们,但是没有额外说一句话。

下午,微信群里对方发来消息,问题已修改,可以再试试。

然后,好多天过去了,问题没有再出现。

规避问题

有的同学问了,系统能创建10万多个线程吗,有可能的。这篇文章是「你假笨」大神写的 Linux 系统下能创建多少个线程的源码分析 club.perfma.com/article/244…,有兴趣可以上去看一看。

这个问题产生的原因就是线程创建了但是没有销毁,估计是销毁逻辑写的有些问题吧。

抛开逻辑错误不说,使用线程的正确做法是使用线程池,以免带来不必要的性能损耗和这种未加控制、未及时销毁带来的线程无止境创建的问题。

创作不易,小小的赞,大大的暖,快来温暖我。不用客气了,赞我!

我是风筝,公众号「古时的风筝」,一个在程序圈混迹多年,主业 Java,另外 Python、React 也玩儿的很 6 的斜杠开发者。可以在公众号中加我好友,进群里小伙伴交流学习,好多大厂的同学也在群内呦。

Guess you like

Origin juejin.im/post/5ea0f2a2f265da47aa3f7b0f