记一次Java Rest Service Hang住的经历

背景:
在前后端分开开发的过程中,前段的页面要调用后台Java Rest 服务。一开始,我们的服务是正常的,可是跑了一段时间之后,发现,后台的service没有接受到来自前端的请求,而且当我们直接去请求Rest的时候,发现服务端根本就不能处理这个请求,这个请求就一直hang在哪里,转圈圈。

分析
因为之前也没有分析定位过这中问题。所以有点不知所措。上网查,都说要dump内存,看服务器cpu,内存消耗情况。具体怎么操作也没说清楚。于是联系了一下前同事,给出了一些可以直接操作的命令。
命令一: jstack(查看线程的信息)。
于是用 jstack pid看了一下线程使用情况:
...
"Finalizer" daemon prio=10 tid=0x00007fdb28072800 nid=0x2ca4 in Object.wait() [0x00007fdb2454f000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x0000000746496810> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
        - locked <0x0000000746496810> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)

"Reference Handler" daemon prio=10 tid=0x00007fdb28070800 nid=0x2ca3 in Object.wait() [0x00007fdb0e04e000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x0000000746496468> (a java.lang.ref.Reference$Lock)
        at java.lang.Object.wait(Object.java:503)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
        - locked <0x0000000746496468> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x00007fdb2800d000 nid=0x2c9d waiting on condition [0x00007fdb30474000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at com.statestreetgx.fcm.core.app.FcmCoreApp.main(FcmCoreApp.java:97)

"VM Thread" prio=10 tid=0x00007fdb2806c800 nid=0x2ca2 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007fdb28022800 nid=0x2c9e runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007fdb28024800 nid=0x2c9f runnable

"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007fdb28026000 nid=0x2ca0 runnable

"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007fdb28028000 nid=0x2ca1 runnable

"VM Periodic Task Thread" prio=10 tid=0x00007fdb285b6800 nid=0x2cae waiting on condition

JNI global references: 412

看到这里的第一刹那,还以为是抛异常了,这还了得。于是去看了tomcat进程的堆栈信息。也是一样的,如下:
...
"Service Thread" daemon prio=10 tid=0x00007f97a4098000 nid=0x1dd8 runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x00007f97a4095800 nid=0x1dd7 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x00007f97a4092800 nid=0x1dd6 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00007f97a4090800 nid=0x1dd5 runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00007f97a406f800 nid=0x1dd4 in Object.wait() [0x00007f978f46f000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
        - locked <0x00000007008f4fe0> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)

"Reference Handler" daemon prio=10 tid=0x00007f97a406d800 nid=0x1dd3 in Object.wait() [0x00007f978f570000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:503)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
        - locked <0x00000007008f4a30> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x00007f97a4009800 nid=0x1dcd runnable [0x00007f97a9fac000]
   java.lang.Thread.State: RUNNABLE
        at java.net.PlainSocketImpl.socketAccept(Native Method)
        at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
        at java.net.ServerSocket.implAccept(ServerSocket.java:530)
        at java.net.ServerSocket.accept(ServerSocket.java:498)
        at org.apache.catalina.core.StandardServer.await(StandardServer.java:451)
        at org.apache.catalina.startup.Catalina.await(Catalina.java:777)
        at org.apache.catalina.startup.Catalina.start(Catalina.java:723)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:321)
        at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:455)

"VM Thread" prio=10 tid=0x00007f97a4069000 nid=0x1dd2 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f97a401f800 nid=0x1dce runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f97a4021000 nid=0x1dcf runnable

"GC task thread#2 (ParallelGC)" prio=10 tid=0x00007f97a4023000 nid=0x1dd0 runnable

"GC task thread#3 (ParallelGC)" prio=10 tid=0x00007f97a4025000 nid=0x1dd1 runnable

"VM Periodic Task Thread" prio=10 tid=0x00007f97a40aa800 nid=0x1dd9 waiting on condition

JNI global references: 411


因为tomcat是好的,所以也就不在怀疑是这里进程里有异常了。

有一种检查该进程是不是有问题,可以这样做。
1.先ps找到你的进程的PID。
2.然后把PID转换成16进制。Linux: $printf "%x\n" PID; 就能专转成16进制了。
3.到刚才的查出来的堆栈信息里面grep 第二步转出来的16进制,如果能找到,并且在此处blocked住,那么就说明java应用是有问题的。没有就应该不用到这里找了,尝试去别的地方找找看原因吧。

命令二:jmap(查看内存信息)
jstack并没有能帮助到我,我还是不知道怎么办,这个时候我去找一个同事聊了一下,同事也对java内存略懂,于是一直说jmap可以查看内存信息,先看看进程的内存使用情况再说。于是就用上了jmap。jmap不查不知道,一查吓一跳啊。
-bash-4.1$ /server/share/java/bin/jmap -heap 11417
Attaching to process ID 11417, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 24.79-b02

using thread-local object allocation.
Parallel GC with 4 thread(s)

Heap Configuration:
   MinHeapFreeRatio = 0
   MaxHeapFreeRatio = 100
   MaxHeapSize      = 3124756480 (2980.0MB)
   NewSize          = 1310720 (1.25MB)
   MaxNewSize       = 17592186044415 MB
   OldSize          = 5439488 (5.1875MB)
   NewRatio         = 2
   SurvivorRatio    = 8
   PermSize         = 21757952 (20.75MB)
   MaxPermSize      = 85983232 (82.0MB)
   G1HeapRegionSize = 0 (0.0MB)

Heap Usage:
PS Young Generation
Eden Space:
   capacity = 777519104 (741.5MB)
   used     = 228470440 (217.88639068603516MB)
   free     = 549048664 (523.6136093139648MB)
   29.38454358543967% used
From Space:
   capacity = 42991616 (41.0MB)
   used     = 42979216 (40.98817443847656MB)
   free     = 12400 (0.0118255615234375MB)
   99.971157167016% used
To Space:
   capacity = 54001664 (51.5MB)
   used     = 0 (0.0MB)
   free     = 54001664 (51.5MB)
   0.0% used
PS Old Generation
   capacity = 130023424 (124.0MB)
   used     = 63940096 (60.97802734375MB)
   free     = 66083328 (63.02197265625MB)
   49.17582850302419% used
PS Perm Generation
   capacity = 71303168 (68.0MB)
   used     = 71057992 (67.76618194580078MB)
   free     = 245176 (0.23381805419921875MB)
   99.65614992029526% used

32711 interned Strings occupying 3615504 bytes.



这回我终于相信自己是找到问题的根本原因了。那就是持久代PG不足,导致的。当时根本就没有去想为什么。脑子里就知道内存不足肯定会出问题的,都99%啦。

我把内存使用情况截给我同事看,我同事呵呵一笑,说这个不能说明问题,要看FGC。 这个时候,我才想起来,java只有在FGC的时候,才对所有的请求都是没有反应的。于是赶紧去找了一下命令。
命令三:jstat
(gc信息查看)

赶紧jstat了一把。
-bash-4.1$ jstat -gc 10626 1000


当前很显然一直在FGC, 能怪一直没有反应。
但是原因,导致FGC的原因是什么呢,还是没有找到。
不管了先把
-XX:PermSize=50m -XX:MaxPermSize=100m
设大一点试试再说吧。
kill 掉进程,再重启。再观察一段时间吧。

这里还有一个不错的文章:
http://ifeve.com/find-bug-online/

--EOF--

猜你喜欢

转载自xfxlch.iteye.com/blog/2267278