[Thread dump Detailed jstack & full gc]

A, jstack command

Thread state jstack Dump log file
dump file, noteworthy thread states are:
  1. Deadlock, the Deadlock (focus) 
  2. Implementation, Runnable   
  3. Waiting for resources, Waiting for condition Condition ON (focus) 
  4. Waiting to obtain monitor, Waiting ON Monitor entry (focus)
  5. Pause, Suspended
  6. Waiting objects, the Object.wait () or TIMED_WAITING
  7. Blocked, Blocked (focus)  
  8. Stop, Parked

Let's start with the first example start analyzing, then lists the different meanings and precautions thread state, and finally to add two instances.

A comprehensive demonstration: Waiting to Lock and Blocked
Examples are as follows:
"RMI TCP Connection(267865)-172.16.5.25" daemon prio=10 tid=0x00007fd508371000 nid=0x55ae waiting for monitor entry [0x00007fd4f8684000]
   java.lang.Thread.State:  BLOCKED (on object monitor)
at org.apache.log4j.Category.callAppenders(Category.java:201)
- waiting to lock <0x00000000acf4d0c0> (a org.apache.log4j.Logger)
at org.apache.log4j.Category.forcedLog(Category.java:388)
at org.apache.log4j.Category.log(Category.java:853)
at org.apache.commons.logging.impl.Log4JLogger.warn(Log4JLogger.java:234)
at com.tuan.core.common.lang.cache.remote.SpyMemcachedClient.get(SpyMemcachedClient.java:110)
……
1) thread state is Blocked , blocked state. Description thread waits for resource timeout!  
2) "  Waiting to Lock <0x00000000acf4d0c0>" means that the thread is waiting for a lock 0x00000000acf4d0c0 address (in English may be described as: trying to obtain 0x00000000acf4d0c0 lock).
3) Find a string 0x00000000acf4d0c0 in the dump log, we found a large number of threads are waiting for the lock to this address . If you can find out who won this lock (such as locked <0x00000000acf4d0c0>) in the log, you can follow it up.
. 4) " Waiting for monitor entry " described by this thread synchronized (obj) {......} apply to enter the critical section, to enter the in 1 "Entry Set" FIG queues, but the monitor is corresponding obj have other threads , so this thread is waiting in the queue Entry Set.
5) the first line, "RMI TCP Connection (267 865) -172.16.5.25" is a Thread Name. tid refers to the Java Thread id. nid refers to the id native threads. prio is the thread priority. [0x00007fd4f8684000] is the thread stack starting address.
 
Thread Status Meaning and precautions Dump file

Meanings are as follows:

  • Deadlock: Deadlock thread, generally refers to calls between multiple threads into each other resource consumption, leading to waits case can not be released.
  • Runnable: generally refers to the thread executing state, the thread takes up resources, a request is being processed, there may be being passed SQL database to perform, it is possible for a file operation, it is possible to convert data types.
  • Waiting on condition: waiting for resources, or wait for the occurrence of certain conditions. To analyze the specific reasons to be combined with stacktrace.
    • If the stack is a clear message application code, then it proves that the thread is waiting for resources. The case is usually read a large number of resources, and the resource using the resource lock, the thread enters a wait state, waiting to read resources.
    • Or, Waiting for other threads.
    • If you find a large number of threads are in Wait on condition, from the thread stack of view, the network is waiting to read and write, this could be a sign of a network bottleneck. Because network congestion causes the thread can not be performed.
      • One case the network is very busy, almost all the bandwidth consumed, there are still large amounts of data waiting for the network to read and write;
      • Another situation may also be the network is idle, but the routing and other issues, the package can not normally reach.
    • Another common situation occurs Wait on condition that the thread sleep, waiting for sleep time to time, will wake up.
  • Blocked: the thread blocked, refers to the process of the current thread execution, the resources needed to wait for a long time has not been able to get, the container thread manager identified as blocked, can be understood as a thread waiting for a resource timeouts.
  • Waiting for monitor entry and in Object.wait (): Monitor in Java is the primary means to achieve mutually exclusive and collaboration between the threads, it can be seen as an object or Class of lock. Each object has also only one monitor. As can be seen in Figure 1 below, each of the Monitor at some point, it can only be a thread that owns the thread is "Active Thread", while the other thread is "Waiting Thread", respectively, in the two queues "Entry Set "and" wait Set "waiting inside. Thread state in the "Entry Set" is waiting for "Waiting for monitor entry", but the thread state in "Wait Set" is waiting "in Object.wait ()".

http://images.cnblogs.com/cnblogs_com/zhengyun_ustc/255879/o_clipboard%20-%20%E5%89%AF%E6%9C%AC039.png

Figure 1 A Java Monitor

 

Comprehensive demonstration two: W aiting ON for condition Condition  and TIMED_WAITING
Examples are as follows:
"RMI TCP Connection(idle)" daemon prio=10 tid=0x00007fd50834e800 nid=0x56b2 waiting on condition [0x00007fd4f1a59000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00000000acd84de8> (a java.util.concurrent.SynchronousQueue$TransferStack)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:424)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:323)
at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:874)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:945)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
at java.lang.Thread.run(Thread.java:662)
1) " TIMED_WAITING (Parking) " TIMED_WAITING finger in a waiting state but time is specified here to reach the exit standby automatically after a specified time; parking refers to the thread is suspended.

2) " Waiting for condition Condition ON " needs stack " Parking to the wait for <0x00000000acd84de8>  ( A java.util.concurrent.SynchronousQueue $ TransferStack ) binds" look. First of all, this thread is definitely waiting for the occurrence of certain conditions, to wake up to themselves. Secondly, SynchronousQueue not a queue, but the mechanism of transfer of information between threads, when we put an element into the SynchronousQueue must have another thread is waiting to accept the transfer of tasks, so it is waiting for the conditions in this thread.
3) the other can not see out.

Comprehensive demonstration three: in Obejct.wait ()  and TIMED_WAITING

Examples are as follows:
"RMI RenewClean-[172.16.5.19:28475]" daemon prio=10 tid=0x0000000041428800 nid=0xb09 in Object.wait() [0x00007f34f4bd0000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000000aa672478> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
- locked <0x00000000aa672478> (a java.lang.ref.ReferenceQueue$Lock)
at sun.rmi.transport.DGCClient$EndpointEntry$RenewCleanThread.run(DGCClient.java:516)
at java.lang.Thread.run(Thread.java:662)
1) " TIMED_WAITING (ON Object Monitor) ", for this example, because this thread calls java.lang.Object.wait (long timeout) and enters a wait state.

2) The thread state "Wait Set" is waiting " in the Object.wait ()  ." When the thread gets the Monitor, enter the critical region, if we find the condition thread continues to run are not met, it is called an object (generally is to be synchronized object) the wait () method, gave up Monitor, enter "Wait Set" queue . Only when another thread calls the object on the notify () or notifyAll (), "Wait Set" queue threads before they get the chance to compete, but only one thread acquires the object Monitor, back to a healthy state.  

3) RMI RenewClean DGCClient is part of. DGC  refers Distributed GC, that is distributed garbage collection.

4) Note that, a first  Locked <0x00000000aa672478> , after  Waiting ON <0x00000000aa672478> , equivalent to a reason to re-lock the object, consider the following code to achieve it:
static private class  Lock { };
private Lock lock = new Lock();
public Reference<? extends T>  remove(long timeout)
{
    synchronized (lock) {
        Reference<?  extends T> r = reallyPoll();
         if (r != null) return r;
         for (;;) {
            .wait(timeout);
             r = reallyPoll();
            ……
       }
}
That is, the thread of execution, the first with synchronized access to this object Monitor (corresponding to a  Locked <0x00000000aa672478> ); when the implementation of lock.wait (timeout) ;, thread to give up ownership of the Monitor, enter "Wait Set" queue (corresponding to  Waiting oN <0x00000000aa672478> ).     
5) from the stack information see, it is being cleaned remote references to remote objects, references to the lease, distributed garbage collection gradually cleared it.
 
Second, the system is running slow
1, excessive number Full GC
In the case of full gc more, there are about two features
  • Line multiple threads in excess of 100% of the CPU, you can see by these threads are mainly jstack command garbage collection thread

  • Jstat command by GC monitoring the situation, you can see the number of Full GC is very large, and the number is increasing.

 Positioning the problem:
 
1, first check that the process of multi-cpu resource occupied by top command (pid)
2, and then find the thread id cpu consume more resources under the current process with the top -Hp pid
3, the thread id converted to hexadecimal ( printf "the X-% \ the n-" thread id )
4, jstack -l pid print thread stacks and look for more than the corresponding file in the "hex thread id"
5, if the thread stack inside appear: VM Thread, and VM Thread last line of the display nid=0xa, nid here means that the operating system thread id meaning.
The VM Thread refers to the garbage collection thread. Here we basically can determine, the current cause of system slow mainly garbage collection too often, resulting in longer GC pauses.
We can see the situation through the following command to the GC :( jstat -gcutil 9 1000 10 )

 

root@8d36124607a0:/# jstat -gcutil 9 1000 10

 

 

S0    S1    E    O     M    CCS    YGC   YGCT   FGC   FGCT    GCT

 

 

0.00 0.00 0.00 75.07 59.09  59.60  3259  0.919  6517  7.715   8.635

 

 

0.00 0.00 0.00 0.08  59.09  59.60  3306  0.930  6611  7.822   8.752

 

 

0.00 0.00 0.00 0.08  59.09  59.60  3351  0.943  6701  7.924   8.867

 

0.00 0.00 0.00 0.08  59.09  59.60  3397  0.955  6793  8.029   8.984

It can be seen here FGC refers to the number of Full GC, where up to 6793 , and growing. Thus further confirms the system is slow due to memory overflow caused. This confirmed the memory leak, but what you see is what objects cause memory overflow, um, you can dump the memory log, and then by the eclipse of the mat viewing tool, as an object tree structure showing of:
|-- dump命令:jmap -dump:format=b,file=/app/eip/jamp1562.hprof  pid

 

 After mat analysis tools , we basically can determine which object in memory is the main memory consuming then find the location of the object is created, can be processed. Here it is mainly up to PrintStream, but we can also see its memory consumption is only 12.2%. In other words, it is not enough to cause a lot of GC Full , then we need to consider another case, it is dependent on third-party code or package display of System.gc()calls. In this case we see the memory dump file to get the judgment, as it will print GC reasons:

[Full GC (System.gc()) [Tenured: 262546K->262546K(349568K), 0.0014879 secs] 262546K->262546K(506816K), [Metaspace: 3109K->3109K(1056768K)], 0.0015151 secs] [Times: user=0.00 sys=0.00, real=0.01 secs]

[GC (Allocation Failure) [DefNew: 2795K->0K(157248K), 0.0001504 secs][Tenured: 262546K->402K(349568K), 0.0012949 secs] 265342K->402K(506816K), [Metaspace: 3109K->3109K(1056768K)], 0.0014699 secs] [Times: user=0.00

 

For example, here for the first time due to GC System.gc()display call results, and the second is the JVM GC initiative launched. In summary, for the Full GC too many times, mainly in the following two reasons:

  • Code once acquired a large number of objects, causing memory overflow, then you can see the eclipse through the mat has more memory tool which objects;

  • Memory consumption is not high, but the number of Full GC, or more, may be displayed at this time of  System.gc()call results in excessive GC number, which can be added  -XX:+DisableExplicitGCto disable response JVM to display the GC.

2, CPU is too high
The first idea and the same
 
3, from time to time consuming phenomenon appears Interface

In this case, a typical example is that we often need to access an interface 2 ~ 3s to return. This is a more troublesome situation because, in general, it consumes much CPU, and memory occupied is not high, that is to say, we are unable to carry out the investigation to solve this problem through the above two methods. And because such interfaces are consuming larger problem that appears from time to time, which led us through the jstackcommand even though the thread stack information has been accessed, we can not determine exactly which thread is executing time-consuming operation thread.

对于不定时出现的接口耗时比较严重的问题,我们的定位思路基本如下:首先找到该接口,通过压测工具不断加大访问力度,如果说该接口中有某个位置是比较耗时的,由于我们的访问的频率非常高,那么大多数的线程最终都将阻塞于该阻塞点,这样通过多个线程具有相同的堆栈日志,我们基本上就可以定位到该接口中比较耗时的代码的位置。如下是一个代码中有比较耗时的阻塞操作通过压测工具得到的线程堆栈日志:

"http-nio-8080-exec-2" #29 daemon prio=5 os_prio=31 tid=0x00007fd08cb26000 nid=0x9603 waiting on condition [0x00007000031d5000]

java.lang.Thread.State: TIMED_WAITING (sleeping)

at java.lang.Thread.sleep(Native Method)

at java.lang.Thread.sleep(Thread.java:340)

at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386)

at com.aibaobei.user.controller.UserController.detail(UserController.java:18)


"http-nio-8080-exec-3" #30 daemon prio=5 os_prio=31 tid=0x00007fd08cb27000 nid=0x6203 waiting on condition [0x00007000032d8000]

java.lang.Thread.State: TIMED_WAITING (sleeping)

at java.lang.Thread.sleep(Native Method)

at java.lang.Thread.sleep(Thread.java:340)

at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386)

at com.aibaobei.user.controller.UserController.detail(UserController.java:18)


"http-nio-8080-exec-4" #31 daemon prio=5 os_prio=31 tid=0x00007fd08d0fa000 nid=0x6403 waiting on condition [0x00007000033db000]

java.lang.Thread.State: TIMED_WAITING (sleeping)

at java.lang.Thread.sleep(Native Method)

at java.lang.Thread.sleep(Thread.java:340)

at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386)

at com.aibaobei.user.controller.UserController.detail(UserController.java:18)

从上面的日志可以看你出,这里有多个线程都阻塞在了UserController的第18行,说明这是一个阻塞点,也就是导致该接口比较缓慢的原因。

 4. 某个线程进入WAITING状态
 

对于这种情况,这是比较罕见的一种情况,但是也是有可能出现的,而且由于其具有一定的“不可复现性”,因而我们在排查的时候是非常难以发现的。笔者曾经就遇到过类似的这种情况,具体的场景是,在使用CountDownLatch时,由于需要每一个并行的任务都执行完成之后才会唤醒主线程往下执行。而当时我们是通过CountDownLatch控制多个线程连接并导出用户的gmail邮箱数据,这其中有一个线程连接上了用户邮箱,但是连接被服务器挂起了,导致该线程一直在等待服务器的响应。最终导致我们的主线程和其余几个线程都处于WAITING状态。

对于这样的问题,查看过jstack日志的读者应该都知道,正常情况下,线上大多数线程都是处于 TIMED_WAITING状态,而我们这里出问题的线程所处的状态与其是一模一样的,这就非常容易混淆我们的判断。解决这个问题的思路主要如下:

  • 通过grep在jstack日志中找出所有的处于 TIMED_WAITING状态的线程,将其导出到某个文件中,如a1.log,如下是一个导出的日志文件示例:

 
"Attach Listener" #13 daemon prio=9 os_prio=31 tid=0x00007fe690064000 nid=0xd07 waiting on condition [0x0000000000000000]

"DestroyJavaVM" #12 prio=5 os_prio=31 tid=0x00007fe690066000 nid=0x2603 waiting on condition [0x0000000000000000]

"Thread-0" #11 prio=5 os_prio=31 tid=0x00007fe690065000 nid=0x5a03 waiting on condition [0x0000700003ad4000]

"C1 CompilerThread3" #9 daemon prio=9 os_prio=31 tid=0x00007fe68c00a000 nid=0xa903 waiting on condition [0x0000000000000000]
  • 等待一段时间之后,比如10s,再次对jstack日志进行grep,将其导出到另一个文件,如a2.log,结果如下所示:

"DestroyJavaVM" #12 prio=5 os_prio=31 tid=0x00007fe690066000 nid=0x2603 waiting on condition [0x0000000000000000]

"Thread-0" #11 prio=5 os_prio=31 tid=0x00007fe690065000 nid=0x5a03 waiting on condition [0x0000700003ad4000]

"VM Periodic Task Thread" os_prio=31 tid=0x00007fe68d114000 nid=0xa803 waiting on condition
  • 重复步骤2,待导出3~4个文件之后,我们对导出的文件进行对比,找出其中在这几个文件中一直都存在的用户线程,这个线程基本上就可以确认是包含了处于等待状态有问题的线程。因为正常的请求线程是不会在20~30s之后还是处于等待状态的。

  • 经过排查得到这些线程之后,我们可以继续对其堆栈信息进行排查,如果该线程本身就应该处于等待状态,比如用户创建的线程池中处于空闲状态的线程,那么这种线程的堆栈信息中是不会包含用户自定义的类的。这些都可以排除掉,而剩下的线程基本上就可以确认是我们要找的有问题的线程。通过其堆栈信息,我们就可以得出具体是在哪个位置的代码导致该线程处于等待状态了

这里需要说明的是,我们在判断是否为用户线程时,可以通过线程最前面的线程名来判断,因为一般的框架的线程命名都是非常规范的,我们通过线程名就可以直接判断得出该线程是某些框架中的线程,这种线程基本上可以排除掉。而剩余的,比如上面的 Thread-0,以及我们可以辨别的自定义线程名,这些都是我们需要排查的对象。

经过上面的方式进行排查之后,我们基本上就可以得出这里的 Thread-0就是我们要找的线程,通过查看其堆栈信息,我们就可以得到具体是在哪个位置导致其处于等待状态了。如下示例中则是在SyncTask的第8行导致该线程进入等待了。

"Thread-0" #11 prio=5 os_prio=31 tid=0x00007f9de08c7000 nid=0x5603 waiting on condition [0x0000700001f89000]

java.lang.Thread.State: WAITING (parking)

at sun.misc.Unsafe.park(Native Method)

at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)

at com.aibaobei.chapter2.eg4.SyncTask.lambda$main$0(SyncTask.java:8)

at com.aibaobei.chapter2.eg4.SyncTask$$Lambda$1/1791741888.run(Unknown Source)

at java.lang.Thread.run(Thread.java:748)

5. 死锁

对于死锁,这种情况基本上很容易发现,因为 jstack可以帮助我们检查死锁,并且在日志中打印具体的死锁线程信息。如下是一个产生死锁的一个 jstack日志示例

 

 It can be seen at the bottom of jstack log directly help us analyze what the deadlock, deadlock, and each thread stack information exists log. Here we have two user threads are waiting for each other to release the lock, which is blocked position is in line 5 ConnectTask, in which case we will be able to directly locate the position and code analysis to find the cause of deadlock the reason.

 
 

Guess you like

Origin www.cnblogs.com/frankruby/p/11933511.html