A warning of atypical CPU investigation

When recently received a CPU online frequently exceeds the threshold alarm, it is clear where the problem is. So a lot of investigation, and finally found the culprit, and suddenly realized that this is a very interesting "atypical" CPU the question, so here specially recorded it.

Why say it is atypical of it, because in my experience, usually a typical CPU soared business code inside an infinite loop or a low performance RPC blocked a large number of threads, etc., and the problem of CPU from the GC is caused by blowing Sting

Take a look at the investigation process

Find out the CPU-thread

top

Of course, is to look at what the thread occupy the highest CPU, you can use the top command:

top -Hp $pid -b -n 1|sed -n "7,17p"

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
 94349 app     20   0 11.2g 5.0g  12m S 15.0 32.1 215:03.69 java              
 94357 app     20   0 11.2g 5.0g  12m S 15.0 32.1  88:22.39 java              
 94352 app     20   0 11.2g 5.0g  12m S 13.1 32.1 215:05.71 java              
 94350 app     20   0 11.2g 5.0g  12m S 11.2 32.1 215:04.86 java              
 94351 app     20   0 11.2g 5.0g  12m S 11.2 32.1 215:04.99 java              
 94935 app     20   0 11.2g 5.0g  12m S 11.2 32.1  63:11.75 java              
 94926 app     20   0 11.2g 5.0g  12m S  9.4 32.1  63:10.58 java              
 94927 app     20   0 11.2g 5.0g  12m S  5.6 32.1  63:06.89 java              
 94932 app     20   0 11.2g 5.0g  12m S  5.6 32.1  63:12.65 java              
 94939 app     20   0 11.2g 5.0g  12m S  5.6 32.1  63:01.75 java  
复制代码

$ Pid is the process of our corresponding java process ID, sed -n "7,17p" is to take 7 to line 17, because the header information in the first seven rows are the top command, so the line is 7 to 17 this thread before consuming most of the CPU 10 threads.

Where "pid" is inside the JVM thread ID corresponding to the first column, we only need to find the thread ID corresponding threads inside jstack know who the dirty tricks.

But it must be noted that the PID top command is decimal, and jstack inside the thread ID is hexadecimal, so we need to do a job is to put the above PID turn into hexadecimal, and here I only converting the first three of the most CPU-intensive:

[app@linux-v-l-02:/app/tmp/]$printf '%x\n' 94349
1708d
[app@linux-v-l-02:/app/tmp/]$printf '%x\n' 94357
17095
[app@linux-v-l-02:/app/tmp/]$printf '%x\n' 94352
17090

复制代码
jstack

Now that we know the thread ID CPU-intensive, then we must go and see these threads corresponding to the ID is what thread.

The first to use jstack play all the threads inside the JVM:

[app@linux-v-l-02:/app/tmp/]jstack -l $pid >>/tmp/jstack.txt
复制代码

It is worth mentioning that, due to the thread inside the JVM has been changing, and the threads TOP has been changing, so if jstack top command and the command is executed separately, is likely to correspond to both the thread ID is not on. So jstack top command and the command is to write the best script to perform together. in fact, I was to write - to perform with scripts

Then take a look at 1708d, 17095, 17090 in the end is what these three threads:

[app@linux-v-l-02:/app/tmp/]$egrep "1708d|17095|17090" jstack.txt --color
"Gang worker#0 (Parallel GC Threads)" os_prio=0 tid=0x00007f4d4c023000 nid=0x1708d runnable 
"Gang worker#3 (Parallel GC Threads)" os_prio=0 tid=0x00007f4d4c028800 nid=0x17090 runnable 
"G1 Concurrent Refinement Thread#0" os_prio=0 tid=0x00007f4d4c032000 nid=0x17095 runnable 
复制代码

The above nid is the corresponding hexadecimal thread ID. As can be seen from jstack, the thread actually consuming the most CPU are some of the GC thread.

The JVM FULL GC we are monitoring, this application since the change of G1, usually about one week will take place once a FULL GC, so we always thought we JVM heap is very healthy, but it is every indication that we the JVM is indeed a problem

GC problems

gc log

GC logs we have been printed, opened it, there really is a lot of GC pause, as follows

2019-08-12T20:12:23.002+0800: 501598.612: [GC pause (G1 Humongous Allocation) (young) (initial-mark), 0.0907586 secs]
   [Parallel Time: 84.5 ms, GC Workers: 4]
      [GC Worker Start (ms): Min: 501598615.0, Avg: 501598615.0, Max: 501598615.0, Diff: 0.1]
      [Ext Root Scanning (ms): Min: 4.9, Avg: 5.0, Max: 5.0, Diff: 0.2, Sum: 19.8]
      [Update RS (ms): Min: 76.6, Avg: 76.7, Max: 76.7, Diff: 0.1, Sum: 306.7]
         [Processed Buffers: Min: 945, Avg: 967.0, Max: 1007, Diff: 62, Sum: 3868]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Object Copy (ms): Min: 2.4, Avg: 2.5, Max: 2.6, Diff: 0.2, Sum: 9.8]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 4]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.3]
      [GC Worker Total (ms): Min: 84.2, Avg: 84.2, Max: 84.2, Diff: 0.1, Sum: 336.7]
      [GC Worker End (ms): Min: 501598699.2, Avg: 501598699.2, Max: 501598699.2, Diff: 0.0]
   [Code Root Fixup: 0.2 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.1 ms]
   [Other: 5.9 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 1.3 ms]
      [Ref Enq: 0.1 ms]
      [Redirty Cards: 0.1 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 0.7 ms]
      [Free CSet: 0.2 ms]
   [Eden: 230.0M(1968.0M)->0.0B(1970.0M) Survivors: 8192.0K->8192.0K Heap: 1693.6M(4096.0M)->1082.1M(4096.0M)]
 [Times: user=0.34 sys=0.00, real=0.10 secs] 
2019-08-12T20:12:23.094+0800: 501598.703: [GC concurrent-root-region-scan-start]
2019-08-12T20:12:23.101+0800: 501598.711: [GC concurrent-root-region-scan-end, 0.0076353 secs]
2019-08-12T20:12:23.101+0800: 501598.711: [GC concurrent-mark-start]
2019-08-12T20:12:23.634+0800: 501599.243: [GC concurrent-mark-end, 0.5323465 secs]
2019-08-12T20:12:23.639+0800: 501599.249: [GC remark 2019-08-12T20:12:23.639+0800: 501599.249: [Finalize Marking, 0.0019652 secs] 2019-08-12T20:12:23.641+0800: 501599.251: [GC ref-proc, 0.0027393 secs] 2019-08-12T20:12:23.644+0800: 501599.254: [Unloading, 0.0307159 secs], 0.0366784 secs]
 [Times: user=0.13 sys=0.00, real=0.04 secs] 
2019-08-12T20:12:23.682+0800: 501599.291: [GC cleanup 1245M->1226M(4096M), 0.0041762 secs]
 [Times: user=0.02 sys=0.00, real=0.01 secs] 
2019-08-12T20:12:23.687+0800: 501599.296: [GC concurrent-cleanup-start]
2019-08-12T20:12:23.687+0800: 501599.296: [GC concurrent-cleanup-end, 0.0000487 secs]
2019-08-12T20:12:30.022+0800: 501605.632: [GC pause (G1 Humongous Allocation) (young) (to-space exhausted), 0.3849037 secs]
   [Parallel Time: 165.7 ms, GC Workers: 4]
      [GC Worker Start (ms): Min: 501605635.2, Avg: 501605635.2, Max: 501605635.3, Diff: 0.1]
      [Ext Root Scanning (ms): Min: 3.5, Avg: 3.8, Max: 4.4, Diff: 0.9, Sum: 15.2]
      [Update RS (ms): Min: 135.5, Avg: 135.8, Max: 136.0, Diff: 0.5, Sum: 543.3]
         [Processed Buffers: Min: 1641, Avg: 1702.2, Max: 1772, Diff: 131, Sum: 6809]
      [Scan RS (ms): Min: 1.5, Avg: 1.6, Max: 1.6, Diff: 0.0, Sum: 6.2]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Object Copy (ms): Min: 24.1, Avg: 24.4, Max: 24.6, Diff: 0.4, Sum: 97.4]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.1]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 4]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [GC Worker Total (ms): Min: 165.6, Avg: 165.6, Max: 165.6, Diff: 0.0, Sum: 662.4]
      [GC Worker End (ms): Min: 501605800.8, Avg: 501605800.9, Max: 501605800.9, Diff: 0.0]
   [Code Root Fixup: 0.2 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.3 ms]
   [Other: 218.7 ms]
      [Evacuation Failure: 210.1 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 1.5 ms]
      [Ref Enq: 0.1 ms]
      [Redirty Cards: 0.3 ms]
      [Humongous Register: 0.2 ms]
      [Humongous Reclaim: 2.2 ms]
      [Free CSet: 0.2 ms]
   [Eden: 666.0M(1970.0M)->0.0B(204.0M) Survivors: 8192.0K->0.0B Heap: 2909.5M(4096.0M)->1712.4M(4096.0M)]
 [Times: user=1.44 sys=0.00, real=0.39 secs] 
2019-08-12T20:12:32.225+0800: 501607.834: [GC pause (G1 Evacuation Pause) (mixed), 0.0800708 secs]
   [Parallel Time: 74.8 ms, GC Workers: 4]
      [GC Worker Start (ms): Min: 501607835.5, Avg: 501607835.6, Max: 501607835.6, Diff: 0.1]
      [Ext Root Scanning (ms): Min: 3.7, Avg: 4.0, Max: 4.4, Diff: 0.6, Sum: 16.2]
      [Update RS (ms): Min: 67.8, Avg: 68.0, Max: 68.1, Diff: 0.3, Sum: 272.0]
         [Processed Buffers: Min: 863, Avg: 899.8, Max: 938, Diff: 75, Sum: 3599]
复制代码

G1 log has a bad place is too much, dazzled, for ease of description, I will omit some of the above-mentioned GC logs meaningless, condensed into the following three sections:

2019-08-12T20:12:23.002+0800: 501598.612: [GC pause (G1 Humongous Allocation) (young) (initial-mark), 0.0907586 secs]
   [Parallel Time: 84.5 ms, GC Workers: 4]
      [GC Worker Start (ms): Min: 501598615.0, Avg: 501598615.0, Max: 501598615.0, Diff: 0.1]
......
    
   [Eden: 230.0M(1968.0M)->0.0B(1970.0M) Survivors: 8192.0K->8192.0K Heap: 1693.6M(4096.0M)->1082.1M(4096.0M)]
 [Times: user=0.34 sys=0.00, real=0.10 secs] 
2019-08-12T20:12:23.094+0800: 501598.703: [GC concurrent-root-region-scan-start]
2019-08-12T20:12:23.101+0800: 501598.711: [GC concurrent-root-region-scan-end, 0.0076353 secs]
2019-08-12T20:12:23.101+0800: 501598.711: [GC concurrent-mark-start]
2019-08-12T20:12:23.634+0800: 501599.243: [GC concurrent-mark-end, 0.5323465 secs]
2019-08-12T20:12:23.639+0800: 501599.249: [GC remark 2019-08-12T20:12:23.639+0800: 501599.249: [Finalize Marking, 0.0019652 secs] 2019-08-12T20:12:23.641+0800: 501599.251: [GC ref-proc, 0.0027393 secs] 2019-08-12T20:12:23.644+0800: 501599.254: [Unloading, 0.0307159 secs], 0.0366784 secs]
 [Times: user=0.13 sys=0.00, real=0.04 secs] 
2019-08-12T20:12:23.682+0800: 501599.291: [GC cleanup 1245M->1226M(4096M), 0.0041762 secs]
 [Times: user=0.02 sys=0.00, real=0.01 secs] 
2019-08-12T20:12:23.687+0800: 501599.296: [GC concurrent-cleanup-start]
2019-08-12T20:12:23.687+0800: 501599.296: [GC concurrent-cleanup-end, 0.0000487 secs]
2019-08-12T20:12:30.022+0800: 501605.632: [GC pause (G1 Humongous Allocation) (young) (to-space exhausted), 0.3849037 secs]
   [Parallel Time: 165.7 ms, GC Workers: 4]
      [GC Worker Start (ms): Min: 501605635.2, Avg: 501605635.2, Max: 501605635.3, Diff: 0.1]
......

   [Eden: 666.0M(1970.0M)->0.0B(204.0M) Survivors: 8192.0K->0.0B Heap: 2909.5M(4096.0M)->1712.4M(4096.0M)]
 [Times: user=1.44 sys=0.00, real=0.39 secs] 
2019-08-12T20:12:32.225+0800: 501607.834: [GC pause (G1 Evacuation Pause) (mixed), 0.0800708 secs]
   [Parallel Time: 74.8 ms, GC Workers: 4]
      [GC Worker Start (ms): Min: 501607835.5, Avg: 501607835.6, Max: 501607835.6, Diff: 0.1]
......

复制代码

This log looked on more clear, first of all can see at least three questions from inside the log:

  1. There have been mixed type of Evacuation Pause
  2. Frequent G1 Humongous Allocation to-space exhausted lead, indicating that a large number of large object constantly dispensed
  3. GC pause time to reach 0.3849037 secs, this is our most intolerable

There is also a more serious problem here is not apparent, is similar to the log very frequently! Peak of fundamental is once every 2 seconds to print out

jmap -histo

Through the above GC logs, basically we can judge, we are constantly new applications due to a number of large objects.

So what is the big object?

Generally List is a local variable, usually can be viewed within the heap memory for which objects are relatively large by jmap -histo, the number of instances is how much

So, first through jmap -histo $ pid distributed objects stack inside look at how to:


num   #instances  #bytes  class name
--------------------------------------------
1:       1120   1032796420   [B
2:     838370    105246813   [C
3:     117631     55348463   [I
4:     352652     31033457   java.lang.reflect.Method
5:     665505     13978410   java.lang.String
6:     198567     12368412   [Ljava.lang.Object
7:     268941      9467465   java.util.HashMap$Node
8:     268941      8064567   java.util.treeMap$Entry
9:     268941      7845665   java.lang.reflect.Field
10:    268941      7754771   [Ljava.util.HashMap$Node

....

复制代码

In general, if lucky, and business code in question, usually we can see that the class name of the business involved in jmap -histo inside.

But unfortunately, there is no.

However, clever students may at a glance This heap is actually very problematic.

We look ranked first of [B (byte array), taking up the heap size 1,032,796,420 (about 1G), while only more than 1120 instances but simply a division, actually each object can have 1M size!

Obviously, this is what we are looking for large objects, but only know some of the byte array, the array does not know what is, so it needs further investigation

Why 1M is a big target of it? Because we only 4G heap size, generally only 2048 G1 largest region, and therefore the size of each region is 2M. G1 in the allocation of the new generation of memory space objects, found objects larger than the region half the size, it can be considered a large object, G1 Humongous Allocation therefore occur

jmap -dump:format=b

Use jmap -dump: format = b, file = head.hprof $ pid command to dump out the contents of the JVM heap dump out the general view in the direct line of command can not see something, get downloaded to the local, with some analysis. tools for analysis. there are many tools that can be analyzed, for example jvisualvm, jprofile, MAT, etc.

Here I used a jvisualvm, open jvisualvm ==> File ==> Load ==> Select I just downloaded head.hprof, then click on the "class" and then click sorted by size, can be obtained in the following figure.

image

As can be seen, byte array instance number accounted for only 0.9% of the heap, memory size proportion was as high as 30%, indicating that each instance is a large object.

Then we double-click the first line of "byte []" to view these details byte array. You can see many of the objects are 1048600byte, which is just 1M, but this is still do not see the contents of the array, so we export this array locally, as shown below:

image

After exporting first with Sublime Text look into it, as shown below

i
It can be seen that the actual size of the array is only about 1K (the number of digits in front of the number of non-zero), followed by 0 values ​​are meaningless.

Although the array can not determine what the code is generated, but at least you can probably determine the cause of the problem: the code must be somewhere a 1,048,600 new large byte array, but the actual scene in the byte array to need only about 1k , fill bits are not followed by a default value 0

Finally confirmed what we guess, simply use

 String str= new String (bytearr, "UTF-8");
 System.out.println("str = [" + str + "]");
复制代码

Print out the contents of the array, the print results something like (I've omitted most of the content):

str = [p C0+org.apache.camel.impl.DefaultExchangeHolder�
exchangeIdinFaultFlagoutFaultFlag	exceptioninBodyoutBody	inHeaders
outHeaders
remoteOptRolefaceUrlversionfromIfromUserfailFaceUrl
.....

复制代码

Then the code associated with keyword search, finally find the murderer:

       data = DefaultExchangeHolder.marshal(exchange, false);
       baos = new ByteArrayOutputStream(1048600);// 真凶在这里
       hessianOut = new Hessian2Output(baos);
       hessianOut.startMessage();
       hessianOut.writeObject(data);
       hessianOut.completeMessage();
       hessianOut.flush();
       exchangeData = baos.toByteArray();
复制代码

ByteArrayOutputStream constructor

  public ByteArrayOutputStream(int size) {
        if (size < 0) {
            throw new IllegalArgumentException("Negative initial size: "
                                               + size);
        }
        buf = new byte[size];
    }
复制代码

In fact, that is, before the use of Hessian serialization, new a 1M byte size of the array, resulting in a large number of large objects appear, and this byte array is only used as a buf, but the size is not automatically grow (The buffer automatically grows as time data is written to it.), so there is no need to set so great.

Guess you like

Origin juejin.im/post/5d52390ae51d4561af16dcf0