Analysis of frequent GC (Allocation Failure) and long GC time

sequence

This article mainly analyzes a case of frequent GC (Allocation Failure) and too long young GC time.

symptom

  • The gc throughput percent gradually decreased, from the normal 99.96% to 98%, and the lowest point reached 94%.
  • The young gc time gradually increases, gradually rising from the general ten milliseconds, breaking through 50, and then breaking through 100, 150, 200, 250
  • In 8.5 days, more than 9000 gcs occurred, of which the full gc was 4 times, the average was nearly 8 seconds, most of them were young gc( allocation failure为主), the average was more than 270 milliseconds, and the maximum value was nearly 7 seconds
  • The average object creation rate is 10.63 mb/sec, the average promotion rate is 2 kb/sec, the cpu usage is normal, and there is no significant spike

jvm parameters

-XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:+UseAdaptiveSizePolicy -XX:MaxHeapSize=2147483648 -XX:MaxNewSize=1073741824 -XX:NewSize=1073741824 -XX:+PrintGCDetails -XX:+PrintTenuringDistribution -XX:+PrintGCTimeStamps

jdk version

java -version
java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)

full gc

27.066: [Full GC (Metadata GC Threshold) [PSYoungGen: 19211K->0K(917504K)] [ParOldGen: 80K->18440K(1048576K)] 19291K->18440K(1966080K), [Metaspace: 20943K->20943K(1069056K)], 0.5005658 secs] [Times: user=0.24 sys=0.01, real=0.50 secs] 
100.675: [Full GC (Metadata GC Threshold) [PSYoungGen: 14699K->0K(917504K)] [ParOldGen: 18464K->23826K(1048576K)] 33164K->23826K(1966080K), [Metaspace: 34777K->34777K(1081344K)], 0.7937738 secs] [Times: user=0.37 sys=0.01, real=0.79 secs]
195.073: [Full GC (Metadata GC Threshold) [PSYoungGen: 24843K->0K(1022464K)] [ParOldGen: 30048K->44782K(1048576K)] 54892K->44782K(2071040K), [Metaspace: 58220K->58220K(1101824K)], 3.7936515 secs] [Times: user=1.86 sys=0.02, real=3.79 secs] 
242605.669: [Full GC (Ergonomics) [PSYoungGen: 67276K->0K(882688K)] [ParOldGen: 1042358K->117634K(1048576K)] 1109635K->117634K(1931264K), [Metaspace: 91365K->90958K(1132544K)], 22.1573804 secs] [Times: user=2.50 sys=3.51, real=22.16 secs]

It can be found that the 4 full gcs occurred, the first three were caused by the Metadata GC Threshold, and only the last was caused by Ergonomics.

Full GC (Metadata GC Threshold)

Java8 is used here. The parameters do not specify the size and upper limit of the metaspace. Check it out.

jstat -gcmetacapacity 7
   MCMN       MCMX        MC       CCSMN      CCSMX       CCSC     YGC   FGC    FGCT     GCT
       0.0  1136640.0    99456.0        0.0  1048576.0    12160.0 38009    16  275.801 14361.992
  • Ignore the following FGC, because the analyzed log is only a quarter of it
  • Here you can see that MCMX (Maximum metaspace capacity (kB)) has more than G, while MC (Metaspace capacity (kB)) is only about 97M, why does it cause Full GC (Metadata GC Threshold)

Related parameters

  • -XX:MetaspaceSize, the initial space size (also the initial threshold, that is, the initial high-water-mark), reaching this value will trigger garbage collection for type unloading, and the GC will adjust the value: if a large amount of If the space is large, reduce the value appropriately; if a little space is released, increase the value appropriately when it does not exceed MaxMetaspaceSize.
  • -XX:MaxMetaspaceSize, the maximum space, the default is unlimited, depending on the local system space capacity.
  • -XX:MinMetaspaceFreeRatio, after GC, the percentage of the minimum Metaspace remaining space capacity ( 即元数据在当前分配大小的最大占用大小), if the free ratio is less than this parameter ( 即超过了最大占用大小), then the meta space will be expanded.
  • -XX:MaxMetaspaceFreeRatio, after GC, the percentage of the maximum Metaspace remaining space capacity ( 即元数据在当前分配大小的最小占用大小), if the free ratio is greater than this parameter ( 即小于最小占用大小), then the meta space will be shrunk.

Since there is no setting, the default on the machine is:

java -XX:+PrintFlagsFinal | grep Meta
    uintx InitialBootClassLoaderMetaspaceSize       = 4194304                             {product}
    uintx MaxMetaspaceExpansion                     = 5451776                             {product}
    uintx MaxMetaspaceFreeRatio                     = 70                                  {product}
    uintx MaxMetaspaceSize                          = 18446744073709547520                    {product}
    uintx MetaspaceSize                             = 21807104                            {pd product}
    uintx MinMetaspaceExpansion                     = 339968                              {product}
    uintx MinMetaspaceFreeRatio                     = 40                                  {product}
     bool TraceMetadataHumongousAllocation          = false                               {product}
     bool UseLargePagesInMetaspace                  = false                               {product}

It can be seen that MinMetaspaceFreeRatio is 40, MaxMetaspaceFreeRatio is 70, MetaspaceSize is 20M, and Full GC (Metadata GC Threshold) is mainly divided into three

  • The first time, [Metaspace: 20943K->20943K(1069056K)]
  • The second time, [Metaspace: 34777K->34777K(1081344K)]
  • The third time, [Metaspace: 58220K->58220K(1101824K)]

It can be seen that the threshold of metaspace is constantly adjusted dynamically. As for the logic of specific adjustment, the official document does not seem to mention it, so I will not delve into it for the time being. As long as the Max value is not exceeded, there is no fatal effect, but for low-latency applications, it is necessary to try to avoid the time-consuming gc caused by dynamic adjustment, which can be solved by calculating and setting the initial threshold according to the optimization.

Full GC (Ergonomics)

Here you can see that the reason of full gc is Ergonomics, because UseAdaptiveSizePolicy is turned on, and the full gc caused by jvm's own adaptive adjustment

GC (Allocation Failure)

After analyzing the full gc, let's take a look at the young gc, and see that 99% of the log is caused by the GC (Allocation Failure). Allocation Failure indicates that the young generation (eden) applies for space for a new object, but the remaining suitable space for the young generation (eden) is not enough for the minor gc caused by the required size.

-XX: + PrintTenuringDistribution

Desired survivor size 75497472 bytes, new threshold 2 (max 15)
- age   1:   68407384 bytes,   68407384 total
- age   2:   12494576 bytes,   80901960 total
- age   3:      79376 bytes,   80981336 total
- age   4:    2904256 bytes,   83885592 total
  • The Desired survivor size indicates that the maximum space allowed in the survivor area is 75497472 bytes
  • The following object list is the age size distribution of the survivor's current surviving objects after this gc. The total size is 83885592 > 75497472, and the age1 size is 68407384 < 75497472, so the new threshold becomes 2( 作用于下次gc). Objects that exceed the threshold will be promoted to the old generation next time gc if the object is not released.

age list is empty

59.463: [GC (Allocation Failure) 
Desired survivor size 134217728 bytes, new threshold 7 (max 15)
[PSYoungGen: 786432K->14020K(917504K)] 804872K->32469K(1966080K), 0.1116049 secs] [Times: user=0.10 sys=0.01, real=0.20 secs] 

Here, there is no distribution of age objects under the line of Desired survivor size, which means that after this gc, there are no surviving objects whose age is less than the max threshold in the current survivor area. There is no output here, indicating that after the gc recycles the object, no surviving objects can be copied to the new survivor area.

Examples of survivor objects after gc

jstat -gcutil -h10 7 10000 10000
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT
  0.00  99.99  90.38  29.82  97.84  96.99    413  158.501     4   14.597  173.098
 11.60   0.00  76.00  29.83  97.84  96.99    414  158.511     4   14.597  173.109
 11.60   0.00  77.16  29.83  97.84  96.99    414  158.511     4   14.597  173.109
  0.00  13.67  60.04  29.83  97.84  96.99    415  158.578     4   14.597  173.176
  0.00  13.67  61.05  29.83  97.84  96.99    415  158.578     4   14.597  173.176
  • Before ygc young generation = eden + S1; after ygc, young generation = eden + S0
  • It can be seen from the observation that the old generation space has not changed after ygc, which means that this ygc, no objects are promoted to the old generation.
  • After gc, the surviving object is moved to another survivor area
  • Since the sampling is performed every 10 seconds, there is a delay, that is, after gc, a new object is allocated in the eden area immediately, so the eden area seen here is occupied by objects.

real time > usr time + sys time

722914.974: [GC (Allocation Failure) 
Desired survivor size 109576192 bytes, new threshold 15 (max 15)
[PSYoungGen: 876522K->8608K(941568K)] 1526192K->658293K(1990144K), 0.0102709 secs] [Times: user=0.03 sys=0.00, real=0.01 secs] 
722975.207: [GC (Allocation Failure) 
Desired survivor size 103284736 bytes, new threshold 15 (max 15)
[PSYoungGen: 843168K->39278K(941568K)] 1492853K->688988K(1990144K), 0.3607036 secs] [Times: user=0.17 sys=0.00, real=0.36 secs] 

There are more than nearly 300 gc real time times greater than usr time + sys time.

  • real: refers to the WallClock Time elapsed from the start to the end of the operation
  • user: refers to the CPU time consumed in user mode;
  • sys: refers to the CPU time consumed by the kernel state.

Wall clock time includes various non-operational waiting time, such as waiting for disk I/O, waiting for thread blocking, and CPU time does not include these time-consuming, but when the system has multiple CPUs or multiple cores, multi-threaded operations will stack these CPUs time, so it's perfectly normal to see user or sys time exceeding real time.

user + sys is the actual time spent by the CPU. Note that this value counts the time on all CPUs. If the process works in a multi-threaded environment and the multi-threaded time is superimposed, this value will exceed the value recorded by real. i.e. user + sys >= real .

There are more than 300 real time times greater than usr time + sys time, indicating that there may be two problems, one is the intensive IO operation, and the other is 分配the insufficient cpu( ).

New Generation Garbage Collection Mechanism

  • Try to allocate new objects on the stack. If not, try TLAB allocation. If not, consider whether to bypass the eden area and allocate space in the old generation ( -XX:PretenureSizeThreshold设置大对象直接进入年老代的阈值,当对象大小超过这个值时,将直接在年老代分配。). If not, consider applying for space in eden.
  • Apply for space to eden to create a new object, eden has no suitable space, so minor gc is triggered
  • minor gc processes the surviving objects in the eden area and the from survivor area
    • If the age of these objects reaches the threshold, they are directly promoted to the old generation
    • If the copied object is too large, it will not be copied to survivor, but directly into the old generation
    • If there is not enough space in the to survivor area/or there is not enough space during the copying process, the survivor overflow occurs, and the old generation is directly entered.
    • Otherwise, if there is enough space in the to survivor area, the surviving objects are copied to the to survivor area
  • At this time, the remaining objects in the eden area and the from survivor area are garbage objects, which are directly erased and recycled, and the released space becomes the new allocatable space.
  • After minor gc, if the eden space is sufficient, the new object allocates space in eden; if the eden space is still insufficient, the new object allocates space directly in the old generation

summary

From the above analysis, it can be seen that the young generation seems to be a bit large, and the ygc time is long; in addition, the survivor space is basically empty after each ygc, which means that the new object is generated quickly and the life cycle is short, and the originally designed survivor space does not come in handy. . Therefore, you can consider reducing the size of the young generation, or try changing it to G1.

There are a few key points about -XX:+PrintTenuringDistribution, to be clear:

  • The distribution of objects in which area of ​​this print ( survivor)
  • Whether to print before gc or after gc ( gc之后打印)
  • When a new object arrives at the survivor for the first time, its age is counted as 0 or 1

The age of the object is the number of MinorGCs he has experienced. When the object is allocated for the first time, the age is 0. After the first MinorGC, if it has not been recovered, the age is +1. Since it is the first time to experience MinorGC, it enters the survivor area. Therefore, the age of the object when it first enters the survivor area is 1.

  • Dynamic adjustment of promotion threshold (new threshold)

If the total size of the underlying age is greater than the size of the Desired survivor size, it means that the survivor space has overflowed and is filled, and then the threshold will be recalculated.

doc

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325111913&siteId=291194637