The reason for frequent FullGC turned out to be "open source code"? | JD Cloud technical team

foreword

First of all, the characteristic of the java language is that it does not need to manually release memory like C and C++, because java itself has a garbage collection mechanism (garbage collection is called GC), which, as the name implies, releases the space occupied by garbage to prevent memory leaks. The heap memory occupies the largest memory space when the JVM is running. In addition, the stack area and method area also occupy space, but the space is limited. This chapter will not explore it. Then the space in the heap is divided into the young generation and the old generation, so we roughly divide garbage collection into two types: the garbage collection of the young generation is called Young GC, and the garbage collection of the old generation is called Full GC. In fact, the Full GC here also includes the recycling of the new generation, old generation, and meta space.

Because the recovery process of Full GC will cause all threads in the system to STW (Stop The World), then we must hope that the system should try not to perform Full GC, or the execution time should be as short as possible when Full GC must be performed. Next, we mainly explore the full GC perspective to analyze the frequent Full GC process that I encountered when developing and operating the background.

event background

Project Introduction:

Our team is working on a background management system, because different users are responsible for different functions and require different permissions, so we introduced the mainstream shiro framework for permission control, which can control the menu bar, buttons, operation boxes, etc. When introducing this framework, the auxiliary component shiro-redis is introduced together . This component is a cache layer to facilitate the management of user login information. The problem of memory leaks is also on this auxiliary component.

Event restore:

At 11:30 noon on Friday, we received a monitoring alarm message indicating that the system was in frequent Full GC. At this time, we immediately did two things :

First : Log in to the company's UMP monitoring platform (open source monitoring can refer to: [Prometheus+grafana monitoring]) to check the system indicators of the machine, and found that the frequent FullGC lasted from 11:00 to 11:30

Second : Keep one machine as evidence collection, restart other machines to ensure normal access to business, full gc is normal after restart

Third : Stack information operation instruction./jmap -F -dump:live,format=b,file=/jmapfile.hprof 18362 (-F operation is to force the export of stack information, 18362 is the application pid, obtained through the top -c command)

Fourth : Because I do not have permission to export the stack information, immediately call the operation and maintenance to export the stack file on the machine through the above command, which is to capture on-site evidence, because the heap memory may be normal after this time

According to the analysis of JVM knowledge, the five common cases of Full GC are as follows:

1. 老年代内存不足(大对象过多或内存泄漏)
2. Metaspace 空间不足
3. 代码主动触发 System.gc()
4. YGC 时的悲观策略
5. dump live 的内存信息时,比如 jmap -dump:live

Analyze the reasons

1. Check the company's SGM monitoring platform (for open source monitoring, please refer to: [Prometheus+grafana monitoring]), the maximum memory of the metaspace is 256M, and it is 117M before and after the occurrence of FullGC, excluding the cause of insufficient Metaspace

2. Search the system for third-party jar packages, and there is no code that actively executes the System.gc() operation

3. Check the following two parameters in the JVM startup parameters, so the reason for the pessimistic strategy during YGC is ruled out

-XX:CMSInitiatingOccupancyFraction=70      # 堆内存达到 70%进行 FullGC
-XX:+UseCMSInitiatingOccupancyOnly         # 禁止 YGC 时的悲观策略(YGC 前后判断是否需要 FullGC),只有达到阈值才进行 FullGc

4. After communicating with the O&M and R&D teams, no one actively performed the dump operation, and there was no dump operation after checking the historical execution instructions of the system. The reason for the active dump was ruled out

Preliminary analysis results:

Based on the monitoring platform, JVM startup parameters, code exclusion, and instruction analysis above, the biggest suspect is the frequent Full GC caused by insufficient memory space in the old generation. However, as a technician, the elimination method obviously cannot be used as the basis for locating the cause. We still need to continue to confirm our conjecture. The following will combine the three key elements of JVM startup parameters, Tomcat startup parameters, and stack files for specific analysis.

The figure below shows the memory status of the old generation when performing FullGC. Remember the following 72%, 1794Mb, 2496Mb, and 448Mb first, and compare them with these values ​​below

Indicator information:

JVM core parameters:

-Xms2048M 								# 系统启动初始化堆空间
-Xmx4096M 								# 系统最大堆空间
-Xmn1600M 								# 年轻代空间(包括 From 区和 To),From 和 To 默认占年轻代 20%
-XX:MaxPermSize=256M 					# 最大非堆内存,按需分配
-XX:MetaspaceSize=256M 					# 元空间大小,JDK1.8 取消了永久代(PermGen)新增元空间,元空间并不在虚拟机中,而是使用本地内存。因此,默认情况下,元空间的大小仅受本地内存限制,存储类和类加载器的元数据信息
-XX:CMSInitiatingOccupancyFraction=70 	# 堆内存达到 70%进行 FullGC
-XX:+UseCMSInitiatingOccupancyOnly 		# 禁止 YGC 时的悲观策略(YGC 前后判断是否需要 FullGC),只有达到阈值才进行 FullGc
-XX:+UseConcMarkSweepGC 				# 使用 CMS 作为垃圾收集器

Tomcat core parameters:

maxThreads=750		# Tomcat 线程池最多能起的线程数
minSpareThreads=50	# Tomcat 初始化的线程池大小或者说 Tomcat 线程池最少会有这么多线程
acceptCount=1000	# Tomcat 维护最大的队列数

Through the above index information, we can have a general understanding of the performance bottleneck of the system. First, the analysis results according to the JVM parameters are as follows:

The maximum heap space is 4096M

The young generation occupies 1600M (including 1280M in Eden, 160M in Survivor From, and 160M in Survivor To)

The maximum space occupied by the old generation is 2496M (corresponding to the above 2496Mb)

System initialization heap memory 2048M

Then the initial memory of the old generation (448M) (corresponding to the 448Mb above) = initialized heap memory (2048M) - young generation memory (1600M)

According to the JVM startup parameters, garbage collection is performed when the heap memory reaches 70. When the system performs garbage collection, the heap memory accounts for 72% (corresponding to the above 72%) and is always greater than 70%. Then the used memory is 0.72 * 2496Mb ≈ 1794Mb (corresponding to the above 1794Mb)

Stack analysis:

Execute the GC reason command before querying the stack: jstat -gccause [pid] 1000 , the execution result is as shown in the figure below, you can see that the column LGCC represents the reason for the last execution of gc. The two phases of CMS Initial Mark and CMS Final Remark are the initial marking and final marking phases of CMS garbage collection, which are the two phases that take the longest and cause STW (Stop The World)

Export stack instruction: jmap -dump:live,format=b,file=jmapfile.hprof [pid]. The exported file needs to be analyzed using MAT software, the full name is MemoryAnalyzer, which mainly analyzes the heap memory. Refer to the download link: http://eclipse.org/mat/downloads.php

From the stack file analysis results, it is found that 50 org.apache.tomcat.util.threads.TaskThread occupy a large space. Total occupied space 96.16%

Each TaskThread instance occupies about 36M space

View memory details The largest and most saved object is the SessionInMemory object stored in ThreadLocal

Final reason:

By analyzing the above JVM parameters, TOMCAT parameters, and stack files, the reason for memory leakage is that there is a threadLocal storage of a large amount of SessionInMemory in each thread, because the number of starting core threads of Tomcat is 50, and the memory of each thread accounts for about 36M. That is, 2496 * 0.7 = 1747.2m will be recovered by garbage. 1.8g is just slightly larger than 1747.2m. But the objects in the thread cannot be recycled, so you will see the system full GC frequently.

positioning problem

Through the above memory analysis, it has been located that the reason for the memory leak is that there are a large number of SessionInMemory in each thread. The following steps carefully analyze the code to find the reason why so many objects are not destroyed.

After preliminary analysis, it is found that SessionInMemory refers to the object in the shiro-redis toolkit, which mainly encapsulates Session information and creation time. The main function is to make a layer of cache in the jvm of the current thread. When the system frequently obtains the Session, there is no need to go to redis to obtain it. The SessionInMemary object is the data stored by shiro when judging the user's successful login, mainly including user information, authentication information, permission information, etc., because the user will not repeat authentication after successful login, and shiro will make permission judgments for different users

After analyzing the code, I found that there is an obvious problem in the process of processing the local cached Session. I drew a simple flow chart. Before introducing the flow chart, I will describe how the Session and the user login operation are connected.

We all know that the operation background needs the user to log in. After the login is successful, a cookie will be generated and saved in the browser. The cookie stores a key field sessionId to identify the user's status and information. When the user visits the page to call the interface, Shiro will obtain the sessionId in the cookie from the Request, and generate a Session based on this unique identifier to store the user's login status and login information. These information will be saved in redis. The shiro-redis component is responsible for the Session information obtained from redis to achieve thread isolation through ThreadLoca.

The summary of the process in the above figure is: the user accesses the page to obtain the Session from the local cache first, and returns the result if it exists within one second, or deletes the current Session and creates a new one to return the result if there is no Session or expires. Overall, the thinking is clear, get the Session first, if not, create a new one and return it, if it expires, delete it and create a new one and return it.

Problems hidden in the flow chart (core problem)

1. Multiple threads will copy multiple copies of the same Session to double the memory (the same Session has different threads)

For example: a user logs in to generate a session in the background. Assuming that all requests go to one machine, the first request goes to thread 1, and the second request goes to thread 2. Since the sessions are the same but the threads are isolated, both thread 1 and thread 2 will create a copy of the same session and store it in ThreaLocal. The more Tomcat's minimum number of idle threads, the more sessions will be replicated. Because Tomcat's core threads will not be closed, the resources inside will not be released. There is a question here. The key of ThreadLocad is a weak reference but why is it not recycled? Answer all below

2. The old Session cannot be cleared (the same thread and different Sessions)

Example 1: Assuming that all requests go to the same thread of a machine, the user logs in to the background for the first time to generate Session1, the first request to thread 1, all requests are executed within 1 second, and the Session is not removed at this time (because the session removal strategy is lazy deletion, it needs to wait for the next time the same Session visits to judge the expiration condition before deleting), the user logs in again, and generates Session2, because Session2 is not yet in thread 1. to this thread

Example 2: Refer to the idea of ​​​​Example 1. If the user does not execute all the requests in Session1 within 1 second, the lazy delete operation will be performed, but a new one will be created after the deletion. Then the newly created Session will not be deleted after the user logs in again. Therefore, it can be concluded that as long as the user logs in again, an old Session must be retained in the thread

code analysis

1. A ThreadLocal variable is defined in the RedisSessionDAO.java file as thread isolation

2. When users access resources such as interfaces, js files, and css files, they will enter Shiro's interception mechanism. During the interception process, the doReasSession() method will be frequently called to obtain the user's Session information, mainly to obtain the information to verify the user's permission control.

The following method mainly integrates the operation of obtaining Session and setting Session. If it is not obtained from ThreadLocal or the local cache exceeds 1 second, it will return null. After it is judged to be null, it will be obtained from redis and a new Session will be stored in ThreadLoca.

3. Take out the sessionMap from the ThreadLocal, and search for the Session in the sessionMap according to the sessionId. If it is not found, return null directly. If it is found, then judge whether the time exceeds 1 second. If not, return the Session. If it exceeds, remove and return null.

4. Get the sessionMap from ThreadLocal, if it is null, create a new one and save it, because the sessionMap in the thread does not exist when the user visits for the first time, so create a new one. Then store the Session object in the sessionMap

So the summary of the completion process of the code: the operation of obtaining the Session is to call the getSessionFromThreadLocal() method, if the Session is not obtained, it will return null, and calling the setSessionToThreadLocal() method will reset a Session. If the Session is saved in the current thread for more than 1 second, remove it.

Through the above analysis of JVM, Tomat, stack, and code, the problem has been located, because the improper handling of the SessionInMemory object stored in shiro-redis leads to more and more inter-thread storage, which eventually leads to memory leaks and frequent FullGC. Because the shiro-redis version we quoted is version 3.2.2, this vulnerability exists. The author has upgraded the jar package to version 3.2.3 in March 2019 to solve this problem. Remarks: This problem exists in versions 3.2.2 and below

Solve the problem

There are currently four solutions to the problem. For our system, we use scheme 1+ scheme 4

serial number Program description advantage shortcoming
plan 1 Traversing and deleting previously expired or null sessions each time a session is set Active deletion, the deletion frequency depends on the user's access frequency If there are a large number of user visits within 1 second, there are many total sessions and few invalid sessions, and a lot of useless work is done to traverse all sessions, resulting in slow access
Scenario 2 Cancel the threadLocal strategy, all requests directly query the cache (redis) Reduce local memory usage It takes longer to access the cache than the local one. After testing, it is found that an interface will call about 16 times to obtain the session operation, and there are dozens of interfaces on a page. There is a problem with directly querying the cache performance.
Option 3 Use a local cache (guavaCache or EhCache, etc.), and do a removal strategy for the cache Multiple threads share a memory, saving memory space and improving system performance Have a deep understanding of the framework, access requires development costs
Option 4 Reduce the number of core threads of tomcat, for example, change the original 50 to 5 Reduce system resources, reduce the number of copies of the same session, and recycle resources destroyed by threads greater than 5 Slightly lower processing concurrency

FAQ

Q : Only one ThreadLocal variable sessionsInThread is defined in RedisSessionDAO, how can 50 threads copy 50 copies of the same Session?

A : First of all, let's understand the structure of ThreadLocal. ThreadLocal has a static class ThreadLocalMap, and there is also an Entry in ThreadLocalMap. Our key and value are stored in Entry. The key is a weakly referenced ThreadLocal type. This key is the same in all threads. In fact, it is the static sessionsInThread we defined. Then how to achieve thread isolation?

This refers to a member variable threadLocals in Thread, this object is ThreadLocal.ThreadLocalMap type, that is, every time a thread is created, a new ThreadLocalMap will be created, so the ThreadLocalMap in each thread is different, but the key stored in the Entry inside is the same, that is, the sessionsInThread static variable we defined earlier.

When a thread needs to obtain the value stored in the Entry, it calls the sessionsInThread.get() method. This method does three things, one is to obtain the instance of the current thread, the other is to obtain the ThreadLocalMap from the thread instance, and the third is to obtain the specified value from the ThreadLocalMap according to the ThreadLocal key

Get ThreadLocalMap in Thread

Get the specified value from ThreadLocalMap, and there is another question, why do you need to get Entry from a table array? It is easy to understand that a thread does not necessarily have only one ThreadLocal variable. Multiple ThreadLocal variables have multiple keys, so they are placed in the table array.

Q : It is said that the key of ThreadLocal is a weak reference. If the memory is insufficient, it will be garbage collected. Our key is not recycled from the stack?

A : This is a good question. First of all, our RedisSessionDAO is a singleton pattern injected by Spring. ThreadLocal is defined as a static variable, which will not be recycled in memory. Supplement: Generally, when we use ThreadLocal, we will define it as a static variable. If it is defined as a non-static variable to create an object, a new ThreadLocal will be created, so ThreadLocal has no meaning of existence.

Q : Why does the thread that has ended still survive, and the objects inside will not disappear?

A : Because the minimum number of idle threads set is 50, and the number of concurrent transactions does not exceed 50, tomcat will keep the minimum number of threads and will not create new ones or recycle them. ThreadLocalMap is a member variable in the thread, so it will not be recycled.

Q : Will a sessionId be generated when accessing an interface once?

A : The access interface first judges whether the user information is valid, and then re-login to obtain a new sessionId if it is invalid

Q : Why does shiro-redis set an expiration time of 1 second when saving the Session locally?

A : Because the operation background is different from the business interface, which will be called continuously. Most of the scenes of the background interface are that the user visits a page and stays on the page to do some operations. When accessing a page, the browser will load multiple resources, including static resources html, css, js, etc., and the dynamic data of the interface. The entire resource loading process should be completed within one second as much as possible. If it exceeds one second, the system experience performance will be poor, so local caching for one second is enough.

Harvest summary

Before calling the police:

1. Familiar with the working principle of the third-party jar package, especially the personal development kit, because it has not been tested by the market, so be careful before using it

2. You can use jvisualvm for local pressure measurement to observe the jvm situation

3. Pay attention to the monitoring and alarm, master the operation of the monitoring platform, and be able to query the information of various indicators of the system from the monitoring

4. Reasonably configure JVM parameters and Tomcat parameters according to the business

After calling the police:

1. It can capture the JVM information of the system at the first time, such as stack, GC information, thread stack, etc.

2. Help yourself analyze the cause of the problem by using the MAT memory assistant software

Author: Guo Yinli, Jingdong Technology

Source: JD Cloud Developer Community

The 8 most in-demand programming languages ​​in 2023: PHP is strong, C/C++ demand is slowing Musk announced that Twitter will be renamed X, and the logo will be changed for five years, Cython 3.0 is officially released GPT-4 is getting more and more stupid? The accuracy rate dropped from 97.6% to 2.4%. MySQL 8.1 and MySQL 8.0.34 were officially released. The father of C# and TypeScript announced the latest open source project: TypeChat Meta Enlargement move: released an open source large language model Llama 2, which is free for commercial use . React core developer Dan Abramov announced his resignation from Meta. ChatGPT for Android will be launched next week. Pre-registration starts now . needs? Maybe this 5k star GitHub open source project can help - MetaGPT
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10090621