FGC frequent pit

To reproduce the problem

2020 New Year, troubleshoot a problem online GC. The most intuitive phenomenon is, January 2 online query service Elasticsearch a sudden alarm, almost at the same time a real-time feedback reporting business data query page error. At the same time see the backlog java.lang.OutOfMemoryError: GCoverheadlimitexceeded, Eurake registered service page has no heartbeat is detected, take the initiative to the service from the registry list of services excluded. Emergency, this service cluster nodes appear for the first time, and recently updated version of the iterative no, bit by surprise. Immediately take measures to ensure that priority is business use, restore service, the value of the original -Xms128m -Xmx512m to transfer large -Xms1024m -Xmx1024m, restart the service back to normal, log normal background, everything seems normal.

The reason the investigation

Business for some time after the beginning of error, request the back-end interface discovery by analyzing the pages Network interface corresponding emergence of a timeout. Some ES query actually took up 49S, appeared in the case Caton between threads and threads, Caton lot of time around 15S.

Guess why the service is running more than six months in the production of ES have some data on the order 1. Because of the ES has become a big party query operations, ES performance is not a problem, check the ES itself, resource usage, and response were normal? 2. Are the service code has failed to release the resources.

Measure

1. ES critical queries made log printing, print startup script added the GC logs command is as follows: -XX: + PrintGCTimeStamps -XX: + PrintGCDateStamps -XX: + PrintGCDetails -Xloggc: / data / app / server / es -query-server / heapEs.log, adjusted the garbage collection algorithm for parallel execution -XX: ConcGCThreads 2. further expand the heap memory to -Xms2048m -Xmx2048m.

3. Real-time report queries made large temporary offline processing, retains two lightweight list query. Issued an emergency service back to normal again.

Locate and repair the problem

The next day appeared error problem in the afternoon, we found the log normal ES query performance is very good, eliminate the problem caused by the corresponding ES. So the question arises in their own service code vulnerabilities, and analyzed by gc log gcviewer been FGC at a later time period, heap has been no effective recovery. Before Caton phenomenon can be explained, fgc led to the emergence of service Stop the world.

 

By jmap -dump: live, file = dump_075.bin 133333 command to get

 dump file .JDK jvisualvm monitoring tool comes analyze the dump file.

Notes that the biggest dump file object is a char [] array, but which are subject request, through the use of scenarios to determine, so many objects should not be there, so further infer whether there is an infinite loop, real-time data recovery before the test environment large list query, and found that the situation can not show the time data queries list page appears, see further background log for error information, navigate to a date calculations boast boast quarter of a mismatch conditions lead to an endless loop.

to sum up

Tuning performance problems, often complex, accurate positioning of test developers familiar with the extent of the underlying, but also flexibility in the use of test tools to troubleshoot the problem. The chance for errors, especially easy to overlook, accumulate more, study the underlying principles to help troubleshoot complex problems.

Released seven original articles · won praise 3 · views 10000 +

Guess you like

Origin blog.csdn.net/pliulijia/article/details/104050854