2020-11-18 Record the FullGC investigation process

  1. The system suddenly crashed. Since the system stalls have been found in the early stage, the GC log is turned on, log in to the server to view the log, the top command finds that the CPU occupies 100, the java process memory occupies 4.7G, and the FullGC log is crazily printed;
  2. View the top -Hp pid, and find that it is occupied by the garbage-collected thread vmthread. Of course, this process needs to convert the pid from decimal to hexadecimal, and jstack looks at thread information at the same time;
  3. Jmap exports the dump file. Because the file is large and the current server download rate is low, use scp to copy to the 190 server, and then use winSCP to download to the local;
  4. Then use jvisualVM to open the heap dump file;
  5. It is found that char[] occupies 3 Gs, and the POI objects are ranked second and third;
  6. By analyzing the string array, it is found that most of them are from the session. At the same time, when viewing the session object, there are more than 1w session objects in the system;
  7. The reason for the existence of so many session objects is that the system does not set the session validity period, and for web projects, users directly close the browser and do not click logout to exit the system, so the server does not clean up useless sessions;
  8. Let's talk about the second problem. Since the servers are clusters, when one of them goes down, the problem server is quickly kicked out of nginx for troubleshooting. After the fullGC of the problem server in the afternoon, it throws OOM, and then returns to normal;
  9. In fact, of course, the available heap memory at this time is already very small. If you continue to use it in time, FullGC will continue soon. The exception information log printed by OOM is related to POI, so check the function code here. Data export has a start and end date. I suspect that some users did not choose a date;
  10. After checking the user operation log, some users did not select the date and clicked it twice. This is the fuse that caused this problem;
  11. The solution is that the session is set to a valid period of 1 day, and the data export increases the period of time to select verification, and the maximum 6 months of pipeline export, avoiding all export of more than 100,000 data, occupying too much system memory, and the user must have forgotten Or unconsciously, not to derive all the flow;
  12. For too many string arrays, the problem is actually difficult to troubleshoot because there are too many places to use, but the reference information can be viewed in the dump, and OQL is also supported (though I have not used it, and I am not familiar with it), so the basic Satisfy the troubleshooting of the problem;
  13. In addition, my own thinking experience is that the memory size displayed by dump should only refer to the memory size corresponding to the object itself, not even the size of the object referenced inside the object, so this is why the problem is not easy to troubleshoot ;
  14. The article is only used to record the investigation process, not technical sharing, so the technical points are not written in detail. If you have a small partner who encounters a similar problem and need help, you can comment and contact, and I will help as much as possible;

Guess you like

Origin blog.csdn.net/weixin_44182586/article/details/109787989