Record an OOM troubleshooting process caused by improper use of OSSClient

First release: Public account " Zhao Xiake "

Preface

Recently, OOM occurred in a relatively marginal project online. Fortunately, this project only does some offline task processing. OOM has no impact on online business. Here is a record of the troubleshooting process.

Dump log view

The main JVM parameter settings of the project configuration are as follows:

-Xmx5120m -XX:+PreserveFramePointer -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/usr/local/update/heap_trace.log -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/update/dump.log 

The maximum heap memory is given to 5G, and GC logs are configured to record and memory exported after OOM. Let's first take a look at the exported memory snapshot of OOM. The dump.log actually has 5GB. The first judgment must be that there is a memory leak.

Then I looked at the GC log of heap_trace.log. The last few GCs took 0.02 seconds and did not release much memory. There must be a memory leak.

2023-09-18T09:58:28.259+0800: 234057.213: [GC (Allocation Failure) [PSYoungGen: 438400K->7648K(441344K)] 763838K->333358K(961024K), 0.0140907 secs] [Times: user=0.04 sys=0.00, real=0.02 secs]  
2023-09-18T10:01:33.925+0800: 234242.879: [GC (Allocation Failure) [PSYoungGen: 436704K->7344K(441856K)] 762414K->333326K(961536K), 0.0134861 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]  
2023-09-18T10:04:16.426+0800: 234405.380: [GC (Allocation Failure) [PSYoungGen: 437424K->8832K(441856K)] 763406K->335022K(961536K), 0.0147276 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]  
2023-09-18T10:06:30.923+0800: 234539.877: [GC (Allocation Failure) [PSYoungGen: 438912K->11520K(442368K)] 765102K->338158K(962048K), 0.0202829 secs] [Times: user=0.06 sys=0.00, real=0.02 secs]  
2023-09-18T10:08:27.655+0800: 234656.609: [GC (Allocation Failure) [PSYoungGen: 442112K->12272K(442880K)] 768750K->340510K(962560K), 0.0216111 secs] [Times: user=0.06 sys=0.00, real=0.02 secs]  
2023-09-18T10:11:37.773+0800: 234846.727: [GC (Allocation Failure) [PSYoungGen: 442864K->12000K(445440K)] 771102K->340918K(965120K), 0.0243473 secs] [Times: user=0.06 sys=0.00, real=0.02 secs]  
2023-09-18T10:14:56.925+0800: 235045.879: [GC (Allocation Failure) [PSYoungGen: 443616K->8192K(445952K)] 772534K->337110K(965632K), 0.0152287 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]  
2023-09-18T10:17:49.358+0800: 235218.312: [GC (Allocation Failure) [PSYoungGen: 439808K->8432K(445952K)] 768726K->337790K(965632K), 0.0151303 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]  
2023-09-18T10:20:51.356+0800: 235400.310: [GC (Allocation Failure) [PSYoungGen: 441072K->8976K(446464K)] 770430K->338470K(966144K), 0.0159285 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]  
2023-09-18T10:24:05.395+0800: 235594.349: [GC (Allocation Failure) [PSYoungGen: 441616K->9504K(446464K)] 771110K->339358K(966144K), 0.0219962 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]  
2023-09-18T10:26:48.374+0800: 235757.328: [GC (Allocation Failure) [PSYoungGen: 443168K->11680K(446976K)] 773022K->341950K(966656K), 0.0195554 secs] [Times: user=0.05 sys=0.00, real=0.02 secs]

Use Jprofiler to analyze Dump files

Use JProFiler to open the Dump file and you can see that HashMap$Node actually has 1GB.

We choose Node and find that there are more than 30 million objects:

We selected Merged incoming references, and then expanded the reference chain of the Node object step by step. Finally, we found that there was an OSSClient object that referenced Node.

Thinking that this business uses a lot of Alibaba Cloud OSS to upload files, I found the code that uses OSSClient. The file is uploaded every time in the agent new OSSClient(), but even if there are new local variables each time, it shouldn't cause memory leaks?

So I took a look at the source code of DefaultServiceClient. We can see that it is called when creating OSSClient.createHttpClientConnectionManager

    public DefaultServiceClient(ClientConfiguration config) {
    
    
        super(config);
        this.connectionManager = createHttpClientConnectionManager();
        this.httpClient = createHttpClient(this.connectionManager);
        RequestConfig.Builder requestConfigBuilder = RequestConfig.custom();    
        

createHttpClientConnectionManagerUsed in to IdleConnectionReapermanage current connections:

    protected HttpClientConnectionManager createHttpClientConnectionManager() {
    
    
        SSLContext sslContext = null;
      if (config.isUseReaper()) {
    
    
            IdleConnectionReaper.setIdleConnectionTime(config.getIdleConnectionTime());
            IdleConnectionReaper.registerConnectionManager(connectionManager);
        }
        return connectionManager;
    }

In IdleConnectionReaper.registerConnectionManagerwe can see that ArrayListall HTTP connections are stored using

public final class IdleConnectionReaper extends Thread {
    
    
    private static final int REAP_INTERVAL_MILLISECONDS = 5 * 1000;
    private static final ArrayList<HttpClientConnectionManager> connectionManagers = new ArrayList<HttpClientConnectionManager>();

    private static IdleConnectionReaper instance;

    private static long idleConnectionTime = 60 * 1000;

    private volatile boolean shuttingDown;

    private IdleConnectionReaper() {
    
    
        super("idle_connection_reaper");
        setDaemon(true);
    }

    public static synchronized boolean registerConnectionManager(HttpClientConnectionManager connectionManager) {
    
    
        if (instance == null) {
    
    
            instance = new IdleConnectionReaper();
            instance.start();
        }
        return connectionManagers.add(connectionManager);
    }

We see that OSSClient provides a shutdownmethod. If the new OSSClint is no longer used, you need to call shutdown to release the connection, which will remove the connected connection from connectionManagers. Well, it is indeed OOM caused by improper use of code.

    @Override
    public void shutdown() {
    
    
        IdleConnectionReaper.removeConnectionManager(this.connectionManager);
        this.connectionManager.shutdown();
    }  
  public static synchronized boolean removeConnectionManager(HttpClientConnectionManager connectionManager) {
    
    
        boolean b = connectionManagers.remove(connectionManager);
        if (connectionManagers.isEmpty())
            shutdown();
        return b;
    }

I also found the same problem on the Alibaba Cloud official website:
image.png

Solution:

  1. Define the OSSClient instance as a singleton mode to avoid instantiating OSSClient multiple times in the application
  2. Use the OSSClient.shutdown() method to close the OSSClient instance and release resources
  3. Use try-finally block and call OSSClient.shutdown() method in finally
  4. When using OSSClient in an application, make sure to close the OSSClient instance after completion.

local reproduction

Our local enablement project uses Jprofiler to connect to our JVM

This function provides an interface to the outside world, so we wrote a for loop to keep requesting this interface, and then observed the memory changes. After running for 6 minutes, the available memory became 0

The server also reported OOM:

Solve the problem

Rewrite the code using factory pattern:

    public static final Map<String, OSSClient> map = new ConcurrentHashMap<>();
    public static OSSClient getClient(String endpoint, String accessKey, String accessSecret) {
        if (!map.containsKey(accessKey)) {
            OSSClient client = new OSSClient(endpoint, accessKey, accessSecret);
            map.put(accessKey, client);
        }
        return map.get(accessKey);
    }

Replace the original code:

  OSSClient client = AliyunUtil.getClient(endpoint, getAccessKey(), getAccessSecret());

Tested again and found that each GC released the memory very well. After running for 6 minutes, the memory usage did not exceed 200M, which perfectly solved the problem.

Summarize

This article introduces the use of Jprofiler to troubleshoot an online OOM process caused by improper use of Alibaba Cloud OSSClient. It is mainly caused by not paying attention to the manual shutdown of OSSClient when writing code. Fortunately, it does not appear in the core business system, otherwise the consequences will be more troublesome. In the future, when using tools provided by others, you must read more about how the official uses them and read the source code to avoid similar problems in the future.

Guess you like

Origin blog.csdn.net/whzhaochao/article/details/132992028