【Production Line Accident】Share an OOM of a production line accident

foreword

Following the last online CPU alarm, this time the service started to work again. After a few days of calm, when watching the production log service running, OutOfMemoryError frequently appeared, which is what we commonly call OOM, which is okay! Frequent OOM will directly cause the service to be in an unavailable situation. Check the link calls through Skywalking, and basically all reports are red, basically in a paralyzed state, because the production of the service is distributed deployment, and the operation and maintenance will immediately stop the service. Restart, because it is a B-side product, let the company's business be used first, ensure the normal use of the service, and then check the problem urgently, find the root cause, and fix it.

Reasons for OutOfMemoryError

Let's first understand the reasons for OutOfMemoryError, which are nothing more than two types 堆内存空间不足,元空间不足

  • Insufficient heap memory space: Means 程序存在一直有引用的对象(强引用),主要对象在引用的状态就无法被GC回收,撑爆了-Xmx堆拓展的最大值,内存不足自然就会触发堆内存溢出.
  • Metaspace: Java 8 introduces the concept of metaspace, which replaces the permanent generation of the previous heap. Since the metaspace belongs to the off-heap memory, there is no need for object references. Classes and metadata are represented by pointers. The reason why the metaspace is referenced is a This kind of JDK upgrade optimization avoids the memory overflow of the permanent generation.

Several situations of common heap memory overflow

  • The amount of data returned by the query database is too large, and it is loaded into the memory, resulting in memory overflow;
  • There is an infinite loop in the code, which causes the large object to be referenced and cannot be recycled by GC;
  • The resource link pool and io stream are not manually released after use;
  • There are reference objects in the static collection class, and there is always a reference relationship, which has not been cleared;

The above are some of the common heap memory overflow scenarios. Of course, sometimes the problems we encounter are strange and weird. Common problems are always rarely encountered...

Phenomena analysis

According to the error log of the production environment, this is a memory overflow reported by Mybatis. By looking at the source code of Mybatis, it is found that the bottom layer also uses some collection classes to store spliced ​​SQL, so of course there may be heap memory overflow, and In the case of relatively large sql volume, the set of receiving sql will become very large, and if it cannot be recycled, it will cause memory overflow.

Mainly becauseMybatis拼接SQL的时候生成的占位符和参数对象,存放在Map里,当SQL的参数多导致SQL太长的时候,Map持有这些SQL时间较长,并且多线程同时操作,这时候内存占用就很高,从而发生OOM

Mybatis source code analysis

By looking at the source code of the DynamicContext class, DynamicContext has another
parameter bindings of the ContextMap type, which inherits HashMap and is equivalent to a Map collection. Then look at the getBindings method in this class, and see that ForEachSqlNode calls the getBindings method. Simply put, ForEachSqlNode passes The getBindings method uniformly puts the SQL parameters and parameter placeholders into the ContextMap collection. The main reason is that the parameters and placeholders cannot be recycled by GC, and OOM will be caused when there are many concurrent queries.

insert image description here
insert image description here
insert image description here

Scenario recurrence

Then I reproduced the online scene. By splicing SQL statements, the parameters in IN were enlarged, and then 50 threads were created for execution, and the JVM heap memory was set to -Xmx256m -XX:+PrintGCDetails -XX: +HeapDumpOnOutOfMemoryError

insert image description here
insert image description here

Looking at the logs printed on the console here, the service is frequently performing Full GC, resulting in OOM.
insert image description here

Summarize

Now that the cause of the problem has been found, the next step is to optimize the code SQL, and try to avoid too large a volume when sql is spliced. Here we warn us that the code should not be written indiscriminately, and the SQL statement should not be written at will. Sometimes we think about the problem Too simple does bring unpredictable risks.

Guess you like

Origin blog.csdn.net/u011397981/article/details/130034305