Slow downtime of online service card caused by Druid connection pool

1. Use the Druid background

     After the company's micro-service product transformation went online, the default connection pool HikariCP of springboot was used in the development environment. Why HikariCP was chosen after springboot 2.0? You can refer to the blog, Five reasons why Springboot 2.0 chooses HikariCP as the default database connection pool https://blog.csdn.net/liuhuiteng/article/details/10762753.
     In fact, in a word, HikariCP has the highest performance and can pk all other connection pools;
     after we conducted stress tests on the product, we found that the program was often stuck in obtaining database connections. After checking the company’s product base, the size of the connection pool was not adjusted. The default is a maximum of 8 connections. Ali's Druid database connection pool, and turned on the monitoring, found that it is really fragrant; in terms of performance comparison, Druid is still
     possible
     .
insert image description here
     The main thing is that the monitoring is really good. Based on the powerful monitoring function of Druid, it is beneficial to the daily development work and online operation and maintenance monitoring. At the same time, the monitoring can be customized and extended based on the interface;
insert image description here

2. Problem emergence and analysis

1. When a problem occurs, quickly confirm the problem type

     A few days after a certain service was launched on the cloud, it was reported that the card was slow and down, and several cluster nodes had problems one after another in the morning;

     Immediately communicate with the operation and maintenance colleagues to confirm that the jvm memory of the application is normal, and the load of the database instance corresponding to the service is normal. The problem phenomenon is that some nodes are normal, and some non-nodes are not normal. It is immediately confirmed that there is a problem node at that time. The application side is blocked;

     The type of problem, the card slow down is nothing more than the JVM memory overflow of the application, the database load is high, the application thread is blocked, and various resource leaks (such as redis connection leaks, database connection pool leaks), etc.

2. Take the log to analyze the problem

     Analyzing the log at the time point of the problem, you will find that many requests will be stuck in obtaining the Druid connection pool connection. This needs to be reflected, either the connections in the pool are used to execute SQL, or the connections in the pool are all leaked;
insert image description here
insert image description here

     How to judge the connection to execute sql?
           Find mysql-related in the thread snapshot, get the connection from the connection pool, of course, execute sql.
           As shown in the figure below, there are very few SQLs being executed, so isn’t it a leak?
insert image description here

     How to judge the connection leak?
           Combining with the technical architecture, except that Druid will go to the connection in the pool, the possibility of taking the business place should be extremely small; on the contrary, if it
          is a connection leak, why are there still 8 connections that have not been leaked according to the above figure (in one case, only 8 connections are leaked, and the sql executed by these 8 connections is slower, or the concurrent requests are higher, which can be further confirmed); at that time, it was directly determined that there was no connection leak, and then a step-by-step thread snapshot analysis
          ;


insert image description here      Further analysis found that there are still requests that       will be blocked on the following lock waiting to lock <0x00000006c69c35f8> (a java.lang.Object) when obtaining the connection of the Druid connection pool Base.loadClass() is stuck on the class loader loading class
insert image description here
? Why is it stuck? I can't figure it out. . . I was stunned for a few minutes;

Then I changed a thread snapshot to view, as shown in the picture below, basically the card is slow. Load the class at the same place.
insert image description here
Check the class loaded here is com.mysql.jdbc.MysqlIO
insert image description here
and then confirm that the card is slow. Search for com.mysql.jdbc.MysqlIO in the downtime service. There is no such class. This kind of search for a class that does not exist must be slow, that is, classload will traverse all class directories, which involves disk IO. lock blocking;

3. Verification

     The problem is confirmed that Druid loads the non-existent class com.mysql.jdbc.MysqlIO, which causes the classload to scan the entire disk and load the class directory, resulting in lock waiting and thread blocking;
insert image description here
     how to verify the problem of slow loadclass?
           Write a demo, reflectively load a non-existent class, and reflectively load an existing class;
     it can be clearly compared that the loading of a non-existent class may take tens of ms (related to the number of jar packages);
     why does com.mysql.jdbc.MysqlIO not exist in this service?
          This package name belongs to the mysql driver package. Check that the 8.0 high driver version is used in the project, and it does not exist; then go to maven to
insert image description here
refer to the different versions of the mysql driver package, and confirm that the mysql driver package does not exist after the lower version 6;
insert image description here

Four, solve

Pull the corresponding version of the Druid source code, simply look at the druid source code, and comment out the loadclass according to the following processing. If it is not loaded, it will be fine; then a jar package, druid-weaver
insert image description here
.

fundamental solution

  上述直接 调整源码注释掉loadclass的逻辑,可以解决性能问题,但是某种情况下会导致获取链接不稳定。
  根本解决方案是 升级版本到 druid-1.1.23
  我们对比下源码:

1.1.22
insert image description here

1.1.23
insert image description here
insert image description here

Five, follow-up

      After that, the service did not go down due to druid, and it ran stably. Since we have more than 200 online service nodes, some high-concurrency business services also have problems. It has been confirmed that the same problem has a greater impact;

6. Suggestions for using individual parameters of Druid

test-on-borrow = true, it is recommended to turn it off online, which really consumes performance. In our online production environment, we have monitoring statistics to obtain druid to obtain connection detection. Basically, each check takes a few milliseconds, and a request executes hundreds of SQL, which is hundreds of milliseconds. It is recommended to turn it off to false; then turn on test-while-idle = true to prevent connection failures and problematic scenarios
;

Guess you like

Origin blog.csdn.net/wf_feng/article/details/121665572