Record an online failure

It has been running fine for a week, but suddenly on Monday the APP runs slowly, and some users have been stuck on the welcome page. When I received the news, my heart skipped a beat: it is likely to be a performance problem caused by concurrency. So the DBA was asked to check the resource usage of the application server and database server at the first time. The resource usage of the application server and database server was within the normal range, which was very strange. Then you can only start with the application itself.

I tailed the log and found that the log has been brushing abnormal information: Pool empty. Unable to fetch a connection in 30 seconds, none available[size:100; busy:100; idle:0; lastwait:30000]. Looking at the exception, it is obvious that the connection pool of tomcat is used up. The first reaction is to increase the connection pool: spring.datasource.tomcat.max-active=200. Adjust the number of database connections to 500, the result is as expected, after a while 200 will be used up: Caused by: org.apache.tomcat.jdbc.pool.PoolExhaustedException: [http-nio-7001-exec-11] Timeout: Pool empty. Unable to fetch a connection in 6 seconds, none available[size:200; busy:199; idle:0; lastwait:6000].

The difference from last time is that the database throws User sltapp already has more than 'max_user_connections' active connections. That is, the maximum number of connections to the database is not enough this time. Now when checking application server and database server resource usage. The resources of the application server are normal, indicating that the enlarged connection line can bear it, but the resources of the database reach more than 85%. It means that the database server is already under pressure.

Next, we analyzed the database and used the SHOW FULL PROCESSLIST snapshot to refresh the process list every 5 seconds, and found that the two statements in the list were running for a long time, up to more than 80 seconds. It came up again immediately before killing it. It can be determined that the number of connections is full due to the long-term query of these two statements.

After finding the reason, I started to attack this SQL. After analysis, the statement has been self-associated four times in order to obtain the different order status of the sub-table, and it is still a large table, and it is also associated with other large tables, so the problem is relatively large. The final solution is that SQL only checks the main table. After the push order status is found out of the main table, the corresponding sub-table information is found through the primary key of the found main table, and then the status is marked. That is to simplify and step-by-step complex statements, and then do logical processing in JAVA. After some processing, the performance is improved to 0. It is possible to query the entire result in a few seconds.

After processing, the test increment went online. After running for a day, the result is quite obvious. The system is no longer stuck, and it is much faster than before. There are two main reasons for this accident: one is that the number of concurrency comes up; the other is that the business data comes up, which leads to the low efficiency of the original SQL execution. For the first point, you can consider increasing hardware resources and the number of connections, but this is not an essential problem, and you need to start with slow SQL.

Guess you like

Origin blog.csdn.net/tinalucky/article/details/118098980