[Online troubleshooting] Service collective avalanche caused by insufficient database connection pool

The author participated in a product-based project with moderate traffic, and many people are using it every day. This project is very strange. It will definitely be suspended at a certain time period every Saturday night ~

At this time, the recovery method is to restart ~, and optimize the code and other aspects according to the hung log afterwards.

I just participated in the project for about two weeks, because I didn't have the authority of the server, I asked a colleague who had permission to check the log that was hung up at that time, and found that about 80% were database errors, or it was caused by the database hang. Error of call chain failure between microservices

Among them, a log that allows the author to find critical errors is as follows

The last packet successfully received from the server was 34,859 milliseconds ago. The last packet sent successfully to the server was 34,859 milliseconds ago.

This error is very common, and everyone should have seen it. At that time, I checked the configuration of the database connection pool based on this error. The database connection pool uses druid.

At that time, the online configuration was probably

maxActive: 50
initialSize: 20
maxWait: 60000
minIdle: 10
… 等等

The author found that the configuration of the connection pool should be enough. According to feedback, the mysql on Alibaba Cloud was also very stable, and there was no alarm for all items in the connection pool. At this time, it was very strange, why there is such a failure to obtain a connection. The problem? Saturday night is not the peak business period, and there have been no problems during the peak business period, but it will hang up at that point on Saturday.

I did not have a specific idea at the time, but I was more curious about the source code implementation of druid. I went to look at the code of the database connection pool and went to a breakpoint to debug. At this time, the new discovery came and I found the configuration of the project The file has not been injected into the configuration of the druid connection pool. For example, the final maxActive value is still the default of druid 8 ~!

I opened the monitoring page provided by druid and found that indeed, all the configuration of the project was not injected into it! The author observed that the configuration prefix of the druid source code acquisition configuration is completely different from the project configuration prefix. At this time, the author made a reference to Baidu's many druid connection pool creation codes, and finally found that the version of the connection pool used by the project at this time Very low. .

The version 1.1.0 is used. If you want to inject a custom prefix configuration, you must use the following method. The problem comes out because the version problem caused the configuration of the database connection pool to not take effect, so each service connection Not enough to use, although there is no thorough study of what kind of limit will trigger an avalanche (the general reason must be that the database connection is used up, and other threads cannot get the connection to use), but it is basically determined that the problem is caused by this place.

 @ConfigurationProperties(prefix = "spring.datasource")
    @Bean
    public DruidDataSource druidDataSource(){
          return new DruidDataSource();
    }

At that time, the latest version of druid had reached 1.1.10, so the author reported this problem, and upgraded the database connection pool version of the microservices. After testing to confirm that the database connection pool configuration was correctly injected, it was released at the appropriate release time. The final result is that the project has never experienced a weekend avalanche since this upgrade was done ~

At that time, the colleagues of the development team did not expect that this small detail caused such a big problem. The author also found a chance by coincidence. Here is a summary: When building the project, you must be aware of the various configurations. , Know what each item is doing, when there is a problem, turn around the error, and slowly diverge your thinking from a small point, rather than the sky-and-white one will doubt the architecture, the middleware and so on. Starting from the detailed basis, you have to ask yourself why there are no problems with so many open source framework projects, but this kind of problem occurs when you use it. It must be that the posture is not appropriate. Usually think more and learn more about the implementation of open source frameworks Principles and the ideas behind it ~

Published 38 original articles · won 17 · views 9021

Guess you like

Origin blog.csdn.net/cainiao1412/article/details/98886125