Building a more robust system: Problems with database connection pools, Too many connections, timeout, etc.

First of all, the connection pool is a means to save opening and closing database connections. If the connection pool is not used, the maximum number of connections that can be obtained is the number of max_connections fields configured on the database side such as mysql.

After using the connection pool, the maximum number of connections is controlled by the program. For example, the general maximum number of connections is 20.

The configuration of a connection pool is the most common, and the most critical configuration is the maximum number of connections and timeout

 

		
                <!-- Configure initialization size, minimum, maximum -->
                <property name="maxActive" value="20" />
		<!-- Configure the time for getting a connection to wait for timeout-->
		<property name="maxWait" value="60000" />

 Then most of the program configuration is these.

 

First of all, let's talk about the problem of Too many connections. In fact, if the connection pool is used, this exception will not occur unless there is a bug in the connection pool or the connection pool is initialized in multiple places. The reason is that the connection pool has already limited the maximum number of connections, basically it will not exceed the maximum number of connections to the database.

 

Then there is the timeout issue:

 

Now your program may have 1 problem: get connection timed out

In general, the solution you may think of is to increase the maximum number of connection pools and increase the timeout period.

But this practice only masks the problem, not solves it. And sql throws an exception is a more dangerous behavior.

 

Because, you don't know where your business logic goes before throwing an exception, and most programs don't try catch for dao business. for example:

A service method:

 

public void addGold(int gold){

User user=dao.getUser(uid);

user.setGold(user.getGold);

dao.update(user);

}

 

 

It is possible that an error is reported in the update step, what is the problem. Most of the current systems use the cache, and the value of the user has been modified, and it is likely that the value in the cache has been modified. Then the database operation side throws an exception, which leads to the inconsistency of the data twice.

 

And, as said before. Adjusting most of the values ​​just covers up and delays the problem.

How to solve the problem? First of all, the reason for the timeout error is 1: the connection cannot be obtained.

Why can't I get a connection? Because some logic has been occupying the connection, there is no connection assigned to you, causing you to time out.

Under normal circumstances, this logic does not appear. So we're going to find out what those problems are.

 

Moreover, throwing an exception if the connection problem is not obtained may affect your existing business. Since this is a relatively random thing, your business might throw an exception halfway through. A is executed, B is not executed. You may say that something can be rolled back, but in fact the data of A has been synchronized into the cache, and the data of B is inconsistent. Can you roll back the data in the cache.

 

So my suggestion is: maxWait can't be timed out casually. Because this will affect the business, and the connection is not necessarily blocked all the time, there is always a time when it can be recovered normally.

For example, the timeout period configured above is 1 minute. The business that has timed out for 1 minute does not care about waiting for a while until the business is completed.

 

Instead, we have to find out the culprit of the problem, the way to occupy the connection for a long time.

 

		<!-- Prevent some connections from being occupied for a long time in seconds-->
		<property name="removeAbandoned" value="true" />
		<property name="removeAbandonedTimeout" value="10" />
		<!-- Output the error log when closing the abandoned connection-->
		<property name="logAbandoned" value="true" />

 

 

Use removeAbandoned to find out which long-lived connections are. Determine the timeout period according to your own business.

For example, there are many statistical queries in your business, and some of them are indeed time-consuming. You can configure a dataSource with another set of configurations to perform these operations. But I think there is basically no business that requires customers to wait for 1 minute.

 

If maxWait never times out or the timeout time is too long, there will of course be other problems, which will cause the upper-layer business to be blocked all the time. This further consumes the performance of the entire system.

For example: to query a result, call the query method. The step of the dao layer acquiring the connection is blocked until the timeout period, and the business layer will also be blocked. If the operation is repeated all the time, a bunch of blocked services will be generated. Blocking this problem is mostly blocking threads. Threads are blocked either by creating new threads or by having an upper limit on the total number of threads. This has resulted in all business being brought to a standstill.

 

The pause time is determined by maxWait, but this is a very complex issue in itself. If your system gets to this point, there will be very messy things, any operation will report errors, and these errors will be of no use to you. The main reason is that even if the current operation throws an exception, in fact other services will still acquire the connection and be blocked. Getting the connection this is not the cause of the problem.

 

Setting the maxWait timeout is of course also beneficial. When the system returns to normal (those long-term connections are killed or finished) your system will not have a bunch of blocked services.

 

My recommended setup: 2 sets of data source configurations. It is mainly used to distinguish the services that will occupy the connection for a long time from other ordinary services. If the system doesn't have one that will result in a long execution then only one configuration is needed.

 

The maxWait configuration will not affect your system, nor will it help you identify the problem.

removeAbandonedTimeout This configuration is used to check which logic is occupying the connection for a long time. This is what caused the problem.

 

And through this configuration, some extremely time-consuming businesses can be eliminated. to prevent other normal services from being disrupted. In the development process, you can set this value as small as possible, so that unreasonable services can be exposed more quickly.

The production environment value can be set larger. The main difference is that if the business volume is huge, it will also cause some connection problems.

 

I thought of these things when testing the relationship between the number of thread pools and the performance of the program.

When I create 10,000 threads to execute a certain SQL concurrently, the maximum number of connection pools is 20. It takes 70 seconds for 10,000 threads to execute concurrently.

The maxWait timeout is set to 60 seconds.

 

I am thinking that although it is indeed timed out for 60 seconds, after all, the phenomenon of delayed queuing due to the high concurrency of the system is very normal. Now there is a consideration: it is normal for the connection pool to fail to handle such a high concurrency, and an exception should not be thrown upwards because of a timeout of 10 seconds. Moreover, if there is no problem with the program, the timeout under high concurrency conditions is excusable. If the operation is discarded directly, unpredictable things will happen.

Therefore , don't set maxWait too short easily, and setting it too long will only interfere with some other services when there is a problem with the system itself or in the case of high concurrency.

The problem of maintaining system business should be decided by the higher-level system. For example, on the issue of building a more stable system multithreading mentioned in the previous article.

Instead of making a fuss about the bottom data layer.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326177997&siteId=291194637