HikariCP reconnection failure problem

Phenomenon

The oracle database crashed at around 22:30 pm on March 3, and it restarted successfully after about 5 minutes. After the database restarted, one of the two business nodes (with the same configuration) succeeded in automatic reconnection, and the other failed. .

Check

Analyze logs

First of all, I looked at the log and found that the business of nodes that successfully reconnected was normal, while a large number of nodes that failed to reconnect throw timeout exceptions

image.png

As can be seen from the log, it timed out when going to hikaricp to get the connection. The abnormal log of the node that was successfully reconnected when the database was down before is as follows

image.png

It can be seen that the acquisition of the connection timed out, but if you look closely, you will find that there is a difference: the log of the problematic node shows the BaseHikariPool, while the node that does not have the problem shows the HikariPool. After code investigation, it was found that two different versions of HikariCP were introduced into the project, one was version 3.4.5 introduced by the project itself, the other was version 2.3.13 introduced by Quartz, and the problematic node introduced version 2.3.13 , the name used at that time was BaseHikaricPool. So the question is, why do two nodes started with the same configuration refer to different packages? This is related to the order in which the environment loads the packages, refer to this article .

View source code

After analyzing the use of different versions of HikariCP in the log, we then began to analyze the difference in design between the 2.3.13 and 3.4.5 versions, and whether the 2.3.13 version has a defect in obtaining connections.

The 2.3.13 version goes to the connection pool to obtain the connection process as follows:

image.png

The 3.4.5 version goes to the connection pool to obtain the connection process as follows:

image.png

The process of version 3.4.5 is basically the same as that of version 2.3.13, with slight modifications:

  • Removed waiting for AbstractQueuedLongSynchronizer
  • Added SynchronousQueue queue and waiter to get connections

This is just an optimization of the logic. It can be seen from the flow charts of the two versions that if the connection fails, the connection will be closed first and then re-acquired, which means that the logic of acquiring the connection is no problem. There is only one of the following (excluding the network problem) may cause the connection pool get timeout that appears in the log:

The connection pool is full and none of the connections state is STATE_NOT_IN_USE.

Look at the return connection logic

2.3.13:

image.png

A problem can be seen here. If the CAS operation of the state fails, the state will stay in STATE_IN_USE and a warning log will be printed. If the connection in the pool is full and all connections in this state, the subsequent connection acquisition will time out, that is conform to the situation above.

3.4.5:

image.png

It can be seen that version 3.4.5 will directly set the state to STATE_NOT_IN_USE, so that subsequent connections can be obtained from the connection pool.

reproduce

Then start to reproduce the problem, change the local code hikaricp version to 2.3.13, point the local database connection to the test environment database and start the program, close the database after the program runs for a while. After waiting for a few minutes, start the database again, and send a request to find that the connection pool will re-acquire the connection. The problem cannot be reproduced for the time being, indicating that it should only appear in the case of extreme multi-threading, and we will try it with multi-threading later.

Summarize

At present, looking at the two versions of the design from the code, it can only be inferred that the connection pool may be full and the CAS operation of all connection states fails to be STATE_NOT_IN_USE. The real reason is still uncertain.

Usually, monitoring and logs should be added to the system. When there is a problem, the jvm-related information at that time should be dumped in time. If there is a lot of information to analyze afterwards, it is impossible to find the root cause. In addition, pay attention to the version when developing. Conflict, otherwise there will be an inexplicable problem.

References

zhuanlan.zhihu.com/p/417119117

Guess you like

Origin juejin.im/post/7083666343458766878