A case study of MySQL connection interruption caused by TCP cache overload

How to analyze the possibility of targeting other factors besides MySQL itself?

Author: Gong Tangjie, a member of Aikesheng DBA team, is mainly responsible for MySQL technical support, and is good at MySQL, PG, and domestic databases.

Produced by the Aikeson open source community, original content may not be used without authorization. Please contact the editor and indicate the source for reprinting.

This article is about 1,200 words and is expected to take 3 minutes to read.

background

During the execution of batch tasks, the application encountered a problem: the database connection for some tasks would suddenly be lost, resulting in the task being unable to be completed. From the error log of the database, Aborted connection information was found, which indicates that the communication between the client and the server was abnormally interrupted.

analyze

In order to find out the cause of the problem, we first analyzed several common situations that may cause the connection to be Aborted based on experience:

  1. The client did not close the connection properly and did not call mysql_close()the function.
  2. If the client's idle time exceeds wait_timeoutthe or interactive_timeoutparameter's seconds, the server automatically disconnects.
  3. The size of the packet sent or received by the client exceeds max_allowed_packetthe value of the parameter, causing the connection to be interrupted.
  4. The client attempted to access the database but did not have permission, or the wrong password was used, or the connection packet did not contain the correct information.

However, after investigation, it was found that none of the above situations apply to the current problem. Because the tasks were running normally before and the program has not changed, the first situation can be ruled out. I checked the timeout parameters of MySQL wait_timeoutand interactive_timeoutfound that they are both 28800, which is 8 hours, far exceeding the task execution time, so the second situation can be ruled out. I also checked max_allowed_packetthe parameters of the client and server and found that they are both 64M and are unlikely to exceed this limit, so the third situation can be ruled out. We have also confirmed that the client's database access rights, password, connection package and other information are all correct, so the fourth situation can be ruled out.

At this point, we initially feel that there should be no problem at the MySQL level, and the problem may lie elsewhere.

In order to further locate the problem, we tried to modify some relevant kernel parameters of the server, as follows:

net.ipv4.tcp_keepalive_intvl = 30
net.ipv4.tcp_keepalive_probes = 3
net.ipv4.tcp_keepalive_time = 120
net.core.rmem_default = 2097152
net.core.wmem_default = 2097152
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_max_syn_backlog = 16384

These parameters are mainly to optimize the performance and stability of the network connection and avoid the connection from being closed unexpectedly or timed out. However, the modified results have not improved, and the connection will still be interrupted abnormally.

Finally, we tried packet capture analysis. Using the Wireshark tool, we discovered an abnormal phenomenon: the server would send a large number of ACK packets to the client. As shown below:

These ACK packets are confirmation packets in the TCP protocol, indicating that the server has received the client's data packet and requests the client to continue sending data. But why does the server send so many ACK packets? We speculate that there may be an abnormality in the network, causing the client to not receive the ACK packet returned by the server, so the server will repeatedly send ACK packets until it times out or receives a response from the client. However, after investigation by network personnel, no obvious problems were found.

Continuing to analyze the packet capture, we discovered another abnormal phenomenon: the client will give some window warnings to the sending server. As shown below:

These window warnings are a flow control mechanism in the TCP protocol, indicating that the server or client's receiving window is full and cannot receive more data.

[TCP Window Full] is a window warning sent by the sender to the receiver, indicating that the limit of the data receiver has been reached.

[TCP ZeroWindow] is a window warning sent by the receiving end to the sending end, telling the sender that the receiving end's receiving window is full and temporarily stops sending.

Based on the above information, we speculate that the cause of the problem is: because the data that MySQL needs to send is too large, the client's TCP cache is full, so it needs to wait for the client to digest the data in the TCP cache before it can continue to receive data. However, during this period, MySQL will keep requesting the client to continue sending data. If the client does not respond within a certain period of time (default is 60 seconds), MySQL will consider that sending data has timed out and interrupt the connection.

In order to verify the speculation, I checked the MySQL slow log and found a lot of Last_errno: 1161 records.

These records indicate that MySQL encountered a timeout error when sending data, and the number of occurrences is very close to the number of application failed tasks. According to the MySQL official website, the meaning of this error is:

Error number: 1161; Symbol: ER_NET_WRITE_INTERRUPTED; SQLSTATE: 08S01

Message: Got timeout writing communication packets

It can be seen that this means that network writing is interrupted, and there is a parameter at the MySQL level to control this, so try to change the net_write_timeout parameter to 600, and the batch task will run normally.

Therefore, the reason why the MySQL connection is abnormally interrupted is that the database obtained by the client is too large and exceeds the client's TCP cache. The client needs to process the data in the cache first. During this period, MySQL will continue to request the client to continue sending data. , but the client failed to respond within 60 seconds, causing MySQL to time out sending data and interrupt the connection.

in conclusion

Through the above analysis and attempts, we have reached the following conclusions:

  • In the packet capture information, there are a lot of ACK information because the client's cache is full and cannot feedback to the server in time, so the server will repeatedly send ACK information until more than 60 seconds ( net_write_timeoutthe default value is 60), causing MySQL to interrupt the connection.
  • In the slow log, there are many Last_errno: 1161 records because the SQL has actually been executed in MySQL, but when sending data to the client, the amount of data exceeds the client's TCP cache, and then the client The application did not process the data in the cache within 60 seconds, causing MySQL to time out when sending data to the client.
  • Adjusting net_write_timeoutparameters at the MySQL level can only alleviate this phenomenon. The root cause is that the amount of data obtained by a single SQL is too large and exceeds the client's cache size. The application cannot process the data in the cache in a short time, which causes subsequent data sending to time out.

Optimization suggestions

  • Data is processed in batches at the business level to avoid a single SQL query from obtaining a large amount of data from the server, resulting in insufficient TCP cache on the client side.
  • Increasing the parameters in MySQL net_write_timeoutor increasing the client's TCP cache can alleviate this situation, but it cannot completely solve the problem because too much data will still affect performance and stability.
  • Optimize SQL statements to reduce unnecessary data returns, such as using LIMIT, WHERE and other conditions, or using aggregate functions, grouping functions, etc., to reduce the amount of data and improve query efficiency.

For more technical articles, please visit: https://opensource.actionsky.com/

About SQLE

SQLE is a comprehensive SQL quality management platform that covers SQL auditing and management from development to production environments. It supports mainstream open source, commercial, and domestic databases, provides process automation capabilities for development and operation and maintenance, improves online efficiency, and improves data quality.

SQLE get

type address
Repository https://github.com/actiontech/sqle
document https://actiontech.github.io/sqle-docs/
release news https://github.com/actiontech/sqle/releases
Data audit plug-in development documentation https://actiontech.github.io/sqle-docs/docs/dev-manual/plugins/howtouse
Linus took matters into his own hands to prevent kernel developers from replacing tabs with spaces. His father is one of the few leaders who can write code, his second son is the director of the open source technology department, and his youngest son is a core contributor to open source. Huawei: It took 1 year to convert 5,000 commonly used mobile applications Comprehensive migration to Hongmeng Java is the language most prone to third-party vulnerabilities. Wang Chenglu, the father of Hongmeng: open source Hongmeng is the only architectural innovation in the field of basic software in China. Ma Huateng and Zhou Hongyi shake hands to "remove grudges." Former Microsoft developer: Windows 11 performance is "ridiculously bad " " Although what Laoxiangji is open source is not the code, the reasons behind it are very heartwarming. Meta Llama 3 is officially released. Google announces a large-scale restructuring
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/actiontechoss/blog/11054532