Mysql报错：Got an error reading communication packets

一、故障说明

某次一线巡检过程中发现，上午8点、9点左右kafka消息积压数异常增加，且同时Mysql数据库主从同步异常，测试数据库远程登录报错：Host …… is blocked because of mang connections errors,unlock with mysqladmin flush-hosts
在这里插入图片描述

数据库mysqld.log一致刷报错：
在这里插入图片描述

2022-10-10T02:48:00.016084Z 1040571 [Note] Aborted connection 1040571 to db: 'dia user: ‘root’ host: ‘172.10.1.42’ (Got timeout reading communication packets)
2022-10-10T02:48:01.145844Z 1049518 [Note] Got an error reading communication packets
2022-10-10T02:48:03.435683Z 1049519 [Note] Got an error reading communication packets
2022-10-10T02:48:04.400894Z 1045516 [Note] Aborted connection 1045516 to db: ‘dia’ user: ‘root’ host: ‘172.10.1.42’ (Got timeout reading communication packets)
……
2022-10-10T02:48:06.152616Z 1049521 [Note] Got an error reading communication packets
2022-10-10T03:34:51.634286Z 1051314 [Warning] IP address ‘172.10.1.66’ could not be resolved: Name or service not known
2022-10-10T03:34:52.259080Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 7114ms. The settings might not be optimal. (flushed=9 and evicted=0, during the time.)
2022-10-10T03:34:53.295160Z 1051316 [Note] Got an error reading communication packets

二、分析

1、因主从失败同时发生，检查主从，测试验证报：Host …… is blocked because of mang connections errors,unlock with mysqladmin flush-hosts；即Mysql master侧错误连接数太多，需要重置连接计数器，否则后续链接blocked；登录Mysql master执行：flush hosts；忽略错误后完成后重启salve，观察主从重新同步恢复；

2、完成后观察kafka消息解压曲线，下降到0位线；在kafka服务器侧执行命令手动检查，消费积压未见明显异常：

./bin/kafka-consumer-groups.sh --bootstrap-server 172.10.1.17:9092 --group group-dialup --describe

3、11点左右，维护人员进行数据库临时备份，执行：

mysqldump -u root -p --databases dia >/home/dia.sql

这是发现kafka又出现大量消息积压，检查数据库，连接访问受阻，于是又重新重置连接计数，综上，判断本次故障原因就是无效连接数导致了数据库连接被阻塞，消费者无从从kafka消费消息写入数据库，导致消息瞬时积压。

4、检查数据库连接数情况，执行：

mysql> show variables like '%connect%';
mysql> show global status like '%abort%';

在这里插入图片描述
由上图可知，当前连接数并不高，但无效连接Aborted connects达到553530，上图的最大错误连接是修改后的，原值为3000；

其中，Aborted_clients 表连接的mysql客户端被连接被kill的数量，MYSQL认为读到了一个错误的包，并将该连接Aborted；相关资料显示可由–log-warnings控制，我们可以关闭它；官方解释如下，启用后：

Print out warnings such as Aborted connection… to the error log. This option is enabled (1) by default. To disable it, use --log-warnings=0. Specifying the option without a level value increments the current value by 1. Enabling this option by setting it greater than 0 is recommended, for example, if you use replication (you get more information about what is happening, such as messages about network failures and reconnections). If the value is greater than 1, aborted connections are written to the error log, and access-denied errors for new connection attempts are written. See Section B.3.2.10, “Communication Errors and Aborted Connections”.

If a replica server was started with –log-warnings enabled, the replica prints messages to the error log to provide information about its status, such as the binary log and relay log coordinates where it
starts its job, when it is switching to another relay log, when it reconnects after a disconnect, and so forth. The server logs messages about statements that are unsafe for statement-based logging if –log-warnings is greater than 0.

mysql> select @@log_warnings;
mysql> set global log_warnings=1;

Query OK, 0 rows affected

对于Aborted_connects错误的大概原因有：

客户端连接一个没有授权的数据库、密码错误、连接包信息错误、连接超时

对于 Aborted_clients的原因大概有：

客户端没有执行mysql_close()关闭、由于连接一直没有关闭导致时间超过wait_timeout或interactive_timeout这两个变量的值（这两个变量的超时时间是8小时）、客户端在程序执行过程中结束、max_allowed_packet包设的过小、网络原因、线程bug等

5、超时设置导致的潜在问题

mysql> show global variables like '%timeout%';
#会话级
mysql> show session variables like '%timeout%';
#全局变量，可以动态增加
mysql> set global interactive_timeout=28800;
#修改会话级，只对当前会话有效，wait_timeout只有同时修改global级interactive_timeout和wait_timeout才能生效，session下的wait_timeout在连接初始化时，继承global interactive_timeout参数值；session级别的wait_timeout对当前交互连接生效，新连接的session级别wait_timeout会使用global级别的interactive_timeout值覆盖，因为interactive_timeout值是对后续新连接生效
mysql> set session interactive_timeout=28800;
#永久修改
[mysqld]
wait_timeout=100
interactive_timeout=100

在这里插入图片描述

其中：connect_timeout在握手认证阶段（authenticate）起作用，interactive_timeout 和wait_timeout在连接空闲阶段（sleep）起作用，而net_read_timeout和net_write_timeout则是在连接繁忙阶段（query）或者网络出现问题时起作用。

wait_timeout表mysql在关闭一个非交互的连接之前所要等待的秒数，其取值范围为1-2147483(Windows)，1-31536000(linux)，默认值28800（即8h）。上图中当connections如果空闲超过30分钟，如果客户端不自己断开，也不做任何操作，Mysql为了节省资源会自动将其断开，而连接池并不知道该connection已经失效，如果这时有Client请求connection，DBCP将该失效的Connection提供给Client，将会造成异常。interactive_time ：指的是mysql在关闭一个交互的连接之前所要等待的秒数，其取值范围随wait_timeout变动，默认值28800。参数interactive_timeout和wait_timeout配置不当可能导致过多sleep进程；每个连接的建立都会消耗内存，所以MySQL会尽量释放闲置连接，从而释放资源；如果连接一直在累加而不断开，最终肯定会达到MySQL Server的连接上限数，这会报’too many connections’的错误。对于wait_timeout的值设定，应该根据系统的运行情况来判断。在系统运行一段时间后，可以通过show processlist命令查看当前系统的连接状态，如果发现有大量的sleep状态的连接进程，则说明该参数设置的过大，可以进行适当的调整小些。

在这里插入图片描述

注意： 线程启动的时候，wait_timeout 的值取自于全局设置的wait_timeout 或者全局的 interactive_timeout。交互式客户端指的是mysqladmin ，navicate 等工具。非交互式客户端，一般指的是jdbc等驱动等；对于interactive_timeout覆盖wait_timeout，只会在用户登陆的时候才出现.其覆盖操作定义在函数server_mpvio_update_thd中，其他任何情况都不会出现覆盖。一旦会话登陆成功如果想要会话级别修改超时参数，不管交互式还是非交互式都是修改wait_timeout(set wait_timeout)参数才会生效。即只有在新建立连接时，interactive_timeout 才有可能覆盖 wait_timeout的值，同时也只有新建连接时，interactive_timeout才有用。其他任何情况，控制空闲连接超时的都是wait_timeout；或非交互式的客户端连接的session将不会使用interactive_timeout覆盖掉wait_timeout，也就是interactive_timeout没有任何作用了。只有wait_timeout参数会传递到网络层设置，而interactive_timeout参数只会在会话登陆的时候判断是否是交互式客户端会话如果是则进行wait_timeout=interactive_timeout的覆盖，如果不是则不生效的。

现场采用hikari druid连接池；其中：

hirake连接池：springboot框架默认使用这个连接池，速度比druid快

druid德鲁伊连接池：阿里巴巴开发的连接池，对SQL语句进行监控、拦截的功能

Hikari连接池（CP）的配置参考：

# 不同数据源这四个配置都会用到
spring.datasource.url=jdbc:mysql://localhost:3306/test
spring.datasource.username=root
spring.datasource.password=123456
spring.datasource.driver-class-name=com.mysql.jdbc.Driver

#以下的配置项是hikari特有的配置
# 等待连接池分配连接的最大时长（毫秒），超过这个时长还没可用的连接则发生SQLException， 默认:30秒
spring.datasource.hikari.connection-timeout=30000
# 最小连接数
spring.datasource.hikari.minimum-idle=5
# 最大连接数
spring.datasource.hikari.maximum-pool-size=15
# 自动提交
spring.datasource.hikari.auto-commit=true
# 一个连接idle状态的最大时长（毫秒），超时则被释放（retired），默认:10分钟
spring.datasource.hikari.idle-timeout=600000
# 连接池名字
spring.datasource.hikari.pool-name=DatebookHikariCP
# 一个连接的生命时长（毫秒），超时而且没被使用则被释放（retired），默认:30分钟 1800000ms，建议设置比数据库超时时长少60秒
spring.datasource.hikari.max-lifetime=28740000
spring.datasource.hikari.connection-test-query=SELECT 1

#以下是针对MYSQL驱动的配置参数
# 在每个连接中缓存的语句的数量。默认值为保守值25。建议将其设置为250-500之间
spring.datasource.hikari.prepStmtCacheSize = 300
# 缓存的已准备SQL语句的最大长度，默认值是256，但是往往这个长度不够用
spring.datasource.hikari.prepStmtCacheSqlLimit = 2048
# 缓存开关，如果这里设置为false，上面两个参数都不生效
spring.datasource.hikari.cachePrepStmts = true
#较新版本的 MySQL 支持服务器端准备好的语句，这可以提供实质性的性能提升
spring.datasource.hikari.useServerPrepStmts = true

6、数据包大小：太小导致可能的异常连接

mysql> show variables like 'max_allowed_packet'; #下图为67M
#修改
set global max_allowed_packet = 10 * 1024 * 1024;
#永久生效
[mysqld]
max_allowed_packet = 200M

在这里插入图片描述

7、相关经验表明：

可能导致Got Timeout reading communication packets错误的原因有如下几个,更多参看Communication Errors说明：

A client attempts to access a database but has no privileges for it.（没有权限）

A client uses an incorrect password.(密码错误)

A connection packet does not contain the right information.(连接没有包含正确信息)

It takes more than connect_timeout seconds to obtain a connect packet. (获取连接信息起过connect_timeout的时长)

The client program did not call mysql_close() before exiting.(客户端没有调用mysql_close()函数)

The client had been sleeping more than wait_timeout or interactive_timeout seconds without issuing any requests to the server. (客户端的空连接时间过长，超过了wait_timeout和interactive_timeout的时间)

The client program ended abruptly in the middle of a data transfer.(数据传输过程中终结)

8、后检查发现，数据库又再次出现断联的情况，检查日志，发现，与数据库超时连接错误的为，消息入库的3台消费者服务器，如下所示：

在这里插入图片描述

这3台消费者使用了hikari连接池，配置如下：

本次故障中的Got an error reading报错表示：客户端已经成功连接，但是被异常终止了。可能出现在客户端侧（超时的设置），或者mysql侧（超过阈值）；但应侧重更大可能出现在客户端侧。调整客户端连接池参数尝试。