一 遇到的问题: 项目用的play框架,数据库DB2, 持久化框架是Mybatis, 连接池用的是Mybatis原生的,遇到的问题是:有时候抛出如下异常:
play.api.UnexpectedException: Unexpected exception[CompletionException: org.apache.ibatis.exceptions.PersistenceException: ### Error querying database. Cause: com.ibm.db2.jcc.am.DisconnectNonTransientConnectionException: [jcc][t4][2030][11211][4.23.42] A communication error occurred during operations on the connection's underlying socket, socket input stream, or socket output stream. Error location: Reply.fill() - socketInputStream.read (-1). Message: Connection timed out (Write failed). ERRORCODE=-4499, SQLSTATE=08001 ### The error may exist in mapper/mapper.xml ### The error may involve defaultParameterMap ### The error occurred while setting parameters ### SQL: select tmpifro.* from (SELECT MAX(nvl(ifm.opportunity_id,'')) AS opportunity_name, MAX(nvl(IFM.restricted, '')) AS restricted, MAX(nvl(ifc.client_id, '')) AS client_id, MAX(nvl(ifc.disp_name, '')) AS client_name, MAX(nvl(ifc.disp_alt_name, '')) AS client_alt_name, MAX(nvl(ifc.issuing_iot_name, '')) AS issuing_iot, MAX(ifm.oppty_close_date) AS decision_date, MAX(nvl(ifss.sales_stage_name, '')) AS opportunity_sales_stage, MAX(nvl(ifss.sales_stage_id, '')) AS opportunity_sales_stage_id, nvl(ifm.opportunity_id, '') AS opportunity_id, nvl(sum(ifrm.manager_amount_usd), 0) AS total_ts_amount, nvl(sum(ifrm.manager_won_amount), 0) + nvl(sum(ifrm.manager_sol_amount), 0) + nvl(sum(ifrm.manager_atr_amount), 0) AS manager_won_solid_atrisk, nvl(sum(ifrm.manager_key_amount), 0) AS man_key_stretch, nvl(sum(ifrm.manager_str_amount), 0) AS man_stretch, nvl(sum(ifrm.manager_won_amount), 0) AS man_won, nvl(sum(ifrm.manager_sol_amount), 0) AS man_solid, nvl(sum(ifrm.manager_atr_amount), 0) AS man_atrisk, nvl(sum(ifrm.manager_nir_amount), 0) AS man_nir, nvl(sum(ifrm.manager_cq_amount), 0) AS manager_cq_amount, nvl(sum(ifrm.manager_nq_amount), 0) AS manager_nq_amount, MAX(nvl(o.DESCRIPTION, '')) AS opportunity_description, MAX(nvl(o.amount_usdollar, 0)) AS total_opportunity_amount, MAX(nvl(o.assigned_user_id, '')) AS assigned_user_id, MAX(nvl(o.assigned_bp_id, '')) AS assigned_bp_id, MAX(nvl(iw.overall_progress, 0)) AS win_plan_progress, MAX((case when (nvl(o.assigned_user_id, '') = '') then nvl(bp.name, '') else nvl(u.first_name, '') || ' ' || nvl(u.last_name, '') end)) AS opportunity_owner, MAX(nvl(iw.executive_sponsor, '')) AS executive_sponsor_id, MAX(nvl(u1.first_name, '') || ' ' || nvl(u1.last_name, '')) AS ibm_executive_sponsor, MAX(nvl(iw.snd_sponsor, '')) AS sd_sponsor_id, MAX(nvl(u2.first_name, '') || ' ' || nvl(u2.last_name, '')) AS ibm_sd_sponsor, MAX(nvl(PARENT_ACC.NAME, '')) AS account_name, MAX(nvl(stc.comments, '')) AS opportunity_steps_to_closure_summary FROM IBM_FORECAST_ROADMAP IFRM INNER JOIN ( SELECT DISTINCT F_ROADMAP_ID FROM IBM_FN_TO_ROADMAP IFTR WHERE IFTR.LOWEST_NODE_ID =? ) IFTR on IFTR.F_ROADMAP_ID = IFRM.ID INNER JOIN IBM_FDIM_ROADMAP IFM ON IFM.ID = IFRM.FDIM_ROADMAP_ID AND IFM.DELETED = 0 INNER JOIN IBM_FDIM_SALES_STAGE IFSS ON IFSS.ID = IFRM.FDIM_SALES_STAGE_ID AND IFSS.DELETED = 0 INNER JOIN IBM_FDIM_DATE MIFD ON MIFD.ID = IFRM.MANAGER_FDIM_DATE_ID AND MIFD.DELETED = 0 INNER JOIN IBM_FDIM_CMR IFC on IFC.ID = IFRM.FDIM_CMR_ID AND IFC.DELETED = 0 INNER JOIN OPPORTUNITIES O ON O.ID = ifm.OPPORTUNITY_ID AND O.deleted=0 AND O.BP_PRIVATE_OPPORTUNITY = '0' LEFT JOIN IBM_WINPLANS IW ON IW.OPPORTUNITY_ID = ifm.OPPORTUNITY_ID AND IW.DELETED = 0 LEFT JOIN IBM_STEPSTOCLOSURE STC ON STC.OPPORTUNITY_ID = ifm.OPPORTUNITY_ID AND STC.TYPE = 'OSC' AND STC.DELETED = 0 LEFT JOIN USERS U ON U.ID = O.ASSIGNED_USER_ID AND U.DELETED = 0 LEFT JOIN USERS U1 ON U1.ID = IW.EXECUTIVE_SPONSOR AND U1.DELETED = 0 LEFT JOIN USERS U2 ON U2.ID = IW.SND_SPONSOR AND U2.DELETED = 0 LEFT JOIN IBM_BUSINESSPARTNERS BP ON BP.ID = O.ASSIGNED_BP_ID AND BP.DELETED = 0 LEFT JOIN ACCOUNTS ACC ON IFC.client_id = ACC.ID AND ACC.DELETED = 0 LEFT JOIN ACCOUNTS PARENT_ACC ON ACC.PARENT_ID = PARENT_ACC.ID AND PARENT_ACC.DELETED = 0 WHERE IFRM.manager_amount_usd!='0' AND MIFD.QUARTERLY_TIMEPERIOD_ID = ? AND IFRM.REVENUE_TYPE = ? GROUP BY IFM.OPPORTUNITY_ID WITH UR) tmpifro order by total_ts_amount desc LIMIT ?,? WITH UR FOR FETCH ONLY ### Cause: com.ibm.db2.jcc.am.DisconnectNonTransientConnectionException: [jcc][t4][2030][11211][4.23.42] A communication error occurred during operations on the connection's underlying socket, socket input stream, or socket output stream. Error location: Reply.fill() - socketInputStream.read (-1). Message: Connection timed out (Write failed). ERRORCODE=-4499, SQLSTATE=08001] at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:251) at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:182) at play.core.server.AkkaHttpServer$$anonfun$2.applyOrElse(AkkaHttpServer.scala:320) at play.core.server.AkkaHttpServer$$anonfun$2.applyOrElse(AkkaHttpServer.scala:318) at scala.concurrent.Future.$anonfun$recoverWith$1(Future.scala:414)
很明显是: connection 和 DB的连接出问题了,正常情况下,一般过一会,连接好用了,就可以恢复,但是有一种情况是:连接一直不恢复: 用
netstat -anop 命令常看连接池状态发现,如下图: 其中,端口是50001的是与SSL连接,这些连接池中Send-Q中是4101的,表示socket连接的发送队列出现阻塞,连接有问题。
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name Timer
tcp6 0 0 :::9000 :::* LISTEN 1/java off (0.00/0/0)
tcp6 0 4101 10.244.1.164:39236 XXXXXX:50001 ESTABLISHED 1/java on (104.87/15/0)
tcp6 0 4101 10.244.1.164:42786 XX.XX.XXX.XXXX:50001 ESTABLISHED 1/java on (18.34/15/0)
tcp6 0 4101 10.244.1.164:39214 XX.XX.XXX.XXXX:50001 ESTABLISHED 1/java on (51.62/12/0)
tcp6 0 4101 10.244.1.164:39218 XX.XX.XXX.XXXX:50001 ESTABLISHED 1/java on (105.38/10/0)
tcp6 0 4101 10.244.1.164:37600 XX.XX.XXX.XXXX:50001 ESTABLISHED 1/java on (5.09/8/0)
tcp6 0 4101 10.244.1.164:33852 XX.XX.XXX.XXXX:50001 ESTABLISHED 1/java on (33.45/10/0)
tcp6 0 0 10.244.1.164:39240 XX.XX.XXX.XXXX:50001 ESTABLISHED 1/java keepalive (26.85/0/7)
tcp6 0 4101 10.244.1.164:39250 XX.XX.XXX.XXXX:50001 ESTABLISHED 1/java on (26.02/15/0)
tcp6 0 4101 10.244.1.164:39256 XX.XX.XXX.XXXX:50001 ESTABLISHED 1/java on (41.38/14/0)
tcp6 0 4101 10.244.1.164:33860 XX.XX.XXX.XXXX:50001 ESTABLISHED 1/java on (41.89/15/0)
tcp6 0 0 10.244.1.164:34200 XX.XX.XXX.XXXX:50001 ESTABLISHED 1/java keepalive (40.17/0/5)
tcp6 0 0 10.244.1.164:39230 XX.XX.XXX.XXXX:50001 ESTABLISHED 1/java keepalive (1072.87/0/0)
tcp6 0 0 10.244.1.164:39224 XX.XX.XXX.XXXX:50001 ESTABLISHED 1/java keepalive (1990.37/0/0)
tcp6 0 4101 10.244.1.164:39210 XX.XX.XXX.XXXX:50001 ESTABLISHED 1/java on (105.89/14/0)
tcp6 0 0 10.244.1.164:9000 10.244.1.1:36294 TIME_WAIT - timewait (44.47/0/0)
tcp6 0 0 10.244.1.164:39262 XX.XX.XXX.XXXX:50001 ESTABLISHED 1/java keepalive (43.75/0/2)
tcp6 0 4101 10.244.1.164:34206 XX.XX.XXX.XXXX:50001 ESTABLISHED 1/java on (84.20/9/0)
tcp6 0 4101 10.244.1.164:33598 XX.XX.XXX.XXXX:50001 ESTABLISHED 1/java on (39.33/13/0)
tcp6 0 4101 10.244.1.164:33584 XX.XX.XXX.XXXX:50001 ESTABLISHED 1/java on (79.27/15/0)
在这种情况下:连接池中的这些连接其实都是有问题的,但是Mybatis每次都从连接池中得到一个connection,并不知道其是否好用。所以就一直阻塞。
二 原因:首先分析了Mybatis原生的连接池的源码,这里简单说一下:
PooledDataSource.java :: popConnection() 这个函数表示每次都去得到一个连接,
(1)首先,如果连接池不是空,就从连接池取一个,跳到(3);否则跳到(2)
(2)新建一个连接
(3)判断这个连接是否可用,如果可用,就返回这个连接;否则继续(1)
这里重点说一下上面标注红色的步骤,如何判断连接是否可用:这个调用的是PooledConnection.java::isValid(),
/* * Method to see if the connection is usable * * @return True if the connection is usable */ public boolean isValid() { return valid && realConnection != null && dataSource.pingConnection(this); }
重点是dataSource.pingConnection(this),如果mybatis里配置了poolPingEnabled,就会在拿到这个连接的时候,执行poolPingQuery,相当于检查数据库连接是正常,如果正常,就返回这个连接;否则关掉这个坏的连接,然后再去找其他的正常的空闲的连接。
关键我代码里面没有配置poolPingEnabled, 所以,根本没去检查空闲的连接是否好用。导致一直用坏的连接去执行SQL语句。
三 解决方案
(1) 如果继续使用mybatis原生的连接池,就配置如下属性:
poolPingEnabled - 默认值是false,当值为true的时候,将开启ping机制。 poolPingQuery - 对数据库进行ping时所使用的sql。 poolPingConnectionsNotUsedFor 默认值是0,单位是毫秒。我们不能在每次使用连接池之前,都使用ping机制, 这会使每一条sql的执行,都要额外执行一次ping语句。所以使用此属性来避免这种不合理做法。 我们只针对闲置时间超过某个时间的连接,进行ping。本例中的值为1小时,当从连接池中拿出的连接闲置超过1小时,才会对它进行ping。
但是这个的效率非常低下,每次执行一条SQL语句,都要执行一个检查的SQL, 效率不好。
(2) 使用DBCP的连接池:DBCP连接池功能强大,提供了四种检测DB连接是否正常的选择:
testOnCreate
testOnBorrow
testOnReturn
testWhileIdle
其中testWhileIdle ,是另外启线程,定时的去检测连接是否健康,这个比较实用,需要在Mybatis里进行如下配置:
<property name="testWhileIdle" value="true"/> 启用空闲连接健康检查机制 <property name="validationQueryTimeout" value="2"/> 如果执行检查的SQL 2秒没有结果,就认为连接不健康,就关掉 <property name="validationQuery" value="SELECT 1 FROM SYSIBM.SYSDUMMY1"/> 执行检查的SQL, 要一定有返回值,并且快! <property name="timeBetweenEvictionRunsMillis" value="20000"/> 多长时间检查一次 <property name="numTestsPerEvictionRun" value="20"/> 检查的时候,开多少个线程
最后,使用DBCP的 testWhileIdle, 可以在连接不健康的情况下,关闭掉不健康的连接。