dbcp重连失败问题排查记录(timeout设置)

背景

  前几天很悲剧,公司的oracle小型机据说是网卡驱动有问题,导致数据库整整挂了1个多小时,后来切换到了备份数据库上,可用性一下子跌倒了3个9。

  在后续的某一天,凌晨3点进行了网卡升级,又从备份库重新切回主库的时,测试过程中发现dbcp的自动重连部分机器有问题。

问题描述

整个过程出现了两种错误日志: 

异常1: 

Caused by: java.sql.SQLException: The Network Adapter could not establish the connection
        at oracle.jdbc.driver.SQLStateMapping.newSQLException(SQLStateMapping.java:70)
        at oracle.jdbc.driver.DatabaseError.newSQLException(DatabaseError.java:131)
        at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:197)
        at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:525)
        at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:413)
        at oracle.jdbc.driver.PhysicalConnection.<init>(PhysicalConnection.java:508)
        at oracle.jdbc.driver.T4CConnection.<init>(T4CConnection.java:203)
        at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:33)
        at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:510)
        at com.alibaba.china.jdbc.SimpleDriver.connect(SimpleDriver.java:101)
        at org.apache.commons.dbcp.DriverConnectionFactory.createConnection(DriverConnectionFactory.java:38)
        at org.apache.commons.dbcp.PoolableConnectionFactory.makeObject(PoolableConnectionFactory.java:582)
        at org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:1148)
        at org.apache.commons.dbcp.AbandonedObjectPool.borrowObject(AbandonedObjectPool.java:79)
        at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:106)
        at org.apache.commons.dbcp.BasicDataSource.getConnection(BasicDataSource.java:1044)
        at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:113)
        at org.springframework.orm.ibatis.SqlMapClientTemplate.execute(SqlMapClientTemplate.java:190)
        ... 49 more
Caused by: oracle.net.ns.NetException: The Network Adapter could not establish the connection
        at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:328)
        at oracle.net.resolver.AddrResolution.resolveAndExecute(AddrResolution.java:421)
        at oracle.net.ns.NSProtocol.establishConnection(NSProtocol.java:630)
        at oracle.net.ns.NSProtocol.connect(NSProtocol.java:206)
        at oracle.jdbc.driver.T4CConnection.connect(T4CConnection.java:966)
        at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:292)
        ... 62 more
Caused by: java.net.ConnectException: Connection timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
        at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
        at java.net.Socket.connect(Socket.java:525)
        at java.net.Socket.connect(Socket.java:475)
        at java.net.Socket.<init>(Socket.java:372)
        at java.net.Socket.<init>(Socket.java:186)
        at oracle.net.nt.TcpNTAdapter.connect(TcpNTAdapter.java:127)
        at oracle.net.nt.ConnOption.connect(ConnOption.java:126)
        at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:306)
        ... 67 more

异常2:

Caused by: org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, pool error Timeout waiting for idle object
        at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:114)
        at org.apache.commons.dbcp.BasicDataSource.getConnection(BasicDataSource.java:1044)
        at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:113)
        at org.springframework.orm.ibatis.SqlMapClientTemplate.execute(SqlMapClientTemplate.java:190)
        ... 49 more
Caused by: java.util.NoSuchElementException: Timeout waiting for idle object
        at org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:1134)
        at org.apache.commons.dbcp.AbandonedObjectPool.borrowObject(AbandonedObjectPool.java:79)
        at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:106)
        ... 52 more

下面再看一下,我们的dbcp配置: (关于dbcp的配置,可以先看下这两篇文章: dbcp基本配置和重连配置 ,   解读dbcp自动重连那些事  )

<bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource" destroy-method="close"> 
	<property name="driverClassName" value="oracle.jdbc.OracleDriver" />
	<property name="url" value="${database.driver.url}" />
	<property name="username"><value>${database.driver.username}</value></property>
	<property name="password" value="${database.driver.password}" />
	<property name="maxActive"><value>14</value></property>
	<property name="initialSize"><value>1</value></property>
        <property name="maxWait"><value>60000</value></property>
        <property name="maxIdle"><value>14</value></property>
        <property name="minIdle"><value>1</value></property>
        <property name="removeAbandoned"><value>true</value></property>
        <property name="removeAbandonedTimeout"><value>180</value></property>
	<property name="timeBetweenEvictionRunsMillis"><value>60000</value></property>
	<property name="minEvictableIdleTimeMillis"><value>1800000</value></property>
	<property name="connectionProperties"><value>bigStringTryClob=true;clientEncoding=GBK;defaultRowPrefetch=50;serverEncoding=ISO-8859-1</value></property>
</bean>

这是oracle的配置,mysql配置时多了个validate配置。

问题分析 

1. 从日志分析来看,对应异常1的只在前面出现了百来次之后,后续的请求都直接抛出了异常2。

2. 从页面使用来看,每次页面返回502错误,一般是等待了60s左右。 这个和maxWait参数配置的60s接近。

3. 当时验证的两组集群,一组业务的机器没问题,全都自动重链ok的。 另一组中,有一台预发布机也能自动重连,但就几台提供正式访问的机器自动重连一直失败.

4. 针对自动重连失败的机器,netstat -nat | grep 1521查看链接为0

通过这几个现象,和查看了dbcp的源码,大致的一个猜想: 

1.  假设说原先的应用中active的链接数为maxActive

2.  在数据库进行关闭后,原先的连接池中的异常会拿到"Broken pipe"异常,从日志中分析也就只有几条,符合。

     如果正在使用的连接会抛出:java.sql.SQLException: 无法从套接字读取更多的数据。

     如果拿到链接正准备使用,会抛出: java.sql.SQLException: OALL8 处于不一致状态

3.  等这些链接,被淘汰完之后,新的请求会尝试调用makeObject()重新去创建链接,因为此时的数据库的服务已经关闭,(理论上应该是"Connection refused",直接是端口不可访问, 但这个估计和我们采用了虚拟ip做数据库热备有关,后面会进行虚拟ip到真实数据库的一个路由过程),所以最后这里抛出了“java.net.ConnectException: Connection timed out”

4.  因为数据库重启过程,机器始终是对外提供服务,而且流量还不小。所以很容易堆积这样正在执行makeObject()创建链接的请求,一旦超过了一个阀值后,后续的链接就会是一个block的过程,block时间到了60s后就直接抛出异常。到时具体看一下dbcp的源码

所以总结一下:

1.  数据库关闭一段时间后,请求阻塞在makeObject()创建链接过程,一直没有释放,导致后续的请求全都是处于block,根本没机会去makeObject()创建链接,即使这时数据库已经重启后恢复了,应用还是无法正常提供服务。

2.  一般的数据库临时性维护,比如重启调整配置,热备数据库切换,网络瞬间断开。这一类问题,基本上在下一次makeObject()时就可以创建好新的链接,不会处于一个长期的阻塞状态,所以自动重连基本没啥影响。

按照这样的解释,因自己能力有限,不能搭建这么一套热备的数据库环境,所以很难再次重现这样的情况。而且基本上数据库维护也就是2这样的情况,基本重链没啥问题。

看一下代码分析:

dbcp使用了common-pools,看一下每次从连接池中获取链接的过程:

public Object borrowObject() throws Exception {
        long starttime = System.currentTimeMillis();
        Latch latch = new Latch();
        byte whenExhaustedAction;
        long maxWait;
        synchronized (this) {
            // Get local copy of current config. Can't sync when used later as
            // it can result in a deadlock. Has the added advantage that config
            // is consistent for entire method execution
            whenExhaustedAction = _whenExhaustedAction;
            maxWait = _maxWait;

            // Add this request to the queue
            _allocationQueue.add(latch);                                                         

            // Work the allocation queue, allocating idle instances and
            // instance creation permits in request arrival order
            allocate();                                                                                      // 位置1
        }

        for(;;) {
            synchronized (this) {
                assertOpen();
            }

            // If no object was allocated from the pool above
            if(latch.getPair() == null) {
                // check if we were allowed to create one
                if(latch.mayCreate()) {
                    // allow new object to be created
                } else {
                    // the pool is exhausted
                    switch(whenExhaustedAction) {
                        case WHEN_EXHAUSTED_GROW:
                            省略.....
                        case WHEN_EXHAUSTED_FAIL:
                            省略.....
                        case WHEN_EXHAUSTED_BLOCK:                                       // 位置5
                            try {
                                synchronized (latch) {
                                    // Before we wait, make sure another thread didn't allocate us an object
                                    // or permit a new object to be created
                                    if (latch.getPair() == null && !latch.mayCreate()) {
                                        if(maxWait <= 0) {
                                            latch.wait();
                                        } else {
                                            // this code may be executed again after a notify then continue cycle
                                            // so, need to calculate the amount of time to wait
                                            final long elapsed = (System.currentTimeMillis() - starttime);
                                            final long waitTime = maxWait - elapsed;
                                            if (waitTime > 0)
                                            {
                                                latch.wait(waitTime);
                                            }
                                        }
                                    } else {
                                        break;
                                    }
                                }
                            } catch(InterruptedException e) {
                                Thread.currentThread().interrupt();
                                throw e;
                            }
                            if(maxWait > 0 && ((System.currentTimeMillis() - starttime) >= maxWait)) {
                                synchronized(this) {
                                    // Make sure allocate hasn't already assigned an object
                                    // in a different thread or permitted a new object to be created
                                    if (latch.getPair() == null && !latch.mayCreate()) {
                                        // Remove latch from the allocation queue
                                        _allocationQueue.remove(latch);
                                    } else {
                                        break;
                                    }
                                }
                                throw new NoSuchElementException("Timeout waiting for idle object");     // 位置6
                            } else {
                                continue; // keep looping
                            }
                        default:
                            throw new IllegalArgumentException("WhenExhaustedAction property " + whenExhaustedAction +
                                    " not recognized.");
                    }
                }
            }
            xxxx 
}


private synchronized void allocate() {
        if (isClosed()) return;

        // First use any objects in the pool to clear the queue
        for (;;) {
            if (!_pool.isEmpty() && !_allocationQueue.isEmpty()) {                                                 // 位置2
                Latch latch = (Latch) _allocationQueue.removeFirst();
                latch.setPair((ObjectTimestampPair) _pool.removeFirst());
                _numInternalProcessing++;
                synchronized (latch) {
                    latch.notify();
                }
            } else {
                break;
            }
        }

        // Second utilise any spare capacity to create new objects
        for(;;) {
            if((!_allocationQueue.isEmpty()) && (_maxActive < 0 || (_numActive + _numInternalProcessing) < _maxActive)) {             // 位置3
                Latch latch = (Latch) _allocationQueue.removeFirst();
                latch.setMayCreate(true);
                _numInternalProcessing++;                                                                                // 位置4
                synchronized (latch) {
                    latch.notify();
                }
            } else {
                break;
            }
        }
    }

说明: 

位置1: 每个数据库请求都会先调用allocate()方法

位置2: 先检查当前空闲的数据库连接,如果有直接拿一个返回,客户端就直接拿到了链接,因为我们没配置testOnBorrow参数,所以客户端会拿到一个Broken pipe的链接

位置3: 当原先没有空闲的链接,检查一下条件: _numActive + _numInternalProcessing < _maxActive。 注意我们的maxActive设置了14。 注意这里的_numInternalProcessing参数

位置4: 增加_numInternalProcessing计数,表明当前正在处理中的连接数。后续在makeObject成功后,就会执行_numInternalProcessing--操作,numInternalProcessing的意即为当前正在处理的连接数,计算maxActive也将其计算在内

位置5:就是当位置3的条件无法满足时,说白了就是_numInternalProcessing = _maxActive = 14时,后续的请求都会进入WHEN_EXHAUSTED_BLOCK,进行maxWait时间的等待,超过时间也就出现了对应的异常2的内容。

看一下makeObject创建链接过程的一些timeout时间设置:

因为从上面的分析来看,造成无法自动重链的问题主要是makeObject进行了一个阻塞动作。是否可以设置合理的socket参数,一般我们知道的主要是两个参数: connectTimeout ,  readTimeout。

这一类timeout设置,都应该石由对应的链接driver驱动提供。

mysql :  (com.mysql.jdbc.Driver)  

方式1: 直接jdbc url中指定参数

jdbc:mysql://127.0.0.1:3066/test?connectTimeout=3000&socketTimeout=60000

方式2: 通过dbcp传递properties参数

<bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource" destroy-method="close"> 
        ..............
	<property name="connectionProperties"><value>connectTimeout=3000&socketTimeout=60000</value></property>
</bean>

具体参见文档: http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-configuration-properties.html 

oracle : (oracle.jdbc.OracleDriver) , 没有找到对应的官方文档,通过反编译源码找到配置项

oracle.net.nt.ConnStrategy中定义:  

public void createSocketOptions(Properties paramProperties)
  {
    String str1 = null;
    String str2 = null;
    int i = 0;
    Enumeration localEnumeration = paramProperties.keys();
    while (localEnumeration.hasMoreElements())
    {
      str1 = (String)localEnumeration.nextElement();
      if (str1.equalsIgnoreCase("TCP.NODELAY"))
      {
        i = 1;
        str2 = paramProperties.getProperty("TCP.NODELAY").toUpperCase();
        if (str2.equals("NO"))
          this.socketOptions.put(new Integer(0), "NO");
        else
          this.socketOptions.put(new Integer(0), "YES");
      }
      else if (str1.equalsIgnoreCase("oracle.net.READ_TIMEOUT"))
      {
        str2 = paramProperties.getProperty("oracle.net.READ_TIMEOUT");
        this.socketOptions.put(new Integer(3), str2);
      }
      else if (str1.equalsIgnoreCase("oracle.net.CONNECT_TIMEOUT"))
      {
        str2 = paramProperties.getProperty("oracle.net.CONNECT_TIMEOUT");
        this.socketOptions.put(new Integer(2), str2);
      }
      else if (str1.equalsIgnoreCase("oracle.net.ssl_server_dn_match"))
      {
        str2 = paramProperties.getProperty("oracle.net.ssl_server_dn_match");
        this.socketOptions.put(new Integer(4), str2);
      }
      else if (str1.equalsIgnoreCase("oracle.net.wallet_location"))
      {
        str2 = paramProperties.getProperty("oracle.net.wallet_location");
        this.socketOptions.put(new Integer(5), str2);
      }
      else if (str1.equalsIgnoreCase("oracle.net.ssl_version"))
      {
        str2 = paramProperties.getProperty("oracle.net.ssl_version");
        this.socketOptions.put(new Integer(6), str2);
      }
      else if (str1.equalsIgnoreCase("oracle.net.ssl_cipher_suites"))
      {
        str2 = paramProperties.getProperty("oracle.net.ssl_cipher_suites");
        this.socketOptions.put(new Integer(7), str2);
      }
      else if (str1.equalsIgnoreCase("javax.net.ssl.keyStore"))
      {
        str2 = paramProperties.getProperty("javax.net.ssl.keyStore");
        this.socketOptions.put(new Integer(8), str2);
      }
      else if (str1.equalsIgnoreCase("javax.net.ssl.keyStoreType"))
      {
        str2 = paramProperties.getProperty("javax.net.ssl.keyStoreType");
        this.socketOptions.put(new Integer(9), str2);
      }
      else if (str1.equalsIgnoreCase("javax.net.ssl.keyStorePassword"))
      {
        str2 = paramProperties.getProperty("javax.net.ssl.keyStorePassword");
        this.socketOptions.put(new Integer(10), str2);
      }
      else if (str1.equalsIgnoreCase("javax.net.ssl.trustStore"))
      {
        str2 = paramProperties.getProperty("javax.net.ssl.trustStore");
        this.socketOptions.put(new Integer(11), str2);
      }
      else if (str1.equalsIgnoreCase("javax.net.ssl.trustStoreType"))
      {
        str2 = paramProperties.getProperty("javax.net.ssl.trustStoreType");
        this.socketOptions.put(new Integer(12), str2);
      }
      else if (str1.equalsIgnoreCase("javax.net.ssl.trustStorePassword"))
      {
        str2 = paramProperties.getProperty("javax.net.ssl.trustStorePassword");
        this.socketOptions.put(new Integer(13), str2);
      }
      else
      {
        if (!str1.equalsIgnoreCase("ssl.keyManagerFactory.algorithm"))
          continue;
        str2 = paramProperties.getProperty("ssl.keyManagerFactory.algorithm");
        this.socketOptions.put(new Integer(14), str2);
      }
    }
    if ((i == 0) && (!this.reuseOpt))
      this.socketOptions.put(new Integer(0), "YES");
  }

指定参数: 

<bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource" destroy-method="close"> 
    ..............
	<property name="connectionProperties"><value>oracle.net.CONNECT_TIMEOUT=3000&oracle.net.READ_TIMEOUT=60000</value></property>
</bean>

最后

因为没法模拟热备的oracle环境,对虚ip机制也不了解。所以针对前面提到的makeObject针对数据库关闭时,不是返回"Connection refused",而是返回给我"Connection timed out"

最后看了下,PlainSocketImpl.c, socket的native的一些实现,苦于知识缺乏,以后有机会再研究一下。

异常列表:

if (connect_rv == JVM_IO_INTR) {
	    JNU_ThrowByName(env, JNU_JAVAIOPKG "InterruptedIOException",
			    "operation interrupted");
	} else if (errno == EPROTO) {
	    NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "ProtocolException",
                           "Protocol error");
	} else if (errno == ECONNREFUSED) {
	    NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "ConnectException",
                           "Connection refused");
	} else if (errno == ETIMEDOUT) {
	    NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "ConnectException",
                           "Connection timed out");
	} else if (errno == EHOSTUNREACH) {
            NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "NoRouteToHostException",
                           "Host unreachable");
        } else if (errno == EADDRNOTAVAIL) {
            NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "NoRouteToHostException",
                             "Address not available");
        } else if ((errno == EISCONN) || (errno == EBADF)) {
            JNU_ThrowByName(env, JNU_JAVANETPKG "SocketException",
                            "Socket closed");
	} else {
	    NET_ThrowByNameWithLastError(env, JNU_JAVANETPKG "SocketException", "connect failed");
	}
	return;
    }

还有线上通过Btrace抓取了下,oracle驱动运行时的timeout参数设置: 

@BTrace
public class OracleTimeoutTracer {

    @OnMethod(clazz = "oracle.net.nt.ConnStrategy", method = "createSocketOptions", location = @Location(value = Kind.ENTRY))
    public static void strategy(@Self Object self, Properties paramProperties) {
        println(str(paramProperties));
    }

    @OnMethod(clazz = "oracle.net.nt.TcpNTAdapter", method = "connect", location = @Location(value = Kind.ENTRY))
    public static void adapter(@Self Object self) {
        Field socketTimeoutField = field("oracle.net.nt.TcpNTAdapter", "sockTimeout");
        Field hostField = field("oracle.net.nt.TcpNTAdapter", "host");
        Field portField = field("oracle.net.nt.TcpNTAdapter", "port");
        int scoketTimeout = (Integer) get(socketTimeoutField, self);
        int port = (Integer) get(portField, self);
        String host = (String) get(hostField, self);

        println(strcat(strcat(strcat(strcat(strcat("host : ", host), ", port : "), str(port)), " , socketTimeout : "),
                       str(scoketTimeout)));
    }
}

输出结果:

{user=alibaba, password=xxxxx, bigStringTryClob=true, defaultRowPrefetch=50, protocol=thin}
host : 10.20.36.19, port : 1521 , socketTimeout : 0

貌似默认的超时时间为0,从oracle driver的代码来看,如果没有显示设置timeout参数,默认都是socket.connectTimeout = socket.setSoTimeout = 0;

socket connect连接时,对应的超时时间为0,按照文档描述是没有timeout限制。 但针对Ip,Port端口不可访问是立马就会知道是个"Connection refused",用telnet测试即可。

针对虚ip(ip漂移技术),下次有机会研究和验证下。

猜你喜欢

转载自agapple.iteye.com/blog/1024508