记录一次TCP连接异常问题使用btrace

Abstract

在这篇文章中, 记录了如何定位TCP连接异常问题. 主要使用的是BTrace工具(大量使用). 整个过程非常有意思,所以记录下来.

Introduction

本文主要使用的是BTrace工具,BTrace在诊断JAVA方面的疑难问题还是非常有用的. 它使用的是java里面的javaagent来达到无侵入的动态调查JVM内部问题.

以前曾经用BTrace做过线程创建追踪,这次主要用来做TCP连接追踪.

完整详细的BTrace使用例子: https://github.com/gaoxingliang/goodutils/blob/master/btrace/btrace_usage.md

BTrace github: https://github.com/btraceio/btrace

问题

某天客户升级软件后,TCP连接数异常:

可以在升级后看到TCP连接数明显异常了.

解决步骤

1.初步判断

在客户问题中,进行了多次尝试(升降级)发现确实如此. 检查了我们上报的数据量发现没有什么变化.

日志中也没有看见任何上报数据相关的异常. (我们主要在上报数据时大量使用HTTPS urlconnection).

2.btrace初步诊断

既然是TCP连接,那么我们就统计下TCP连接的调用次数和是谁调用的不就好了嘛

BTrace 脚本如下:

package btrace;

import com.sun.btrace.AnyType;
import com.sun.btrace.BTraceUtils;
import com.sun.btrace.annotations.*;

import java.util.concurrent.atomic.AtomicInteger;

import static com.sun.btrace.BTraceUtils.*;

/**
 * Monitor the socket creation stats
 *
 * reference :
 * Monitor using java or aop:
 * https://www.javaspecialists.eu/archive/Issue169.html
 *
 * Monitor using BTrace:
 * https://dzone.com/articles/socket-monitoring-now-using
 */
@BTrace
public class MonitorSocket {

    static AtomicInteger doConnectCalled = BTraceUtils.newAtomicInteger(0);
    static AtomicInteger connectCalled = BTraceUtils.newAtomicInteger(0);


    // connectToAddress(InetAddress address, int port, int timeout)
    @OnMethod(
            clazz="/java\\.net\\.AbstractPlainSocketImpl/",
            method="/.*/"
    )
    public static void anyConnect(@ProbeClassName String pcn, @ProbeMethodName String pmn, AnyType[] args) {
        // print the threadName
        if (BTraceUtils.startsWith(pmn, "connect")) {
            println("connect with thread " + BTraceUtils.currentThread());
            BTraceUtils.printArray(args);

            // this will printout current threadDump
            Threads.jstack();
            BTraceUtils.incrementAndGet(connectCalled);
        } else if (BTraceUtils.startsWith(pmn, "doConnect")) {

            /**
             * doConnect is a subcall of connect method in AbstractPlainSocketImpl
             */
//            println("doConnect " + BTraceUtils.currentThread());
//            BTraceUtils.printArray(args);
//            Threads.jstack();
            incrementAndGet(doConnectCalled);
        }
    }

    /**
     * print the metrics every 10 seconds
     */
    @OnTimer(10000)
    public static void stat() {
        println(BTraceUtils.timestamp("yyyy-MM-dd' 'HH:mm:ss") + " StatSconnect=" + getAndSet(connectCalled, 0) + " doConnect=" + getAndSet(doConnectCalled, 0));
    }

}

通过加入如下的启动到JVM:

-javaagent:../btrace/btrace-agent.jar=script=../btrace/MonitorSocket.class,scriptOutputFile=../logs/btrace.log

对比发现连接次数确实多了很多:

升级前:

升级后:

这里再一次验证了确实多了很多的请求.

然后,我们可以在Btrace里面打印出完整的调用堆栈:

Threads.jstack();

那么我们得到了第一个堆栈信息:

connect with thread Thread[collector-reporter-cache-4-10,5,main]
[/xx.xx.xx.xx:443, 5000, ]
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java)
java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172)
java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
java.net.Socket.connect(Socket.java:589)
sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:673)
sun.net.NetworkClient.doConnect(NetworkClient.java:175)
sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:264)
sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:367)
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156)
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050)
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:162)
com.santaba.agent.http.HttpClient._connect(HttpClient.java:514)
com.santaba.agent.http.HttpClient._connectAll(HttpClient.java:427)
com.santaba.agent.http.HttpClient.post(HttpClient.java:270)
com.santaba.agent.http.HttpClient.post(HttpClient.java:235)

2.逐行比对

我详细对比了2个版本的堆栈后,发现堆栈信息一样, 但是最终调用connect次数不一样, 上面的post方法次数一样.

那么我们的思路出来了,就是看下在整个HTTP请求的调用堆栈上,到底哪里少了真正的连接connect调用.

再次借助BTrace 我们可以很简单的统计各个方法的调用次数.


import com.sun.btrace.AnyType;
import com.sun.btrace.BTraceUtils;
import com.sun.btrace.annotations.*;

import java.util.concurrent.atomic.AtomicInteger;

import static com.sun.btrace.BTraceUtils.*;

@BTrace
public class MonitorSocket {

    static AtomicInteger doConnectCalled = BTraceUtils.newAtomicInteger(0);
    static AtomicInteger connectCalled = BTraceUtils.newAtomicInteger(0);

    static AtomicInteger postFeedStream = BTraceUtils.newAtomicInteger(0);

    // com.santaba.agent.http.HttpClient._connect
    static AtomicInteger _connectHttpClient = BTraceUtils.newAtomicInteger(0);

    // sun.net.www.protocol.http.HttpURLConnection.plainConnect
    static AtomicInteger _plainConnect = BTraceUtils.newAtomicInteger(0);
    static AtomicInteger _plainConnect0 = BTraceUtils.newAtomicInteger(0);


    static AtomicInteger _getNewHttpClient = BTraceUtils.newAtomicInteger(0);


    static AtomicInteger _HttpsClientinit = BTraceUtils.newAtomicInteger(0);
    static AtomicInteger _openServer = BTraceUtils.newAtomicInteger(0);

    static AtomicInteger _HttpsClientNew = BTraceUtils.newAtomicInteger(0);


    @OnMethod(
            clazz = "com.santaba.agent.http.AgentHttpService",
            method = "postFeedStream"
    )
    public static void postFeedStream(@ProbeClassName String pcn, @ProbeMethodName String pmn, AnyType[] args) {
        BTraceUtils.incrementAndGet(postFeedStream);
    }

    @OnMethod(
            clazz = "/com\\.santaba\\.agent\\.http\\.HttpClient/",
            method = "/.*/"
    )
    public static void _connectHttpClient(@ProbeClassName String pcn, @ProbeMethodName String pmn, AnyType[] args) {
        // print the threadName
        if (BTraceUtils.compare("_connect", pmn)) {
            incrementAndGet(_connectHttpClient);
        }
    }

    @OnMethod(
            clazz = "/sun\\.net\\.www\\.protocol\\.http\\.HttpURLConnection/",
            method = "/.*/"
    )
    public static void _plainConnect(@ProbeClassName String pcn, @ProbeMethodName String pmn, AnyType[] args) {
        // print the threadName
        if (BTraceUtils.compare("plainConnect", pmn)) {
            incrementAndGet(_plainConnect);
        }
        else if (BTraceUtils.compare("plainConnect0", pmn)) {
            incrementAndGet(_plainConnect0);
        }
    }


    // sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.g
    @OnMethod(
            clazz = "/sun\\.net\\.www\\.protocol\\.https\\.AbstractDelegateHttpsURLConnection/",
            method = "/.*/"
    )
    public static void getNewHttpClient(@ProbeClassName String pcn, @ProbeMethodName String pmn, AnyType[] args) {
        if (BTraceUtils.compare("getNewHttpClient", pmn)) {
            incrementAndGet(_getNewHttpClient);
        }
    }


    @OnMethod(
            clazz = "/sun\\.net\\.www\\.protocol\\.https\\.HttpsClient/",
            method = "/.*/"
    )
    public static void newHttpClient(@ProbeClassName String pcn, @ProbeMethodName String pmn, AnyType[] args) {
        if (BTraceUtils.compare("<init>", pmn)) {
            incrementAndGet(_HttpsClientinit);
        }
        else if (BTraceUtils.compare("New", pmn)) {
            incrementAndGet(_HttpsClientNew);
            println("_HttpsClientNew args:");
            BTraceUtils.printArray(args);
        }
    }

    // sun.net.www.http.HttpClient.openServer
    @OnMethod(
            clazz = "/sun\\.net\\.www\\.http\\.HttpClient/",
            method = "/.*/"
    )
    public static void openServer(@ProbeClassName String pcn, @ProbeMethodName String pmn, AnyType[] args) {
        if (BTraceUtils.compare("open", pmn)) {
            incrementAndGet(_openServer);
        }
    }


    // connectToAddress(InetAddress address, int port, int timeout)
    @OnMethod(
            clazz = "/java\\.net\\.AbstractPlainSocketImpl/",
            method = "/.*/"
    )
    public static void anyConnect(@ProbeClassName String pcn, @ProbeMethodName String pmn, AnyType[] args) {
        // print the threadName
        if (BTraceUtils.startsWith(pmn, "connect")) {
            println("connect with thread " + BTraceUtils.currentThread());
            BTraceUtils.printArray(args);

            // this will printout current threadDump
            Threads.jstack();
            BTraceUtils.incrementAndGet(connectCalled);
        }
        else if (BTraceUtils.startsWith(pmn, "doConnect")) {

            /**
             * doConnect is a subcall of connect method in AbstractPlainSocketImpl
             */
//            println("doConnect " + BTraceUtils.currentThread());
//            BTraceUtils.printArray(args);
//            Threads.jstack();
            incrementAndGet(doConnectCalled);
        }
    }

    /**
     * print the metrics every 10 seconds
     */
    @OnTimer(10000)
    public static void stat() {
        println(BTraceUtils.timestamp("yyyy-MM-dd' 'HH:mm:ss") + " StatSconnect=" + getAndSet(connectCalled, 0) + " doConnect=" +
                getAndSet(doConnectCalled, 0) +
                " postFeedStream=" + getAndSet(postFeedStream, 0) +
                " _connectHttpClient=" + getAndSet(_connectHttpClient, 0) +
                " _plainConnect=" + getAndSet(_plainConnect, 0) +
                " _plainConnect0=" + getAndSet(_plainConnect0, 0) +
                " _getNewHttpClient=" + getAndSet(_getNewHttpClient, 0) +
                " HttpsClient.New=" + getAndSet(_HttpsClientNew, 0) +
                " _newHttpClient=" + getAndSet(_HttpsClientinit, 0) +
                " _openServer=" + getAndSet(_openServer, 0));
    }

}

那么我们得到了整个调用堆栈上各个方法的调用次数(忽略最后的openServer计数,当时代码有误,包没写对,当时doConnect方法与其计数一致):

升级前:

升级后:

注意其中的HttpsClient.New 方法调用和_newHttpClient调用,发现他们的次数完全不一样:

那么我们可以看下HttpsClient.New的实现:

sun.net.www.protocol.https.HttpsClient#New(SSLSocketFactory, URL, HostnameVerifier, Proxy, boolean, int, HttpURLConnection)

    static HttpClient New(SSLSocketFactory sf, URL url, HostnameVerifier hv,
                          Proxy p, boolean useCache,
                          int connectTimeout, HttpURLConnection httpuc)
        throws IOException
    {
        if (p == null) {
            p = Proxy.NO_PROXY;
        }
        PlatformLogger logger = HttpURLConnection.getHttpLogger();
        if (logger.isLoggable(PlatformLogger.Level.FINEST)) {
            logger.finest("Looking for HttpClient for URL " + url +
                " and proxy value of " + p);
        }
        HttpsClient ret = null;
        if (useCache) {
            /* see if one's already around */
            ret = (HttpsClient) kac.get(url, sf);
            if (ret != null && httpuc != null &&
                httpuc.streaming() &&
                httpuc.getRequestMethod() == "POST") {
                if (!ret.available())
                    ret = null;
            }

            if (ret != null) {
                // 设置一些属性这里忽略掉

            }
        }
        if (ret == null) {
            ret = new HttpsClient(sf, url, p, connectTimeout);
        } else {
            SecurityManager security = System.getSecurityManager();
            if (security != null) {
                if (ret.proxy == Proxy.NO_PROXY || ret.proxy == null) {
                    security.checkConnect(InetAddress.getByName(url.getHost()).getHostAddress(), url.getPort());
                } else {
                    security.checkConnect(url.getHost(), url.getPort());
                }
            }
            ret.url = url;
        }
        ret.setHostnameVerifier(hv);

        return ret;
    }

这里面有一步很关键的判断决定了是否需要创建新的HttpsClient对象:

/* see if one's already around */
            ret = (HttpsClient) kac.get(url, sf);

这里的kac 实际上是是一个KeepAliveCache对象:

那么是否是前面的kac没有缓存到导致的呢?

3.KAC

我采用了运行groovy的办法来列出当前JVM里面的kac对象的所有值:

StringBuilder info = new StringBuilder()

sun.net.www.http.HttpClient.kac.entrySet().forEach({
    en -> k = en.getKey();
        v = en.getValue()
        info.append(String.format("[host=%s,protocol=%s,port=%s,obj=%s]=[size=%s]\n", k.host, k.protocol, k.port, k.obj, v.size()))
        v.forEach({
            t -> info.append(String.format("hc=%s,idleStartTime=%s;", t.hc, new Date(t.idleStartTime)))
        })

        info.append("\n")
})
println info.toString()

升级前:

[host=MaskIP,protocol=https,port=-1,obj=sun.security.ssl.SSLSocketFactoryImpl@ce5a68e]=[size=5]
hc=sun.net.www.protocol.https.HttpsClient(https://MaskIP/santaba/api/reportData?version=27001&platform=windows&sender=agent&Maskurl98559b0sbagent),idleStartTime=Mon Jul 16 23:22:49 PDT 2018;
hc=sun.net.www.protocol.https.HttpsClient(https://MaskIP/santaba/api/reportData?version=27001&platform=windows&sender=agent&Maskurl98559b0sbagent),idleStartTime=Mon Jul 16 23:22:50 PDT 2018;
hc=sun.net.www.protocol.https.HttpsClient(https://MaskIP/santaba/api/reportData?version=27001&platform=windows&sender=agent&Maskurl98559b0sbagent),idleStartTime=Mon Jul 16 23:22:52 PDT 2018;
hc=sun.net.www.protocol.https.HttpsClient(https://MaskIP/santaba/api/getPendingRequests?version=27001&sender=agent&platform=windows&Maskurl98559b0sbagent),idleStartTime=Mon Jul 16 23:22:53 PDT 2018;
hc=sun.net.www.protocol.https.HttpsClient(https://MaskIP/santaba/api/reportActiveHosts?version=27001&platform=windows&sender=agent&Maskurl98559b0sbagent),idleStartTime=Mon Jul 16 23:22:53 PDT 2018;
[host=10.0.3.2,protocol=https,port=-1,obj=com.vmware.vim25.mo.ssl.FilteredSSLSocketFactory@5bba354b]=[size=1]
hc=sun.net.www.protocol.https.HttpsClient(https://MASKIP/sdk),idleStartTime=Mon Jul 16 23:22:47 PDT 2018;

升级后:

[host=MASKIP,protocol=https,port=-1,obj=sun.security.ssl.SSLSocketFactoryImpl@2a0af0c9]=[size=1]
hc=sun.net.www.protocol.https.HttpsClient(https://MASKIP/santaba/api/reportData?version=27200&platform=windows&sender=agent&company=Maskurlsbagent),idleStartTime=Mon Jul 16 23:27:46 PDT 2018;
[host=MASKIP,protocol=https,port=-1,obj=sun.security.ssl.SSLSocketFactoryImpl@2d5ba0d2]=[size=1]
hc=sun.net.www.protocol.https.HttpsClient(https://MASKIP/santaba/api/getAlerts?version=27200&company=Maskurlsbagent),idleStartTime=Mon Jul 16 23:27:49 PDT 2018;
[host=MASKIP,protocol=https,port=-1,obj=sun.security.ssl.SSLSocketFactoryImpl@5f6b7adb]=[size=1]
hc=sun.net.www.protocol.https.HttpsClient(https://MASKIP/santaba/api/reportData?version=27200&platform=windows&sender=agent&company=Maskurlsbagent),idleStartTime=Mon Jul 16 23:27:47 PDT 2018;
[host=MASKIP,protocol=https,port=-1,obj=sun.security.ssl.SSLSocketFactoryImpl@3a116cda]=[size=1]
hc=sun.net.www.protocol.https.HttpsClient(https://MASKIP/santaba/api/reportData?version=27200&platform=windows&sender=agent&company=Maskurlsbagent),idleStartTime=Mon Jul 16 23:27:48 PDT 2018;
[host=MASKIP,protocol=https,port=-1,obj=sun.security.ssl.SSLSocketFactoryImpl@3b1919d]=[size=1]
hc=sun.net.www.protocol.https.HttpsClient(https://MASKIP/santaba/api/reportData?version=27200&platform=windows&sender=agent&company=Maskurlsbagent),idleStartTime=Mon Jul 16 23:27:47 PDT 2018;
[host=MASKIP,protocol=https,port=-1,obj=sun.security.ssl.SSLSocketFactoryImpl@1b67dcf9]=[size=1]
hc=sun.net.www.protocol.https.HttpsClient(https://MASKIP/santaba/api/reportData?version=27200&platform=windows&sender=agent&company=Maskurlsbagent),idleStartTime=Mon Jul 16 23:27:49 PDT 2018;
[host=MASKIP,protocol=https,port=-1,obj=sun.security.ssl.SSLSocketFactoryImpl@ca8e92d]=[size=1]
hc=sun.net.www.protocol.https.HttpsClient(https://MASKIP/santaba/api/reportData?version=27200&platform=windows&sender=agent&company=Maskurlsbagent),idleStartTime=Mon Jul 16 23:27:46 PDT 2018;
[host=MASKIP,protocol=https,port=-1,obj=sun.security.ssl.SSLSocketFactoryImpl@1cd9a37d]=[size=1]
hc=sun.net.www.protocol.https.HttpsClient(https://MASKIP/santaba/api/reportData?version=27200&platform=windows&sender=agent&company=Maskurlsbagent),idleStartTime=Mon Jul 16 23:27:48 PDT 2018;
[host=MASKIP,protocol=https,port=-1,obj=sun.security.ssl.SSLSocketFactoryImpl@2e8c66b]=[size=1]
hc=sun.net.www.protocol.https.HttpsClient(https://MASKIP/santaba/api/reportData?version=27200&platform=windows&sender=agent&company=Maskurlsbagent),idleStartTime=Mon Jul 16 23:27:48 PDT 2018;

 如果仔细观察,可以看到这里根本没有缓存,因为key都不一样. KeepAliveCache的key是url+socketfactory.

那么这里大致就很清楚了, 肯定是传入的SocketFactory每次都不一样才导致KeepAlive没有生效而导致的连接重建.

最终是因为我们代码中为了ignore ssl 相关错误,而每次都会重新设置默认factory 导致的:

HttpsURLConnection httpsConn = (HttpsURLConnection)conn;
            httpsConn.setSSLSocketFactory((SSLSocketFactory)SSLSocketFactory.getDefault());

而这个SSLSocketFactory.getDefault()每次都会创建新的factory.

详细调用栈:
javax.net.ssl.SSLSocketFactory#getDefault ->
javax.net.ssl.SSLContext#getSocketFactory ->
sun.security.ssl.SSLContextImpl#engineGetSocketFactory
    protected SSLSocketFactory engineGetSocketFactory() {
        if (!this.isInitialized) {
            throw new IllegalStateException("SSLContextImpl is not initialized");
        } else {
            return new SSLSocketFactoryImpl(this);
        }
    }


总结

1.btrace的强大的非侵入式非常有用, 即便是sun的代码,我们也能清楚地知道执行的过程和方法参数.

2.Java KeepAlive的机制实现实际上是依赖url+socketfactory 而不仅仅是url.

猜你喜欢

转载自blog.csdn.net/scugxl/article/details/81081262