Abstract
在这篇文章中, 记录了如何定位TCP连接异常问题. 主要使用的是BTrace工具(大量使用). 整个过程非常有意思,所以记录下来.
Introduction
本文主要使用的是BTrace工具,BTrace在诊断JAVA方面的疑难问题还是非常有用的. 它使用的是java里面的javaagent来达到无侵入的动态调查JVM内部问题.
以前曾经用BTrace做过线程创建追踪,这次主要用来做TCP连接追踪.
完整详细的BTrace使用例子: https://github.com/gaoxingliang/goodutils/blob/master/btrace/btrace_usage.md
BTrace github: https://github.com/btraceio/btrace
问题
某天客户升级软件后,TCP连接数异常:
可以在升级后看到TCP连接数明显异常了.
解决步骤
1.初步判断
在客户问题中,进行了多次尝试(升降级)发现确实如此. 检查了我们上报的数据量发现没有什么变化.
日志中也没有看见任何上报数据相关的异常. (我们主要在上报数据时大量使用HTTPS urlconnection).
2.btrace初步诊断
既然是TCP连接,那么我们就统计下TCP连接的调用次数和是谁调用的不就好了嘛
BTrace 脚本如下:
package btrace;
import com.sun.btrace.AnyType;
import com.sun.btrace.BTraceUtils;
import com.sun.btrace.annotations.*;
import java.util.concurrent.atomic.AtomicInteger;
import static com.sun.btrace.BTraceUtils.*;
/**
* Monitor the socket creation stats
*
* reference :
* Monitor using java or aop:
* https://www.javaspecialists.eu/archive/Issue169.html
*
* Monitor using BTrace:
* https://dzone.com/articles/socket-monitoring-now-using
*/
@BTrace
public class MonitorSocket {
static AtomicInteger doConnectCalled = BTraceUtils.newAtomicInteger(0);
static AtomicInteger connectCalled = BTraceUtils.newAtomicInteger(0);
// connectToAddress(InetAddress address, int port, int timeout)
@OnMethod(
clazz="/java\\.net\\.AbstractPlainSocketImpl/",
method="/.*/"
)
public static void anyConnect(@ProbeClassName String pcn, @ProbeMethodName String pmn, AnyType[] args) {
// print the threadName
if (BTraceUtils.startsWith(pmn, "connect")) {
println("connect with thread " + BTraceUtils.currentThread());
BTraceUtils.printArray(args);
// this will printout current threadDump
Threads.jstack();
BTraceUtils.incrementAndGet(connectCalled);
} else if (BTraceUtils.startsWith(pmn, "doConnect")) {
/**
* doConnect is a subcall of connect method in AbstractPlainSocketImpl
*/
// println("doConnect " + BTraceUtils.currentThread());
// BTraceUtils.printArray(args);
// Threads.jstack();
incrementAndGet(doConnectCalled);
}
}
/**
* print the metrics every 10 seconds
*/
@OnTimer(10000)
public static void stat() {
println(BTraceUtils.timestamp("yyyy-MM-dd' 'HH:mm:ss") + " StatSconnect=" + getAndSet(connectCalled, 0) + " doConnect=" + getAndSet(doConnectCalled, 0));
}
}
通过加入如下的启动到JVM:
-javaagent:../btrace/btrace-agent.jar=script=../btrace/MonitorSocket.class,scriptOutputFile=../logs/btrace.log
对比发现连接次数确实多了很多:
升级前:
升级后:
这里再一次验证了确实多了很多的请求.
然后,我们可以在Btrace里面打印出完整的调用堆栈:
Threads.jstack();
那么我们得到了第一个堆栈信息:
connect with thread Thread[collector-reporter-cache-4-10,5,main]
[/xx.xx.xx.xx:443, 5000, ]
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java)
java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172)
java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
java.net.Socket.connect(Socket.java:589)
sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:673)
sun.net.NetworkClient.doConnect(NetworkClient.java:175)
sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:264)
sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:367)
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156)
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050)
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:162)
com.santaba.agent.http.HttpClient._connect(HttpClient.java:514)
com.santaba.agent.http.HttpClient._connectAll(HttpClient.java:427)
com.santaba.agent.http.HttpClient.post(HttpClient.java:270)
com.santaba.agent.http.HttpClient.post(HttpClient.java:235)
2.逐行比对
我详细对比了2个版本的堆栈后,发现堆栈信息一样, 但是最终调用connect次数不一样, 上面的post方法次数一样.
那么我们的思路出来了,就是看下在整个HTTP请求的调用堆栈上,到底哪里少了真正的连接connect调用.
再次借助BTrace 我们可以很简单的统计各个方法的调用次数.
import com.sun.btrace.AnyType;
import com.sun.btrace.BTraceUtils;
import com.sun.btrace.annotations.*;
import java.util.concurrent.atomic.AtomicInteger;
import static com.sun.btrace.BTraceUtils.*;
@BTrace
public class MonitorSocket {
static AtomicInteger doConnectCalled = BTraceUtils.newAtomicInteger(0);
static AtomicInteger connectCalled = BTraceUtils.newAtomicInteger(0);
static AtomicInteger postFeedStream = BTraceUtils.newAtomicInteger(0);
// com.santaba.agent.http.HttpClient._connect
static AtomicInteger _connectHttpClient = BTraceUtils.newAtomicInteger(0);
// sun.net.www.protocol.http.HttpURLConnection.plainConnect
static AtomicInteger _plainConnect = BTraceUtils.newAtomicInteger(0);
static AtomicInteger _plainConnect0 = BTraceUtils.newAtomicInteger(0);
static AtomicInteger _getNewHttpClient = BTraceUtils.newAtomicInteger(0);
static AtomicInteger _HttpsClientinit = BTraceUtils.newAtomicInteger(0);
static AtomicInteger _openServer = BTraceUtils.newAtomicInteger(0);
static AtomicInteger _HttpsClientNew = BTraceUtils.newAtomicInteger(0);
@OnMethod(
clazz = "com.santaba.agent.http.AgentHttpService",
method = "postFeedStream"
)
public static void postFeedStream(@ProbeClassName String pcn, @ProbeMethodName String pmn, AnyType[] args) {
BTraceUtils.incrementAndGet(postFeedStream);
}
@OnMethod(
clazz = "/com\\.santaba\\.agent\\.http\\.HttpClient/",
method = "/.*/"
)
public static void _connectHttpClient(@ProbeClassName String pcn, @ProbeMethodName String pmn, AnyType[] args) {
// print the threadName
if (BTraceUtils.compare("_connect", pmn)) {
incrementAndGet(_connectHttpClient);
}
}
@OnMethod(
clazz = "/sun\\.net\\.www\\.protocol\\.http\\.HttpURLConnection/",
method = "/.*/"
)
public static void _plainConnect(@ProbeClassName String pcn, @ProbeMethodName String pmn, AnyType[] args) {
// print the threadName
if (BTraceUtils.compare("plainConnect", pmn)) {
incrementAndGet(_plainConnect);
}
else if (BTraceUtils.compare("plainConnect0", pmn)) {
incrementAndGet(_plainConnect0);
}
}
// sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.g
@OnMethod(
clazz = "/sun\\.net\\.www\\.protocol\\.https\\.AbstractDelegateHttpsURLConnection/",
method = "/.*/"
)
public static void getNewHttpClient(@ProbeClassName String pcn, @ProbeMethodName String pmn, AnyType[] args) {
if (BTraceUtils.compare("getNewHttpClient", pmn)) {
incrementAndGet(_getNewHttpClient);
}
}
@OnMethod(
clazz = "/sun\\.net\\.www\\.protocol\\.https\\.HttpsClient/",
method = "/.*/"
)
public static void newHttpClient(@ProbeClassName String pcn, @ProbeMethodName String pmn, AnyType[] args) {
if (BTraceUtils.compare("<init>", pmn)) {
incrementAndGet(_HttpsClientinit);
}
else if (BTraceUtils.compare("New", pmn)) {
incrementAndGet(_HttpsClientNew);
println("_HttpsClientNew args:");
BTraceUtils.printArray(args);
}
}
// sun.net.www.http.HttpClient.openServer
@OnMethod(
clazz = "/sun\\.net\\.www\\.http\\.HttpClient/",
method = "/.*/"
)
public static void openServer(@ProbeClassName String pcn, @ProbeMethodName String pmn, AnyType[] args) {
if (BTraceUtils.compare("open", pmn)) {
incrementAndGet(_openServer);
}
}
// connectToAddress(InetAddress address, int port, int timeout)
@OnMethod(
clazz = "/java\\.net\\.AbstractPlainSocketImpl/",
method = "/.*/"
)
public static void anyConnect(@ProbeClassName String pcn, @ProbeMethodName String pmn, AnyType[] args) {
// print the threadName
if (BTraceUtils.startsWith(pmn, "connect")) {
println("connect with thread " + BTraceUtils.currentThread());
BTraceUtils.printArray(args);
// this will printout current threadDump
Threads.jstack();
BTraceUtils.incrementAndGet(connectCalled);
}
else if (BTraceUtils.startsWith(pmn, "doConnect")) {
/**
* doConnect is a subcall of connect method in AbstractPlainSocketImpl
*/
// println("doConnect " + BTraceUtils.currentThread());
// BTraceUtils.printArray(args);
// Threads.jstack();
incrementAndGet(doConnectCalled);
}
}
/**
* print the metrics every 10 seconds
*/
@OnTimer(10000)
public static void stat() {
println(BTraceUtils.timestamp("yyyy-MM-dd' 'HH:mm:ss") + " StatSconnect=" + getAndSet(connectCalled, 0) + " doConnect=" +
getAndSet(doConnectCalled, 0) +
" postFeedStream=" + getAndSet(postFeedStream, 0) +
" _connectHttpClient=" + getAndSet(_connectHttpClient, 0) +
" _plainConnect=" + getAndSet(_plainConnect, 0) +
" _plainConnect0=" + getAndSet(_plainConnect0, 0) +
" _getNewHttpClient=" + getAndSet(_getNewHttpClient, 0) +
" HttpsClient.New=" + getAndSet(_HttpsClientNew, 0) +
" _newHttpClient=" + getAndSet(_HttpsClientinit, 0) +
" _openServer=" + getAndSet(_openServer, 0));
}
}
那么我们得到了整个调用堆栈上各个方法的调用次数(忽略最后的openServer计数,当时代码有误,包没写对,当时doConnect方法与其计数一致):
升级前:
升级后:
注意其中的HttpsClient.New 方法调用和_newHttpClient调用,发现他们的次数完全不一样:
那么我们可以看下HttpsClient.New的实现:
sun.net.www.protocol.https.HttpsClient#New(SSLSocketFactory, URL, HostnameVerifier, Proxy, boolean, int, HttpURLConnection)
static HttpClient New(SSLSocketFactory sf, URL url, HostnameVerifier hv,
Proxy p, boolean useCache,
int connectTimeout, HttpURLConnection httpuc)
throws IOException
{
if (p == null) {
p = Proxy.NO_PROXY;
}
PlatformLogger logger = HttpURLConnection.getHttpLogger();
if (logger.isLoggable(PlatformLogger.Level.FINEST)) {
logger.finest("Looking for HttpClient for URL " + url +
" and proxy value of " + p);
}
HttpsClient ret = null;
if (useCache) {
/* see if one's already around */
ret = (HttpsClient) kac.get(url, sf);
if (ret != null && httpuc != null &&
httpuc.streaming() &&
httpuc.getRequestMethod() == "POST") {
if (!ret.available())
ret = null;
}
if (ret != null) {
// 设置一些属性这里忽略掉
}
}
if (ret == null) {
ret = new HttpsClient(sf, url, p, connectTimeout);
} else {
SecurityManager security = System.getSecurityManager();
if (security != null) {
if (ret.proxy == Proxy.NO_PROXY || ret.proxy == null) {
security.checkConnect(InetAddress.getByName(url.getHost()).getHostAddress(), url.getPort());
} else {
security.checkConnect(url.getHost(), url.getPort());
}
}
ret.url = url;
}
ret.setHostnameVerifier(hv);
return ret;
}
这里面有一步很关键的判断决定了是否需要创建新的HttpsClient对象:
/* see if one's already around */
ret = (HttpsClient) kac.get(url, sf);
这里的kac 实际上是是一个KeepAliveCache对象:
那么是否是前面的kac没有缓存到导致的呢?
3.KAC
我采用了运行groovy的办法来列出当前JVM里面的kac对象的所有值:
StringBuilder info = new StringBuilder()
sun.net.www.http.HttpClient.kac.entrySet().forEach({
en -> k = en.getKey();
v = en.getValue()
info.append(String.format("[host=%s,protocol=%s,port=%s,obj=%s]=[size=%s]\n", k.host, k.protocol, k.port, k.obj, v.size()))
v.forEach({
t -> info.append(String.format("hc=%s,idleStartTime=%s;", t.hc, new Date(t.idleStartTime)))
})
info.append("\n")
})
println info.toString()
升级前:
[host=MaskIP,protocol=https,port=-1,obj=sun.security.ssl.SSLSocketFactoryImpl@ce5a68e]=[size=5]
hc=sun.net.www.protocol.https.HttpsClient(https://MaskIP/santaba/api/reportData?version=27001&platform=windows&sender=agent&Maskurl98559b0sbagent),idleStartTime=Mon Jul 16 23:22:49 PDT 2018;
hc=sun.net.www.protocol.https.HttpsClient(https://MaskIP/santaba/api/reportData?version=27001&platform=windows&sender=agent&Maskurl98559b0sbagent),idleStartTime=Mon Jul 16 23:22:50 PDT 2018;
hc=sun.net.www.protocol.https.HttpsClient(https://MaskIP/santaba/api/reportData?version=27001&platform=windows&sender=agent&Maskurl98559b0sbagent),idleStartTime=Mon Jul 16 23:22:52 PDT 2018;
hc=sun.net.www.protocol.https.HttpsClient(https://MaskIP/santaba/api/getPendingRequests?version=27001&sender=agent&platform=windows&Maskurl98559b0sbagent),idleStartTime=Mon Jul 16 23:22:53 PDT 2018;
hc=sun.net.www.protocol.https.HttpsClient(https://MaskIP/santaba/api/reportActiveHosts?version=27001&platform=windows&sender=agent&Maskurl98559b0sbagent),idleStartTime=Mon Jul 16 23:22:53 PDT 2018;
[host=10.0.3.2,protocol=https,port=-1,obj=com.vmware.vim25.mo.ssl.FilteredSSLSocketFactory@5bba354b]=[size=1]
hc=sun.net.www.protocol.https.HttpsClient(https://MASKIP/sdk),idleStartTime=Mon Jul 16 23:22:47 PDT 2018;
升级后:
[host=MASKIP,protocol=https,port=-1,obj=sun.security.ssl.SSLSocketFactoryImpl@2a0af0c9]=[size=1]
hc=sun.net.www.protocol.https.HttpsClient(https://MASKIP/santaba/api/reportData?version=27200&platform=windows&sender=agent&company=Maskurlsbagent),idleStartTime=Mon Jul 16 23:27:46 PDT 2018;
[host=MASKIP,protocol=https,port=-1,obj=sun.security.ssl.SSLSocketFactoryImpl@2d5ba0d2]=[size=1]
hc=sun.net.www.protocol.https.HttpsClient(https://MASKIP/santaba/api/getAlerts?version=27200&company=Maskurlsbagent),idleStartTime=Mon Jul 16 23:27:49 PDT 2018;
[host=MASKIP,protocol=https,port=-1,obj=sun.security.ssl.SSLSocketFactoryImpl@5f6b7adb]=[size=1]
hc=sun.net.www.protocol.https.HttpsClient(https://MASKIP/santaba/api/reportData?version=27200&platform=windows&sender=agent&company=Maskurlsbagent),idleStartTime=Mon Jul 16 23:27:47 PDT 2018;
[host=MASKIP,protocol=https,port=-1,obj=sun.security.ssl.SSLSocketFactoryImpl@3a116cda]=[size=1]
hc=sun.net.www.protocol.https.HttpsClient(https://MASKIP/santaba/api/reportData?version=27200&platform=windows&sender=agent&company=Maskurlsbagent),idleStartTime=Mon Jul 16 23:27:48 PDT 2018;
[host=MASKIP,protocol=https,port=-1,obj=sun.security.ssl.SSLSocketFactoryImpl@3b1919d]=[size=1]
hc=sun.net.www.protocol.https.HttpsClient(https://MASKIP/santaba/api/reportData?version=27200&platform=windows&sender=agent&company=Maskurlsbagent),idleStartTime=Mon Jul 16 23:27:47 PDT 2018;
[host=MASKIP,protocol=https,port=-1,obj=sun.security.ssl.SSLSocketFactoryImpl@1b67dcf9]=[size=1]
hc=sun.net.www.protocol.https.HttpsClient(https://MASKIP/santaba/api/reportData?version=27200&platform=windows&sender=agent&company=Maskurlsbagent),idleStartTime=Mon Jul 16 23:27:49 PDT 2018;
[host=MASKIP,protocol=https,port=-1,obj=sun.security.ssl.SSLSocketFactoryImpl@ca8e92d]=[size=1]
hc=sun.net.www.protocol.https.HttpsClient(https://MASKIP/santaba/api/reportData?version=27200&platform=windows&sender=agent&company=Maskurlsbagent),idleStartTime=Mon Jul 16 23:27:46 PDT 2018;
[host=MASKIP,protocol=https,port=-1,obj=sun.security.ssl.SSLSocketFactoryImpl@1cd9a37d]=[size=1]
hc=sun.net.www.protocol.https.HttpsClient(https://MASKIP/santaba/api/reportData?version=27200&platform=windows&sender=agent&company=Maskurlsbagent),idleStartTime=Mon Jul 16 23:27:48 PDT 2018;
[host=MASKIP,protocol=https,port=-1,obj=sun.security.ssl.SSLSocketFactoryImpl@2e8c66b]=[size=1]
hc=sun.net.www.protocol.https.HttpsClient(https://MASKIP/santaba/api/reportData?version=27200&platform=windows&sender=agent&company=Maskurlsbagent),idleStartTime=Mon Jul 16 23:27:48 PDT 2018;
如果仔细观察,可以看到这里根本没有缓存,因为key都不一样. KeepAliveCache的key是url+socketfactory.
那么这里大致就很清楚了, 肯定是传入的SocketFactory每次都不一样才导致KeepAlive没有生效而导致的连接重建.
最终是因为我们代码中为了ignore ssl 相关错误,而每次都会重新设置默认factory 导致的:
HttpsURLConnection httpsConn = (HttpsURLConnection)conn;
httpsConn.setSSLSocketFactory((SSLSocketFactory)SSLSocketFactory.getDefault());
而这个SSLSocketFactory.getDefault()每次都会创建新的factory.
详细调用栈:
javax.net.ssl.SSLSocketFactory#getDefault ->
javax.net.ssl.SSLContext#getSocketFactory ->
sun.security.ssl.SSLContextImpl#engineGetSocketFactory
protected SSLSocketFactory engineGetSocketFactory() {
if (!this.isInitialized) {
throw new IllegalStateException("SSLContextImpl is not initialized");
} else {
return new SSLSocketFactoryImpl(this);
}
}
总结
1.btrace的强大的非侵入式非常有用, 即便是sun的代码,我们也能清楚地知道执行的过程和方法参数.
2.Java KeepAlive的机制实现实际上是依赖url+socketfactory 而不仅仅是url.