Dubbo源代码分析八：再说Provider线程池被EXHAUSTED

在上回《Dubbo源代码实现六》中我们已经了解到，对于Dubbo集群中的Provider角色，有IO线程池（默认无界）和业务处理线程池（默认200）两个线程池，所以当业务的并发比较高，或者某些业务处理变慢，业务线程池就很容易被“打满”，抛出“RejectedExecutionException: Thread pool is EXHAUSTED! ”异常。当然，前提是我们每给Provider的线程池配置等待Queue。

既然Provider端已经抛出异常，表明自己已经受不了了，但线上我们却发现，Consumer无动于衷，发出的那笔请求还在那里傻傻的候着，直到超时。这样极其容易导致整个系统的“雪崩”，因为它违背了fail-fast原则。我们希望一旦Provider由于线程池被打满而无法收到请求，Consumer应该立即感知然后快速失败来释放线程。后来发现，完全是Dispatcher配置得不对，默认是all，我们应该配置成message。

我们从源码角度来看看到底是咋回事，这里先假设我们用的是Netty框架来实现IO操作，上回我们已经提到过，NettyHandler、NettyServer、MultiMessageHandler、HeartbeatHandler都实现了ChannelHandler接口，来实现接收、发送、连接断开和异常处理等操作，目前上面提到的这四个Handler都是在IO线程池中按顺序被调用，但HeartbeatHandler调用后下一个Handler是？这时候就要Dispatcher来上场了，Dispatcher是dubbo中的调度器，用来决定操作是在IO中执行还是业务线程池执行，来一张官方的图（http://dubbo.io/user-guide/demos/线程模型.html）：

上图Dispatcher后面跟着的ThreadPool就是我们所说的业务线程池。Dispatcher分为5类，默认是all，解释也直接参考官方截图：

因为默认是all，所以包括请求、响应、连接、断开和心跳都交给业务线程池来处理，则无疑加大了业务线程池的负担，因为默认是200。每种Dispatcher，都有对应的ChannelHandler，ChannelHandler将Handler的调动形成调用链。如果配置的是all，那么接下来上场的就是AllChannelHandler；如果配置的是message，那么接下来上场的就是MessageOnlyChannelHandler，这些ChannelHandler都是WrappedChannelHandler的子类，WrappedChannelHandler默认把请求、响应、连接、断开、心跳操作都交给Handler来处理：

protected final ChannelHandler handler;

public void connected(Channel channel) throws RemotingException {

handler.connected(channel);

}

public void disconnected(Channel channel) throws RemotingException {

handler.disconnected(channel);

}

public void sent(Channel channel, Object message) throws RemotingException {

handler.sent(channel, message);

}

public void received(Channel channel, Object message) throws RemotingException {

handler.received(channel, message);

}

public void caught(Channel channel, Throwable exception) throws RemotingException {

handler.caught(channel, exception);

}

很显然，如果直接使用WrappedChannelHandler的处理方式，那么Handler的调用会在当前的线程中完成（这里是IO线程），我们看看AllChannelHandler内部实现：

public void connected(Channel channel) throws RemotingException {

ExecutorService cexecutor = getExecutorService();

try{

cexecutor.execute(new ChannelEventRunnable(channel, handler ,ChannelState.CONNECTED));

}catch (Throwable t) {

throw new ExecutionException("connect event", channel, getClass()+" error when process connected event ." , t);

}

public void disconnected(Channel channel) throws RemotingException {

ExecutorService cexecutor = getExecutorService();

try{

cexecutor.execute(new ChannelEventRunnable(channel, handler ,ChannelState.DISCONNECTED));

}catch (Throwable t) {

throw new ExecutionException("disconnect event", channel, getClass()+" error when process disconnected event ." , t);

}

public void received(Channel channel, Object message) throws RemotingException {

ExecutorService cexecutor = getExecutorService();

try {

cexecutor.execute(new ChannelEventRunnable(channel, handler, ChannelState.RECEIVED, message));

} catch (Throwable t) {

throw new ExecutionException(message, channel, getClass() + " error when process received event .", t);

}

public void caught(Channel channel, Throwable exception) throws RemotingException {

ExecutorService cexecutor = getExecutorService();

try{

cexecutor.execute(new ChannelEventRunnable(channel, handler ,ChannelState.CAUGHT, exception));

}catch (Throwable t) {

throw new ExecutionException("caught event", channel, getClass()+" error when process caught event ." , t);

}

可以看出，AllChannelHandler覆盖了WrappedChannelHandler所有的关键操作，都将其放进到ExecutorService（这里指的是业务线程池）中异步来处理，但唯一没有异步操作的就是sent方法，该方法主要用于应答，但官方文档却说使用all时应答也是放到业务线程池的，写错了？这里，关键的地方来了，一旦业务线程池满了，将抛出执行拒绝异常，将进入caught方法来处理，而该方法使用的仍然是业务线程池，所以很有可能这时业务线程池还是满的，于是悲剧了，直接导致下游的一个HeaderExchangeHandler没机会调用，而异常处理后的应答消息正是HeaderExchangeHandler#caught来完成的，所以最后NettyHandler#writeRequested也没有被调用，Consumer只能死等到超时，无法收到Provider的线程池打满异常。

从上面的分析得出结论，当Dispatcher使用all时，一旦Provider线程池被打满，由于异常处理也需要用业务线程池，如果此时运气好，业务线程池有空闲线程，那么Consumer将收到Provider发送的线程池打满异常；但很可能此时业务线程池还是满的，于是悲剧，异常处理和应答步骤也没有线程可以跑，导致无法应答Consumer，这时候Consumer只能苦等到超时!

这也是为什么我们有时候能在Consumer看到线程池打满异常，有时候看到的确是超时异常。

为啥我们设置Dispatcher为message可以规避此问题？直接看MessageOnlyChannelHandler的实现：

public void received(Channel channel, Object message) throws RemotingException {

ExecutorService cexecutor = executor;

if (cexecutor == null || cexecutor.isShutdown()) {

cexecutor = SHARED_EXECUTOR;

}

try {

cexecutor.execute(new ChannelEventRunnable(channel, handler, ChannelState.RECEIVED, message));

} catch (Throwable t) {

throw new ExecutionException(message, channel, getClass() + " error when process received event .", t);

}

没错，MessageOnlyChannelHandler只覆盖了WrappedChannelHandler的received方法，意味着只有请求处理会用到业务线程池，其他的非业务操作直接在IO线程池执行，这不正是我们想要的吗？所以使用message的Dispatcher，不会存在Provider线程池满了，Consumer却还在傻等的情况，因为默认IO线程池是无界的，一定会有线程来处理异常和应答（如果你把它设置成有界，那我也没啥好说的了）。

所以，为了减少在Provider线程池打满时整个系统雪崩的风险，建议将Dispatcher设置成message：

<dubbo:protocol name="dubbo" port="8888" threads="500" dispatcher="message" />

Dubbo源代码分析八：再说Provider线程池被EXHAUSTED

猜你喜欢