欢迎大家关注 github.com/hsfxuebao ，希望对大家有所帮助，要是觉得可以的话麻烦给点一下Star哈

当我们进行系统设计时，不仅要考虑正常情况下代码逻辑应该如何走，还要考虑异常情况下代码逻辑应该怎么走。当服务消费方调用服务提供方的服务出现错误时， Dubbo 提供了多种容错方案，默认模式为 Failover Cluster，也就是失败重试。下面让我们看看 Dubbo 提供的集群容错模式。

1. 容错实例的创建和调用图

2. 容错策略的解析

2.1 failover

故障转移策略。当消费者调用提供者集群中的某个服务器失败时，其会自动尝试着调用其它服务器。而重试的次数是通过 retries 属性指定的。

org.apache.dubbo.rpc.cluster.support.FailoverClusterInvoker#doInvoke:

public Result doInvoke(Invocation invocation, final List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
    List<Invoker<T>> copyInvokers = invokers;
    // 检测invokers列表是否为空
    checkInvokers(copyInvokers, invocation);
    // 获取RPC调用的方法名
    String methodName = RpcUtils.getMethodName(invocation);
    // 获取retries属性值
    int len = calculateInvokeTimes(methodName);
    // retry loop.
    RpcException le = null; // last exception.
    // 存放所有已经尝试调用过的invoker，这些invoker中，除了最后一个外，其它的都是不可用的
    List<Invoker<T>> invoked = new ArrayList<Invoker<T>>(copyInvokers.size()); // invoked invokers.
    Set<String> providers = new HashSet<String>(len);

    for (int i = 0; i < len; i++) {
        //Reselect before retry to avoid a change of candidate `invokers`.
        //NOTE: if `invokers` changed, then `invoked` also lose accuracy.
        if (i > 0) {
            // 检测委托对象invoker是否被销毁
            checkWhetherDestroyed();
            // 更新本地invoker列表
            copyInvokers = list(invocation);
            // check again 重新检测invokers列表是否为空
            checkInvokers(copyInvokers, invocation);
        }
        // 负载均衡
        Invoker<T> invoker = select(loadbalance, invocation, copyInvokers, invoked);
        // 将选择出的invoker写入到invoked集合
        invoked.add(invoker);
        RpcContext.getServiceContext().setInvokers((List) invoked);
        try {
            // 远程调用
            Result result = invokeWithContext(invoker, invocation);
            //重试过程中，将最后一次调用的异常信息以 warn 级别日志输出
            if (le != null && logger.isWarnEnabled()) {
                logger.warn("Although retry the method " + methodName
                        + " in the service " + getInterface().getName()
                        + " was successful by the provider " + invoker.getUrl().getAddress()
                        + ", but there have been failed providers " + providers
                        + " (" + providers.size() + "/" + copyInvokers.size()
                        + ") from the registry " + directory.getUrl().getAddress()
                        + " on the consumer " + NetUtils.getLocalHost()
                        + " using the dubbo version " + Version.getVersion() + ". Last error is: "
                        + le.getMessage(), le);
            }
            return result;
        } catch (RpcException e) {
            // 如果是业务性质的异常，不再重试，直接抛出
            if (e.isBiz()) { // biz exception.
                throw e;
            }
            // 其他性质的异常统一封装成RpcException
            le = e;
        } catch (Throwable e) {
            le = new RpcException(e.getMessage(), e);
        } finally {
            // 将提供者的地址添加到providers
            providers.add(invoker.getUrl().getAddress());
        }
    }  // end-for
    // 最后抛出异常
    throw new RpcException(le.getCode(), "Failed to invoke the method "
            + methodName + " in the service " + getInterface().getName()
            + ". Tried " + len + " times of the providers " + providers
            + " (" + providers.size() + "/" + copyInvokers.size()
            + ") from the registry " + directory.getUrl().getAddress()
            + " on the consumer " + NetUtils.getLocalHost() + " using the dubbo version "
            + Version.getVersion() + ". Last error is: "
            + le.getMessage(), le.getCause() != null ? le.getCause() : le);
}

2.2 failfast

快速失败策略。消费者端只发起一次调用，若失败则立即报错。通常用于非幂等性的写操作，比如新增记录. org.apache.dubbo.rpc.cluster.support.FailfastClusterInvoker#doInvoke:

public Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
    checkInvokers(invokers, invocation);
    Invoker<T> invoker = select(loadbalance, invocation, invokers, null);
    try {
        return invokeWithContext(invoker, invocation);
    } catch (Throwable e) {
        if (e instanceof RpcException && ((RpcException) e).isBiz()) { // biz exception.
            throw (RpcException) e;
        }
        throw new RpcException(e instanceof RpcException ? ((RpcException) e).getCode() : 0,
                "Failfast invoke providers " + invoker.getUrl() + " " + loadbalance.getClass().getSimpleName()
                        + " for service " + getInterface().getName()
                        + " method " + invocation.getMethodName() + " on consumer " + NetUtils.getLocalHost()
                        + " use dubbo version " + Version.getVersion()
                        + ", but no luck to perform the invocation. Last error is: " + e.getMessage(),
                e.getCause() != null ? e.getCause() : e);
    }
}

2.3 failsafe

失败安全策略。当消费者调用提供者出现异常时，直接忽略本次消费操作。该策略通常用于执行相对不太重要的服务。org.apache.dubbo.rpc.cluster.support.FailsafeClusterInvoker#doInvoke:

public Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
    try {
        checkInvokers(invokers, invocation);
        Invoker<T> invoker = select(loadbalance, invocation, invokers, null);
        return invokeWithContext(invoker, invocation);
    } catch (Throwable e) {
        logger.error("Failsafe ignore exception: " + e.getMessage(), e);
        return AsyncRpcResult.newDefaultAsyncResult(null, null, invocation); // ignore
    }
}

2.4 failback

失败自动恢复策略。消费者调用提供者失败后， Dubbo 会记录下该失败请求，然后会定时发起重试请求，而定时任务执行的次数仍是通过配置文件中的 retries 指定的。该策略通常用于实时性要求不太高的服务.

org.apache.dubbo.rpc.cluster.support.FailbackClusterInvoker#doInvoke:

protected Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
    Invoker<T> invoker = null;
    try {
        checkInvokers(invokers, invocation);
        invoker = select(loadbalance, invocation, invokers, null);
        return invokeWithContext(invoker, invocation);
    } catch (Throwable e) {
        logger.error("Failback to invoke method " + invocation.getMethodName() + ", wait for retry in background. Ignored exception: "
                + e.getMessage() + ", ", e);
        addFailed(loadbalance, invocation, invokers, invoker);
        return AsyncRpcResult.newDefaultAsyncResult(null, null, invocation); // ignore
    }
}

private void addFailed(LoadBalance loadbalance, Invocation invocation, List<Invoker<T>> invokers, Invoker<T> lastInvoker) {
    if (failTimer == null) {
        synchronized (this) {
            if (failTimer == null) {
                failTimer = new HashedWheelTimer(
                        new NamedThreadFactory("failback-cluster-timer", true),
                        1,
                        TimeUnit.SECONDS, 32, failbackTasks);
            }
        }
    }
    RetryTimerTask retryTimerTask = new RetryTimerTask(loadbalance, invocation, invokers, lastInvoker, retries, RETRY_FAILED_PERIOD);
    try {
        failTimer.newTimeout(retryTimerTask, RETRY_FAILED_PERIOD, TimeUnit.SECONDS);
    } catch (Throwable e) {
        logger.error("Failback background works error,invocation->" + invocation + ", exception: " + e.getMessage());
    }
}

2.5 forking

并行策略。消费者对于同一服务并行调用多个提供者服务器，只要一个成功即调用结束并返回结果。通常用于实时性要求较高的读操作，但其会浪费较多服务器资源: org.apache.dubbo.rpc.cluster.support.ForkingClusterInvoker#doInvoke:

public Result doInvoke(final Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
    try {
        checkInvokers(invokers, invocation);
        // 存放的是挑选出的用于进行并行运行的invoker
        final List<Invoker<T>> selected;
        // 获取forks属性值
        final int forks = getUrl().getParameter(FORKS_KEY, DEFAULT_FORKS);
        // 获取timeout属性值，远程调用超时时限
        final int timeout = getUrl().getParameter(TIMEOUT_KEY, DEFAULT_TIMEOUT);
        if (forks <= 0 || forks >= invokers.size()) {
            selected = invokers;
        } else {  // 处理forks取值在(0, invokers.size())范围的情况
            selected = new ArrayList<>(forks);
            while (selected.size() < forks) {
                // 负载均衡选择一个invoker
                Invoker<T> invoker = select(loadbalance, invocation, invokers, selected);
                if (!selected.contains(invoker)) {
                    //Avoid add the same invoker several times.
                    selected.add(invoker);
                }
            }
        }
        RpcContext.getServiceContext().setInvokers((List) selected);

        // 计数器，记录并行运行异常的invoker数量
        final AtomicInteger count = new AtomicInteger();

        // 队列：存放并行运行结果
        final BlockingQueue<Object> ref = new LinkedBlockingQueue<>();

        // 并行运行
        for (final Invoker<T> invoker : selected) {
            // 使用线程池中的线程执行，这是并行执行的过程
            executor.execute(() -> {
                try {
                    // 远程调用
                    Result result = invokeWithContext(invoker, invocation);
                    // 将当前invoker执行结果写入到队列
                    ref.offer(result);
                } catch (Throwable e) {
                    // 若invoker执行过程中出现异常，则计数器加一
                    int value = count.incrementAndGet();
                    if (value >= selected.size()) {
                        // 代码走到这里说明，没有任何一个并行远程调用是成功的。
                        // 为了能够唤醒后面的poll()，这里就将异常信息写入到ref队列
                        ref.offer(e);
                    }
                }
            });
        }  // end-for
        try {
            // poll()是一个阻塞方法，等待ref中具有一个元素。
            // 只要ref中被写入了一个元素，阻塞马上被唤醒。或一直等待到timeout超时
            // 注意，该poll()方法的执行与前面的并行远程调用的执行也是并行的
            Object ret = ref.poll(timeout, TimeUnit.MILLISECONDS);
            if (ret instanceof Throwable) {
                Throwable e = (Throwable) ret;
                throw new RpcException(e instanceof RpcException ? ((RpcException) e).getCode() : 0, "Failed to forking invoke provider " + selected + ", but no luck to perform the invocation. Last error is: " + e.getMessage(), e.getCause() != null ? e.getCause() : e);
            }
            return (Result) ret;
        } catch (InterruptedException e) {
            throw new RpcException("Failed to forking invoke provider " + selected + ", but no luck to perform the invocation. Last error is: " + e.getMessage(), e);
        }
    } finally {
        // clear attachments which is binding to current thread.
        RpcContext.getClientAttachment().clearAttachments();
    }
}

2.6 broadcast

广播策略。广播调用所有提供者，逐个调用，任意一台报错则报错。通常用于通知所有提供者更新缓存或日志等本地资源信息。 org.apache.dubbo.rpc.cluster.support.BroadcastClusterInvoker#doInvoke

2.7 available

首个可用策略。从所有 invoker 中查找，选择第一个可用的 invoker。 org.apache.dubbo.rpc.cluster.support.AvailableClusterInvoker#doInvoke

2.8 mergeable

合并策略。将多个 group 的 invoker 的执行结果进行合并

org.apache.dubbo.rpc.cluster.support.MergeableClusterInvoker#doInvoke

2.9 zone-aware

当有多个注册中心可供订阅时，该容错机制提供了一种策略，用于决定如何在它们之间分配流量：

标记为“preferred=true”的注册表具有最高优先级。
检查当前请求所属的区域，首先选择具有相同区域的注册表。
根据每个注册表的权重均衡所有注册表之间的流量。
挑选任何有空的人。

3. 基于扩展接口自定义集群容错策略

Dubbo 本身提供了丰富的集群容错策略，但是如果你有定制化需求，可以根据 Dubbo 提供的扩展接口 Cluster 进行定制。为了自定义扩展实现，首先需要实现 Cluster 接口：

public class MyCluster implements Cluster{
    @Override
    public <T> Invoker<T> join(Directory<T> directory) throws RpcException {
        return new MyClusterinvoker(directory) ;
    }
}

在上面的代码中， MyCluster 实现了 Cluster 的 join 接口。然后，需要集成 AbstractC lusterinvoker 类创建自己的 Clusterlnvoker 类：

public class MyCluster l nvoker<T> extends AbstractClusterln飞roker<T> {
    public MyClusterinvoker(Di rectory<T> directory) {
        super(directory) ;
    }
    @Override
    protected Result doinvoke （工nvocation invocation, List<Invoker<T> invokers ,
        Loac!Balance loadbalance )
        throws RpcException {
        checklnvokers (invoker s , invocation) ;
        RpcContext . getContext () . setinvokers ( (List) invokers ) ;
        RpcExcept工on exception = null ;
        Result result = null ;
        ／／做 自己的集群容错策略
        return result ;
    }
}

通过上面的代码可知， dolnvoke 方法需要重写，在该方法内用户就可以实现自己的集群容错策略。然后，在 org.apache .dubbo.rpc.cluster.Cluster 目录下创建文件，并在文件里添加 myCluster=org.apache.dubbo.demo.cluster.MyCluster。

最后，使用下面的方法将集群容错模式切换为 myCluster:

<dubbo :reference id= "greetingService"
interface＝ "com.books.dubbo.demo.api.GreetingService " group＝ "dubbo"
cluster＝ "myCluster"／>

参考文章

Dubbo3.0源码注释github地址
 dubbo源码系列
 dubbo源码分析专栏

Dubbo3源码篇6-集群容错策略