dubbo source code series 9-cluster fault-tolerant dubbo cluster Cluster

1. Frontier

In order to solve the problem of single point of failure, current applications usually deploy applications on at least two machines. For services with higher loads, more machines will be deployed to support the business. Similarly, in dubbo, multiple providers will provide services for the same service. At this time, the consumer needs to decide which provider to call. In addition, the processing logic when the service call fails also needs to be designed, such as retrying, throwing an exception, or only outputting an exception. In order to solve these problems, dubbo defines the cluster interface Cluster and Cluster Invoker . The purpose of Cluster is to merge multiple providers into a Cluster Invoker and expose this Invoker to the consumer , so that the consumer only needs to make remote calls through this Invoker. As to which provider to call and how to deal with it after the call fails, all issues are handed over to the cluster module. The cluster module is the middle layer between the provider and the consumer, shielding the consumer from the situation of the provider, the consumer only needs to call the service, and does not need to care about the specific situation of the provider

The load balancing selection Invoker involved in this article is not explained in detail here, there is a special article analysis, if you want to know, please check the  load balancing article


Dubbo provides a variety of cluster implementations, including but not limited to FailoverCluster, FailfastCluster, FailsafeCluster, FailbackCluster, ForkingCluster, etc., as shown in the following figure:

Each type of cluster has different purposes, and we will analyze them one by one.

Two, cluster fault tolerance

Before analyzing the source code of the dubbo cluster, we need to understand all the components of cluster fault tolerance, including: Cluster, Cluster Invoker, Directory, Router, LoadBalance, etc. The relationship diagram is as follows:

The dubbo cluster work is divided into the following two stages:

1. The first stage is during the initialization of the consumer, the cluster implementation class creates a Cluster Invoker instance for the consumer , which is the merge operation in the figure above

2. The second stage is when the consumer makes a remote call, taking FailoverClusterInvoker (default) as an example, the following process has been experienced:

1). This type of Cluster Invoker will first call the list method of Directory to enumerate the Invoker list (invoker can be simply understood as provider), call the route method of Router for routing, and filter out Invokers that do not meet the routing rules

2) After FailoverClusterInvoker gets the Invoker list returned by Directory, it will select an Invoker from the Invoker list through LoadBalance

3), FailoverClusterInvoker passes the parameters to the invoker method of the Invoker instance selected by LoadBalance to make a real remote call

Directory : The purpose is to save the Invoker , which can be simply analogized to List<Invoker>. Its implementation class RegistryDirectory is a dynamic service directory that can sense changes in the registry configuration, and the Invoker list it holds will change as the content of the registry changes. After each change, RegistryDirectory will dynamically add or delete Invokers, and call Router's route method for routing, filtering out Invokers that do not meet the routing rules

The above is the whole process of the cluster work, here is not how the cluster is fault-tolerant. Dubbo mainly provides the following six commonly used fault tolerance methods:

Failover Cluster : automatic failover

Failfast Cluster : fail fast

Failsafe Cluster : fail safe

Failback Cluster : automatic failure recovery

Forking Cluster : Call multiple service providers in parallel

Broadcast Cluster : call service providers one by one, throw an error when an exception occurs

Three, cluster source code

3.1 Cluster implementation class

Cluster interface Cluster is an interface with only one join method, as shown in the figure below:

The Cluster interface is only used to create and generate Cluster Invoker. Let's take FailoverCluster (default) as an example to take a look at the source code:

public class FailoverCluster implements Cluster {

    public final static String NAME = "failover";

    @Override
    public <T> Invoker<T> join(Directory<T> directory) throws RpcException {
        // 创建并返回 FailoverClusterInvoker 对象
        return new FailoverClusterInvoker<T>(directory);
    }

}

There is only one function of all types of Cluster, namely, the creation and generation of Cluster Invoker. The logic is quite simple, so no additional analysis will be done here. Let's take a look at the source code of Cluster Invoker

3.2 Cluster Invoker

Cluster Invoker is a kind of Invoker . The selection logic of the provider and the processing logic after the remote call fails are all encapsulated in the Cluster Invoker.

Preamble We know that the work of the cluster can be divided into two phases, the first phase during consumer initialization, in this Invoker create a service cited analyzed, not here in the repeat. The second stage is when the consumer makes a remote call. At this time, the invoke method of AbstractClusterInvoker (the parent class of Cluster Invoker) will be called, and operations such as Invoker listing and load balancing will be executed in this stage. So let's take a look at the logic of AbstractClusterInvoker's invoke method. The code is as follows:

    // AbstractClusterInvoker 的 invoke 方法
    @Override
    public Result invoke(final Invocation invocation) throws RpcException {
        // 校验 Invoker 是否销毁了
        checkWhetherDestroyed();

        // binding attachments into invocation.
        // 将 RpcContext 中的 attachments 参数绑定到 RpcInvocation 中
        Map<String, String> contextAttachments = RpcContext.getContext().getAttachments();
        if (contextAttachments != null && contextAttachments.size() != 0) {
            ((RpcInvocation) invocation).addAttachments(contextAttachments);
        }

        // 从directory中获取 Invoker 列表
        List<Invoker<T>> invokers = list(invocation);
        // 初始化负载均衡策略,RandomLoadBalance 为默认的负载均衡策略
        // invokers 不为空时,则取第一个 invoker 的 url 中 loadbalance 参数设置的负载均衡策略
        LoadBalance loadbalance = initLoadBalance(invokers, invocation);
        // 如果请求是异步的,需要设置 id 参数到 RpcInvocation 的 Attachment 中
        RpcUtils.attachInvocationIdIfAsync(getUrl(), invocation);
        // 调用具体的 Invoker 实现类,dubbo 中默认是 FailoverClusterInvoker,故这里调用 FailoverClusterInvoker 的 doInvoke 方法
        return doInvoke(invocation, invokers, loadbalance);
    }

The invoke method of AbstractClusterInvoker mainly does the following work:

1) Get the Invoker list and initialize LoadBalance

2), call the template method doInvoke, which is the implementation method of the specific Invoker implementation class

Let's take a look at the logic of the Invoker listing method list(Invocation). The debug stack flow is as follows:

Debug specific stack information:

doList:581, RegistryDirectory (org.apache.dubbo.registry.integration)
list:85, AbstractDirectory (org.apache.dubbo.rpc.cluster.directory)
list:290, AbstractClusterInvoker (org.apache.dubbo.rpc.cluster.support)
invoke:249, AbstractClusterInvoker (org.apache.dubbo.rpc.cluster.support)

The source code of the whole process is as follows:

    // 1、AbstractClusterInvoker 的 list方法
    protected List<Invoker<T>> list(Invocation invocation) throws RpcException {
        // 调用 AbstractDirectory 的 list 方法获取 Invoker 列表
        return directory.list(invocation);
    }


    // 2、AbstractDirectory 的 list 方法
    @Override
    public List<Invoker<T>> list(Invocation invocation) throws RpcException {
        if (destroyed) {
            throw new RpcException("Directory already destroyed .url: " + getUrl());
        }

        // 调用 RegistryDirectory 的 doList 方法
        return doList(invocation);
    }


    // 3、RegistryDirectory 的 doList 方法
    @Override
    public List<Invoker<T>> doList(Invocation invocation) {
        if (forbidden) {
            // 1. No service provider 2. Service providers are disabled
            throw new RpcException(RpcException.FORBIDDEN_EXCEPTION, "No provider available from registry " +
                    getUrl().getAddress() + " for service " + getConsumerUrl().getServiceKey() + " on consumer " +
                    NetUtils.getLocalHost() + " use dubbo version " + Version.getVersion() +
                    ", please check status of providers(disabled, not registered or in blacklist).");
        }

        if (multiGroup) {
            return this.invokers == null ? Collections.emptyList() : this.invokers;
        }

        List<Invoker<T>> invokers = null;
        try {
            // Get invokers from cache, only runtime routers will be executed.
            // route 路由器职责链过滤满足条件的 Invoker 列表
            invokers = routerChain.route(getConsumerUrl(), invocation);
        } catch (Throwable t) {
            logger.error("Failed to execute router: " + getUrl() + ", cause: " + t.getMessage(), t);
        }

        return invokers == null ? Collections.emptyList() : invokers;
    }

After obtaining the Invoker list, LoadBalance is initialized and loaded. This logic is relatively simple. I won't analyze it here. If you are interested, you can analyze it. The following focuses on the doInvoke method logic of different implementation classes of AbstractClusterInvoker.

3.2.1 FailoverClusterInvoker

FailoverClusterInvoker will automatically switch the Invoker to retry when the call fails . In the default configuration, Dubbo will use this class as the default Cluster Invoker. Let's take a look at the doInvoke source code of this class:

    public Result doInvoke(Invocation invocation, final List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
        List<Invoker<T>> copyInvokers = invokers;
        checkInvokers(copyInvokers, invocation);
        // 获取方法名
        String methodName = RpcUtils.getMethodName(invocation);
        // 获取重试次数,最终值是获取的参数值 + 1,如果小于 0,则取 1
        int len = getUrl().getMethodParameter(methodName, RETRIES_KEY, DEFAULT_RETRIES) + 1;
        if (len <= 0) {
            len = 1;
        }
        // retry loop.
        // 循环调用,失败重试
        RpcException le = null; // last exception.
        List<Invoker<T>> invoked = new ArrayList<Invoker<T>>(copyInvokers.size()); // invoked invokers.
        Set<String> providers = new HashSet<String>(len);
        for (int i = 0; i < len; i++) {
            //Reselect before retry to avoid a change of candidate `invokers`.
            //NOTE: if `invokers` changed, then `invoked` also lose accuracy.
            if (i > 0) {
                checkWhetherDestroyed();
                // 在进行重试前重新列举 Invoker,这样做的好处是,如果某个服务挂了,
                // 通过调用 list 可得到最新可用的 Invoker 列表
                copyInvokers = list(invocation);
                // check again
                // 对获取的最新 Invoker 列表判断检查
                checkInvokers(copyInvokers, invocation);
            }
            // 通过负载均衡策略选择 Invoker
            Invoker<T> invoker = select(loadbalance, invocation, copyInvokers, invoked);
            // 添加到 invoker 到 invoked 列表中
            invoked.add(invoker);
            // 设置 invoked 列表到 RpcContext 上下文中
            RpcContext.getContext().setInvokers((List) invoked);
            try {
                // 调用目标 Invoker 的 invoke 方法
                Result result = invoker.invoke(invocation);
                if (le != null && logger.isWarnEnabled()) {
                    logger.warn("Although retry the method " + methodName
                            + " in the service " + getInterface().getName()
                            + " was successful by the provider " + invoker.getUrl().getAddress()
                            + ", but there have been failed providers " + providers
                            + " (" + providers.size() + "/" + copyInvokers.size()
                            + ") from the registry " + directory.getUrl().getAddress()
                            + " on the consumer " + NetUtils.getLocalHost()
                            + " using the dubbo version " + Version.getVersion() + ". Last error is: "
                            + le.getMessage(), le);
                }
                return result;
            } catch (RpcException e) {
                if (e.isBiz()) { // biz exception.
                    throw e;
                }
                le = e;
            } catch (Throwable e) {
                le = new RpcException(e.getMessage(), e);
            } finally {
                providers.add(invoker.getUrl().getAddress());
            }
        }
        throw new RpcException(le.getCode(), "Failed to invoke the method "
                + methodName + " in the service " + getInterface().getName()
                + ". Tried " + len + " times of the providers " + providers
                + " (" + providers.size() + "/" + copyInvokers.size()
                + ") from the registry " + directory.getUrl().getAddress()
                + " on the consumer " + NetUtils.getLocalHost() + " using the dubbo version "
                + Version.getVersion() + ". Last error is: "
                + le.getMessage(), le.getCause() != null ? le.getCause() : le);
    }

In this method, the number of retries is obtained first, and it is called cyclically according to the number of retries. If the call fails, the retries are cyclically retried, and the result is returned directly if it succeeds. In the loop body, first call the parent class list method to obtain the latest Invoker list, then call the select method to load balance to select an Invoker, and finally call the invoke method of this Invoker to make remote calls. If the call fails, record the exception and retry in a loop.

Let's analyze the key AbstractClusterInvoker's select method. The method is mainly to select Invoker through load balancing. The source code is as follows:

    /**
     * Select a invoker using loadbalance policy.</br>
     * a) Firstly, select an invoker using loadbalance. If this invoker is in previously selected list, or,
     * if this invoker is unavailable, then continue step b (reselect), otherwise return the first selected invoker</br>
     * <p>
     * b) Reselection, the validation rule for reselection: selected > available. This rule guarantees that
     * the selected invoker has the minimum chance to be one in the previously selected list, and also
     * guarantees this invoker is available.
     *
     * @param loadbalance load balance policy
     * @param invocation  invocation
     * @param invokers    invoker candidates
     * @param selected    exclude selected invokers or not
     * @return the invoker which will final to do invoke.
     * @throws RpcException exception
     */
    // AbstractClusterInvoker 的 select方法
    protected Invoker<T> select(LoadBalance loadbalance, Invocation invocation,
                                List<Invoker<T>> invokers, List<Invoker<T>> selected) throws RpcException {

        if (CollectionUtils.isEmpty(invokers)) {
            return null;
        }
        // 获取调用方法名
        String methodName = invocation == null ? StringUtils.EMPTY : invocation.getMethodName();

        // 获取 sticky 配置,默认值 false,sticky 表示粘滞连接。所谓粘滞连接是指让服务消费者尽可能的调用同一个服务提供者,除非该提供者挂了再进行切换
        boolean sticky = invokers.get(0).getUrl()
                .getMethodParameter(methodName, CLUSTER_STICKY_KEY, DEFAULT_CLUSTER_STICKY);

        //ignore overloaded method
        // 检测 invokers 列表是否包含 stickyInvoker,如果不包含,说明 stickyInvoker 代表的服务提供者挂了,此时需要将其置空
        if (stickyInvoker != null && !invokers.contains(stickyInvoker)) {
            stickyInvoker = null;
        }
        //ignore concurrency problem
        // 当 sticky 为 true,且 stickyInvoker != null 的情况下。如果 selected 包含 stickyInvoker,表明 stickyInvoker
        // 对应的服务提供者可能因网络原因未能成功提供服务。但是该提供者并没挂,此时 invokers 列表中仍存在该服务提供者对应的 Invoker,
        // 如果 selected 不包含 stickyInvoker,则表明 stickyInvoker 没有被选择过,则需要进一步检查 stickyInvoker 是否可用
        if (sticky && stickyInvoker != null && (selected == null || !selected.contains(stickyInvoker))) {
            // availablecheck 表示是否开启了可用性检查,如果开启了,则调用 stickyInvoker 的 isAvailable 方法进行检查,如果检查通过,则直接返回 stickyInvoker
            if (availablecheck && stickyInvoker.isAvailable()) {
                return stickyInvoker;
            }
        }

        // 如果线程运行到当前代码处,说明前面的 stickyInvoker 为空,或者不可用。此时继续调用 doSelect 选择 Invoker
        Invoker<T> invoker = doSelect(loadbalance, invocation, invokers, selected);

        // 如果 sticky 为 true,则将负载均衡组件选出的 Invoker 赋值给 stickyInvoker,保存起来,下次请求直接使用
        if (sticky) {
            stickyInvoker = invoker;
        }
        return invoker;
    }

The select method mainly implements the following logic:

1) Get the sticky configuration, the default is false

2) Check whether the invokers (the list of surviving Invokers) contain stickyInvoker, if it does not, it means that the service provider represented by stickyInvoker has hung up, and it needs to be left blank at this time

3), check whether stickyInvoker is available, and return directly if available

4) If stickyInvoker is empty or unavailable or has been selected, call load balancing to reselect Invoker

5) If sticky is true, then the Invoker selected by the doSelect method will be assigned to stickyInvoker

Let's continue to look at the doSelect method of AbstractClusterInvoker. The source code is as follows:

    // AbstractClusterInvoker 的 doSelect方法
    private Invoker<T> doSelect(LoadBalance loadbalance, Invocation invocation,
                                List<Invoker<T>> invokers, List<Invoker<T>> selected) throws RpcException {

        if (CollectionUtils.isEmpty(invokers)) {
            return null;
        }
        if (invokers.size() == 1) {
            return invokers.get(0);
        }
        // 负载均衡组件选择 Invoker
        Invoker<T> invoker = loadbalance.select(invokers, getUrl(), invocation);

        //If the `invoker` is in the  `selected` or invoker is unavailable && availablecheck is true, reselect.
        // 如果负载均衡组件选择出的 invoker 已经包含在了 selected 中 或者 invoker 不可用 && availablecheck 为true,需要重新选择 Invoker
        if ((selected != null && selected.contains(invoker))
                || (!invoker.isAvailable() && getUrl() != null && availablecheck)) {
            try {
                // 调用 reselect 方法重新选择 Invoker
                Invoker<T> rInvoker = reselect(loadbalance, invocation, invokers, selected, availablecheck);
                if (rInvoker != null) {
                    // 如果 rinvoker 不为空,则将其赋值给 invoker
                    invoker = rInvoker;
                } else {
                    //Check the index of current selected invoker, if it's not the last one, choose the one at index+1.
                    // 获取 invoker 在 invokers 中的位置
                    int index = invokers.indexOf(invoker);
                    try {
                        //Avoid collision
                        // 获取 index + 1 对  invokers 长度 取模位置的 invoker,这样避免碰撞冲突
                        invoker = invokers.get((index + 1) % invokers.size());
                    } catch (Exception e) {
                        logger.warn(e.getMessage() + " may because invokers list dynamic change, ignore.", e);
                    }
                }
            } catch (Throwable t) {
                logger.error("cluster reselect fail reason is :" + t.getMessage() + " if can not solve, you can set cluster.availablecheck=false in url", t);
            }
        }
        return invoker;
    }

The doSelect method mainly implements the following logic:

1) Select Invoker for load balancing components

2) If the selected Invoker has been selected or is unavailable, call the reselect method to reselect

3). If the Invoker selected by reselect is empty, take the position of the invoker in the invokers + 1 the Invoker modulo the position of the length of the invokers

Next, look at the reselect method of AbstractClusterInvoker. The source code is as follows:

    /**
     * Reselect, use invokers not in `selected` first, if all invokers are in `selected`,
     * just pick an available one using loadbalance policy.
     *
     * @param loadbalance    load balance policy
     * @param invocation     invocation
     * @param invokers       invoker candidates
     * @param selected       exclude selected invokers or not
     * @param availablecheck check invoker available if true
     * @return the reselect result to do invoke
     * @throws RpcException exception
     */
    // AbstractClusterInvoker 的 reselect方法
    private Invoker<T> reselect(LoadBalance loadbalance, Invocation invocation,
                                List<Invoker<T>> invokers, List<Invoker<T>> selected, boolean availablecheck) throws RpcException {

        //Allocating one in advance, this list is certain to be used.
        List<Invoker<T>> reselectInvokers = new ArrayList<>(
                invokers.size() > 1 ? (invokers.size() - 1) : invokers.size());

        // First, try picking a invoker not in `selected`.
        for (Invoker<T> invoker : invokers) {
            // 检测invoker的可用性
            if (availablecheck && !invoker.isAvailable()) {
                continue;
            }

            // 如果 selected 列表不包含当前 invoker,则将其添加到 reselectInvokers 中
            if (selected == null || !selected.contains(invoker)) {
                reselectInvokers.add(invoker);
            }
        }

        // reselectInvokers 不为空,此时通过负载均衡组件进行选择 Invoker
        if (!reselectInvokers.isEmpty()) {
            return loadbalance.select(reselectInvokers, getUrl(), invocation);
        }

        // Just pick an available invoker using loadbalance policy
        if (selected != null) {
            for (Invoker<T> invoker : selected) {
                // 如果 invoker 可用 && reselectInvokers 列表不包含当前 invoker,则将其添加到 reselectInvokers 中
                if ((invoker.isAvailable()) // available first
                        && !reselectInvokers.contains(invoker)) {
                    reselectInvokers.add(invoker);
                }
            }
        }

        // reselectInvokers 不为空,再次通过负载均衡组件进行选择 Invoker
        if (!reselectInvokers.isEmpty()) {
            return loadbalance.select(reselectInvokers, getUrl(), invocation);
        }

        return null;
    }

The reselect method mainly implements the following logic:

1) Traverse the list of live invokers, if the current invoker is available && selected list does not contain the current invoker, add it to reselectInvokers

2), when reselectInvokers is not empty, select Invoker through the load balancing component

3) Traverse the selected invokers (selected) list, if the current invoker is available && reselectInvokers list does not contain the current invoker, add it to reselectInvokers

4) When reselectInvokers is not empty, select Invoker again through the load balancing component

At this point, the analysis of FailoverClusterInvoker is finished, let’s continue to analyze other Cluster Invoker

3.2.2 FailfastCluster

FailfastClusterInvoker will only be called once, and an exception will be thrown immediately after the call fails .

Applicable scenarios: suitable for idempotent operations, such as adding new records

The source code is as follows:

public class FailfastClusterInvoker<T> extends AbstractClusterInvoker<T> {

    public FailfastClusterInvoker(Directory<T> directory) {
        super(directory);
    }

    @Override
    public Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
        checkInvokers(invokers, invocation);
        // 调用 AbstractClusterInvoker 的 select方法选择 Invoker
        Invoker<T> invoker = select(loadbalance, invocation, invokers, null);
        try {
            // 调用 Invoker 的 invoke 方法
            return invoker.invoke(invocation);
        } catch (Throwable e) {
            // 调用失败,直接抛出异常
            if (e instanceof RpcException && ((RpcException) e).isBiz()) { // biz exception.
                throw (RpcException) e;
            }
            throw new RpcException(e instanceof RpcException ? ((RpcException) e).getCode() : 0,
                    "Failfast invoke providers " + invoker.getUrl() + " " + loadbalance.getClass().getSimpleName()
                            + " select from all providers " + invokers + " for service " + getInterface().getName()
                            + " method " + invocation.getMethodName() + " on consumer " + NetUtils.getLocalHost()
                            + " use dubbo version " + Version.getVersion()
                            + ", but no luck to perform the invocation. Last error is: " + e.getMessage(),
                    e.getCause() != null ? e.getCause() : e);
        }
    }
}

The logic of the doInvoke method of FailfastClusterInvoker is very simple. First select the Invoker, and then call the invoke method of the Invoker. After the call fails, an exception will be thrown directly. The select method has been analyzed in FailoverClusterInvoker, so I won't repeat it here.

3.2.3 FailsafeClusterInvoker

FailsafeClusterInvoker is a fail safe of Cluster Invoker. The so-called failure safety means that when an exception occurs during the call, FailsafeClusterInvoker will only print the exception, but will not throw it .

Applicable scenarios: suitable for operations such as writing audit logs

The source code is as follows:

public class FailsafeClusterInvoker<T> extends AbstractClusterInvoker<T> {
    private static final Logger logger = LoggerFactory.getLogger(FailsafeClusterInvoker.class);

    public FailsafeClusterInvoker(Directory<T> directory) {
        super(directory);
    }

    @Override
    public Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
        try {
            checkInvokers(invokers, invocation);
            // 调用 AbstractClusterInvoker 的 select方法选择 Invoker
            Invoker<T> invoker = select(loadbalance, invocation, invokers, null);
            // 调用 invoker 的 invoke 方法
            return invoker.invoke(invocation);
        } catch (Throwable e) {
            // 调用失败,不抛出异常,只是打印输出异常
            logger.error("Failsafe ignore exception: " + e.getMessage(), e);
            // 返回空结果给 consumer
            return AsyncRpcResult.newDefaultAsyncResult(null, null, invocation); // ignore
        }
    }
}

The logic of doInvoke of FailsafeClusterInvoker is simple, so I won’t repeat it here.

3.2.4  FailbackClusterInvoker

FailbackClusterInvoker will return an empty result to the service consumer after the call fails . And retransmit the failed call through the timing task

Applicable scenarios: perform operations such as message notification

The source code is as follows:

public class FailbackClusterInvoker<T> extends AbstractClusterInvoker<T> {

    private static final Logger logger = LoggerFactory.getLogger(FailbackClusterInvoker.class);

    private static final long RETRY_FAILED_PERIOD = 5;

    private final int retries;

    private final int failbackTasks;

    private volatile Timer failTimer;

    public FailbackClusterInvoker(Directory<T> directory) {
        super(directory);

        int retriesConfig = getUrl().getParameter(RETRIES_KEY, DEFAULT_FAILBACK_TIMES);
        if (retriesConfig <= 0) {
            retriesConfig = DEFAULT_FAILBACK_TIMES;
        }
        int failbackTasksConfig = getUrl().getParameter(FAIL_BACK_TASKS_KEY, DEFAULT_FAILBACK_TASKS);
        if (failbackTasksConfig <= 0) {
            failbackTasksConfig = DEFAULT_FAILBACK_TASKS;
        }
        retries = retriesConfig;
        failbackTasks = failbackTasksConfig;
    }

    // 2、FailbackClusterInvoker 的 addFailed 方法
    private void addFailed(LoadBalance loadbalance, Invocation invocation, List<Invoker<T>> invokers, Invoker<T> lastInvoker) {
        if (failTimer == null) {
            synchronized (this) {
                if (failTimer == null) {
                    // 创建线程 factory,主要用来自定义线程名称
                    failTimer = new HashedWheelTimer(
                            new NamedThreadFactory("failback-cluster-timer", true),
                            1,
                            TimeUnit.SECONDS, 32, failbackTasks);
                }
            }
        }
        // 创建重试任务,每 5 秒执行一次
        RetryTimerTask retryTimerTask = new RetryTimerTask(loadbalance, invocation, invokers, lastInvoker, retries, RETRY_FAILED_PERIOD);
        try {
            failTimer.newTimeout(retryTimerTask, RETRY_FAILED_PERIOD, TimeUnit.SECONDS);
        } catch (Throwable e) {
            // 调用失败,只输出异常日志
            logger.error("Failback background works error,invocation->" + invocation + ", exception: " + e.getMessage());
        }
    }

    // 1、FailbackClusterInvoker 的 doInvoke 方法
    @Override
    protected Result doInvoke(Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
        Invoker<T> invoker = null;
        try {
            checkInvokers(invokers, invocation);
            // 调用 AbstractClusterInvoker 的 select方法选择 Invoker
            invoker = select(loadbalance, invocation, invokers, null);
            // 调用 Invoker 的 invoke 方法
            return invoker.invoke(invocation);
        } catch (Throwable e) {
            // 调用失败时,只输出异常
            logger.error("Failback to invoke method " + invocation.getMethodName() + ", wait for retry in background. Ignored exception: "
                    + e.getMessage() + ", ", e);
            // 记录调用信息,以便重试
            addFailed(loadbalance, invocation, invokers, invoker);
            return AsyncRpcResult.newDefaultAsyncResult(null, null, invocation); // ignore
        }
    }

    @Override
    public void destroy() {
        super.destroy();
        if (failTimer != null) {
            failTimer.stop();
        }
    }

    /**
     * RetryTimerTask
     */
    private class RetryTimerTask implements TimerTask {
        private final Invocation invocation;
        private final LoadBalance loadbalance;
        private final List<Invoker<T>> invokers;
        private final int retries;
        private final long tick;
        private Invoker<T> lastInvoker;
        private int retryTimes = 0;

        RetryTimerTask(LoadBalance loadbalance, Invocation invocation, List<Invoker<T>> invokers, Invoker<T> lastInvoker, int retries, long tick) {
            this.loadbalance = loadbalance;
            this.invocation = invocation;
            this.invokers = invokers;
            this.retries = retries;
            this.tick = tick;
            this.lastInvoker=lastInvoker;
        }

        // 3、FailbackClusterInvoker 的 RetryTimerTask 的 run 方法
        @Override
        public void run(Timeout timeout) {
            try {
                // 调用 AbstractClusterInvoker 的 select方法选择 Invoker
                Invoker<T> retryInvoker = select(loadbalance, invocation, invokers, Collections.singletonList(lastInvoker));
                lastInvoker = retryInvoker;
                // 调用 retryInvoker 的 invoke 方法
                retryInvoker.invoke(invocation);
            } catch (Throwable e) {
                // 调用失败,输出异常日志
                logger.error("Failed retry to invoke method " + invocation.getMethodName() + ", waiting again.", e);
                if ((++retryTimes) >= retries) {
                    // 超出重试次数,输出错误日志
                    logger.error("Failed retry times exceed threshold (" + retries + "), We have to abandon, invocation->" + invocation);
                } else {
                    // 没有超出重试次数,继续加入重试任务重试
                    rePut(timeout);
                }
            }
        }

        private void rePut(Timeout timeout) {
            if (timeout == null) {
                return;
            }

            Timer timer = timeout.timer();
            if (timer.isStop() || timeout.isCancelled()) {
                return;
            }

            timer.newTimeout(timeout.task(), tick, TimeUnit.SECONDS);
        }
    }
}

The logic comments of FailbackClusterInvoker have been analyzed very clearly, so I won’t repeat them here.

3.2.5  ForkingClusterInvoker

ForkingClusterInvoker will create multiple threads through the thread pool at runtime, and call multiple service providers concurrently . As long as one service provider successfully returns the result, the doInvoke method will immediately stop running

Applicable scenarios: It is used under high real-time requirements for read operations ( note that it is read operations, parallel write operations may be unsafe ) , but this will consume more resources

The source code is as follows:

public class ForkingClusterInvoker<T> extends AbstractClusterInvoker<T> {

    /**
     * Use {@link NamedInternalThreadFactory} to produce {@link org.apache.dubbo.common.threadlocal.InternalThread}
     * which with the use of {@link org.apache.dubbo.common.threadlocal.InternalThreadLocal} in {@link RpcContext}.
     */
    // 创建线程池
    private final ExecutorService executor = Executors.newCachedThreadPool(
            new NamedInternalThreadFactory("forking-cluster-timer", true));

    public ForkingClusterInvoker(Directory<T> directory) {
        super(directory);
    }

    @Override
    @SuppressWarnings({"unchecked", "rawtypes"})
    public Result doInvoke(final Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
        try {
            checkInvokers(invokers, invocation);
            final List<Invoker<T>> selected;
            // 获取 forks 配置,默认 2
            final int forks = getUrl().getParameter(FORKS_KEY, DEFAULT_FORKS);
            // 获取 timeout 配置,默认 1000
            final int timeout = getUrl().getParameter(TIMEOUT_KEY, DEFAULT_TIMEOUT);
            // 如果 forks 配置不合理,则直接将 invokers 赋值给 selected
            if (forks <= 0 || forks >= invokers.size()) {
                selected = invokers;
            } else {
                selected = new ArrayList<>();
                // 循环选出 forks 个 Invoker,并添加到 selected 中,为并行执行多个 invoker 的 invoke 方法做准备
                for (int i = 0; i < forks; i++) {
                    // 调用 AbstractClusterInvoker 的 select方法选择 Invoker
                    Invoker<T> invoker = select(loadbalance, invocation, invokers, selected);
                    // selected 中不包含当前 invoker 时,添加到 selected 中
                    if (!selected.contains(invoker)) {
                        //Avoid add the same invoker several times.
                        selected.add(invoker);
                    }
                }
            }
            RpcContext.getContext().setInvokers((List) selected);
            final AtomicInteger count = new AtomicInteger();
            final BlockingQueue<Object> ref = new LinkedBlockingQueue<>();
            for (final Invoker<T> invoker : selected) {
                // 为每个 invoker 创建一个执行线程
                executor.execute(new Runnable() {
                    @Override
                    public void run() {
                        try {
                            // 调用 Invoker 的 invoke 方法
                            Result result = invoker.invoke(invocation);
                            // 将执行结果存入阻塞队列中
                            ref.offer(result);
                        } catch (Throwable e) {
                            // 调用异常时,计数+1
                            int value = count.incrementAndGet();
                            // 如果计数大于等于 selected 的数量,表明所有的 invoker 执行都是异常的,最终将异常信息存入阻塞队列中返回
                            if (value >= selected.size()) {
                                ref.offer(e);
                            }
                        }
                    }
                });
            }
            try {
                // 从阻塞队列中取出执行结果
                Object ret = ref.poll(timeout, TimeUnit.MILLISECONDS);
                // 如果结果类型为 Throwable,则抛出异常
                if (ret instanceof Throwable) {
                    Throwable e = (Throwable) ret;
                    throw new RpcException(e instanceof RpcException ? ((RpcException) e).getCode() : 0, "Failed to forking invoke provider " + selected + ", but no luck to perform the invocation. Last error is: " + e.getMessage(), e.getCause() != null ? e.getCause() : e);
                }
                // 返回结果
                return (Result) ret;
            } catch (InterruptedException e) {
                throw new RpcException("Failed to forking invoke provider " + selected + ", but no luck to perform the invocation. Last error is: " + e.getMessage(), e);
            }
        } finally {
            // clear attachments which is binding to current thread.
            RpcContext.getContext().clearAttachments();
        }
    }
}

ForkingClusterInvoker's doInvoke method mainly implements the following logic:

1). According to the forks configuration loop, select forks Invokers and add them to selected to prepare for the parallel execution of the invoke method of multiple invokers

2) The thread pool creates an execution thread for each invoker, calls the invoke method of the invoker, and stores the execution result in the blocking queue

3), take out the execution result from the blocking queue and return

Everyone, think about why the exception object is added to the blocking queue when value >= selected.size()?

Answer: In the case of calling multiple service providers in parallel, as long as one service provider can successfully return the result, all others fail. At this point, ForkingClusterInvoker should still return a successful result instead of throwing an exception. When value >= selected.size(), the abnormal object is placed in the blocking queue to ensure that the abnormal object will not appear in front of the normal result, so that the normal result can be taken out of the blocking queue first

3.2.6  BroadcastClusterInvoker

BroadcastClusterInvoker will call each service provider one by one. If one of them reports an error, after the loop call ends, BroadcastClusterInvoker will throw an exception .

Applicable scenario: Used to notify all providers to update local resource information such as cache or log

The source code is as follows:

public class BroadcastClusterInvoker<T> extends AbstractClusterInvoker<T> {

    private static final Logger logger = LoggerFactory.getLogger(BroadcastClusterInvoker.class);

    public BroadcastClusterInvoker(Directory<T> directory) {
        super(directory);
    }

    @Override
    @SuppressWarnings({"unchecked", "rawtypes"})
    public Result doInvoke(final Invocation invocation, List<Invoker<T>> invokers, LoadBalance loadbalance) throws RpcException {
        checkInvokers(invokers, invocation);
        RpcContext.getContext().setInvokers((List) invokers);
        RpcException exception = null;
        Result result = null;
        // 遍历 Invoker 列表,逐个调用
        for (Invoker<T> invoker : invokers) {
            try {
                result = invoker.invoke(invocation);
            } catch (RpcException e) {
                // 记录异常
                exception = e;
                logger.warn(e.getMessage(), e);
            } catch (Throwable e) {
                // 记录异常
                exception = new RpcException(e.getMessage(), e);
                logger.warn(e.getMessage(), e);
            }
        }
        // 只要有异常,则抛出异常信息
        if (exception != null) {
            throw exception;
        }
        return result;
    }

}

The logic of the doInvoke method of BroadcastClusterInvoker is very simple, so I won’t repeat it here.

to sum up

This article introduces six commonly used cluster fault tolerance methods. Cluster fault tolerance is very important for dubbo. The cluster module is between the provider and the consumer. For the consumer, the cluster can shield the provider cluster from it so that it can concentrate on remote calls. In addition, through the cluster module, we can also orchestrate and optimize the call links between services, and manage services

reference:

https://dubbo.apache.org/zh-cn/docs/source_code_guide/cluster.html

Guess you like

Origin blog.csdn.net/ywlmsm1224811/article/details/103063115