Talking about the Graceful Restart of the Dubbo Framework

1. Background

Recently, the Dubbo service has been introduced into the production environment. Every time the service is restarted online, there will be a timeout alarm. The strange thing is that the restart of the client and server will have an impact, and the alarm will become more obvious when the volume is large.

The general alarm information is as follows:

cause: org.apache.dubbo.remoting.TimeoutException: Waiting server-side response timeout by scan timer. start time: 2021-09-09 11:59:56.822, end time: 2021-09-09 11:59:58.828, client elapsed: 0 ms, server elapsed: 2006 ms, timeout: 2000 ms, request: Request [id=307463, version=2.0.2, twoway=true, event=false, broken=false, data=null], channel: /XXXXXX:52149 -> /XXXXXX:20880] with root cause]

What could be the reason?

  1. No graceful shutdown?

  2. At the moment of restart, the request volume is too large, and there is no warm-up?

  3. After Dubbo starts successfully, SpringBoot has not started successfully, and there is no delay exposure?

  4. Are there any parameter configurations that are unreasonable?

All of the above are possible. After nearly half a month of reading and verifying the source code of the Dubbo framework, I finally found all the answers, and hereby put my heart into sorting out the mining pit records.

2. Description

  1. Version

components Version
Dubbo 2.7.7
Netty 4.0.36.Final
Zookeeper 3.4.9

  1. basic situation

For read requests, which are idempotent, we retry by default, but for write requests, we do not retry by default.

The default timeout is 2000ms.

The services are all docker containers, and the number of Dubbo clients is much larger than that of the service provider, the ratio is about 10: 1

  1. Tip This article focuses on explaining the technical points and principles related to service restart, and will not explain the differences between the Dubbo framework foundation, Netty foundation, and previous versions.

3. Gracefully restart the key technical points

For the above problems, the Dubbo framework also provides solutions, let's take a look at them in turn.

  1. Dubbo graceful downtime mechanism

Dubbo uses JDK's ShutdownHook to complete graceful shutdown. The elegant shutdown mechanism implemented in Dubbo mainly includes 6 steps: 

(1) After receiving the kill PID process exit signal, the Spring container will trigger the container destruction event.

(2) The provider side will log out the service metadata information (delete the ZK node).

(3) The consumer will pull the latest list of service providers.

(4) The provider will send a readonly event message to notify the consumer that the service is unavailable.

(5) The server waits for the tasks that have already been executed to end and refuses to execute new tasks.

exit gracefully

Core code:

  @Override
    public void close(final int timeout) {
        startClose();
        if (timeout > 0) {
            final long max = (long) timeout;
            final long start = System.currentTimeMillis();
            if (getUrl().getParameter(Constants.CHANNEL_SEND_READONLYEVENT_KEY, true)) {
                //发送 readonly 事件报文通知 consumer 服务不可用
                sendChannelReadOnlyEvent();
            }
            while (HeaderExchangeServer.this.isRunning()
                    && System.currentTimeMillis() - start < max) {
                try {
                    Thread.sleep(10);
                } catch (InterruptedException e) {
                    logger.warn(e.getMessage(), e);
                }
            }
        }
        doClose();
        server.close(timeout);
    }

related configuration

dubbo:
  application:
        shutwait: 10000 # 优雅退出等待时间,单位毫秒 默认等待 10s
  1. Dubbo preheating mechanism

The default weight of Dubbo service is 100. Dubbo actually provides a pseudo warm-up mechanism, which calculates the weight according to the running time of the service provider, and then uses the load balancing strategy to realize the traffic from small to large. Let's start from the Dubbo source code and observe the specific implementation of service preheating. The specific source code is located inAbstractLoadBalance#getWeight

 /**
     * Get the weight of the invoker's invocation which takes warmup time into account
     * if the uptime is within the warmup time, the weight will be reduce proportionally
     *
     * @param invoker    the invoker
     * @param invocation the invocation of this invoker
     * @return weight
     */
    int getWeight(Invoker<?> invoker, Invocation invocation) {
        int weight;
        URL url = invoker.getUrl();
        // Multiple registry scenario, load balance among multiple registries.
        if (REGISTRY_SERVICE_REFERENCE_PATH.equals(url.getServiceInterface())) {
            weight = url.getParameter(REGISTRY_KEY + "." + WEIGHT_KEY, DEFAULT_WEIGHT);
        } else {
            weight = url.getMethodParameter(invocation.getMethodName(), WEIGHT_KEY, DEFAULT_WEIGHT);
            if (weight > 0) {
                //获取服务启动时间 timestamp
                long timestamp = invoker.getUrl().getParameter(TIMESTAMP_KEY, 0L);
                if (timestamp > 0L) {
                    //使用当前时间减去服务提供者启动时间,计算服务提供者已运行时间 `uptime`
                    long uptime = System.currentTimeMillis() - timestamp;
                    if (uptime < 0) {
                        return 1;
                    }
                    //获取服务预热时间基数,默认是10分钟
                    int warmup = invoker.getUrl().getParameter(WARMUP_KEY, DEFAULT_WARMUP);
                    //如果服务启动时间 小于 warmup 则重新计算权重
                    if (uptime > 0 && uptime < warmup) {
                        //根据已运行时间动态计算服务预热过程的权重
                        weight = calculateWarmupWeight((int)uptime, warmup, weight);
                    }
                }
            }
        }
        return Math.max(weight, 0);
    }

Let's look at the calculation weight algorithm

 /**
     * Calculate the weight according to the uptime proportion of warmup time
     * the new weight will be within 1(inclusive) to weight(inclusive)
     *
     * @param uptime the uptime in milliseconds
     * @param warmup the warmup time in milliseconds
     * @param weight the weight of an invoker
     * @return weight which takes warmup into account
     */
    static int calculateWarmupWeight(int uptime, int warmup, int weight) {
        int ww = (int) ( uptime / ((float) warmup / weight));
        return ww < 1 ? 1 : (Math.min(ww, weight));
    }

The calculation method here is actually very simple. Simply put, the longer the service runs, the higher the weight will be. When uptime = warmup, the normal weight will be restored.

By default (the default weight of Dubbo service is 100, and the warm-up time is 10 minutes)

If the service provider has been running for 1 minute, then weight will end up being 10.

If the service provider has been running for 5 minutes, then weight will end up being 50.

If the service provider has been running for 11 minutes and exceeds the default warm-up time threshold of 10 minutes, then no further calculation will be performed, and the default weight of weight will be returned directly.

Reminder: The load balancing strategy consistenthash (consistency Hash) does not support service preheating.

related configuration

dubbo:
    provider:
         warmup: 600000 # 单位毫秒 默认10分钟
  1. delayed exposure

Some external containers (such as tomcat) will block calls to dubbo service before they are fully started, resulting in a timeout on the consumer side. This situation may occur with a certain probability during release. In order to avoid this problem, set a certain delay time (guaranteed after tomcat is started) to achieve smooth release.

The delayed exposure of dubbo in the source code is mainly reflected in ServiceBeanthe class and its parent class ServiceConfig#export.

 public synchronized void export() {
        //是否已经暴露
        if (!shouldExport()) {
            return;
        }

        if (bootstrap == null) {
            bootstrap = DubboBootstrap.getInstance();
            bootstrap.init();
        }

        checkAndUpdateSubConfigs();

        //init serviceMetadata
        serviceMetadata.setVersion(version);
        serviceMetadata.setGroup(group);
        serviceMetadata.setDefaultGroup(group);
        serviceMetadata.setServiceType(getInterfaceClass());
        serviceMetadata.setServiceInterfaceName(getInterface());
        serviceMetadata.setTarget(getRef());

        if (shouldDelay()) { //是否需要延迟暴露
            DELAY_EXPORT_EXECUTOR.schedule(this::doExport, getDelay(), TimeUnit.MILLISECONDS);
        } else {
            //真正执行服务暴露的方法
            doExport();
        }

        exported();
    }

It can be seen from the above code that Dubbo uses a schedule delay task to delay the execution of doExport.

The delayed exposure sequence diagram is as follows:

related configuration

dubbo:
    provider:
         delay: 5000 # 默认null不延迟, 单位毫秒
  1. other

After solving these, there are still a lot of timeouts when restarting the service, which is found by checking the client logs.

/XXX:57330 -> /XXXX:20880 is established., dubbo version: 2.7.7, current host: XXXX
2021-09-07 15:01:07.748 [NettyClientWorker-1-16] INFO  o.a.d.r.t.netty4.NettyClientHandler   -  [DUBBO] The connection of /XXXX:57332 -> /XXXX:20880 is established., dubbo version: 2.7.7, current host: XXXX

# 简单统计一下发现 客户端启动时建立了3600个长连接
$ less /u01/logs/order-service-api_XXX/dubbo.log  | grep NettyClientWorker- |grep  '2021-09-07 15' | wc -l
3600

With this question in mind, check the source code to find out.

DubboProtocol#getClients

private ExchangeClient[] getClients(URL url) {
        boolean useShareConnect = false;

        //获取配置连接数, 如果没有配置默认0
        int connections = url.getParameter(CONNECTIONS_KEY, 0);
        List<ReferenceCountExchangeClient> shareClients = null;
        // if not configured, connection is shared, otherwise, one connection for one service
        if (connections == 0) {
            //注意: 如果Provider 配置了connections, 就不会使用共享连接,Consumer就算配置了shareConnections也不会生效
            useShareConnect = true;

            /*
             * The xml configuration should have a higher priority than properties.
             */
            String shareConnectionsStr = url.getParameter(SHARE_CONNECTIONS_KEY, (String) null);
            connections = Integer.parseInt(StringUtils.isBlank(shareConnectionsStr) ? ConfigUtils.getProperty(SHARE_CONNECTIONS_KEY,
                    DEFAULT_SHARE_CONNECTIONS) : shareConnectionsStr);
            shareClients = getSharedClient(url, connections);
        }

        ExchangeClient[] clients = new ExchangeClient[connections];
        for (int i = 0; i < clients.length; i++) {
            if (useShareConnect) {
                clients[i] = shareClients.get(i);

            } else {
                //初始化创连接
                clients[i] = initClient(url);
            }
        }

        return clients;
    }

The problem is that our server configuration

dubbo:
  provider:
    connections: 200

Explain the above code. If connections are not configured, shared connections will be used. The number of shared connections is determined by the number of shared connections configured by the Consumer. The default is 1. On the contrary, if connections are configured, the number of connections will be established for each service. connect.

Let's take a look at the initClient process

initClient(URL url) {

        // client type setting.
        String str = url.getParameter(CLIENT_KEY, url.getParameter(SERVER_KEY, DEFAULT_REMOTING_CLIENT));

        url = url.addParameter(CODEC_KEY, DubboCodec.NAME);
        // enable heartbeat by default
        url = url.addParameterIfAbsent(HEARTBEAT_KEY, String.valueOf(DEFAULT_HEARTBEAT));

        // BIO is not allowed since it has severe performance issue.
        if (str != null && str.length() > 0 && !ExtensionLoader.getExtensionLoader(Transporter.class).hasExtension(str)) {
            throw new RpcException("Unsupported client type: " + str + "," +
                    " supported client type is " + StringUtils.join(ExtensionLoader.getExtensionLoader(Transporter.class).getSupportedExtensions(), " "));
        }

        ExchangeClient client;
        try {
            // 是否配置了懒加载
            if (url.getParameter(LAZY_CONNECT_KEY, false)) {
                client = new LazyConnectExchangeClient(url, requestHandler);

            } else {
                //没有配置懒加载会初始化长连接
                client = Exchangers.connect(url, requestHandler);
            }

        } catch (RemotingException e) {
            throw new RpcException("Fail to create remoting client for service(" + url + "): " + e.getMessage(), e);
        }

        return client;
    }

As can be seen from the above code, if lazy loading is not configured, the long connection will be initialized directly. That is to say, whenever our consumer restarts, the number of services * 200 * several long connections of docker services on the server will be established. The number of our services is 3, and the number of docker services is 6, which is exactly 3600 long connections.

Then, when the server restarts, ZK will notify the consumer (about 60 docker services) when the server restarts, and will establish a connection with the newly started docker service. A consumer will establish 200 * 3, and a total of 36,000 long connections will be established. .

It can be seen from this that every time the service is restarted, a large number of long connections need to be established, resulting in a particularly long time-consuming (roughly calculated, about 10s).

Optimization: Reduce the number of connection pools. After stress testing, configuration 2 is sufficient.

dubbo:
  provider:
    connections: 2

Of course, the server can also not be configured by default, and the number of long connections can be determined by the consumer. Lazy loading can be used when many long connections are required. It is recommended that the total number of persistent connections established immediately after the server is restarted should not exceed 500. After solving the above problems, the restart timeout problem is finally solved.

Summarize

The problem of Dubbo’s elegant restart is a big pit, and it also shows the importance of knowing the reason for parameter configuration, otherwise it may lead to unpredictable problems.

In addition, we also stepped on a thread pool pit, which will be introduced in the next article.

Follow me, don't get lost, welcome to like and collect.

Guess you like

Origin blog.csdn.net/weixin_38130500/article/details/120279023