Understanding Long Polling: How does the configuration center implement push?

I. Introduction

When you want to modify a configuration in the traditional static configuration method, you must restart the application. If the database connection string is changed, it may be easier to accept, but if the change is some real-time aware configuration at runtime, such as a certain The switch of the function item, restarting the application seems a bit turbulent. The configuration center came into being to solve such problems, especially in the microservice architecture system, it is more inclined to use the configuration center to manage the configuration in a unified way.

The core capability of the configuration center is the dynamic push of configuration. Common configuration centers such as Nacos, Apollo, etc. have achieved this capability. When I contacted the configuration center in the early days, I was curious about how the configuration center can realize the server-side perception of configuration changes and push them to the client in real time. Before studying the implementation principle of the configuration center, I once thought that the configuration center is through a long connection. Do configuration push. In fact, the two currently popular configuration centers: Nacos and Apollo, precisely do not use long connections, but use long polling. This article is to introduce long polling, a technology that sounds like it has been in the last century. It is new to sing old dramas and see if you can taste a different kind of charm. There will be code examples in the article, presenting a simple configuration monitoring process.

Two data interaction mode

As we all know, there are two modes of data interaction: Push (push mode) and Pull (pull mode).

Push mode refers to the establishment of a long network connection between the client and the server, and the server has related data, which is directly pushed to the client through the persistent connection channel. Its advantage is that it is timely. Once there is a data change, the client can immediately perceive it; in addition, the logic is simple for the client, and there is no need to care about the logical processing of data. The disadvantage is that you do not know the data consumption capacity of the client, which may result in a backlog of data on the client, too late to process.

The pull mode means that the client actively sends a request to the server to pull related data. The advantage is that this process is initiated by the client, so there is no problem of data backlog in the push mode. The disadvantage is that it may not be timely enough. For the client, it is necessary to consider the logic of data pull, when to pull, how to control the frequency of pull, and so on.

Three long polling and polling

At the beginning, I will focus on the difference between Long Polling and Polling, both of which are implementations of the pull mode.

"Polling" means that regardless of whether the server data is updated or not, the client requests to pull data at regular intervals, and there may be updated data returned, or there may be nothing. If the configuration center uses "polling" to implement dynamic push, there will be the following problems:

  • Push delay. The client pulls the configuration every 5s. If the configuration change occurs in the 6th second, the configuration push delay will reach 4s.
  • Server pressure. The configuration generally does not change, and frequent polling will cause a lot of pressure on the server.
  • Push delay and server pressure cannot be neutralized. Decrease the polling interval, the delay decreases, and the pressure increases; increase the polling interval, the pressure decreases, and the delay increases.

"Long polling" does not have the above problems. The client initiates a long poll. If the data on the server does not change, it will hold the request until the data on the server changes or wait for a certain period of time before returning. After returning, the client will immediately initiate the next long polling again. It is obvious how the configuration center uses "long polling" to solve the problems encountered by "polling":

  • Push delay. After the server data is changed, the long polling ends, and the response is immediately returned to the client.
  • Server pressure. The long polling interval is generally very long, such as 30s, 60s, and the server holding the connection will not consume too much server resources.

The long polling process using Nacos as an example is as follows:

Some people may have questions. Why does a long polling need to wait for a certain time to time out, and then initiate a long polling after the timeout, why not let the server hold it all the time? There are two main considerations. One is the consideration of connection stability. Long polling is essentially a TCP protocol at the transport layer. If the server hangs, fullgc and other abnormal problems, or restarts and other regular operations, long polling There is no heartbeat mechanism at the application layer, and it is difficult to ensure availability only by relying on the heartbeat keep-alive at the TCP layer. Therefore, setting a certain timeout for a long polling is also ensuring availability. In addition, in the configuration center scenario, there are certain business requirements that need to be designed like this. During the use of the configuration center, users may add configuration monitors at any time. Before that, long polling may have been sent out. The newly added configuration monitors cannot be included in the old long polling, so in the design of the configuration center , Generally, after a long polling is over, the newly added configuration will be monitored and piggybacked. If the long polling has no timeout period, as long as the configuration does not change, the response cannot be returned, and the newly added configuration cannot be returned. Set up monitoring.

Four configuration center long polling design

The figure above introduces the process of long polling. This section will explain the design details of long polling in the configuration center.

Client initiates long polling

The client initiates an HTTP request, and the request information includes the address of the configuration center and the monitored dataId (for the sake of simplifying the description, this article considers the dataId to be the only key to locate the configuration). If the configuration does not change, the client and server are always connected.

The server monitors data changes

The server will maintain the mapping relationship between dataId and long polling. If the configuration changes, the server will find the corresponding connection and write the updated configuration content in response. If the configuration does not change within the timeout period, the server finds the corresponding timeout long polling connection and writes a 304 response.

304 stands for "unchanged" in the HTTP response code and does not mean an error. It is more suitable for the scenario where the configuration has not changed during long polling.

Client receives long polling response

First check whether the response code is 200 or 304 to determine whether the configuration has changed, and make the corresponding callback. Then initiate the next long polling again.

Server-side settings configuration write access point

The configuration console and client are mainly used to publish configuration and trigger configuration changes.

These points are the core steps for the configuration center to implement long polling, and they are also the key to guiding the code implementation in the following chapters. But before coding, there are still some other points of attention that need to be clarified.

The configuration center often provides services for distributed clusters, and the applications deployed on each machine have multiple dataIds that need to be monitored. Instance level * The number of configurations is a large number. The configuration center server maintains these dataIds. Obviously, the long polling connection cannot be used for one-to-one correspondence with threads, otherwise the number of server threads will increase exponentially. A Tomcat has 200 threads. Long polling should not block Tomcat's business threads. Therefore, the configuration center often uses asynchronous responses when implementing long polling. A common method that is more convenient to implement asynchronous HTTP is the AsyncContext mechanism provided by Servlet3.0.

Servlet3.0 is not a particularly new specification, it is the product of the same period as Java 6. For example, SpringBoot's embedded Tomcat supports Servlet3.0 for a long time, so you don't need to worry about the AsyncContext mechanism not working.

SpringMVC implements DeferredResult and AsyncContext provided by Servlet3.0. Actually, there is not much difference. I have not studied the source code behind the two implementations in depth, but from the perspective of usage, AsyncContext is more flexible. For example, it can customize the response code. DeferredResult is encapsulated in the upper layer, which can quickly help developers implement an asynchronous response, but there is no fine-grained control of the response. So in the example below, I chose AsyncContext.

Five configuration center long polling implementation

1 client implementation

@Slf4j
public class ConfigClient {

    private CloseableHttpClient httpClient;
    private RequestConfig requestConfig;

    public ConfigClient() {
        this.httpClient = HttpClientBuilder.create().build();
        // ① httpClient 客户端超时时间要大于长轮询约定的超时时间
        this.requestConfig = RequestConfig.custom().setSocketTimeout(40000).build();
    }

    @SneakyThrows
    public void longPolling(String url, String dataId) {
        String endpoint = url + "?dataId=" + dataId;
        HttpGet request = new HttpGet(endpoint);
        CloseableHttpResponse response = httpClient.execute(request);
        switch (response.getStatusLine().getStatusCode()) {
            case 200: {
                BufferedReader rd = new BufferedReader(new InputStreamReader(response.getEntity()
                    .getContent()));
                StringBuilder result = new StringBuilder();
                String line;
                while ((line = rd.readLine()) != null) {
                    result.append(line);
                }
                response.close();
                String configInfo = result.toString();
                log.info("dataId: [{}] changed, receive configInfo: {}", dataId, configInfo);
                longPolling(url, dataId);
                break;
            }
            // ② 304 响应码标记配置未变更
            case 304: {
                log.info("longPolling dataId: [{}] once finished, configInfo is unchanged, longPolling again", dataId);
                longPolling(url, dataId);
                break;
            }
            default: {
                throw new RuntimeException("unExcepted HTTP status code");
            }
        }

    }

    public static void main(String[] args) {
        // httpClient 会打印很多 debug 日志,关闭掉
        Logger logger = (Logger)LoggerFactory.getLogger("org.apache.http");
        logger.setLevel(Level.INFO);
        logger.setAdditive(false);

        ConfigClient configClient = new ConfigClient();
        // ③ 对 dataId: user 进行配置监听 
        configClient.longPolling("http://127.0.0.1:8080/listener", "user");
    }

}

There are three main points to note:

  • RequestConfig.custom().setSocketTimeout(40000).build(): httpClient client timeout time should be greater than the long polling agreed timeout time. It's easy to understand, otherwise the client will disconnect the HTTP connection by itself before the server returns.
  • response.getStatusLine().getStatusCode() == 304: As mentioned above, it is agreed to use the 304 response code to indicate that the configuration has not changed, and the client continues to initiate long polling.
  • configClient.longPolling(" http://127.0.0.1:8080/listener ", "user"): In the example, we are simply considering that we only start a client and listen to a single dataId:user (note that you need Start the server first).

2 server implementation

@RestController
@Slf4j
@SpringBootApplication
public class ConfigServer {

    @Data
    private static class AsyncTask {
        // 长轮询请求的上下文,包含请求和响应体
        private AsyncContext asyncContext;
        // 超时标记
        private boolean timeout;

        public AsyncTask(AsyncContext asyncContext, boolean timeout) {
            this.asyncContext = asyncContext;
            this.timeout = timeout;
        }
    }

    // guava 提供的多值 Map,一个 key 可以对应多个 value
    private Multimap<String, AsyncTask> dataIdContext = Multimaps.synchronizedSetMultimap(HashMultimap.create());

    private ThreadFactory threadFactory = new ThreadFactoryBuilder().setNameFormat("longPolling-timeout-checker-%d")
        .build();
    private ScheduledExecutorService timeoutChecker = new ScheduledThreadPoolExecutor(1, threadFactory);

    // 配置监听接入点
    @RequestMapping("/listener")
    public void addListener(HttpServletRequest request, HttpServletResponse response) {

        String dataId = request.getParameter("dataId");
        
        // 开启异步
        AsyncContext asyncContext = request.startAsync(request, response);
        AsyncTask asyncTask = new AsyncTask(asyncContext, true);

        // 维护 dataId 和异步请求上下文的关联
        dataIdContext.put(dataId, asyncTask);

        // 启动定时器,30s 后写入 304 响应
        timeoutChecker.schedule(() -> {
            if (asyncTask.isTimeout()) {
                dataIdContext.remove(dataId, asyncTask);
                response.setStatus(HttpServletResponse.SC_NOT_MODIFIED);
                asyncContext.complete();
            }
        }, 30000, TimeUnit.MILLISECONDS);
    }

    // 配置发布接入点
    @RequestMapping("/publishConfig")
    @SneakyThrows
    public String publishConfig(String dataId, String configInfo) {
        log.info("publish configInfo dataId: [{}], configInfo: {}", dataId, configInfo);
        Collection<AsyncTask> asyncTasks = dataIdContext.removeAll(dataId);
        for (AsyncTask asyncTask : asyncTasks) {
            asyncTask.setTimeout(false);
            HttpServletResponse response = (HttpServletResponse)asyncTask.getAsyncContext().getResponse();
            response.setStatus(HttpServletResponse.SC_OK);
            response.getWriter().println(configInfo);
            asyncTask.getAsyncContext().complete();
        }
        return "success";
    }

    public static void main(String[] args) {
        SpringApplication.run(ConfigServer.class, args);
    }

}

Some notes on the above realization:

@RequestMapping("/listener") configures the listening access point and is also the entry point for long polling. After obtaining the dataId, use request.startAsync to set the request to be asynchronous, so that Tomcat's thread pool will not be occupied after the method ends.

Then dataIdContext.put(dataId, asyncTask) will associate the dataId with the asynchronous request context to facilitate the configuration and release of the corresponding context. Note that a data structure Multimap<String, AsyncTask> dataIdContext provided by guava is used here. It is a multi-valued Map. A key can correspond to multiple values. You can also understand it as Map<String,List>, but use Multimap to maintain it. You can handle some concurrent logic more easily. As for why there are multiple values, it is easy to understand, because the Server side of the configuration center will accept the monitoring of the same dataId from multiple clients.

timeoutChecker.schedule() starts the timer and writes a 304 response after 30s. Combined with the logic of the previous client, after receiving 304, it will re-initiate long polling, forming a loop.

@RequestMapping("/publishConfig"), configure the published entry. After the configuration is changed, take out all the long polls at once based on the dataId, write the change response for it, and don't forget to cancel the scheduled task. So far, a process of pushing after configuration changes is completed.

3 Start configuration monitoring

Start ConfigServer first, then ConfigClient. The client prints the log of long polling as follows:

22:18:09.185 [main] INFO moe.cnkirito.demo.ConfigClient - longPolling dataId: [user] once finished, configInfo is unchanged, longPolling again
22:18:39.197 [main] INFO moe.cnkirito.demo.ConfigClient - longPolling dataId: [user] once finished, configInfo is unchanged, longPolling again

Publish a configuration:

curl -X GET "localhost:8080/publishConfig?dataId=user&configInfo=helloworld"

The server print log is as follows:

2021-01-24 22:18:50.801  INFO 73301 --- [nio-8080-exec-6] moe.cnkirito.demo.ConfigServer           : publish configInfo dataId: [user], configInfo: helloworld

The client accepts configuration push:

22:18:50.806 [main] INFO moe.cnkirito.demo.ConfigClient - dataId: [user] changed, receive configInfo: helloworld

Six implementation details

Why do you need the timer to return 304

In the above implementation, the server uses a timer, and when the configuration does not change, it returns to 304 regularly. After receiving the 304, the client re-initiates long polling. In the previous article, I have explained why the long polling needs to be re-initiated after a timeout, instead of the server holding it until the configuration changes and then returning, but some readers may still have questions, why not control the timeout by the client and remove it from the server Stop the timer, so that the client will re-initiate the next long poll after the timeout. Isn't this design simpler? Both Nacos and Apollo have such timers instead of relying on the client to control the timeout. There are two main considerations for doing so:

  • Distinguish from the real client timeout.
  • Only use exceptions (Exception) to express the abnormal flow, and should not use exceptions to express the normal business flow. 304 is not a timeout exception, but a normal process in which the configuration has not changed in long polling, and should not be expressed in timeout exceptions.

The client timeout needs to be configured separately and needs to be longer than the server long polling timeout. Just as the client timeout setting in the above demo is 40s, the server judges that a long polling timeout is 30s. These two values ​​are 30s and 29.5s by default in Nacos, and 90s and 60s by default in Apollo.

Long polling contains multiple sets of dataId

In the above demo, a dataId will initiate a long polling. In the design of the actual configuration center, it is definitely not possible to design this way. The general optimization method is to form a group of dataIds and include them in a long polling task in batches. In Nacos, 3000 dataIds are grouped into a long polling task.

Seven long polling and long connections

After talking about the implementation details, the core part of this article has been introduced. Going back to the push model and pull model mentioned in the data interaction mode mentioned above, in fact, when writing this article, I once asked the friends in the communication group "the principle of the configuration center to achieve dynamic push". Most of them think it is a push model of long connection. However, in fact, almost all mainstream configuration centers use the long polling scheme introduced in this article. Why?

I also read a lot of blogs. Obviously the reasons they gave did not convince me. I tried to analyze this established fact from my own perspective:

Long polling is relatively easy to implement, and all logic can be realized by relying entirely on HTTP, and HTTP is the communication method most acceptable to the public.

Long polling uses HTTP, which is convenient for writing multi-language clients. Most languages ​​have HTTP clients.

So is the long connection really not suitable for the configuration center scene? Some people may think that maintaining a long connection consumes a lot of resources, and long polling can improve the throughput of the system. In the configuration center scenario, this assumption is not supported by actual stress test data, benchmark everything! please~

In addition, after reading the milestone of Nacos 2.0, I found an interesting plan. The registration center of Nacos (currently short polling + udp push) and configuration center (currently long polling) are planned to be transformed into long connection mode. .

Looking back again, the long polling implementation has already supported the configuration center component well enough. To replace the growth connection, you must find a suitable reason.

Eight summary

This article introduces the differences between the data interaction models of long polling, polling, and long connection.

It is analyzed that mainstream configuration centers such as Nacos and Apollo all implement real-time configuration push through long polling. Real-time perception is based on the client pull, because the data interaction is essentially through HTTP, the reason for the "push" feeling is because the server holds the client's response body, and actively writes after the configuration change Enter the response object and then return.

Original link

This article is the original content of Alibaba Cloud and may not be reproduced without permission.

Guess you like

Origin blog.csdn.net/weixin_43970890/article/details/113858071