Use JMeter stress test results to analyze the service perception of Eureka after various services are offline.

Preface

As mentioned at the end of the article, Eureka is not very sensitive to offline services. Will load offline services into the available service list. When the service instance is polled to process the request, "The calling request has been sent, but the interface has TimeOut, 404, 500... errors." This article will use a variety of service offline methods and combine it with JMeter stress testing for detailed analysis< /span>
Insert image description here

1. Design of Eureka-Server

Three levels of cache are designed in Eureka: Level 1 cache (registry) - Level 2 cache (readWriteCacheMap read-write cache) - Level 3 cache (readOnlyCacheMap read-only cache). As a classic AP model, reading and writing are separated, sacrificing consistency to ensure high availability. (The source code will be analyzed later)

2. Eureka+Ribbon aware offline service mechanism

When the client's service instance goes offline normally, it will send a heartbeat to the Eureka serverThe first-level cache update information (30S) - the first-level cache will send the update information to the second-level cache Synchronize information (immediately) - the second-level cache synchronizes information to the third-level cache (30S) - the client synchronizes information from the third-level cache (30S) - Ribbon will synchronize the cache to the client and update the service list upList (30S) , it can be seen that in extreme cases, it will take 120S to detect that a service is offline. (Note that this is not a serial execution, 30S is the default time)
Insert image description here
Based on this process, the following uses a variety of offline methods combined with the JMeter stress test report to study the offline perception of the Eureka service

3. Service calling interface stress testing model

Use Jmeter (100 threads - request within 3S) to perform a stress test on the service call interface after the service is offline to observe the execution of the interface, and combine the logs to reflect the load balancing of Ribbon:
Insert image description here
Insert image description here
By observing the exception rate of the response after the thread group is executed, we can determine whether the offline service has not been updated in time by Eureka and Ribbon (that is, the perception of the service), thus causing the caller to call an unavailable service. Serve.
The service callee currently has three instances of ports 8081, 8083, and 8084. In this experiment, two of the service instances were offline, and the Ribbon load balancing strategy was set to default

4. Several ways for Eureka services to go offline

4.1 Forced offline

closes the process directly, similar to killing-9 on the server. As you can see from the following case, when I forcibly offline two service instances under the service, remote calls between services are immediately made (due to Eureka's caching mechanism, the offline service will still be in the cached service The list has not had time to be updated, but the instances in the list that have been offline can no longer process requests), the caller will report a connect refused error, like this:
In the console, call The party cannot establish a connection directly. The request has reached the target service, but the target service actively rejected the connection.
Insert image description here
The interface that initiated the call in postman directly reported a 500 error, indicating that there was an internal problem on the server:
Insert image description here
PS: There are many risks in going offline in this way, for example, there are still requests being processed in the process, so it is not recommended to use

Pressure test

15S, use the Jmeter pressure test model for pressure testing, and find that the abnormality rate is as high as 51%
Insert image description here
30S, use the Jmeter pressure testing model for pressure testing, and find that the abnormality rate is 0% a>
Insert image description here

4.2 Send delete() request

Send an http request to the Eureka server to delete the service information of the registry, that is, the data in the first-level cache

@GetMapping("/service-down")
public String shutDown(@RequestParam List<Integer> portParams,@RequestParam String vipAddress) {
    
    
    List<Integer> successList = new ArrayList<>();
    //获取到服务名下的所有服务实例
    List<InstanceInfo> instances = eurekaClient.getInstancesByVipAddress(vipAddress, false);
    //map<端口-实例id>
    instances.forEach(temp -> {
    
    
        String instanceId = temp.getInstanceId();
        String appName = temp.getAppName();
        int port = temp.getPort();
        //"http://eureka-server-url/eureka/apps/" + appName + "/" + instanceId;
        sourceMap.put(port, appName +"/"+instanceId);
    });
    //创建请求体
    OkHttpClient client = new OkHttpClient();

    if (ObjectUtils.isEmpty(portParams)){
    
    
      return "端口为空"; //todo 完善自定义异常
    }
    portParams.forEach(temp->{
    
    
        //处理服务信息
        String serviceInfo = sourceMap.get(temp);
        //创建请求去删除服务
        Request request = new Request.Builder()
                .url("http://"+eurekaServer+"/eureka/apps/" + serviceInfo)
                .delete()
                .build();
        log.debug(request.url().toString());
        try {
    
    
            Response response = client.newCall(request).execute();
            if (response.code() == 200) {
    
    
                log.debug(serviceInfo+"服务下线成功");
                successList.add(temp);
            } else {
    
    
                log.debug(serviceInfo+"服务下线失败");
            }
        } catch (IOException e) {
    
    
            log.error(e.toString());
        }
    });
    return "goodbye service"+successList;
}

Using this method skips the step of the client sending a heartbeat to the first-level cache and directly clears the first-level cache, which saves 30 seconds in extreme cases. In fact, this is not advisable, because at this time I just told eureka-Serve that the service is offline, but the service process will still send heartbeats to the first-level cache synchronization information of eureka-server without being shut down. of.
will lead to the following situation:
Insert image description here
Insert image description here
After about ten seconds, the offline service is registered again: Insert image description here
Just after executing the offline service interface, immediately made a remote call, and an exception occurred. The interface response time was too long. Too long:
Insert image description here
My understanding: Because the interface of the offline service was called, the first-level cache information in Eureka-Server was cleared, but the third-level cache and the cache in Ribbon were not cleared. Clear (i.e. update). It happened that the load balancing polled the service instance that was offline but not updated. The load balancing allowed the caller to successfully send the request, which gave the caller an illusion. It is worth noting that until this time, the client's service has not been offline at the physical level. It is still sending heartbeats to the first-level cache of Eureka-server and synchronizing to the third-level cache to ensure that the service is available. This period of time (renewal service) is the reason why the interface response time is too long.

Pressure test

15S, use the Jmeter stress test model for stress testing, and find that the interface abnormality rate is 0%
Insert image description here
30S, use the Jmeter stress testing model for stress testing, and find that the interface abnormality rate is 0%
Insert image description here
Why does this method not cause problems at the above two time nodes? This is because the services that went offline after 15S and 30S have been renewed, and the interface requests for cross-service calls are still It is load balanced to three service instances.
It is conceivable: When the second-level cache in Eureka-Server synchronizes the first-level cache, the offline service has been renewed to the first-level cache through heartbeat (Registry), the offline services have been re-registered. When the third-level cache synchronizes the second-level cache, the services are all online, and there are no calls to the offline services.

4.3 Call DiscoveryManager

The client actively logs off and calls DiscoveryManager's API to log off the service (the process will not be closed). You can send http requests through the interface:

@GetMapping(value = "/service-down")
public void offLine(){
    
    
    DiscoveryManager.getInstance().shutdownComponent();
}

In order to facilitate the offline designated port, I wrote it like this (send a request to adjust the interface through the interface):

 /**
     * DiscoveryManager下线服务
     * @param portParams 下线端口列表
     */
    @GetMapping(value = "/service-down-list")
    public String offLine(@RequestParam List<Integer> portParams) {
    
    
        List<Integer> successList = new ArrayList<>();
        //得到服务信息
        List<InstanceInfo> instances = eurekaClient.getInstancesByVipAddress(appName, false);
        List<Integer> servicePorts = instances.stream().map(InstanceInfo::getPort).collect(Collectors.toList());

        //去服务列表里挨个下线
        OkHttpClient client = new OkHttpClient();
        portParams.forEach(temp -> {
    
    
            if (servicePorts.contains(temp)) {
    
    
                Request request = new Request.Builder()
                        .url("http://" + ipAddress + ":" + temp + "/control/service-down")
                        .build();
                try {
    
    
                    Response response = client.newCall(request).execute();
                    if (response.code() == 200) {
    
    
                        log.debug(temp + "服务下线成功");
                        successList.add(temp);
                    } else {
    
    
                        log.debug(temp + "服务下线失败");
                    }
                } catch (IOException e) {
    
    
                    log.error(e.toString());
                }
            }
        });
        return successList + "优雅下线成功";
    }

After going offline in this way, the client service will not send heartbeats to eureka-server, and the service information in the first-level cache will be cleared immediately.
Ideal state: If we use this method to offline the service and update the Ribbon synchronization cache time, the second-level cache synchronizes the third-level cache time, and the client synchronizes the third-level cache time , will the probability of polling offline services be greatly reduced? (Of course, updating this time should only be for native Eureka, and is not applicable to SpringCloudEureka)

Pressure test

15S, the exception rate is 0%, but the service information that does not exist in the first-level cache will still be called
Insert image description here
30S, the service information that does not exist in the first-level cache will still be called transfer.
Insert image description here
Although the abnormality rate of the above two time nodes is 0%, the load will be balanced to the service instance that has been offline just after the API call:
Insert image description here
Only in the 40-50S time period will offline services not be called. During this period, Eureka's local cache is mainly synchronized (secondary synchronization and third level, client synchronization level three, Ribbon synchronization client)

4.4 Third-party tool Actuator

Use a third-party tool, actuator, to shut down the service. I heard online that this method will cause the service to finish processing the current request and then shut down, and will immediately stop accepting requests (close the process). It is relatively simple to implement. You only need to introduce the actuator dependency and send Just specify the post request on the port, like this:
Insert image description here

/**
     * actuator下线服务列表
     * @param portParams 端口集合
     * @return 优雅
     */
    @GetMapping(value = "/service-down-ports")
    public String downServiceByPorts(@RequestParam List<Integer> portParams) {
    
    
        if (ObjectUtils.isEmpty(portParams)) {
    
    
            return "端口为空";
        }
        //成功下线列表
        List<Integer> successList = new ArrayList<>();
        OkHttpClient client = new OkHttpClient();
        portParams.forEach(temp -> {
    
    
            Request request = new Request.Builder()
                    .url("http://" + ipAddress + ":" + temp + "/actuator/shutdown")
                    .post(RequestBody.create(null, new byte[0]))
                    .build();
            try {
    
    
                Response response = client.newCall(request).execute();
                if (response.code() == 200) {
    
    
                    log.debug(temp + "服务下线成功");
                    successList.add(temp);
                } else {
    
    
                    log.debug(temp + "服务下线失败");
                }
            } catch (IOException e) {
    
    
                log.error(e.toString());
            }
        });
        return successList + "优雅下线成功";
    }

The service will be offline, and the data in the first-level cache in Eureka will be cleared immediately. It should also be a good choice to change the cache synchronization time (this is also for native Eureka)

Pressure test

15S, the exception rate is as high as 50%
Insert image description here
30S, the exception rate is 0%
Insert image description here
Looking at it, the latter two solutions can directly clean the first-level cache , and the service will not be renewed. However, since the third-level cache is not updated in real time, there is still a risk of calling offline services, causing interface errors.

Summarize

The premise for the interface to report an error is that a service call occurs. The premise for the service call is load balancing. The premise for load balancing is to pull synchronization information to the Ribbon cache. The problem lies here, that is, the Ribbon is loaded into the downloaded file. online services. I thought of cleaning the Ribbon cache (forced update). This can reduce the time to refresh the service but cannot fundamentally solve the problem, because it synchronizes the cache from the client, and the client synchronizes the cache from the third-level cache, so in the final analysis, it lies in the third-level cache. Synchronizing the first-level cache with the second-level cache is equivalent to the first-level cache (this process takes a very short time). However, springcloudeureka cannot forcefully update the third-level cache
It seems that the probability of errors can only be reduced by shortening the sensing time, but errors cannot be completely avoided
For SpringCloudEureka , the above method also has many optimization methods to shorten the sensing time. Let’s talk about it next time~

Guess you like

Origin blog.csdn.net/weixin_57535055/article/details/134564835