[SpringCloud] Manually update the Ribbon cache through Redis to solve the problem of service offline awareness in the Eureka microservice architecture

Preface

Based on the above, it can be seen from the stress test results that there will be no abnormality in the stress test after using DiscoveryManager to offline the service, but the only disadvantage is that the way to offline the service is to cancel the registration. With the renewal of the contract, there is no end to the process. In other words,the service after calling the api is offline still has the ability to process requests. In addition, it takes a certain amount of time to synchronize the three levels of eureka cache. What Eureka-Client pulls from the third-level cache is not a real-time service list, so what Ribbon pulls from Eureka-Client is not a real-time service list. Ultimately, Ribbon load balancing reaches the offline service instance, and at this time, the instance (the process has not been closed) can just handle the request! This caused the service instances of two ports to be offline, but the load balancer was still used to process the request!
Following this idea, let’s look at this picture:
Insert image description here
Is it possible to use some means to bypass the third-level cache and directly update the Ribbon when the service is offline? Caching to shorten sensing time?

Let me tell you the answer first - yes

1.First try

1.1 Service callee update

Manually synchronize service cache information from Eureka-Client:

When analyzing the Ribbon source code before, it was mentioned that the interface path is from http://service name/interface path ->http://service address/interface path. In this process, the caller's request is processed by Ribbon. The interceptor intercepts and is finally rewritten into an accurate service address through load balancing. There is a very important method, getLoadBalancer("Service Name")
Insert image description here
. It can be seen that he uses the service name to After getting the list of all services (allServerList) and the list of available services (upServerList) under the service name, can we directly obtain the latest available service list through this operation and manually set it into Ribbon's available service list cache, so that He no longer synchronizes every 30 seconds?

Tips: A very important component in our SpringCloud projectSpringClientFactory is the factory class used in Spring Cloud to manage and obtain client instances. Here you can get the load balancer of a specific service (i.e. ILoadBalancer)

Therefore, the following operation is performed, specifically configuring a Bean to update the Ribbon cache. Whenever the service offline interface is called to offline the specified service, the Ribbon cache will be automatically synchronized. There is no need for the Ribbon to automatically synchronize every 30 seconds:

@Configuration
@Slf4j
public class ClearRibbonCache {
    
    

    public void clearRibbonCache(SpringClientFactory clientFactory, List<Integer> portParams) {
    
    
        // 获取指定服务的负载均衡器
        ILoadBalancer loadBalancer = clientFactory.getLoadBalancer("user-service");
        //在主动拉取可用列表,而不是走拦截器被动的方式——这里
        List<Server> reachableServers = loadBalancer.getReachableServers();//这里从客户端获取,会等待客户端同步三级缓存
        // 在某个时机需要清除Ribbon缓存
        ((BaseLoadBalancer) loadBalancer).setServersList(ableServers); // 清除Ribbon负载均衡器的缓存
    }
}

So in the offline service interface, there is an additional step to automatically update the cache (if you are not familiar with this interface, you can read the previous article):

@GetMapping(value = "/service-down-list")
    public String offLine(@RequestParam List<Integer> portParams) {
    
    
        List<Integer> successList = new ArrayList<>();
        //得到服务信息
        List<InstanceInfo> instances = eurekaClient.getInstancesByVipAddress(appName, false);
        List<Integer> servicePorts = instances.stream().map(InstanceInfo::getPort).collect(Collectors.toList());

        //去服务列表里挨个下线
        OkHttpClient client = new OkHttpClient();
        log.error("开始时间:{}", System.currentTimeMillis());
        portParams.parallelStream().forEach(temp -> {
    
    
            if (servicePorts.contains(temp)) {
    
    
                String url = "http://" + ipAddress + ":" + temp + "/control/service-down";
                try {
    
    
                    Response response = client.newCall(new Request.Builder().url(url).build()).execute();
                    if (response.code() == 200) {
    
    
                        log.debug(temp + "服务下线成功");
                        successList.add(temp);
                    } else {
    
    
                        log.debug(temp + "服务下线失败");
                    }
                } catch (IOException e) {
    
    
                    log.error(e.toString());
                }
            }
        });
        log.debug("开始清除Ribbon缓存");
        clearRibbonCache.clearRibbonCache(clientFactory,portParams);
        return successList + "优雅下线成功";
    }

1.2 First attempt at stress testing

Similarly, we use the (100 threads-3S) JMeter stress test model to perform stress tests 15S and 30S after calling the service offline interface. The stress test interface is an ordinary cross-service calling interface The situation is exactly the same as 15S, and the request load is balanced to the service that has been offline but the process has not exited. 30S of offline service: At this time, observing the log output of the console, you can find that it has been offline. The two service instances are still load balanced (offline but the process has not exited). It seems that updating the cache has no effect. 15S of offline service:
Offline service:
Insert image description here

Insert image description here

Insert image description here

Insert image description here

45S of offline service:
Insert image description here
Insert image description here
It can be seen that it is not until about 45S when calling the api offline service that the offline service is completely cleared from each layer of cache information. This time is very fatal. of!

1.3 Problem analysis

In the service release scenario, such a business problem will arise: The development call API has offline two services, and the operation and maintenance is notified to close these two service processes. , the operation and maintenance kill-9 killed these two processes and prepared to release new services. But at this time the client (user) sent a request to the server, which happened to involve cross-service calls, and because Ribbon synchronized the Eureka-Client cache, it took a certain amount of time for Eureka-Client to synchronize the third-level cache in Eurek-Server, and the Ribbon cache The list of available services in is not the latest, and the synchronized services have been offline (the process has also been killed). The final request was sent by Ribbon load balancing to a service instance that was developed and offline through the API, and distributed to a service instance with kill-9 for operation and maintenance, causing the interface to return errors such as 500, 404, connect time out, connect refused, etc., resulting in frequent Alert.

1.4 What is synchronized is not the latest list

Look beyond the surface:

Why does manual synchronization of Ribbon cache not work? Is there something wrong with the synchronized content? The following break point enables debugging to see what service list we got after the service was offline:
Insert image description here
I accidentally discovered that the real-time service list I once naively thought I could get was in vain. The clown turned out to be myself. Obviously 8083 and 8084 have been offline, but why are they still in the list of available services and set in the Ribbon cache

It turns out that the service list obtained through that method is obtained from Eureka-client, and this is actually a problem of the client synchronizing with the third-level cache. Did you mention why there is still a synchronization time after updating the cache manually? That is, the service list synchronized by the client from the third-level cache still contains services that are not offline, so the list that is manually updated to the ribbon cache also contains services that are not offline. Seeing this, Is Eureka’s “sacrifice of consistency to ensure high availability” fully reflected?

Can this consistency really not be solved?
Actually, I have another trick.
At the same time, combined with the service call link of the Eureka-Ribbon architecture, in fact, updating the Ribbon cache on the service caller can better ensure the Ribbon load. The balanced service list is under my control
PS: (This saves one attempt, that is, introducing filtering operations on the service callee. After trying the pressure test, the results are still the same as before, so here is Ignored. Try directly to the service caller)
Insert image description here

2. Second try

2.1 Caller filtering offline services

Filter offline services from the obtained service list and execute them on the caller:
Execute on the caller? How to let the caller know the port information of the called party going offline? Should you choose MQ for cross-process communication? Or Redis? Here I choose Redis
and make slight changes in the above-mentioned update cache operation, move the update operation to the service caller, and introduce Redis as communication support (here the hash data structure is used), Then what the callee needs now is to update the offline port information to redis:
Insert image description here

    @GetMapping(value = "/service-down-list")
    public String offLine(@RequestParam List<Integer> portParams) {
    
    
        List<Integer> successList = new ArrayList<>();
        //得到服务信息
        List<InstanceInfo> instances = eurekaClient.getInstancesByVipAddress(appName, false);
        List<Integer> servicePorts = instances.stream().map(InstanceInfo::getPort).collect(Collectors.toList());

        //去服务列表里挨个下线
        OkHttpClient client = new OkHttpClient();
        log.error("开始时间:{}", System.currentTimeMillis());
        portParams.parallelStream().forEach(temp -> {
    
    
            if (servicePorts.contains(temp)) {
    
    
                String url = "http://" + ipAddress + ":" + temp + "/control/service-down";
                try {
    
    
                    Response response = client.newCall(new Request.Builder().url(url).build()).execute();
                    if (response.code() == 200) {
    
    
                        log.debug(temp + "服务下线成功");
                        successList.add(temp);
                    } else {
    
    
                        log.debug(temp + "服务下线失败");
                    }
                } catch (IOException e) {
    
    
                    log.error(e.toString());
                }
            }
        });
        // todo Redis通知
        stringRedisTemplate.opsForHash().put("port-map","down-ports",portParams.toString());
        return successList + "优雅下线成功";
    }

And the previous operation of updating the Ribbon available service list has also changed slightly, that is, a new manual filtering operation has been added:

@Configuration
@Slf4j
public class ClearRibbonCache {
    
    
    /**
     * 削减
     */
    public static boolean cutDown(List<Integer> ports, Server index) {
    
    
        return ports.contains(index.getPort());
    }

   public void clearRibbonCache(SpringClientFactory clientFactory, String portParams) {
    
    
        // 获取指定服务的负载均衡器
        ILoadBalancer loadBalancer = clientFactory.getLoadBalancer("user-service");
        //在主动拉取可用列表,而不是走拦截器被动的方式——这里
        List<Server> reachableServers = loadBalancer.getReachableServers();//这里从客户端获取,会等待客户端同步三级缓存
        //过滤掉已经下线的端口,符合条件端口的服务过滤出来
        List<Integer> portList = StringChange.stringToList(portParams);
        List<Server> ableServers = reachableServers.stream().filter(temp -> !cutDown(portList, temp)).collect(Collectors.toList());
        log.debug("可用服务列表:{}", ableServers);
        // 在某个时机需要清除Ribbon缓存
        ((BaseLoadBalancer) loadBalancer).setServersList(ableServers); // 清除Ribbon负载均衡器的缓存
    }
}

On the service caller, before each cross-service call, the real-time offline port is obtained from Redis and the Ribbon cache is updated:
Insert image description here

2.2 Second attempt of pressure test

When the service is offline, a stress test is performed immediately. It can be seen that all cross-service call requests fall on the instances that have not been offline, and the service instances that have been offline but the process is not closed are not processed anymore. Request:
Insert image description here
Insert image description here
And there are no exceptions at the time nodes of 15S and 30S:
Insert image description here
It can be seen that it is indeed feasible to actively update the Ribbon available service list in this way, < /span>Especially in a specific scenario where new services are released on the operation and maintenance side, it can solve the problem of Eureka's slow perception of offline services, which affects the Ribbon load to unavailable service instances.

2.3 Optimization

In fact, if the alarm interfaces can be accurately located every time when a new service is released, and the number is not large, I think it is not a big deal to manually synchronize the Ribbon cache in those business interfaces. Problems can also solve problems. However, if there are many interfaces for each alarm and they are not fixed, the above method will become a bit bloated. And this is also a kind of intrusive programming, which I actually don’t recommend!
When talking about intrusive programming, one can’t help but think of non-intrusive programming - Aop
directly uses the module with the error as the aspect, and uses the operation of updating the Ribbon as By writing the entry point into the expression, you can achieve the perfect update function without changing the existing business, just like this:

@Aspect
@Component
@Slf4j
public class RequestAspect {
    
    

    @Resource
    SpringClientFactory springClientFactory;
    @Resource
    ClearRibbonCacheBean clearRibbonCacheBean;
    @Resource
    private StringRedisTemplate stringRedisTemplate;

    @Before(value = "execution(* com.yu7.order.web.*.*(..))")
    public void refreshBefore(JoinPoint joinPoint) {
    
    
        String ports = (String) stringRedisTemplate.opsForHash().get("port-map", "down-ports");
        log.debug("从Redis获取的端口为:{}", ports);
        //下线了才会有值,没有值说明没下线不用更新
        if (ObjectUtils.isNotEmpty(ports)) {
    
    
            clearRibbonCacheBean.clearRibbonCache(springClientFactory, ports);
        }
    }
}

Conducted a pressure test and the results were exactly as expected~

write to the end

I want to say: In fact, my plan is equivalent to proposing a general framework and concept, roughly realizing the problem of service status awareness in the microservice architecture based on Eureka. When there is more than one calling relationship in the business, the offline service type Inconsistency, intermittent service offline will cause the value to be lost...The solution needs to be further refined (there are still hard coding problems, hehe), and in order not to affect the business, other insurance measures such as TTL should be added to the data saved in Redis. , in short, everyone is welcome to make suggestions, work together to improve, and solve this problem together!

Guess you like

Origin blog.csdn.net/weixin_57535055/article/details/134719740
Recommended