问题

在生产环境重启Eureka Server集群的时候，发现订单客户端调用分布式Id生成服务出错，

Caused by: com.netflix.client.ClientException: Load balancer does not have available server for client: IDG

显示订单服务调不到IDG服务了

问题思考

Eureka Client缓存由一个定时线程去刷新，每30秒执行一次增量更新，ribbon每30秒从Eureka Client的本地缓存里面获取服务的信息，上面的错误，是有ribbon报出来的，说明ribbon里面IDG服务的信息不存在，通过后续调试，发现Eureka Client的本地缓存是空的。由此引发了一个问题，当Eureka Server正在重启或者重启完成，Eureka Client来获取注册信息，然后更新到本地出了问题

问题追踪

检查Eureka Client日志


2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.DiscoveryClient - Got delta update with apps hashcode 
2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.DiscoveryClient - The total number of instances fetched by the delta processor : 0
2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.DiscoveryClient - The Reconcile hashcodes do not match, client : UP_5_, server : . Getting the full registry
2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG c.n.discovery.shared.MonitoredConnectionManager - Get connection: {}->http://server1:7010, timeout = 5000
2018-06-05 15:11:49.338 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.shared.NamedConnectionPool - [{}->http://server1:7010] total kept alive: 1, total issued: 1, total allocated: 2 out of 200
2018-06-05 15:11:49.339 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG com.netflix.discovery.shared.NamedConnectionPool - Getting free connection [{}->http://server1:7010][null]
2018-06-05 15:11:49.339 [DiscoveryClient-CacheRefreshExecutor-0] DEBUG org.apache.http.impl.client.DefaultHttpClient - Stale connection check

发现如上片段的日志，当客户端的CacheRefreshExecutor（缓存刷新线程池）执行任务的时候

第1行：获取增量更新数据的hashCode

第2行：获取到的增量数据总数为0

第3行：节点合并之后，增量数据（服务端）的HashCode和本地client端的HashCode不一致， client = UP5 , Server = “” ，因此需要发起全量获取

第4..7行：发起全量获取。

发生问题的原因已经很明显了，就是在Eureka Server重启的时候，注册信息为空，刚好被Eureka Client获取到，由于HashCode计算不一致

导致发起全量获取，然后覆盖本地的缓存数据。导致本地的缓存数据更新为错误的，由此发生调用问题。

通过检查Eureka Server的配置，发现如下问题：


eureka:
  instance:
      hostname: server2
  client:
    serviceUrl:
      defaultZone: http://server1:7010/eureka/,http://server3:7012/eureka/
    fetch-registry: false 
    register-with-eureka: true   // 将自身注册到Eureka 集群上面去

fetch-registry = false , 这就表明当Eureka Server作为Client注册到Eureka集群上面去的时候，默认是不会去全量抓取注册信息的。但是Eureka Server作为服务端的时候，在服务刚刚启动的时候，会从本地client获取注册信息（register-with-eureka: true时，他本身也作为客户端注册到Eureka上去了），然后注册到自身的服务上去。想了解具体详情可以看：深入理解Eureka Server集群同步（十）

也就是说Eureka Server刚刚启动的时候，他作为server端的注册信息是空的。只能依赖后续集群续约同步的方式，慢慢补全自身的信息。

通过上面的了解，将配置修改成下面这样：


eureka:
  instance:
      hostname: server2
  client:
    serviceUrl:
      defaultZone: http://server1:7010/eureka/,http://server3:7012/eureka/
    fetch-registry: true 
    register-with-eureka: true   // 将自身注册到Eureka 集群上面去

将fetch-register修改为true，这样在Eureka Server 刚刚启动的时候，就可以将注册信息全部注册到自己的节点上去。

通过并发测试，发现刚刚那个配置只是减小了几率，并不能做到完全避免，原因如下：

protected void initEurekaServerContext() throws Exception {
   // .....省略N多代码
   // 从其他服务同步节点
   int registryCount = this.registry.syncUp();
    // 修改eureka状态为up 同时，这里面会开启一个定时任务，用于清理 60秒没有心跳的客户端。自动下线
   this.registry.openForTraffic(this.applicationInfoManager, registryCount);

   // .....省略N多代码
   EurekaMonitors.registerAllStats();
}

@Override
public void openForTraffic(ApplicationInfoManager applicationInfoManager, int count) {
    // Renewals happen every 30 seconds and for a minute it should be a factor of 2.
    // 计算每分钟最大续约数
    this.expectedNumberOfRenewsPerMin = count * 2;
    // 每分钟最小续约数
    this.numberOfRenewsPerMinThreshold =
            (int) (this.expectedNumberOfRenewsPerMin * serverConfig.getRenewalPercentThreshold());
    logger.info("Got " + count + " instances from neighboring DS node");
    logger.info("Renew threshold is: " + numberOfRenewsPerMinThreshold);
    this.startupTime = System.currentTimeMillis();
    if (count > 0) {
        this.peerInstancesTransferEmptyOnStartup = false;
    }
    DataCenterInfo.Name selfName = applicationInfoManager.getInfo().getDataCenterInfo().getName();
    boolean isAws = Name.Amazon == selfName;
    if (isAws && serverConfig.shouldPrimeAwsReplicaConnections()) {
        logger.info("Priming AWS connections for all replicas..");
        primeAwsReplicas(applicationInfoManager);
    }
    logger.info("Changing status to UP");
    // 设置实例的状态为UP
    applicationInfoManager.setInstanceStatus(InstanceStatus.UP);
    // 开启定时任务，默认60秒执行一次，用于清理60秒之内没有续约的实例
    super.postInit();
}

从上面的代码粗略上来看，没有什么问题，假如存在下面这种情况


Eureka Client   增量同步
Eureka Server   同步集群节点数据

当Eureka Server还没有同步完成节点数据的时候， Eureka Client就过来拉取数据了，如此，Eureka Client拉取到的

就是不完整的或者是空的数据，这样还是会造成上面的问题，只不过几率比较小、

完整解决方案

修改配置文件


eureka:
  instance:
      hostname: server1
      initial-status: STARTING
  client:
    serviceUrl:
      defaultZone: http://server2:7011/eureka/,http://server3:7012/eureka/
    fetch-registry: true 
    register-with-eureka: true

添加eureka.instance.initial-status: STARTING 表示在Eureka Server 刚刚启动的时候，默认不主动去注册，等待服务同步数据完成之后

再去注册。

自定义过滤器


public void doFilter(ServletRequest request, ServletResponse response,
                     FilterChain chain) throws IOException, ServletException {
    InstanceInfo myInfo = ApplicationInfoManager.getInstance().getInfo();
    InstanceStatus status = myInfo.getStatus();
    if (status != InstanceStatus.UP && response instanceof HttpServletResponse) {
        throw  new RuntimeException("Eureka Server status is not UP ,do not provide service ");
    }
    chain.doFilter(request, response);
}

自定义过滤器，当Eureka Server的状态不是UP的时候，不对外提供服务。只有当Eureka Server启动完成并且同步数据完成

才会修改状态为UP，防止Eureka Client获取到不完整的数据。


@Bean
public CustomerStatusFilter statusFilter(){

    return  new CustomerStatusFilter();
}
@Bean
public FilterRegistrationBean someFilterRegistration() {

    FilterRegistrationBean registration = new FilterRegistrationBean();
    registration.setFilter(statusFilter());
    registration.addUrlPatterns("/*");
    return registration;
}

弊端：加入这个过滤器，如果在集群完全没有启动的时候，一台一台的启动的话，默认需要150秒才可以正常提供服务。

Eureka Server集群重启问题追踪

问题

问题思考

问题追踪

完整解决方案

修改配置文件

自定义过滤器

猜你喜欢