Netflix Eureka源码分析(18)——eureka server网络故障时的的自我保护机制源码剖析

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/A_Story_Donkey/article/details/82980849

假如说,20个服务实例,结果在1分钟之内,只有8个服务实例保持了心跳 --> eureka server是应该将剩余的12个没有心跳的服务实例都摘除吗?

这个时候很可能说的是,eureka server自己网络故障了,那些服务没问题的。只不过eureka server自己的机器所在的网络故障了,导致那些服务的心跳发送不过来。就导致eureka server本地一直没有更新心跳。

其实eureka server自己会进入一个自我保护的机制,从此之后就不会再摘除任何服务实例了

注册表的evict()方法,EvictionTask,定时调度的任务,60s来一次,会判断一下服务实例是否故障了,如果故障了,一直没有心跳,就会将服务实例给摘除。

 

1、evict()方法内部,先会判断上一分钟的心跳次数,是否小于我期望的一分钟的心跳次数,如果小于,那么压根儿就不让清理任何服务实例

public abstract class AbstractInstanceRegistry implements InstanceRegistry {

     public void evict(long additionalLeaseMs) {
        logger.debug("Running the evict task");

        //自我保护机制,直接return,不摘除任何服务实例
        if (!isLeaseExpirationEnabled()) {
            logger.debug("DS: lease expiration is currently disabled.");
            return;
        }

        // We collect first all expired items, to evict them in random order. For large eviction sets,
        // if we do not that, we might wipe out whole apps before self preservation kicks in. By randomizing it,
        // the impact should be evenly distributed across all applications.
        List<Lease<InstanceInfo>> expiredLeases = new ArrayList<>();
        for (Entry<String, Map<String, Lease<InstanceInfo>>> groupEntry : registry.entrySet()) {
            Map<String, Lease<InstanceInfo>> leaseMap = groupEntry.getValue();
            if (leaseMap != null) {
                for (Entry<String, Lease<InstanceInfo>> leaseEntry : leaseMap.entrySet()) {
                    Lease<InstanceInfo> lease = leaseEntry.getValue();
                    //调用Lease的isExpired()方法,来判断当前这个服务实例的租约是否过期了,是否失效了是否故障了
                    if (lease.isExpired(additionalLeaseMs) && lease.getHolder() != null) {
                        expiredLeases.add(lease);
                    }
                }
            }
        }

        // To compensate for GC pauses or drifting local time, we need to use current registry size as a base for
        // triggering self-preservation. Without that we would wipe out full registry.
        int registrySize = (int) getLocalRegistrySize();
        int registrySizeThreshold = (int) (registrySize * serverConfig.getRenewalPercentThreshold());
        int evictionLimit = registrySize - registrySizeThreshold;

        int toEvict = Math.min(expiredLeases.size(), evictionLimit);
        if (toEvict > 0) {
            logger.info("Evicting {} items (expired={}, evictionLimit={})", toEvict, expiredLeases.size(), evictionLimit);

            Random random = new Random(System.currentTimeMillis());
            for (int i = 0; i < toEvict; i++) {
                // Pick a random item (Knuth shuffle algorithm)
                int next = i + random.nextInt(expiredLeases.size() - i);
                Collections.swap(expiredLeases, i, next);
                Lease<InstanceInfo> lease = expiredLeases.get(i);

                String appName = lease.getHolder().getAppName();
                String id = lease.getHolder().getId();
                EXPIRED.increment();
                logger.warn("DS: Registry: expired lease for {}/{}", appName, id);
                internalCancel(appName, id, false);
            }
        }
    }
}
@Singleton
public class PeerAwareInstanceRegistryImpl extends AbstractInstanceRegistry implements PeerAwareInstanceRegistry {

    @Override
    public boolean isLeaseExpirationEnabled() {
        if (!isSelfPreservationModeEnabled()) {
            // The self preservation mode is disabled, hence allowing the instances to expire.
            //自我保存模式被禁用,因此允许实例过期。
            return true;
        }
        //numberOfRenewsPerMinThreshold代表期望一分钟至少有多少次心跳
        //getNumOfRenewsInLastMin() 获取上一分钟心跳的总次数
        return numberOfRenewsPerMinThreshold > 0 && getNumOfRenewsInLastMin() > numberOfRenewsPerMinThreshold;
    }
}

2、我期望的一分钟的心跳次数是怎么算出来的?

(1)eureka server启动的时候,就会初始化一次这个值

EurekaBootStrap是启动的初始化的类

registry.openForTraffic(applicationInfoManager, registryCount);

完成了numberOfRenewsPerMinThreshold这个值,我期望一分钟得有多少次心跳的值,初始化。刚开始会调用syncUp()的方法,从相邻的eureka server节点,拷贝过来注册表,如果是自己本地还没注册的服务实例,就在自己本地注册一下。

会记录一下从别的eureka server拉取过来的服务实例的数量registryCount,将这个服务实例的数量,就作为自己eureka server本地初始化的这么一个服务实例的数量

protected void initEurekaServerContext() throws Exception {
        //之前的代码处理省略。。。
        //从相邻的eureka server节点,拷贝过来注册表
        int registryCount = registry.syncUp();
        registry.openForTraffic(applicationInfoManager, registryCount);
}

将 服务实例数量 * 2 * 0.85 ,期望心跳次数的计算,居然hard code了。

假设你现在有20个服务实例,每个服务实例每30秒发送一次心跳,于是一分钟一个服务实例应该发送2次心跳,1分钟内我期望获取到的心跳的次数,应该是20 * 2 = 40个心跳。

用这个服务实例 * 2 * 0.85 = 20 * 2 * 0.85 = 34,期望的是最少一分钟20个服务实例,得有34个心跳。根据当前的服务实例的数量,计算出来的一分钟最少需要的心跳次数。

硬编码可能会产生的问题:假设现在我们默认的心跳是30秒1次,如果我调整了撑10秒一次心跳了???怎么办??这里的count * 2,就错了。

@Singleton
public class PeerAwareInstanceRegistryImpl extends AbstractInstanceRegistry implements PeerAwareInstanceRegistry {

    @Override
    public void openForTraffic(ApplicationInfoManager applicationInfoManager, int count) {
        // Renewals happen every 30 seconds and for a minute it should be a factor of 2.
        this.expectedNumberOfRenewsPerMin = count * 2;
        //初始化numberOfRenewsPerMinThreshold的值 服务实例数量 * 2 * 0.85
        this.numberOfRenewsPerMinThreshold =
                (int) (this.expectedNumberOfRenewsPerMin * serverConfig.getRenewalPercentThreshold());
        logger.info("Got " + count + " instances from neighboring DS node");
        logger.info("Renew threshold is: " + numberOfRenewsPerMinThreshold);
        this.startupTime = System.currentTimeMillis();
        if (count > 0) {
            this.peerInstancesTransferEmptyOnStartup = false;
        }
        DataCenterInfo.Name selfName = applicationInfoManager.getInfo().getDataCenterInfo().getName();
        boolean isAws = Name.Amazon == selfName;
        if (isAws && serverConfig.shouldPrimeAwsReplicaConnections()) {
            logger.info("Priming AWS connections for all replicas..");
            primeAwsReplicas(applicationInfoManager);
        }
        logger.info("Changing status to UP");
        applicationInfoManager.setInstanceStatus(InstanceStatus.UP);
        super.postInit();
    }
}

 

(2)注册、下线、故障

这个每分钟期望的心跳的次数,是跟咱们的这个服务实例的数量相关的,服务实例随着上线和下线、故障,都在不断的变动着。注册的时候,每分钟期望心跳次数 + 2。服务下线的时候,直接每分钟期望心跳次数 - 2。

public abstract class AbstractInstanceRegistry implements InstanceRegistry {
 
    /**
     * Registers a new instance with a given duration.
     *
     * @see com.netflix.eureka.lease.LeaseManager#register(java.lang.Object, int, boolean)
     */
    public void register(InstanceInfo registrant, int leaseDuration, boolean isReplication) {
        // The lease does not exist and hence it is a new registration
        synchronized (lock) {
            if (this.expectedNumberOfRenewsPerMin > 0) {
                // Since the client wants to cancel it, reduce the threshold
                // (1
                // for 30 seconds, 2 for a minute)
                this.expectedNumberOfRenewsPerMin = this.expectedNumberOfRenewsPerMin + 2;
                this.numberOfRenewsPerMinThreshold =
                        (int) (this.expectedNumberOfRenewsPerMin * serverConfig.getRenewalPercentThreshold());
            }
        }
        logger.debug("No previous lease information found; it is new registration");
    }
}
@Singleton
public class PeerAwareInstanceRegistryImpl extends AbstractInstanceRegistry implements PeerAwareInstanceRegistry {

    @Override
    public boolean cancel(final String appName, final String id,
                          final boolean isReplication) {
        if (super.cancel(appName, id, isReplication)) {
            replicateToPeers(Action.Cancel, appName, id, null, null, isReplication);
            synchronized (lock) {
                if (this.expectedNumberOfRenewsPerMin > 0) {
                    // Since the client wants to cancel it, reduce the threshold (1 for 30 seconds, 2 for a minute)
                    this.expectedNumberOfRenewsPerMin = this.expectedNumberOfRenewsPerMin - 2;
                    this.numberOfRenewsPerMinThreshold =
                            (int) (this.expectedNumberOfRenewsPerMin * serverConfig.getRenewalPercentThreshold());
                }
            }
            return true;
        }
        return false;
    }
}

注意:故障的时候,摘除一个服务实例,居然没找到更新期望心跳次数的代码。bug,如果说有很多的服务实例都是故障下线的,摘除了。结果每分钟期望的心跳次数并没有减少,但是实际的服务实例变少了一些,就会导致实际的心跳次数变少,如果说出现较多的服务实例故障被自动摘除的话,很可能会快速导致eureka server进自我保护机制。

实际的心跳次数比期望的心跳次数要小,就不会再摘除任何服务实例了

 

(3)定时更新

Registry注册表,默认是15分钟,会跑一次定时任务,算一下服务实例的数量,如果从别的eureka server拉取到的服务实例的数量,大于当前的服务实例的数量,会重新计算一下,主要是跟其他的eureka server做一下同步

触发概率很小

 

3、实际的上一分钟的心跳次数是怎么算出来的

抓大放小,之前我们看源码的时候,看到过这个MeasuredRate,当时肯定是看不懂的,因为很多代码,都是一个机制相关的。每次一个心跳过来,一定会更新这个MeasuredRate。来计算每一分钟的心跳的实际的次数。

public abstract class AbstractInstanceRegistry implements InstanceRegistry {

    private final MeasuredRate renewsLastMin;

    protected AbstractInstanceRegistry(EurekaServerConfig serverConfig, EurekaClientConfig clientConfig, ServerCodecs serverCodecs) {
        //构造一个用来计算上一分钟实际的心跳次数的线程,传入参数为60s
       this.renewsLastMin = new MeasuredRate(1000 * 60 * 1);
    }

    public boolean renew(String appName, String id, boolean isReplication) {
        //每发送一次心跳,currentBucket就累加一次
        renewsLastMin.increment();
    }

    protected void postInit() {
        //eureka server在启动初始化的时候启动MeasuredRate线程(可以延迟、重复执行)
        renewsLastMin.start();
        //后边的代码逻辑省略。。。
    }
}

MeasuredRate类,好好看看,技术亮点:如何计算每一分钟内的一个内存中的计数的呢?计算每一分钟内的心跳的次数?

public class MeasuredRate {
    private static final Logger logger = LoggerFactory.getLogger(MeasuredRate.class);
    //上一分钟心跳的总次数
    private final AtomicLong lastBucket = new AtomicLong(0);
    //每发送一次心跳,currentBucket就累加一次,每60秒清零一次
    private final AtomicLong currentBucket = new AtomicLong(0);

    private final long sampleInterval;
    private final Timer timer;

    private volatile boolean isActive;

    /**
     * @param sampleInterval in milliseconds
     */
    public MeasuredRate(long sampleInterval) {
        this.sampleInterval = sampleInterval;
        this.timer = new Timer("Eureka-MeasureRateTimer", true);
        this.isActive = false;
    }

    public synchronized void start() {
        if (!isActive) {
            /**public void schedule(TimerTask task,long delay,long period)
             * 在delay毫秒之后第一次执行,然后按照period间隔时间,重复执行
             * delay: 延迟执行的毫秒数,即在delay毫秒之后第一次执行
             * period:重复执行的时间间隔
             */
            timer.schedule(new TimerTask() {

                @Override
                public void run() {
                    try {
                        // Zero out the current bucket==》把当前的桶清零
                        //1.将currentBucket的值赋给lastBucket,
                        //2.将currentBucket清零
                        lastBucket.set(currentBucket.getAndSet(0));
                    } catch (Throwable e) {
                        logger.error("Cannot reset the Measured Rate", e);
                    }
                }
            }, sampleInterval, sampleInterval);

            isActive = true;
        }
    }

    public synchronized void stop() {
        if (isActive) {
            timer.cancel();
            isActive = false;
        }
    }

    /**
     * Returns the count in the last sample interval.
     * 返回上一分钟心跳的总次数
     */
    public long getCount() {
        return lastBucket.get();
    }

    /**
     * Increments the count in the current sample interval.
     * 每发送一次心跳,currentBucket就累加一次
     */
    public void increment() {
        currentBucket.incrementAndGet();
    }
}

4、来看看自我保护机制的触发

如果上一分钟实际的心跳次数,比我们期望的一分钟的心跳次数要小,触发自我保护机制,不允许摘除任何服务实例,此时认为自己的eureka server出现网络故障,大量的服务实例无法发送心跳过来

@Singleton
public class PeerAwareInstanceRegistryImpl extends AbstractInstanceRegistry implements PeerAwareInstanceRegistry {

    @Override
    public boolean isLeaseExpirationEnabled() {
        if (!isSelfPreservationModeEnabled()) {
            // The self preservation mode is disabled, hence allowing the instances to expire.
            //自我保存模式被禁用,因此允许实例过期。
            return true;
        }
        //numberOfRenewsPerMinThreshold代表期望一分钟至少有多少次心跳
        //getNumOfRenewsInLastMin() 获取上一分钟心跳的总次数
        return numberOfRenewsPerMinThreshold > 0 && getNumOfRenewsInLastMin() > numberOfRenewsPerMinThreshold;
    }
}

5、eureka这一块,自我保护机制,必须从源码级别要看懂

因为其实在线上的时候,最坑爹的就是这儿,就是你会发现有些服务实例下线了,但是eureka控制台老是没给他摘除,自我保护机制了。线上生产环境,如果你可以的话,你可以选择将这个自我保护给关了。如果eureka server接收不到心跳的话,各个服务实例也是无法从eureka server拉取注册表的。每个服务实例直接基于自己的本地的注册表的缓存来就可以了。自我保护机制给打开也可以,从源码层面已经知道了,服务故障摘除,自我保护的源码,如果你发现线上生产环境,出现了一些问题,你可以从源码级别去看一下是怎么回事。

 

总结:eureka自我保护机制 流程图

猜你喜欢

转载自blog.csdn.net/A_Story_Donkey/article/details/82980849