如果eureka client要停机,我们需要在代码里自己调用DiscoveryClient的shutdown()方法,就会发送请求到eureka server去下线一个服务实例。很多时候,可能不是说我们把某个服务给停机了,而是说服务自己宕机了,就不会调用shutdown()方法,也不会去发送请求下线服务实例。
eureka自己,有一个所谓到自动故障感知机制,以及服务实例摘除的机制
eureka靠的是心跳,来感知,可能某个服务已经挂掉了,就不会再发送心跳了,如果在一段时间内没有接收到某个服务的心跳,那么就将这个服务实例给摘除,认为这个服务实例已经宕机了
(1)自动检查服务实例是否故障宕机的入口:在EurekaBootStrap中,eureka server在启动初始化的时候,registry.openForTraffic(applicationInfoManager, registryCount)。
registry.openForTraffic(applicationInfoManager, registryCount);
(2)PeerAwareInstanceRegistry.openForTraffic()方法里,最后隐藏了一行调用,postInit()。每隔60s会运行一次定时调度的后台线程任务,EvictionTask。
@Singleton
public class PeerAwareInstanceRegistryImpl extends AbstractInstanceRegistry implements PeerAwareInstanceRegistry {
@Override
public void openForTraffic(ApplicationInfoManager applicationInfoManager, int count) {
// Renewals happen every 30 seconds and for a minute it should be a factor of 2.
this.expectedNumberOfRenewsPerMin = count * 2;
this.numberOfRenewsPerMinThreshold =
(int) (this.expectedNumberOfRenewsPerMin * serverConfig.getRenewalPercentThreshold());
logger.info("Got " + count + " instances from neighboring DS node");
logger.info("Renew threshold is: " + numberOfRenewsPerMinThreshold);
this.startupTime = System.currentTimeMillis();
if (count > 0) {
this.peerInstancesTransferEmptyOnStartup = false;
}
DataCenterInfo.Name selfName = applicationInfoManager.getInfo().getDataCenterInfo().getName();
boolean isAws = Name.Amazon == selfName;
if (isAws && serverConfig.shouldPrimeAwsReplicaConnections()) {
logger.info("Priming AWS connections for all replicas..");
primeAwsReplicas(applicationInfoManager);
}
logger.info("Changing status to UP");
applicationInfoManager.setInstanceStatus(InstanceStatus.UP);
super.postInit();
}
}
(3)获取一个补偿时间compensationTimeMs,是为了避免说EvictionTask两次调度的时间超过了设置的60s,补偿时间的机制
19:55:00 3分钟 19:58:00 -> 过期
补偿时间:EvictionTask本身调度的就慢了,比上一次该调度的时间晚了92s
19:55:00过后,3分钟内没有心跳,在他延迟的这92s之内(补偿时间),也没心跳,19:59:32,都没心跳发送过,才能认为是失效
public abstract class AbstractInstanceRegistry implements InstanceRegistry {
protected void postInit() {
renewsLastMin.start();
if (evictionTaskRef.get() != null) {
evictionTaskRef.get().cancel();
}
evictionTaskRef.set(new EvictionTask());
evictionTimer.schedule(evictionTaskRef.get(),
serverConfig.getEvictionIntervalTimerInMs(),
serverConfig.getEvictionIntervalTimerInMs());
}
//EvictionTask是一个内部类
class EvictionTask extends TimerTask {
private final AtomicLong lastExecutionNanosRef = new AtomicLong(0l);
@Override
public void run() {
try {
//获取一个补偿时间compensationTimeMs
long compensationTimeMs = getCompensationTimeMs();
logger.info("Running the evict task with compensationTime {}ms", compensationTimeMs);
evict(compensationTimeMs);
} catch (Throwable e) {
logger.error("Could not run the evict task", e);
}
}
/**
* compute a compensation time defined as the actual time this task was executed since the prev iteration,
* vs the configured amount of time for execution. This is useful for cases where changes in time (due to
* clock skew or gc for example) causes the actual eviction task to execute later than the desired time
* according to the configured cycle.
*/
long getCompensationTimeMs() {
long currNanos = getCurrentTimeNano();
long lastNanos = lastExecutionNanosRef.getAndSet(currNanos);
if (lastNanos == 0l) {
return 0l;
}
long elapsedMs = TimeUnit.NANOSECONDS.toMillis(currNanos - lastNanos);
long compensationTime = elapsedMs - serverConfig.getEvictionIntervalTimerInMs();
return compensationTime <= 0l ? 0l : compensationTime;
}
long getCurrentTimeNano() { // for testing
return System.nanoTime();
}
}
}
(4)遍历注册表中所有的服务实例,然后调用Lease的isExpired()方法,来判断当前这个服务实例的租约是否过期了,是否失效了,服务实例故障了,如果是故障的服务实例,加入一个列表。
public void evict(long additionalLeaseMs) {
logger.debug("Running the evict task");
if (!isLeaseExpirationEnabled()) {
logger.debug("DS: lease expiration is currently disabled.");
return;
}
// We collect first all expired items, to evict them in random order. For large eviction sets,
// if we do not that, we might wipe out whole apps before self preservation kicks in. By randomizing it,
// the impact should be evenly distributed across all applications.
List<Lease<InstanceInfo>> expiredLeases = new ArrayList<>();
for (Entry<String, Map<String, Lease<InstanceInfo>>> groupEntry : registry.entrySet()) {
Map<String, Lease<InstanceInfo>> leaseMap = groupEntry.getValue();
if (leaseMap != null) {
for (Entry<String, Lease<InstanceInfo>> leaseEntry : leaseMap.entrySet()) {
Lease<InstanceInfo> lease = leaseEntry.getValue();
//调用Lease的isExpired()方法,来判断当前这个服务实例的租约是否过期了,是否失效了是否故障了
if (lease.isExpired(additionalLeaseMs) && lease.getHolder() != null) {
expiredLeases.add(lease);
}
}
}
}
// To compensate for GC pauses or drifting local time, we need to use current registry size as a base for
// triggering self-preservation. Without that we would wipe out full registry.
int registrySize = (int) getLocalRegistrySize();
int registrySizeThreshold = (int) (registrySize * serverConfig.getRenewalPercentThreshold());
int evictionLimit = registrySize - registrySizeThreshold;
int toEvict = Math.min(expiredLeases.size(), evictionLimit);
if (toEvict > 0) {
logger.info("Evicting {} items (expired={}, evictionLimit={})", toEvict, expiredLeases.size(), evictionLimit);
Random random = new Random(System.currentTimeMillis());
for (int i = 0; i < toEvict; i++) {
// Pick a random item (Knuth shuffle algorithm)
int next = i + random.nextInt(expiredLeases.size() - i);
Collections.swap(expiredLeases, i, next);
Lease<InstanceInfo> lease = expiredLeases.get(i);
String appName = lease.getHolder().getAppName();
String id = lease.getHolder().getId();
EXPIRED.increment();
logger.warn("DS: Registry: expired lease for {}/{}", appName, id);
internalCancel(appName, id, false);
}
}
}
(5)因为判断条件lastUpdateTimestamp + duration + additionalLeaseMs中,源码多加了一个duration (默认90s),所以真正心跳到现在间隔了90s * 2 = 180s,3分钟,才会认为是故障了,eureka bug。
源码注释的意思是 真正过期的时间是 2 * duration ,而 duration 默认为90s。因此真正需要等待90s * 2 = 180s 没有心跳才会自动摘除服务实例。为了不影响老用户的使用,这个bug修复的时间待定!也就是说可能根本不会修复!
public class Lease<T> {
/**
* Checks if the lease of a given {@link com.netflix.appinfo.InstanceInfo} has expired or not.
*
* Note that due to renew() doing the 'wrong" thing and setting lastUpdateTimestamp to +duration more than
* what it should be, the expiry will actually be 2 * duration. This is a minor bug and should only affect
* instances that ungracefully shutdown. Due to possible wide ranging impact to existing usage, this will
* not be fixed.
*
* @param additionalLeaseMs any additional lease time to add to the lease evaluation in ms.
*/
public boolean isExpired(long additionalLeaseMs) {
return (evictionTimestamp > 0 || System.currentTimeMillis() > (lastUpdateTimestamp + duration + additionalLeaseMs));
}
}
(6)不会一次性将所有故障的服务实例都摘除,每次最多讲注册表中15%的服务实例给摘除掉,所以一次没摘除所有的故障实例,下次EvictionTask再次执行的时候,会再次摘除,分批摘取机制
(7)在摘除的时候,是从故障实例中随机挑选本次可以摘除的数量的服务实例,来摘除,随机摘取机制
(8)摘除服务实例的时候,其实就是调用下线的方法,internelCancel()方法,注册表、recentChangeQueue、invalidate缓存
public abstract class AbstractInstanceRegistry implements InstanceRegistry {
public void evict(long additionalLeaseMs) {
//前边的代码逻辑省略。。。
//最后调用internalCancel方法摘除服务实例
internalCancel(appName, id, false);
}
protected boolean internalCancel(String appName, String id, boolean isReplication) {
try {
read.lock();
CANCEL.increment(isReplication);
Map<String, Lease<InstanceInfo>> gMap = registry.get(appName);
Lease<InstanceInfo> leaseToCancel = null;
if (gMap != null) {
leaseToCancel = gMap.remove(id);
}
synchronized (recentCanceledQueue) {
recentCanceledQueue.add(new Pair<Long, String>(System.currentTimeMillis(), appName + "(" + id + ")"));
}
InstanceStatus instanceStatus = overriddenInstanceStatusMap.remove(id);
if (instanceStatus != null) {
logger.debug("Removed instance id {} from the overridden map which has value {}", id, instanceStatus.name());
}
if (leaseToCancel == null) {
CANCEL_NOT_FOUND.increment(isReplication);
logger.warn("DS: Registry: cancel failed because Lease is not registered for: {}/{}", appName, id);
return false;
} else {
leaseToCancel.cancel();
InstanceInfo instanceInfo = leaseToCancel.getHolder();
String vip = null;
String svip = null;
if (instanceInfo != null) {
//设置针对此服务实例的行为是DELETED
instanceInfo.setActionType(ActionType.DELETED);
//将服务实例放入最近变化的队列中
recentlyChangedQueue.add(new RecentlyChangedItem(leaseToCancel));
instanceInfo.setLastUpdatedTimestamp();
vip = instanceInfo.getVIPAddress();
svip = instanceInfo.getSecureVipAddress();
}
invalidateCache(appName, vip, svip);
logger.info("Cancelled instance {}/{} (replication={})", appName, id, isReplication);
return true;
}
} finally {
read.unlock();
}
}
}
总结:服务实例默认每隔30s发送一次心跳(原理是更新一下注册表中本服务实例的lastUpdateTimestamp时间戳),当evictionTimestamp > 0 || System.currentTimeMillis() > (lastUpdateTimestamp + duration + additionalLeaseMs)这个判断条件为true时表明此服务已经宕机,此时需要调用服务下线时的internelCancel()方法,删除注册表map中对应的服务实例、recentChangeQueue、invalidate缓存