9. How SOFARPC source code analysis --- is to achieve fault culling?

SOFARPC source analytic series:

1. source code analysis --- SOFARPC extensible mechanism SPI

2. Source analysis --- SOFARPC client service reference

3. The source code analysis --- SOFARPC client service calls

4. source code analysis --- SOFARPC server exposed

The source code analysis --- SOFARPC call service

6. SOFARPC source code analysis --- how to achieve load balancing and dubbo compared to?

7. source code analysis --- how SOFARPC connection management is to achieve a heartbeat?

8. source code analysis --- SOFARPC seen in EventBus from design patterns?


In SEVEN inside 7. source code analysis --- how SOFARPC connection management is to achieve a heartbeat? I talked about how the client is connected to maintain the long end of the service. However, there is a case Consumer and Provider connection is still long, removed the registry is not issued, but the server side for some reason, for example, long Full GC, hardware failures (hereinafter to avoid repetition, description of a unified machine die away ) and other scenes, in suspended animation.

Consumer should not call this time or less calling the Provider, can be controlled by the weight method. More than the current 5.3.0 version SOFARPC support RPC stand-alone fault culling capacity. SOFARPC to reduce abnormal call service through a service weight control, more traffic will hit the machine normal service, improve service availability.

Next, let 's get specific weight downgrade service is how to achieve. Before reading this article I hope readers will read the following articles:

  1. 8. source code analysis --- SOFARPC seen in EventBus from design patterns? Because SOFARPC service is downgraded by weight EventBus to call.
  2. 3. The source code analysis --- SOFARPC client service call , which is written in this article is how to call a server, the client will trigger bus when calling the service side, it sends a message to subscribers.
  3. 6. SOFARPC source code analysis --- how to achieve load balancing and dubbo compared to? This article is written inside SOFARPC load balancing is how to achieve, and how concurrency control weight by weight.

If you understand the above knowledge, we can start the next content.

Examples

We first give an instance of the server and the client, to facilitate debugging.

The official documentation here: automatic fault removed

service

public static void main(String[] args) {
    ServerConfig serverConfig = new ServerConfig()
            .setProtocol("bolt") // 设置一个协议,默认bolt
            .setPort(12200) // 设置一个端口,默认12200
            .setDaemon(false); // 非守护线程

    ProviderConfig<HelloService> providerConfig = new ProviderConfig<HelloService>()
        .setInterfaceId(HelloService.class.getName()) // 指定接口
        .setRef(new HelloServiceImpl()) // 指定实现
        .setServer(serverConfig); // 指定服务端

    providerConfig.export(); // 发布服务
}

client

public static void main(String[] args) {

        FaultToleranceConfig faultToleranceConfig = new FaultToleranceConfig();
        faultToleranceConfig.setRegulationEffective(true);
        faultToleranceConfig.setDegradeEffective(true);
        faultToleranceConfig.setTimeWindow(10);
        faultToleranceConfig.setWeightDegradeRate(0.5);

        FaultToleranceConfigManager.putAppConfig("appName", faultToleranceConfig);

        ApplicationConfig applicationConfig = new ApplicationConfig();
        applicationConfig.setAppName("appName");

        ConsumerConfig<HelloService> consumerConfig = new ConsumerConfig<HelloService>()
                .setInterfaceId(HelloService.class.getName()) // 指定接口
                .setProtocol("bolt") // 指定协议
                .setDirectUrl("bolt://127.0.0.1:12200") // 指定直连地址
                .setConnectTimeout(2000 * 1000)
                .setApplication(applicationConfig);

        HelloService helloService = consumerConfig.refer();

        while (true) {
            try {
                LOGGER.info(helloService.sayHello("world"));
            } catch (Exception e) {
                e.printStackTrace();
            }

            try {
                Thread.sleep(2000);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }

        }
    }

Automatic fault cull module registration

In the case of our client which is registered by FaultToleranceConfigManager FaultToleranceConfig configuration.

FaultToleranceConfig faultToleranceConfig = new FaultToleranceConfig();
faultToleranceConfig.setRegulationEffective(true);
faultToleranceConfig.setDegradeEffective(true);
faultToleranceConfig.setTimeWindow(10);
faultToleranceConfig.setWeightDegradeRate(0.5);

FaultToleranceConfigManager.putAppConfig("appName", faultToleranceConfig);

We first go to FaultToleranceConfigManager inside look putAppConfig done.

FaultToleranceConfigManager#putAppConfig

/**
 * All fault-tolerance config of apps
 */
private static final ConcurrentMap<String, FaultToleranceConfig> APP_CONFIGS = new ConcurrentHashMap<String, FaultToleranceConfig>();

public static void putAppConfig(String appName, FaultToleranceConfig value) {
    if (appName == null) {
        if (LOGGER.isWarnEnabled()) {
            LOGGER.warn("App name is null when put fault-tolerance config");
        }
        return;
    }
    if (value != null) {
        APP_CONFIGS.put(appName, value);
        if (LOGGER.isInfoEnabled(appName)) {
            LOGGER.infoWithApp(appName, "Get a new resource, value[" + value + "]");
        }
    } else {
        APP_CONFIGS.remove(appName);
        if (LOGGER.isInfoEnabled(appName)) {
            LOGGER.infoWithApp(appName, "Remove a resource, key[" + appName + "]");
        }
    }
    calcEnable();
}

static void calcEnable() {
    for (FaultToleranceConfig config : APP_CONFIGS.values()) {
        if (config.isRegulationEffective()) {
            aftEnable = true;
            return;
        }
    }
    aftEnable = false;
}

The above method of writing is very clear:

  1. Check appName, is empty, then direct return
  2. Then we define the config into APP_CONFIGS inside this variable
  3. Call calcEnable, according to our config configuration, the variable is set to true aftEnable

Here the failure to complete the culling of configuration settings.

Registration Troubleshooting cull module

We analyze the source code --- 8. seen SOFARPC in EventBus from design patterns? Which stresses the initialization ConsumerConfig when the static code block will be initialized parent class, then initializes the static code block RpcRuntimeContext.

RpcRuntimeContext

static {
    if (LOGGER.isInfoEnabled()) {
        LOGGER.info("Welcome! Loading SOFA RPC Framework : {}, PID is:{}", Version.BUILD_VERSION, PID);
    }
    put(RpcConstants.CONFIG_KEY_RPC_VERSION, Version.RPC_VERSION);
    // 初始化一些上下文
    initContext();
    // 初始化其它模块
    ModuleFactory.installModules();
    // 增加jvm关闭事件
    if (RpcConfigs.getOrDefaultValue(RpcOptions.JVM_SHUTDOWN_HOOK, true)) {
        Runtime.getRuntime().addShutdownHook(new Thread(new Runnable() {
            @Override
            public void run() {
                if (LOGGER.isWarnEnabled()) {
                    LOGGER.warn("SOFA RPC Framework catch JVM shutdown event, Run shutdown hook now.");
                }
                destroy(false);
            }
        }, "SOFA-RPC-ShutdownHook"));
    }
}

In this code block which calls the initialization of another module ModuleFactory

ModuleFactory#installModules

public static void installModules() {
    ExtensionLoader<Module> loader = ExtensionLoaderFactory.getExtensionLoader(Module.class);
    //moduleLoadList 默认是 *
    String moduleLoadList = RpcConfigs.getStringValue(RpcOptions.MODULE_LOAD_LIST);
    for (Map.Entry<String, ExtensionClass<Module>> o : loader.getAllExtensions().entrySet()) {
        String moduleName = o.getKey();
        Module module = o.getValue().getExtInstance();
        // judge need load from rpc option
        if (needLoad(moduleLoadList, moduleName)) {
            // judge need load from implement
            if (module.needLoad()) {
                if (LOGGER.isInfoEnabled()) {
                    LOGGER.info("Install Module: {}", moduleName);
                }
                //安装模板
                module.install();
                INSTALLED_MODULES.put(moduleName, module);
            } else {
                if (LOGGER.isInfoEnabled()) {
                    LOGGER.info("The module " + moduleName + " does not need to be loaded.");
                }
            }
        } else {
            if (LOGGER.isInfoEnabled()) {
                LOGGER.info("The module " + moduleName + " is not in the module load list.");
            }
        }
    }
}

Here initialize the SPI based four modules, namely:
Fault-Tolerance
sofaTracer-RESTEasy
Lookout
sofaTracer

We are here only to explain the fault-tolerance module.

Then we enter into FaultToleranceModule # install methods

private Regulator                regulator = new TimeWindowRegulator();

public void install() {
    subscriber = new FaultToleranceSubscriber();
    //注册ClientSyncReceiveEvent和ClientAsyncReceiveEvent到总线中
    EventBus.register(ClientSyncReceiveEvent.class, subscriber);
    EventBus.register(ClientAsyncReceiveEvent.class, subscriber);

    String regulatorAlias = RpcConfigs.getOrDefaultValue(RpcOptions.AFT_REGULATOR, "timeWindow");
    regulator = ExtensionLoaderFactory.getExtensionLoader(Regulator.class).getExtension(regulatorAlias);
    //调用TimeWindowRegulator的init方法
    regulator.init();
}

Here our subscribers are FaultToleranceSubscriber example, subscribed to two ClientSyncReceiveEvent and ClientAsyncReceiveEvent events.

Initialization method then calls the implementation class TimeWindowRegulator regulator of the
TimeWindowRegulator # init


/**
 * 度量策略(创建计算模型, 对计算模型里的数据进行度量,选出正常和异常节点)
 */
private MeasureStrategy                          measureStrategy;

/**
 * 计算策略(根据度量结果,判断是否需要执行降级或者恢复) 
 */
private RegulationStrategy                       regulationStrategy;

/**
 * 降级策略: 例如调整权重 
 */
private DegradeStrategy                          degradeStrategy;

/**
 * 恢复策略:例如调整权重 
 */
private RecoverStrategy                          recoverStrategy;

/**
 * Listener for invocation stat change.
 */
private final InvocationStatListener             listener           = new TimeWindowRegulatorListener();


public void init() {
    String measureStrategyAlias = RpcConfigs
        .getOrDefaultValue(RpcOptions.AFT_MEASURE_STRATEGY, "serviceHorizontal");
    String regulationStrategyAlias = RpcConfigs.getOrDefaultValue(RpcOptions.AFT_REGULATION_STRATEGY,
        "serviceHorizontal");
    String degradeStrategyAlias = RpcConfigs.getOrDefaultValue(RpcOptions.AFT_DEGRADE_STRATEGY, "weight");
    String recoverStrategyAlias = RpcConfigs.getOrDefaultValue(RpcOptions.AFT_RECOVER_STRATEGY, "weight");
    //ServiceHorizontalMeasureStrategy
    measureStrategy = ExtensionLoaderFactory.getExtensionLoader(MeasureStrategy.class).getExtension(
        measureStrategyAlias);
    //ServiceHorizontalRegulationStrategy
    regulationStrategy = ExtensionLoaderFactory.getExtensionLoader(RegulationStrategy.class).getExtension(
            regulationStrategyAlias);
    //WeightDegradeStrategy
    degradeStrategy = ExtensionLoaderFactory.getExtensionLoader(DegradeStrategy.class).getExtension(
            degradeStrategyAlias);
    //WeightRecoverStrategy
    recoverStrategy = ExtensionLoaderFactory.getExtensionLoader(RecoverStrategy.class).getExtension(
        recoverStrategyAlias);

    //TimeWindowRegulatorListener
    InvocationStatFactory.addListener(listener);
}

There are mainly based SPI initialization of the measurement methods, calculation methods, demotion strategy, recovery strategy, what is the use of these things, we talk about below.

Trigger weight downgrade

We 3. source code analysis --- SOFARPC client service calls which talked about a client when the call ends up calling AbstractCluster # doSendMsg method, then according to different strategies, synchronous, asynchronous, one-way and other calls and then return response example .

protected SofaResponse doSendMsg(ProviderInfo providerInfo, ClientTransport transport,
                                 SofaRequest request) throws SofaRpcException {
    ....
    // 同步调用
    if (RpcConstants.INVOKER_TYPE_SYNC.equals(invokeType)) {
        long start = RpcRuntimeContext.now();
        try {
            //BoltClientTransport#syncSend
            response = transport.syncSend(request, timeout);
        } finally {
            if (RpcInternalContext.isAttachmentEnable()) {
                long elapsed = RpcRuntimeContext.now() - start;
                context.setAttachment(RpcConstants.INTERNAL_KEY_CLIENT_ELAPSE, elapsed);
            }
        }
    }
    ....
}

Because when the failed module registered and subscribed to two ClientSyncReceiveEvent ClientAsyncReceiveEvent events. That is a synchronous event and an asynchronous event, we are here to pick synchronous call to explain.

In the above code snippet, we see calls to BoltClientTransport # syncSend.

BoltClientTransport#syncSend

public SofaResponse syncSend(SofaRequest request, int timeout) throws SofaRpcException {
    //检查连接
    checkConnection();
    RpcInternalContext context = RpcInternalContext.getContext();
    InvokeContext boltInvokeContext = createInvokeContext(request);
    SofaResponse response = null;
    SofaRpcException throwable = null;
    try {
        //向总线发出ClientBeforeSendEvent事件
        beforeSend(context, request);
        response = doInvokeSync(request, boltInvokeContext, timeout);
        return response;
    } catch (Exception e) { // 其它异常
        throwable = convertToRpcException(e);
        throw throwable;
    } finally {
        //向总线发出ClientAfterSendEvent事件
        afterSend(context, boltInvokeContext, request);
        //向总线发出ClientSyncReceiveEvent事件
        if (EventBus.isEnable(ClientSyncReceiveEvent.class)) {
            //把当前被调用的provider和ConsumerConfig发送到总线中去
            EventBus.post(new ClientSyncReceiveEvent(transportConfig.getConsumerConfig(),
                    transportConfig.getProviderInfo(), request, response, throwable));
        }
    }
}

In fact, the above code and so a large section of this article we will have a relationship as long as the final event to send ClientSyncReceiveEvent bus only.

When sending a bus trigger subscriber FaultToleranceSubscriber of onEvent method.

We enter into FaultToleranceSubscriber # onEvent in


public void onEvent(Event originEvent) {
    Class eventClass = originEvent.getClass();

    if (eventClass == ClientSyncReceiveEvent.class) {
        //这里会调用aftEnable
        if (!FaultToleranceConfigManager.isEnable()) {
            return;
        }
        // 同步结果
        ClientSyncReceiveEvent event = (ClientSyncReceiveEvent) originEvent;
        ConsumerConfig consumerConfig = event.getConsumerConfig();
        ProviderInfo providerInfo = event.getProviderInfo();
        InvocationStat result = InvocationStatFactory.getInvocationStat(consumerConfig, providerInfo);
        if (result != null) {
            //记录调用次数
            result.invoke();
            Throwable t = event.getThrowable();
            if (t != null) {
                 //记录异常次数
                result.catchException(t);
            }
        }
    }  
    ...
}

Here we ignore other events, leaving only the processing flow ClientSyncReceiveEvent events.
Here we see the InvocationStatFactory this factory class, above TimeWindowRegulator # init also used this class.

After returning to invoke method calls result, the record about the number of server client calls, if there is an exception, it is also called catchException method, the number of abnormal record it. These two parameters will be removed when the asynchronous service for statistical use.

InvocationStatFactory#getInvocationStat

public static InvocationStat getInvocationStat(ConsumerConfig consumerConfig, ProviderInfo providerInfo) {
    String appName = consumerConfig.getAppName();
    if (appName == null) {
        return null;
    }
    // 应用开启单机故障摘除功能
    if (FaultToleranceConfigManager.isRegulationEffective(appName)) {
        return getInvocationStat(new InvocationStatDimension(providerInfo, consumerConfig));
    }
    return null;
}


public static InvocationStat getInvocationStat(InvocationStatDimension statDimension) {
    //第一次的时候为空
    InvocationStat invocationStat = ALL_STATS.get(statDimension);
    if (invocationStat == null) {
        //直接new一个实例放入到ALL_STATS变量中
        invocationStat = new ServiceExceptionInvocationStat(statDimension);
        InvocationStat old = ALL_STATS.putIfAbsent(statDimension, invocationStat);
        if (old != null) {
            invocationStat = old;
        }
        //LISTENERS在调用TimeWindowRegulator#init的时候add进来的,只有一个TimeWindowRegulatorListener
        for (InvocationStatListener listener : LISTENERS) {
            listener.onAddInvocationStat(invocationStat);
        }
    }
    return invocationStat;
}

The first time you came to this method, then instantiates a ServiceExceptionInvocationStat put into ALL_STATS variable, then traverse InvocationStatFactory traversal LISTENERS, call onAddInvocationStat method listener.

Examples LISTENERS which is our TimeWindowRegulator the init # the Add method to go inside TimeWindowRegulatorListener.

Note that, here the two wrapper classes are use to the next. They are InvocationStatDimension and ServiceExceptionInvocationStat.

InvocationStatDimension

public class InvocationStatDimension {
    /**
     * One provider of service reference
     */
    private final ProviderInfo   providerInfo;

    /**
     * Config of service reference
     */
    private final ConsumerConfig consumerConfig;

    /**
     * cache value: dimensionKey
     */
    private transient String     dimensionKey;
    /**
     * cache value : originWeight
     */
    private transient Integer    originWeight;
}

FIG ServiceExceptionInvocationStat structure:

ServiceExceptionInvocationStat

public class ServiceExceptionInvocationStat extends AbstractInvocationStat {

    /**
     * Instantiates a new Service exception invocation stat.
     *
     * @param invocation the invocation
     */
    public ServiceExceptionInvocationStat(InvocationStatDimension invocation) {
        super(invocation);
    }

    @Override
    public long catchException(Throwable t) {
        //统计异常次数
        if (t instanceof SofaRpcException) {
            SofaRpcException exception = (SofaRpcException) t;
            if (exception.getErrorType() == RpcErrorType.CLIENT_TIMEOUT
                    || exception.getErrorType() == RpcErrorType.SERVER_BUSY) {
                return exceptionCount.incrementAndGet();
            }
        }
        return exceptionCount.get();
    }
}

然后直接看它父类的具体参数就好了
AbstractInvocationStat

public abstract class AbstractInvocationStat implements InvocationStat {
    /**
     * 统计维度
     */
    protected final InvocationStatDimension dimension;
    /**
     * 调用次数
     */
    protected final AtomicLong              invokeCount    = new AtomicLong(0L);
    /**
     * 异常次数
     */
    protected final AtomicLong              exceptionCount = new AtomicLong(0L);

    /**
     * when useless in one window, this value increment 1. <br />
     * If this value is greater than threshold, this stat will be deleted.
     */
    private final transient AtomicInteger   uselessCycle   = new AtomicInteger(0);
}

上面的这些参数,我们接下来还会用到。

权重降级具体实现

TimeWindowRegulatorListener是TimeWIndowRegulator的内部类。

class TimeWindowRegulatorListener implements InvocationStatListener {
    @Override
    public void onAddInvocationStat(InvocationStat invocationStat) {
        //度量策略不为空
        if (measureStrategy != null) {
            //ServiceHorizontalMeasureStrategy
            MeasureModel measureModel = measureStrategy.buildMeasureModel(invocationStat);
            if (measureModel != null) {
                measureModels.add(measureModel);
                startRegulate();
            }
        }
    }

    @Override
    public void onRemoveInvocationStat(InvocationStat invocationStat) {
        if (measureStrategy != null) {
            measureStrategy.removeMeasureModel(invocationStat);
        }
    }
}

这个监听器里面就是调用ServiceHorizontalMeasureStrategy#buildMeasureModel,返回调控模型。

我们先看一下MeasureModel里面封装了什么:

MeasureModel

public class MeasureModel {
    /**
     * App name of measure model
     * 服务名
     */
    private final String                            appName;
    /**
     * service name of measure model
     * 被调用的服务
     */
    private final String                            service;
    /**
     * all dimension statics stats of measure model
     * InvokeStat集合
     */
    private final ConcurrentHashSet<InvocationStat> stats = new ConcurrentHashSet<InvocationStat>();
    ....
}

所以根据这几个全局变量,我们可以推测,MeasureModel应该是根据appName+service为维度,里面有很多的InvocationStat。

我们再回到ServiceHorizontalMeasureStrategy#buildMeasureModel

public MeasureModel buildMeasureModel(InvocationStat invocationStat) {
    InvocationStatDimension statDimension = invocationStat.getDimension();
    //AppName + ":" + Service
    String key = statDimension.getDimensionKey();
    MeasureModel measureModel = appServiceMeasureModels.get(key);
    if (measureModel == null) {
        measureModel = new MeasureModel(statDimension.getAppName(), statDimension.getService());
        MeasureModel oldMeasureModel = appServiceMeasureModels.putIfAbsent(key, measureModel);
        if (oldMeasureModel == null) {
            measureModel.addInvocationStat(invocationStat);
            return measureModel;
        } else {
            oldMeasureModel.addInvocationStat(invocationStat);
            return null;
        }
    } else {
        measureModel.addInvocationStat(invocationStat);
        return null;
    }
}

buildMeasureModel方法里面的做法也和我上面说的一样。根据appName+service为维度封装不同的invocationStat在MeasureModel里面。

接着,回到TimeWindowRegulatorListener#onAddInvocationStat中,会往下调用startRegulate方法。


/**
 * 度量线程池
 */
private final ScheduledService                   measureScheduler   = new ScheduledService("AFT-MEASURE",
                                                                        ScheduledService.MODE_FIXEDRATE,
                                                                        new MeasureRunnable(), 1, 1,
                                                                        TimeUnit.SECONDS);

public void startRegulate() {
    if (measureStarted.compareAndSet(false, true)) {
        measureScheduler.start();
    }
}

ScheduledService是一个线程池,measureScheduler变量实例化了一个固定频率执行延迟线程池,会每1秒钟固定调用MeasureRunnable的run方法。

MeasureRunnable是TimeWindowRegulator的内部类:

private class MeasureRunnable implements Runnable {

    @Override
    public void run() {
        measureCounter.incrementAndGet();
        //遍历TimeWindowRegulatorListener加入的MeasureModel实例
        for (MeasureModel measureModel : measureModels) {
            try {
                //时间窗口是10,也就是说默认每过10秒才能进入下面的方法。
                if (isArriveTimeWindow(measureModel)) {
                    //ServiceHorizontalMeasureStrategy
                    MeasureResult measureResult = measureStrategy.measure(measureModel);
                    regulationExecutor.submit(new RegulationRunnable(measureResult));
                }
            } catch (Exception e) {
                LOGGER.errorWithApp(measureModel.getAppName(), "Error when doMeasure: " + e.getMessage(), e);
            }
        }
    }

    private boolean isArriveTimeWindow(MeasureModel measureModel) {
        //timeWindow默认是10
        long timeWindow = FaultToleranceConfigManager.getTimeWindow(measureModel.getAppName());
        return measureCounter.get() % timeWindow == 0;
    }
}

我们先来到ServiceHorizontalMeasureStrategy#measure来看看是怎么判断为异常或正常

如何判断一个节点是异常还是正常

我们首先不看代码的实现,先白话的说明一下是如何实现的。

  1. 首先在FaultToleranceSubscriber#onEvent中收到同步或异步结果事件后,就会从工厂中获取这次调用的 InvokeStat(如果 InvokeStat 已经存在则直接返回,如果没有则创建新的并保持到缓存中)。通过调用 InvokeStat 的 invoke 和 catchException 方法统计调用次数和异常次数。
  2. 然后在MeasureRunnable方法中根据设置的窗口期,在到达窗口期的时候会从 MeasueModel 的各个 InvokeStat 创建一份镜像数据,表示当前串口内的调用情况。
  3. 对所有的节点进行度量,计算出所有节点的平均异常率,如果某个节点的异常率大于平均异常率到一定比例,则判定为异常。

我这里选用官方的例子来进行说明:
假如有三个节点,提供同一服务,调用次数和异常数如表格所示:

invokeCount expCount
invokeStat 1 5 4
invokeStat 2 10 1
invokeStat 3 10 0

结合上述例子,度量策略的大致逻辑如下:

  • 首先统计该服务下所有 ip 的平均异常率,并用 averageExceptionRate 表示。平均异常率比较好理解,即异常总数 / 总调用次数,上例中 averageExceptionRate =(1 + 4) / (5 + 10 + 10) = 0.2.
  • 当某个ip的窗口调用次数小于该服务的最小窗口调用次数( leastWindCount )则忽略并将状态设置为 IGNOGRE。否则进行降级和恢复度量。 如 invokeStat 1 的 invokeCount 为5,如果 leastWindCount 设置为6 则 invokeStat 1 会被忽略。
  • 当某个ip的 时间窗口内的异常率和服务平均异常比例 windowExceptionRate 大于 配置的 leastWindowExceptionRateMultiplte (最小时间窗口内异常率和服务平均异常率的降级比值),那么将该IP设置为 ABNORMAL, 否则设置为 HEALTH.

windowExceptionRate 是异常率和服务平均异常比例,invokeStat 1 的异常率为 4/5 = 0.8, 则其对应的 windowExceptionRate = 0.8 / 0.2 = 4. 假设 leastWindowExceptionRateMultiplte =4, 那么 invokeStat 1 是一次服务,则需要进行降级操作。

接下来我们来看具体的源码实现:
ServiceHorizontalMeasureStrategy#measure

public MeasureResult measure(MeasureModel measureModel) {

    MeasureResult measureResult = new MeasureResult();
    measureResult.setMeasureModel(measureModel);

    String appName = measureModel.getAppName();
    List<InvocationStat> stats = measureModel.getInvocationStats();
    if (!CommonUtils.isNotEmpty(stats)) {
        return measureResult;
    }

    //1
    //这个方法主要是复制出一个当前时间点的调用情况,只统计被复制的InvocationStat
    //如果有被新剔除的InvocationStat,则不会存在于该次获取结果中。
    List<InvocationStat> invocationStats = getInvocationStatSnapshots(stats);
    //FaultToleranceConfig的timeWindow所设置的,时间窗口,默认是10
    long timeWindow = FaultToleranceConfigManager.getTimeWindow(appName);
    /* leastWindowCount在同一次度量中保持不变*/
    //默认InvocationStat如果要参与统计的窗口内最低调用次数,时间窗口内,至少调用的次数.在时间窗口内总共都不足10,认为不需要调控.
    long leastWindowCount = FaultToleranceConfigManager.getLeastWindowCount(appName);
    //最小是1,也就是时间窗口内,只要调用了就进行统计
    leastWindowCount = leastWindowCount < LEGAL_LEAST_WINDOW_COUNT ? LEGAL_LEAST_WINDOW_COUNT
        : leastWindowCount;

    //2.
    /* 计算平均异常率和度量单个ip的时候都需要使用到appWeight*/
    double averageExceptionRate = calculateAverageExceptionRate(invocationStats, leastWindowCount);

    //表示当前机器是平均异常率的多少倍才降级,默认是6
    double leastWindowExceptionRateMultiple = FaultToleranceConfigManager
        .getLeastWindowExceptionRateMultiple(appName);

    for (InvocationStat invocationStat : invocationStats) {
        MeasureResultDetail measureResultDetail = null;
        InvocationStatDimension statDimension = invocationStat.getDimension();

        long windowCount = invocationStat.getInvokeCount();
        //3
        //这里主要是根据Invocation的实际权重计算该Invocation的实际最小窗口调用次数
        long invocationLeastWindowCount = getInvocationLeastWindowCount(invocationStat,
                ProviderInfoWeightManager.getWeight(statDimension.getProviderInfo()),
                leastWindowCount);
        //4
        //当总调用的次数为0的时候,averageExceptionRate =-1,这个时候可以设置为忽略
        if (averageExceptionRate == -1) {
            measureResultDetail = new MeasureResultDetail(statDimension, MeasureState.IGNORE);
        } else {
            if (invocationLeastWindowCount != -1 && windowCount >= invocationLeastWindowCount) {
                //获取异常率
                double windowExceptionRate = invocationStat.getExceptionRate();
                //没有异常的情况,设置状态为健康
                if (averageExceptionRate == 0) {
                    measureResultDetail = new MeasureResultDetail(statDimension, MeasureState.HEALTH);
                } else {
                    //5
                    //这里主要是看这次被遍历到invocationStat的异常率和平均异常率之比
                    double windowExceptionRateMultiple = CalculateUtils.divide(
                            windowExceptionRate, averageExceptionRate);
                    //如果当前的invocationStat的异常是平均异常的6倍,那么就设置状态为异常
                    measureResultDetail = windowExceptionRateMultiple >= leastWindowExceptionRateMultiple ?
                            new MeasureResultDetail(statDimension, MeasureState.ABNORMAL) :
                            new MeasureResultDetail(statDimension, MeasureState.HEALTH);
                }
                measureResultDetail.setAbnormalRate(windowExceptionRate);
                measureResultDetail.setAverageAbnormalRate(averageExceptionRate);
                measureResultDetail.setLeastAbnormalRateMultiple(leastWindowExceptionRateMultiple);
            } else {
                measureResultDetail = new MeasureResultDetail(statDimension, MeasureState.IGNORE);
            }
        }

        measureResultDetail.setWindowCount(windowCount);
        measureResultDetail.setTimeWindow(timeWindow);
        measureResultDetail.setLeastWindowCount(invocationLeastWindowCount);
        measureResult.addMeasureDetail(measureResultDetail);
    }
    //打日志
    logMeasureResult(measureResult, timeWindow, leastWindowCount, averageExceptionRate,
        leastWindowExceptionRateMultiple);

    InvocationStatFactory.updateInvocationStats(invocationStats);
    return measureResult;
}  

上面这个方法有点长,我给这个方法标注了数字,跟着数字标记去看。

  1. getInvocationStatSnapshots
public static List<InvocationStat> getInvocationStatSnapshots(List<InvocationStat> stats) {
    List<InvocationStat> snapshots = new ArrayList<InvocationStat>(stats.size());
    for (InvocationStat stat : stats) {
        //赋值一个InvocationStat出来
        InvocationStat snapshot = stat.snapshot();
        //如果被调用的次数小于0
        if (snapshot.getInvokeCount() <= 0) {
            if (stat.getUselessCycle().incrementAndGet() > 6) {
                // 6 个时间窗口无调用,删除统计
                InvocationStatFactory.removeInvocationStat(stat);
                InvocationStatDimension dimension = stat.getDimension();
                String appName = dimension.getAppName();
                if (LOGGER.isDebugEnabled(appName)) {
                    LOGGER.debugWithApp(appName, "Remove invocation stat : {}, {} because of useless cycle > 6",
                        dimension.getDimensionKey(), dimension.getProviderInfo());
                }
            }
        } else {
            //如果被调用了,那么就从新计数
            stat.getUselessCycle().set(0);
            snapshots.add(snapshot);
        }
    }
    return snapshots;
}

//ServiceExceptionInvocationStat#snapshot
public InvocationStat snapshot() {
    ServiceExceptionInvocationStat invocationStat = new ServiceExceptionInvocationStat(dimension);
    invocationStat.setInvokeCount(getInvokeCount());
    invocationStat.setExceptionCount(getExceptionCount());
    return invocationStat;
}

首先 这个方法里面首先是遍历所有的InvocationStat,然后调用snapshot创建一个新的InvocationStat实例。

其次 校验新的InvocationStat实例调用次数是不是小于等于0,如果是,说明没有在时间窗口内没有被调用过一次,那么就再看是不是在6 个时间窗口无调用,如果是,那么就删除统计数据

然后返回新的InvocationStat集合

  1. calculateAverageExceptionRate
private double calculateAverageExceptionRate(List<InvocationStat> invocationStats, long leastWindowCount) {
    long sumException = 0;
    long sumCall = 0;
    for (InvocationStat invocationStat : invocationStats) {

        long invocationLeastWindowCount = getInvocationLeastWindowCount(invocationStat,
            ProviderInfoWeightManager.getWeight(invocationStat.getDimension().getProviderInfo()),
            leastWindowCount);
        //统计所有的invocationStat被调用的次数,和异常次数
        if (invocationLeastWindowCount != -1
            && invocationStat.getInvokeCount() >= invocationLeastWindowCount) {
            sumException += invocationStat.getExceptionCount();
            sumCall += invocationStat.getInvokeCount();
        }
    }
    if (sumCall == 0) {
        return -1;
    }
    //计算异常比率
    return CalculateUtils.divide(sumException, sumCall);
}


private long getInvocationLeastWindowCount(InvocationStat invocationStat, Integer weight, long leastWindowCount) {
    //目标地址原始权重
    InvocationStatDimension statDimension = invocationStat.getDimension();
    Integer originWeight = statDimension.getOriginWeight();
    if (originWeight == 0) {
        LOGGER.errorWithApp(statDimension.getAppName(), "originWeight is 0,but is invoked. service["
                + statDimension.getService() + "];ip["
                + statDimension.getIp() + "].");
        return -1;
    } else if (weight == null) { //如果地址还未被调控过或者已经恢复。
        return leastWindowCount;
    } else if (weight == -1) { //如果地址被剔除
        return -1;
    }
    //这里主要是根据Invocation的实际权重计算该Invocation的实际最小窗口调用次数
    double rate = CalculateUtils.divide(weight, originWeight);
    long invocationLeastWindowCount = CalculateUtils.multiply(leastWindowCount, rate);
    return invocationLeastWindowCount < LEGAL_LEAST_WINDOW_COUNT ? LEGAL_LEAST_WINDOW_COUNT
            : invocationLeastWindowCount;
}

这个方法总的来说就是遍历所有的InvocationStat,然后求和说有的调用次数和异常次数,然后用(异常次数/调用次数)计算平均异常比率。

getInvocationLeastWindowCount方法主要是用来做校验,如果原始的权重为0,或者为-1,那么就返回-1。
因为当前的InvocationStat的权重可能被降权过,所以我们不能按原来的最小窗口调用次数来算,所以这里需要乘以一个比率,然后看是不是小于LEGAL_LEAST_WINDOW_COUNT,返回际权重计算该Invocation的实际最小窗口调用次数。

  1. if判断

我们在分析calculateAverageExceptionRate方法的时候看了,如果总的调用次数为0,那么averageExceptionRate会为-1。代表所有的InvocationStat没有被调用,我们设置忽略。

那么接着往下走,会发现有一个averageExceptionRate是否为0的判断,由于averageExceptionRate =(异常次数/调用次数),所以如果没有异常的时候设置状态为健康。

  1. windowExceptionRateMultipe
    windowExceptionRateMultipe这个变量主要是用来看这次被遍历到invocationStat的异常率和平均异常率之比。如果当前的(异常率/平均异常率)>=leastWindowExceptionRateMultiple,默认是6倍,那么就设置当前的invocationStat为异常。

根据MeasureResult进行降权或恢复

调用完ServiceHorizontalMeasureStrategy#measure方法后会返回一个MeasureResult,然会新建一个RegulationRunnable实例,丢到regulationExecutor线程池中执行。

RegulationRunnable是TimeWeindowRegulator的内部类。

RegulationRunnable#run

RegulationRunnable(MeasureResult measureResult) {
    this.measureResult = measureResult;
}
 
public void run() {
    List<MeasureResultDetail> measureResultDetails = measureResult.getAllMeasureResultDetails();
    for (MeasureResultDetail measureResultDetail : measureResultDetails) {
        try {
            doRegulate(measureResultDetail);
        } catch (Exception e) {
            LOGGER.errorWithApp(measureResult.getMeasureModel().getAppName(),
                "Error when doRegulate: " + e.getMessage(), e);
        }
    }
}

RegulationRunnable会在run方法里面遍历所有的measureResult,然后调用doRegulate方法进行降权或恢复的处理

void doRegulate(MeasureResultDetail measureResultDetail) {
    MeasureState measureState = measureResultDetail.getMeasureState();
    InvocationStatDimension statDimension = measureResultDetail.getInvocationStatDimension();
    //默认是否进行降级 ,默认为否 ServiceHorizontalRegulationStrategy
    boolean isDegradeEffective = regulationStrategy.isDegradeEffective(measureResultDetail);

    if (isDegradeEffective) {
        measureResultDetail.setLogOnly(false);
        if (measureState.equals(MeasureState.ABNORMAL)) {
            //这里是为了以防对太多节点做了降权,所以默认限制只能最多给两个节点降权
            boolean isReachMaxDegradeIpCount = regulationStrategy.isReachMaxDegradeIpCount(measureResultDetail);
            if (!isReachMaxDegradeIpCount) {
                //降权 WeightDegradeStrategy
                degradeStrategy.degrade(measureResultDetail);
            } else {
                String appName = measureResult.getMeasureModel().getAppName();
                if (LOGGER.isInfoEnabled(appName)) {
                    LOGGER.infoWithApp(appName, LogCodes.getLog(LogCodes.INFO_REGULATION_ABNORMAL_NOT_DEGRADE,
                            "Reach degrade number limit.", statDimension.getService(), statDimension.getIp(),
                            statDimension.getAppName()));
                }
            }
        } else if (measureState.equals(MeasureState.HEALTH)) {
            boolean isExistDegradeList = regulationStrategy.isExistInTheDegradeList(measureResultDetail);
            if (isExistDegradeList) {
                //恢复
                recoverStrategy.recover(measureResultDetail);
                regulationStrategy.removeFromDegradeList(measureResultDetail);
            }
            //没有被降级过,因此不需要被恢复。
        }
    } else {
        measureResultDetail.setLogOnly(true);
        if (measureState.equals(MeasureState.ABNORMAL)) {
            //这个时候调用degrade,主要是打印日志用的
            degradeStrategy.degrade(measureResultDetail);
            String appName = measureResult.getMeasureModel().getAppName();
            if (LOGGER.isInfoEnabled(appName)) {
                LOGGER.infoWithApp(appName, LogCodes.getLog(LogCodes.INFO_REGULATION_ABNORMAL_NOT_DEGRADE,
                        "Degrade switch is off", statDimension.getService(),
                        statDimension.getIp(), statDimension.getAppName()));
            }
        }
    }
}
}

我们分两种情况进行分析。

  1. 如果该节点是异常节点
    首先会调用ServiceHorizontalRegulationStrategy#isReachMaxDegradeIpCount方法。

ServiceHorizontalRegulationStrategy#isReachMaxDegradeIpCount

public boolean isReachMaxDegradeIpCount(MeasureResultDetail measureResultDetail) {
    InvocationStatDimension statDimension = measureResultDetail.getInvocationStatDimension();
    ConcurrentHashSet<String> ips = getDegradeProviders(statDimension.getDimensionKey());

    String ip = statDimension.getIp();
    if (ips.contains(ip)) {
        return false;
    } else {
        //默认一个服务能够调控的最大ip数
        int degradeMaxIpCount = FaultToleranceConfigManager.getDegradeMaxIpCount(statDimension.getAppName());
        ipsLock.lock();
        try {
            if (ips.size() < degradeMaxIpCount) {
                ips.add(ip);
                return false;
            } else {
                return true;
            }
        } finally {
            ipsLock.unlock();
        }
    }
}

这个方法是为了能够控制最多一个服务下面能调控多少个节点。比如一个服务下面只有3个节点,其中2个节点出了问题,通过调控解决了,那么不可能将第三个节点也进行调控了吧,必须要进行人工干预了,为啥会出现这样的问题。

然后会调用WeightDegradeStrategy#degrade对节点进行降权
WeightDegradeStrategy#degrade

public void degrade(MeasureResultDetail measureResultDetail) {
    //调用LogPrintDegradeStrategy方法,打印日志用
    super.degrade(measureResultDetail);

    if (measureResultDetail.isLogOnly()) {
        return;
    }

    InvocationStatDimension statDimension = measureResultDetail.getInvocationStatDimension();
    String appName = statDimension.getAppName();

    ProviderInfo providerInfo = statDimension.getProviderInfo();
    // if provider is removed or provider is warming up
    //如果为空,或是在预热中,则直接返回
    if (providerInfo == null || providerInfo.getStatus() == ProviderStatus.WARMING_UP) {
        return;
    }
    //目前provider权重
    int currentWeight = ProviderInfoWeightManager.getWeight(providerInfo);
    //降权比重
    double weightDegradeRate = FaultToleranceConfigManager.getWeightDegradeRate(appName);
    //最少权重,默认为1
    int degradeLeastWeight = FaultToleranceConfigManager.getDegradeLeastWeight(appName);
    //权重比率 * 目前权重
    int degradeWeight = CalculateUtils.multiply(currentWeight, weightDegradeRate);
    //不能小于最小值
    degradeWeight = degradeWeight < degradeLeastWeight ? degradeLeastWeight : degradeWeight;

    // degrade weight of this provider info
    boolean success = ProviderInfoWeightManager.degradeWeight(providerInfo, degradeWeight);
    if (success && LOGGER.isInfoEnabled(appName)) {
        LOGGER.infoWithApp(appName, "the weight was degraded. serviceUniqueName:["
            + statDimension.getService() + "],ip:["
            + statDimension.getIp() + "],origin weight:["
            + currentWeight + "],degraded weight:["
            + degradeWeight + "].");
    }
}

//ProviderInfoWeightManager
public static boolean degradeWeight(ProviderInfo providerInfo, int weight) {
    providerInfo.setStatus(ProviderStatus.DEGRADED);
    providerInfo.setWeight(weight);
    return true;
}

这个方法实际上就是权重拿出来,然后根据比率进行设值并且不能小于最小的比重。
最后调用ProviderInfoWeightManager把当前的节点设值为DEGRADED,并设值新的权重。

  1. 如果是健康节点

调用ServiceHorizontalRegulationStrategy#isExistInTheDegradeList判断一下当前节点有没有被降级
ServiceHorizontalRegulationStrategy#isExistInTheDegradeList

public boolean isExistInTheDegradeList(MeasureResultDetail measureResultDetail) {
    InvocationStatDimension statDimension = measureResultDetail.getInvocationStatDimension();
    ConcurrentHashSet<String> ips = getDegradeProviders(statDimension.getDimensionKey());
    return ips != null && ips.contains(statDimension.getIp());
}

在调用isReachMaxDegradeIpCount方法的时候会把被降级的ip放入到ips集合中,所以这里只要获取就可以了。

如果该节点已被降级那么调用WeightRecoverStrategy#recover进行恢复
WeightRecoverStrategy#recover

public void recover(MeasureResultDetail measureResultDetail) {
    InvocationStatDimension statDimension = measureResultDetail.getInvocationStatDimension();
    ProviderInfo providerInfo = statDimension.getProviderInfo();
    // if provider is removed or provider is warming up
    if (providerInfo == null || providerInfo.getStatus() == ProviderStatus.WARMING_UP) {
        return;
    }
    Integer currentWeight = ProviderInfoWeightManager.getWeight(providerInfo);
    if (currentWeight == -1) {
        return;
    }

    String appName = statDimension.getAppName();
    //默认2
    double weightRecoverRate = FaultToleranceConfigManager.getWeightRecoverRate(appName);
    //也就是说一次只能恢复到2倍,不会一次性就恢复到originWeight
    int recoverWeight = CalculateUtils.multiply(currentWeight, weightRecoverRate);
    int originWeight = statDimension.getOriginWeight();

    // recover weight of this provider info
    if (recoverWeight >= originWeight) {
        measureResultDetail.setRecoveredOriginWeight(true);
        //将provider状态设置为AVAILABLE,并且设置Weight
        ProviderInfoWeightManager.recoverOriginWeight(providerInfo, originWeight);
        if (LOGGER.isInfoEnabled(appName)) {
            LOGGER.infoWithApp(appName, "the weight was recovered to origin value. serviceUniqueName:["
                + statDimension.getService() + "],ip:["
                + statDimension.getIp() + "],origin weight:["
                + currentWeight + "],recover weight:["
                + originWeight + "].");
        }
    } else {
        measureResultDetail.setRecoveredOriginWeight(false);
        boolean success = ProviderInfoWeightManager.recoverWeight(providerInfo, recoverWeight);
        if (success && LOGGER.isInfoEnabled(appName)) {
            LOGGER.infoWithApp(appName, "the weight was recovered. serviceUniqueName:["
                + statDimension.getService() + "],ip:["
                + statDimension.getIp() + "],origin weight:["
                + currentWeight + "],recover weight:["
                + recoverWeight + "].");
        }
    }
}

这个方法很简单,各位可以看看我上面的注释。

总结

总的来说FaultToleranceModule分为两部分:

  1. FaultToleranceSubscriber订阅事件,负责订阅同步和异步结果事件
  2. 根据调用事件进行统计,以及内置的一些策略完成服务的降级和恢复操作。

Guess you like

Origin www.cnblogs.com/luozhiyun/p/11333036.html