spring cloud alibaba-sentinel principle

        In the previous article, the use of sentinel was introduced. In this article, the principle of sentinel will be further analyzed through the source code. Source code download address: sentinel source code download address

Table of contents

a core concept

Two source code analysis

1 Source code entry

2 context acquisition

3 Construction and execution of chain of responsibility

Triple Sliding Time Window Algorithm

1 Sliding time window algorithm to store data

2 Sliding time window algorithm to obtain data


a core concept

        Before understanding the source code, we need to understand some core concepts, which will help us understand the source code. Let's first look at a frame diagram of sentinel's official website:

slot chain : function slots, different slots have different functions, there are seven system-defined slots, of course, you can also customize slots, custom function slots, its execution order is before FlowSlot. The seven system-defined slots are (source: sentinel official website ):

        NodeSelectorSlot : It is mainly used to collect the paths of resources, and store the calling paths of these resources in a stump structure. Each node of this tree structure will be introduced below;

        ClusterBuilderSlot : Used to store resource statistics and caller information, such as resource qps (number of visits per second), rt (interface response time), number of threads, etc., which will be used as multi-dimensional current limiting and downgrading The basis corresponds to the cluster point link. Build ClusterNode node information.

        StaticSlot : It is used to record and count the monitoring information of runtime indicators in different dimensions. It is real-time monitoring, and the bottom layer adopts the sliding time window algorithm (will be introduced in the subsequent content of this chapter).

        The following ParamFlowSlot, SystemSlot, AuthoritySlot, FlowSlot, and DegradeSlot are the slots corresponding to each check in the current-limiting fuse, and are used to judge whether the corresponding current-limiting degradation type meets the rules. If it is satisfied, the current-limiting degradation is performed; otherwise, it is normal pass.

Node : store node information, used to store information of different dimensions of resources, node has the following classifications, namely:

        StatisticNode : statistical node, used to complete data statistics;

        EntranceNode : It belongs to the entry node and is used to count the overall traffic data of a Context, and its statistical dimension is context;

        DefaultNode : Used to count the traffic data of a Resource in the current context, and its statistical dimension is context+resource;

        ClusterNode : It is used to save the traffic data of a Resource in different contexts, and its statistical dimension is resource;

        Let's reorganize the architecture information of the node nodes in the above architecture diagram to facilitate our understanding, as shown in the following diagram:

Context : The context for resource operations. Each resource operation must belong to a Context. If no Context is specified in the program, a Context named "sentinel_default_context" will be created. There may be multiple resource operations in a Context life cycle, and the Context will be cleaned up when the last resource in the Context life cycle exits, which also indicates the end of the Context life cycle;

Entry : Indicates a resource operation, and the current call information will be saved internally. Multiple resource operations in a Context life cycle will also correspond to multiple Entry, and the parent/child structure of these Entry trips will be reported in the Entry instance.      

Two source code analysis

1 Source code entry

        Next, we enter the source code analysis of current limiting and downgrading. Through our use of sentinel, we will find that he actually uses aop, which is aspect programming. He did not invade the business code we wrote, but every time a request is made, it will trigger the verification of current limiting and downgrading. In addition, when we use customization, we use the @SentinelResource annotation. I can use this as a basis to find its corresponding aspect from the source code: SentinelResourceAspect, where the cut point, notification and other information are defined

// 切面
@Aspect
public class SentinelResourceAspect extends AbstractSentinelAspectSupport {
    
    // 切点
    @Pointcut("@annotation(com.alibaba.csp.sentinel.annotation.SentinelResource)")
    public void sentinelResourceAnnotationPointcut() {
    }
    // 环绕通知
    @Around("sentinelResourceAnnotationPointcut()")
    public Object invokeResourceWithSentinel(ProceedingJoinPoint pjp) throws Throwable {
        ....
        String resourceName = getResourceName(annotation.value(), originMethod);
        EntryType entryType = annotation.entryType();
        int resourceType = annotation.resourceType();
        Entry entry = null;
        try {
            // 这里就对应我们所说的资源对象entry
            entry = SphU.entry(resourceName, resourceType, entryType, pjp.getArgs());
            // 调用原方法,通过限流降级规则
            return pjp.proceed();
        } catch (BlockException ex) {
            // 限流或者降级
            return handleBlockException(pjp, annotation, ex);
        } catch (Throwable ex) {
            ...
        } finally {
            if (entry != null) {
                entry.exit(1, pjp.getArgs());
            }
        }
    }
}

From the above, we can know that the main method is the creation of the Entry object, which contains all the processing of the working principle of sentinel. Next, we will continue to follow up, skip a series of overloaded methods, and follow up directly to the following code :

@Override
public Entry entryWithType(String name, int resourceType, EntryType entryType, int count, boolean prioritized,
                           Object[] args) throws BlockException {
    // 第一步,分装资源对象,是根据资源名称以及@SentinelResource注解中的相关信息
    StringResourceWrapper resource = new StringResourceWrapper(name, entryType, resourceType);
    // 第二步,进入sentinel具体的工作流程,prioritized这个字段默认是false,标识不按照优先级的方式执行接下来的流程
    return entryWithPriority(resource, count, prioritized, args);
}

Next, we enter the process of entryWithPriority. Here, we mainly do three things, 1. Get the context; 2. Build the chain of responsibility, using the spi interface extension; 3. Execute the chain of responsibility.

private Entry entryWithPriority(ResourceWrapper resourceWrapper, int count, boolean prioritized, Object... args)
    throws BlockException {
    // 第一步,获取context
    // 通过跟进代码,发现这里是通过从ThreadLocal中获取
    Context context = ContextUtil.getContext();
    // 如果获取的是NullContext类型,则为当前context的数量超多阈值,然后只进行Entry的初始化
    if (context instanceof NullContext) {
        return new CtEntry(resourceWrapper, null, context);
    }

    // 如果context为空,则进行context的初始化操作
    if (context == null) {
        // 初始化时,默认的context的名称为sentinel_default_context,和上面介绍核心概念时的介绍匹配上了
        context = InternalContextUtil.internalEnter(Constants.CONTEXT_DEFAULT_NAME);
    }
    // 如果全局的限流规则为关闭,只进行Entry资源的初始化
    if (!Constants.ON) {
        return new CtEntry(resourceWrapper, null, context);
    }

    /**第二步 构建责任链
      *这里进行一下重点说明,这里采用了spi的接口扩展方式构建处理链,处理链的数据结构为单向链表
      *之所以构建这个单向链表,目的为了与业务进行解耦,因为限流降级规则很多,如果写在一起,耦合会
      *很严重,为了遵循oop的设计思想,因此进行解耦,各司其职
      * /
    ProcessorSlot<Object> chain = lookProcessChain(resourceWrapper);


    if (chain == null) {
        return new CtEntry(resourceWrapper, null, context);
    }

    Entry e = new CtEntry(resourceWrapper, chain, context);
    try {
        // 第三步,责任链的执行,针对上下文和资源进行操作
        chain.entry(context, resourceWrapper, null, count, prioritized, args);
    } catch (BlockException e1) {
        e.exit(count, args);
        throw e1;
    } catch (Throwable e1) {
        RecordLog.info("Sentinel unexpected exception", e1);
    }
    return e;
}

2 context acquisition

        Next, first sort out the acquisition of context, 1. Obtain the context from the cache in the current thread; 2. If the current thread has not created a context, initialize the context. Next, we mainly look at the initialization process of the context. In this process, we use a method in our singleton mode, double check, to ensure thread safety, and create the entranceNode through double check. Through the follow-up of the code, we directly locate the trueEnter method. Let's go straight to the code

protected static Context trueEnter(String name, String origin) {
        // 从当前线程中再次获取,进行线程安全保证
        Context context = contextHolder.get();
        if (context == null) {
            // 如果当前线程中context为空,则从缓存中获取node信息
            Map<String, DefaultNode> localCacheNameMap = contextNameNodeMap;
            DefaultNode node = localCacheNameMap.get(name);
            if (node == null) {
                // 如果node信息为空,判断当前context的容量是否超过限制,如果是,则直接返回,不进行流控校验
                if (localCacheNameMap.size() > Constants.MAX_CONTEXT_NAME_SIZE) {
                    setNullContext();
                    return NULL_CONTEXT;
                } else {
                    // 这里相信大家非常熟悉,采用了双重检查的方式,保证线程安全
                    LOCK.lock();
                    try {
                        node = contextNameNodeMap.get(name);
                        if (node == null) {
                            if (contextNameNodeMap.size() > Constants.MAX_CONTEXT_NAME_SIZE) {
                                setNullContext();
                                return NULL_CONTEXT;
                            } else {
                                // 进行节点创建,这里创建的是EntranceNode,在上面介绍核心概念的时候,我们知道,它的统计维度为context
                                node = new EntranceNode(new StringResourceWrapper(name, EntryType.IN), null);
                                // 在上面介绍核心概念的时候,我们说过,node的存储结构是树状结构,这里就是树的构建
                                Constants.ROOT.addChild(node);

                                Map<String, DefaultNode> newMap = new HashMap<>(contextNameNodeMap.size() + 1);
                                newMap.putAll(contextNameNodeMap);
                                newMap.put(name, node);
                                contextNameNodeMap = newMap;
                            }
                        }
                    } finally {
                        LOCK.unlock();
                    }
                }
            }
            // 根据node以及contextName创建context
            context = new Context(node, name);
            context.setOrigin(origin);
            contextHolder.set(context);
        }

        return context;
    }

3 Construction and execution of chain of responsibility

        After the context is built, the next step is to build the chain of responsibility. Here we will look down on the initialization of the slot chain. Do you still remember the calling order of the slots in the architecture diagram? It will be reflected here. Next, we will directly look at code:

// CtSph.lookProcessChain获取责任链
ProcessorSlot<Object> lookProcessChain(ResourceWrapper resourceWrapper) {
        // 通过资源信息,从缓存中获取责任链信息
        ProcessorSlotChain chain = chainMap.get(resourceWrapper);
        if (chain == null) {
            // 通过创冲检查的方式获取责任链信息,保证线程安全,也保证值加载一次
            synchronized (LOCK) {
                chain = chainMap.get(resourceWrapper);
                if (chain == null) {
                    // 责任链缓存的长度超过最大值,则返回null
                    if (chainMap.size() >= Constants.MAX_SLOT_CHAIN_SIZE) {
                        return null;
                    }
                    // 进行责任链的初始化,初始化完成后,将责任链信息放入缓存中
                    chain = SlotChainProvider.newSlotChain();
                    Map<ResourceWrapper, ProcessorSlotChain> newMap = new HashMap<ResourceWrapper, ProcessorSlotChain>(
                        chainMap.size() + 1);
                    newMap.putAll(chainMap);
                    newMap.put(resourceWrapper, chain);
                    chainMap = newMap;
                }
            }
        }
        return chain;
    }

// SlotChainProvider.newSlotChain 初始化责任链
public static ProcessorSlotChain newSlotChain() {

        if (slotChainBuilder != null) {
            return slotChainBuilder.build();
        }
        // 获取责任链的构建器,读取的是配置文件
        // META-INF/services/com.alibaba.csp.sentinel.slotchain.SlotChainBuilder
        slotChainBuilder = SpiLoader.of(SlotChainBuilder.class).loadFirstInstanceOrDefault();

        if (slotChainBuilder == null) {
            RecordLog.warn("[SlotChainProvider] Wrong state when resolving slot chain builder, using default");
            slotChainBuilder = new DefaultSlotChainBuilder();
        } else {
            RecordLog.info("[SlotChainProvider] Global slot chain builder resolved: {}",
                slotChainBuilder.getClass().getCanonicalName());
        }
                // 通过责任链构建器,初始化责任链
        return slotChainBuilder.build();
    }


// DefaultSlotChainBuilder.build真正进行责任链,也就是slot插槽的构建
public ProcessorSlotChain build() {

        ProcessorSlotChain chain = new DefaultProcessorSlotChain();
        // 读取配置文件
        // META-INF/services/com.alibaba.csp.sentinel.slotchain.ProcessorSlot
        List<ProcessorSlot> sortedSlotList = SpiLoader.of(ProcessorSlot.class).loadInstanceListSorted();
        // 对获取的slot进行校验,排除类型不是AbstractLinkedProcessorSlot的slot
        for (ProcessorSlot slot : sortedSlotList) {
            if (!(slot instanceof AbstractLinkedProcessorSlot)) {
                RecordLog.warn("The ProcessorSlot(" + slot.getClass().getCanonicalName() + ") is not an instance of AbstractLinkedProcessorSlot, can't be added into ProcessorSlotChain");
                continue;
            }

            chain.addLast((AbstractLinkedProcessorSlot<?>) slot);
        }

        return chain;
    }

When creating the responsibility chain, through the above source code analysis, we can know that it reads the configuration file through the spi extension interface. Note that the location of the configuration file in the above comment is a relative location. These two configuration files are in In the sentinel-core submodule, let's take a look at the contents of these two configuration files.

SlotChainBuilder

com.alibaba.csp.sentinel.slots.DefaultSlotChainBuilder

ProcessorSlot

com.alibaba.csp.sentinel.slots.nodeselector.NodeSelectorSlot
com.alibaba.csp.sentinel.slots.clusterbuilder.ClusterBuilderSlot
com.alibaba.csp.sentinel.slots.logger.LogSlot
com.alibaba.csp.sentinel.slots.statistic.StatisticSlot
com.alibaba.csp.sentinel.slots.block.authority.AuthoritySlot
com.alibaba.csp.sentinel.slots.system.SystemSlot
com.alibaba.csp.sentinel.slots.block.flow.FlowSlot
com.alibaba.csp.sentinel.slots.block.degrade.DegradeSlot

Take a look at the class names in the above configuration file. Do you feel familiar? It just corresponds to the slot information in the architecture diagram, and the order from top to bottom is exactly the calling order of the slots. This order is also what we need to remember next. When executing the business functions of the chain of responsibility, it is executed in this order, that is, the one-way linked list of the chain of responsibility is built from top to bottom according to this configuration file.

        Finally, we enter the operation of the chain of responsibility for resources, including resource statistics, current limiting and downgrading, etc. Next, we enter the specific code for analysis. According to the configuration content in the above configuration file, we know that the first slot of the chain of responsibility is NodeSelectNode.

        NodeSelectNode.entry is the starting point of the chain of responsibility, from which the chain of responsibility is called, and DefaultNode will be created here.

    public void entry(Context context, ResourceWrapper resourceWrapper, Object obj, int count, boolean prioritized, Object... args)
        // 根据context的name获取DefaultNode信息
        // 从核心概念中我们可以这道,DefaultNode统计信息的维度为context+resource
        DefaultNode node = map.get(context.getName());
        if (node == null) {
            // 如果defaultNodez还没被撞见,则通过双重检查的方式进行创建
            synchronized (this) {
                node = map.get(context.getName());
                    node = new DefaultNode(resourceWrapper, null);
                    HashMap<String, DefaultNode> cacheMap = new HashMap<String, DefaultNode>(map.size());
                    cacheMap.putAll(map);
                    // 将创建的node放入缓存中
                    cacheMap.put(context.getName(), node);
                    map = cacheMap;
                    // 将新建的node放入到node树中
                    ((DefaultNode) context.getLastNode()).addChild(node);
                }

            }
        }

        context.setCurNode(node);
        // 触发下一个节点
        fireEntry(context, resourceWrapper, node, count, prioritized, args);
    }

    // 这里使用了模板的设计模式,所有的slot都是AbstractLinkedProcessorSlot的子类,在父类中定义了触发下一个slot的方法
    public void fireEntry(Context context, ResourceWrapper resourceWrapper, Object obj, int count, boolean prioritized, Object... args)
        throws Throwable {
        // 如果责任链中slot还没执行结束,则执行下一个slot,这里的next根据上一个的slot,根据配置文件中由上到下的顺序,
        // 在后面slot的执行中,我们会经常的调用到这个方法,要根据上一个调用者来确定下一个slot
        if (next != null) {
            next.transformEntry(context, resourceWrapper, obj, count, prioritized, args);
        }
    }
    // 进入到下一个slot中的处理中
    void transformEntry(Context context, ResourceWrapper resourceWrapper, Object o, int count, boolean prioritized, Object... args)
        throws Throwable {
        T t = (T)o;
        entry(context, resourceWrapper, t, count, prioritized, args);
    }

    

From the sequence of slots in the configuration file, we can know that the next step is to call the ClusterBuilderSlot.entry method. In this method, the function of this method is to initialize the ClusterNode and establish a relationship with the DefaultNode. This node stores resource call information and call User information, such as resource RT, QPS, etc., these data are used as the basis for current limiting and downgrading.

public void entry(Context context, ResourceWrapper resourceWrapper, DefaultNode node, int count,
                      boolean prioritized, Object... args)
        throws Throwable {
        // 通过双重检查的方式创建ClusterNode,
        // 在核心概念中我们知道,ClusterNode统计数据的维度是Resource
        if (clusterNode == null) {
            synchronized (lock) {
                if (clusterNode == null) {
                    // Create the cluster node.
                    clusterNode = new ClusterNode(resourceWrapper.getName(), resourceWrapper.getResourceType());
                    HashMap<ResourceWrapper, ClusterNode> newMap = new HashMap<>(Math.max(clusterNodeMap.size(), 16));
                    newMap.putAll(clusterNodeMap);
                    newMap.put(node.getId(), clusterNode);

                    clusterNodeMap = newMap;
                }
            }
        }
        // 将clusterNode和DefaultNode进行关联
        node.setClusterNode(clusterNode);

        // 确认资源来源
        if (!"".equals(context.getOrigin())) {
            Node originNode = node.getClusterNode().getOrCreateOriginNode(context.getOrigin());
            context.getCurEntry().setOriginNode(originNode);
        }
        // 进入下一个slot的执行
        fireEntry(context, resourceWrapper, node, count, prioritized, args);
    }

Next, enter StaticSlot. This slot belongs to a very critical slot. This slot is also the entrance of a key algorithm in Sentinel, that is, the sliding time window algorithm. Data statistics will be performed here, and data statistics will be performed through this algorithm. Then let's take a look at the data in the source code. The time window algorithm will be explained in detail later, so it will not be disassembled here.

public void entry(Context context, ResourceWrapper resourceWrapper, DefaultNode node, int count,
                      boolean prioritized, Object... args) throws Throwable {
        try {
            // 会调用slotchain后续的所有slot,进行规则统计
            fireEntry(context, resourceWrapper, node, count, prioritized, args);

            // 增加线程数,这里是使用的原子类LongAddr,感性的同学可以去看我以前的文章,有对它的讲解
            node.increaseThreadNum();
            // 增加通过请求的数量(滑动时间窗算法)
            node.addPassRequest(count);

            ......
    }

The above three slots are for the preparation of current limiting and downgrading and the aftermath work. Next, we will enter the specific flow control rule slots. Due to space constraints, we only introduce the two slots FlowSlot and DegradeSlot for source code introduction, and the rest Students who are interested in the slot can take a look for themselves.

        FlowSlot is what we call the slot of the flow control rule. According to the preset resource statistics information, the flow control rule is verified. Here we will also add the configuration field information about the persistence of the flow control rule in the previous article. We The corresponding resource information and enumeration information can be found here.

   
public void entry(Context context, ResourceWrapper resourceWrapper, DefaultNode node, int count,
                      boolean prioritized, Object... args) throws Throwable {
    // 监测并且应用流量规则
    checkFlow(resourceWrapper, context, node, count, prioritized);
    // 触发下一个slot
    fireEntry(context, resourceWrapper, node, count, prioritized, args);
}

    // 我们根据代码的调用,定位到下面的这个方法
public void checkFlow(Function<String, Collection<FlowRule>> ruleProvider, ResourceWrapper resource,
                      Context context, DefaultNode node, int count, boolean prioritized) throws BlockException {
    // 判断资源和规则不能为空
    if (ruleProvider == null || resource == null) {
        return;
    }
    // 根据资源名获取所有流控规则,我们跟进去以后会发现,他是通过FlowRuleManager来进行FlowRule的管理
    Collection<FlowRule> rules = ruleProvider.apply(resource.getName());
    if (rules != null) {
        // 每一个规则进行校验,如果校验失败,则抛出异常,抛出的异常是FlowException
        // sentinel触发流控的规则时会抛出BlockException,我们查看FlowException会发现它是BlockException的子类
        for (FlowRule rule : rules) {
            if (!canPassCheck(rule, context, node, count, prioritized)) {
                throw new FlowException(rule.getLimitApp(), rule);
            }
        }
    }
}

Here we take a look at the source code of FlowRule, which contains the relevant fields when the corresponding flow control rules are persisted. Each field here corresponds to the current limiting rule configured in our previous article. When persisting, it can be configured according to the following fields and corresponding enumerations.

    // 阈值类型,默认为1-qps,还有0-并发线程数
    private int grade = RuleConstant.FLOW_GRADE_QPS;

    // 单机阈值
    private double count;

    // 流控模式,0-直接,1-关联,2-链路
    private int strategy = RuleConstant.STRATEGY_DIRECT;

    // 流控模式为关联时,设置的关联资源
    private String refResource;

    // 流控效果,0-快速失败,1-预热(warm up),2-排队等待
    private int controlBehavior = RuleConstant.CONTROL_BEHAVIOR_DEFAULT;

    // 预热时长
    private int warmUpPeriodSec = 10;

    // 排队超时时长
    private int maxQueueingTimeMs = 500;

    // 集群模式
    private boolean clusterMode;

The preheating and queuing in the flow control effect involve two algorithms, namely the token bucket algorithm and the funnel algorithm, which will be introduced in detail in subsequent articles related to the algorithm, and here is a brief explanation.

Token Bucket Algorithm : The system puts tokens into the bucket at a constant speed. When a request needs to be processed, a token needs to be obtained from the bucket first. When there is no token in the bucket, the service is refused.

Funnel algorithm : Requests will first enter the leaky bucket, and the leaky bucket will release requests at a fixed speed for processing; but when the number of requests in the leaky bucket exceeds the capacity of the bucket, it will be rejected directly.

        Next, let's take a look at the canPass method. In this method, we will choose whether it is cluster flow control verification or stand-alone flow control verification. The code will not be released. We will analyze the stand-alone flow control verification here. It is the passLocalCheck method. Two things are done in this method: 1. According to the current limiting rules, the context is used to obtain node information, that is, the node that stores statistical information; 2. According to the flow control effect configured in the rule, select specific The controller executes the canPass method.

    private static boolean passLocalCheck(FlowRule rule, Context context, DefaultNode node, int acquireCount,
                                          boolean prioritized) {
        // 根据请求获取节点,我们去跟进代码,在这个方法中会根据context和rule的信息,来返回不同的node节点
        // 我们以流控模式为直接时为例,它返回的就是ClusterNode
        Node selectedNode = selectNodeByRequesterAndStrategy(rule, context, node);
        if (selectedNode == null) {
            return true;
        }
        // 根据rule中配置的流控效果选择对应的类进行处理,我们会发现这里会有四个controller
        // DefaultController 流控效果为快速失败
        // WarmUpController 流控效果为预热(warm up)
        // RateLimiterController 流控效果为排队等待
        // WarmUpRateLimiterController 预热+排队等待,需要注意的是,这种方式在dashbord中是无法直接配置的
        return rule.getRater().canPass(selectedNode, acquireCount, prioritized);
    }

Let's take fast failure as an example to continue the analysis. Let me say a little digression here. When analyzing the source code, we can analyze it with a simple branch, which is the DefaultController.canPass method. In this method, two things are mainly done: 1. Calculate the current data; 2. Verify the current limiting rules.

    @Override
    public boolean canPass(Node node, int acquireCount, boolean prioritized) {
        // 获取当前node节点的线程数或者qps总数,在这里就涉及到了滑动窗口算法
        int curCount = avgUsedTokens(node);
        // 当前请求数+申请的请求数量 > 阈值
        if (curCount + acquireCount > count) {
            if (prioritized && grade == RuleConstant.FLOW_GRADE_QPS) {
                long currentTime;
                long waitInMs;
                currentTime = TimeUtil.currentTimeMillis();
                waitInMs = node.tryOccupyNext(currentTime, acquireCount, count);
                if (waitInMs < OccupyTimeoutProperty.getOccupyTimeout()) {
                    node.addWaitingRequest(currentTime + waitInMs, acquireCount);
                    node.addOccupiedPass(acquireCount);
                    sleep(waitInMs);
                    throw new PriorityWaitException(waitInMs);
                }
            }
            return false;
        }
        return true;
    }

Finally, let's analyze DegradeSlot again. Its code is somewhat different from other rule slots. DegradeSlot represents the slot of the fuse. We know the fuse rules of the fuse: average response time, abnormal number and abnormal ratio. These data must be in It can only be obtained after the interface is called, so DegradeSlot only obtains the fusing rules in the entry, and the verification of the fusing rules is executed in the exit. Let's take a look at the acquisition of the fuse rule first, and by the way, look at the entity class of the fuse rule, which is used for the configuration of the entity.

    void performChecking(Context context, ResourceWrapper r) throws BlockException {
        // 获取所有资源的熔断器
        List<CircuitBreaker> circuitBreakers = DegradeRuleManager.getCircuitBreakers(r.getName());
        if (circuitBreakers == null || circuitBreakers.isEmpty()) {
            return;
        }
        for (CircuitBreaker cb : circuitBreakers) {
            // 对当前熔断状态进行判断,我们在上一章中也说过有关熔断状态的判断
            if (!cb.tryPass(context)) {
                throw new DegradeException(cb.getRule().getLimitApp(), cb.getRule());
            }
        }
    }


public class DegradeRule extends AbstractRule {

    // 熔断策略,0-慢调用比例,1-异常比例,2-异常数量
    private int grade = RuleConstant.DEGRADE_GRADE_RT;

    // 阈值
    private double count;

    // 熔断时长
    private int timeWindow;

    // 最小请求数量
    private int minRequestAmount = RuleConstant.DEGRADE_DEFAULT_MIN_REQUEST_AMOUNT;
    
    // 慢调用比例
    private double slowRatioThreshold = 1.0d;
    
    // 慢调用统计时长
    private int statIntervalMs = 1000;
}

enum State {
        // 开启状态,服务熔断
        OPEN,
        // 半开启状态,超出熔断时间后,如果下次请求正常,则服务恢复正常;否则,继续熔断
        HALF_OPEN,
        // 关闭状态,服务正常
        CLOSED
    }


Next, let's look at the process of verifying the circuit breaker rules. In the exit method, two things are mainly done: 1. Determine whether other slots are abnormal. If so, it will end directly without continuing to verify; 2. According to the resource name Obtain the circuit breaker rules and verify the circuit breaker rules.

There are two methods for checking circuit breaker rules, one is ExceptionCircuitBreaker, abnormal circuit breaker rules; the other is ResponseTimeCircuitBreaker response time circuit breaker rules, here we only focus on ExceptionCircuitBreaker abnormal circuit breaker rules

    public void onRequestComplete(Context context) {
        Entry entry = context.getCurEntry();
        if (entry == null) {
            return;
        }
        Throwable error = entry.getError();
        // 异常事件窗口计数器
        SimpleErrorCounter counter = stat.currentWindow().value();
        // 本次请求是否抛异常,如果是,则异常数加一
        if (error != null) {
            counter.getErrorCount().add(1);
        }
        // 请求总数加一
        counter.getTotalCount().add(1);
        // 熔断规则校验
        handleStateChangeWhenThresholdExceeded(error);
    }
    private void handleStateChangeWhenThresholdExceeded(Throwable error) {
        // 如果熔断器为开启状态,则直接返回
        if (currentState.get() == State.OPEN) {
            return;
        }
        // 如果熔断状态为半开启
        if (currentState.get() == State.HALF_OPEN) {
            // 如果本次请求为正常请求,则将熔断状态置为关闭,通过cas的方式
            if (error == null) {
                fromHalfOpenToClose();
            } else {
                // 将熔断状态置为开启状态,在修改状态时需要计算下次半开启状态的起始时间
                fromHalfOpenToOpen(1.0d);
            }
            return;
        }

        List<SimpleErrorCounter> counters = stat.values();
        long errCount = 0;
        long totalCount = 0;
        // 统计总的异常请求数以及总请求数
        for (SimpleErrorCounter counter : counters) {
            errCount += counter.errorCount.sum();
            totalCount += counter.totalCount.sum();
        }
        // 如果总请求数没有超过最小请求数量,则直接返回
        if (totalCount < minRequestAmount) {
            return;
        }

        double curCount = errCount;
        // 如果熔断策略为异常比例,则计算异常比例
        if (strategy == DEGRADE_GRADE_EXCEPTION_RATIO) {
            curCount = errCount * 1.0d / totalCount;
        }
        // 如果异常数量/异常比例达到了熔断标砖,则将熔断器置为开启状态
        if (curCount > threshold) {
            transformToOpen(curCount);
        }
    }

Triple Sliding Time Window Algorithm

        The sliding time window algorithm is the core algorithm for data statistics inside sentinel. In the above architecture diagram, we will find that the entire time window is a ring data group, and each element in the ring array is a time sample window. The sample window Each has its own length, and its length is fixed, so the length of each time window is also fixed. When performing data statistics, it will calculate the time sample window where the current time is located, and then calculate the statistics of the sample window Data, and then calculate the total statistics of other sample windows in this time window, and add the two statistics to get the total value of this time window. Of course, there will be certain errors in this calculation method. The current time may not have reached the end of the time window to which it belongs. This error is allowed within the rules, so there is no need to worry about it.

1 Sliding time window algorithm to store data

        Next, let's take a look at the source code of the sliding time window algorithm for further understanding. In the source code explanation in the previous part, the data statistics of the time window are node.addPassRequest in the StatisticSlot, and we use this as the entry point to enter the source code analysis. Through code analysis, we enter the addPassRequest method of StatisticSlot, where the sliding counter is used to add this data. We continue to follow up the code and enter the ArrayMetric.addPass method.

    public void addPass(int count) {
        // 获取当前时间点所在的样本窗口
        WindowWrap<MetricBucket> wrap = data.currentWindow();
        // 在当前样本窗口中加入本次请求
        wrap.value().addPass(count);
    }

Let's look at the first line of code first, follow up to the LeapArray.currentWindow() method, let's look at the LeapArray class first, this class is the ring array in the architecture diagram, let's take a look at the elements inside

ublic abstract class LeapArray<T> {
    // 样本窗口长度
    protected int windowLengthInMs;
    // 一个时间窗口中包含的样本窗口量
    protected int sampleCount;
    // 时间窗的长度
    protected int intervalInMs;
    private double intervalInSecond;
    // 元素为样本窗口类型,这里的泛型实际为MetricBucket
    protected final AtomicReferenceArray<WindowWrap<T>> array;
    ......
}

There is another type in it, namely WindowWrp, let's take a look at this class, which contains some definitions of the window of the team sample

public class WindowWrap<T> {

    // 样本窗口长度
    private final long windowLengthInMs;

   // 样本窗口的起始时间戳
    private long windowStart;

    // 存储具体的统计数据,类型为MetricBucket,统计的多维数据存储在MetricEvent中
    private T value;
    ......
}

After reading these two categories, we will go to the original method call, follow up through the code, and we will enter the LeapArray.currentWindow method.

    public WindowWrap<T> currentWindow(long timeMillis) {
        if (timeMillis < 0) {
            return null;
        }
        // 计算当前时间所在的样本窗口所在的索引,计算方式为当前时间戳/样本的窗口长度,然后用计算出的值对样本数量取余
        int idx = calculateTimeIdx(timeMillis);
        // 计算当前样本窗口的开始时间点,计算方式为,当前时间-(当前时间%样本窗口长度)
        long windowStart = calculateWindowStart(timeMillis);

        while (true) {
            // 根据计算得到的索引,获取当前时间窗中的样本窗口
            WindowWrap<T> old = array.get(idx);
            // 如果当前样本窗口不存在,则进行样本窗口的新建
            if (old == null) {
                WindowWrap<T> window = new WindowWrap<T>(windowLengthInMs, windowStart, newEmptyBucket(timeMillis));
                if (array.compareAndSet(idx, null, window)) {
                    return window;
                } else {
                    Thread.yield();
                }
            } else if (windowStart == old.windowStart()) {
                // 如果当前时间所在样本窗口的开始时间等于计算得到的样本窗口的开始时间,证明这两个窗口是同一个,直接返回
                return old;
            } else if (windowStart > old.windowStart()) {
                // 如果当前时间所在样本窗口的开始时间大于计算得到的样本窗口的开始时间,则证明原有样本窗口已经过期,需要进行替换
                if (updateLock.tryLock()) {
                    try {
                        return resetWindowTo(old, windowStart);
                    } finally {
                        updateLock.unlock();
                    }
                } else {
                    Thread.yield();
                }
               
            } else if (windowStart < old.windowStart()) {
                // 这种情况再正常情况下是不会出现的,除非调整服务器时间,我们不做过多分析
                return new WindowWrap<T>(windowLengthInMs, windowStart, newEmptyBucket(timeMillis));
            }
        }
    }

After we get the current sample window, the next step is to add the request information to the sample window

    // 需要注意这里存储多维度数据使用的是LongAddr[],这里是在数组下标代表的是PASS的位置进行数据相加
    public void addPass(int n) {
        add(MetricEvent.PASS, n);
    }
    // 我们看一下MetricEvent
    public enum MetricEvent {
        PASS,
        BLOCK,
        EXCEPTION,
        SUCCESS,
        RT,
        OCCUPIED_PASS
    }

2 Sliding time window algorithm to obtain data

        So far, the process of adding data through the sliding time window algorithm is over. Next, we will look at the source code part of obtaining data through the sliding time window algorithm. When introducing the source code of FlowSlot above, we mentioned the sliding time window algorithm in the code, which is in the DefaultController.pass method. Do you remember what this method does? It deals with the direct failure of the current limiting effect. controller. We note that in this method

    // int curCount = avgUsedTokens(node);获取当前node节点的线程数或者qps
    
    private int avgUsedTokens(Node node) {
        if (node == null) {
            return DEFAULT_AVG_USED_TOKENS;
        }
        // 根据不同的数据类型湖区不同的值,这里我们以qps分支分析
        return grade == RuleConstant.FLOW_GRADE_THREAD ? node.curThreadNum() : (int)(node.passQps());
    }
    
    // 进入到StatisticNode.passQps方法
    public double passQps() {
        // 获取qps的值,即当前时间窗中的通过的请求数/当前时间窗长度
        return rollingCounterInSecond.pass() / rollingCounterInSecond.getWindowIntervalInSec();
    }

The next thing we need to pay attention to is pass, here is to get statistics from the sliding time window

    public long pass() {
        // 这个方法相信大家很熟悉,就是上面通过滑动时间窗算法增加数据时进行样本窗口数据更新的方法
        data.currentWindow();
        long pass = 0;
        // 获取当前时间窗口中的所有样本窗口
        List<MetricBucket> list = data.values();

        // 将样本窗口中统计的多维数据中,状态为PASS的数据的总数量
        for (MetricBucket window : list) {
            pass += window.pass();
        }
        return pass;
    }
    // 我们通过跟进方法,跳过重载方法,来到以下方法,获取所有的有效样本窗口
    public List<T> values(long timeMillis) {
        if (timeMillis < 0) {
            return new ArrayList<T>();
        }
        int size = array.length();
        List<T> result = new ArrayList<T>(size);
        // 遍历每一个样本窗口
        for (int i = 0; i < size; i++) {
            WindowWrap<T> windowWrap = array.get(i);
            // 若当前数据为空,或者已经过时,则本条数据不处理
            // 超时代表的是:当前时间节点-样本窗口的起始节点时间>时间窗口长度,代表不是同一个时间窗口
            if (windowWrap == null || isWindowDeprecated(timeMillis, windowWrap)) {
                continue;
            }
            result.add(windowWrap.value());
        }
        return result;
    }

Guess you like

Origin blog.csdn.net/weixin_38612401/article/details/126037408