Open source! Practice of Netease Yunxin's hotspot detection platform

background

For an Internet platform, especially toB's PaaS/SaaS platform, hot key is an unavoidable problem. As an open system, the platform carries requests from a large number of external systems or massive terminals every day. Although all requests must meet the authentication rules defined by the open platform, sudden requests or abnormal requests are always Unexpectedly.

For the toB system, such requests include at least the following types:

  • Abnormal traffic caused by wrong use of posture by customers

    PaaS platforms often provide external services in the form of APIs and SDKs. Although we will provide various demos and solutions to guide customers to use our services in a more elegant way, you can never predict that a customer will use it in an unexpected way. way to call our API.

  • Traffic from an unknown attacker

    Taking Yunxin as an example, our servers have been subjected to various four-layer and seven-layer attacks all year round. Layer 4 traffic generally does not go directly to the background server, but for layer 7 traffic, in addition to general waf protection, it is also necessary to discover and locate the source in a timely manner at the business layer.

  • Customer pressure test

    Yes, customers often give us big surprises without prior notice. In order to protect our system from being overwhelmed by sudden traffic, we must quickly identify such traffic and make appropriate feedback.

  • Customer's customer, abnormal traffic caused by wrong posture

    For toB products, the platform directly faces developers or enterprises. In addition to C-end customers, customers of customers may also be other enterprises. This complicated relationship has greatly increased uncertainty.

  • Abnormal traffic to the platform generated by the customer being attacked

    We also receive requests from customers from time to time, feedback that they have been hacked by black products, and hope that we can help provide solutions. These black products often use various tools and are often accompanied by large traffic. PaaS/SaaS manufacturers belong to the role of lying guns .

In order to deal with the system stability risks brought by the above sudden and abnormal traffic, the open platform often sets some interface QPS restrictions (frequency control system) for the tenant level to protect the system. However, frequency control often sets rules at the entry layer (API), which cannot fully describe the root cause of sudden abnormal traffic pressure on the system (such as a certain redis-key or a certain database row), that is to say, the frequency control system is It is externally oriented, but the source of risk is internally resource-oriented. From the point of view of the risk itself, the logic of the frequency control system sometimes appears a bit extensive. In addition, due to various reasons, the settings of the frequency control configuration often cannot cover the completeness, and as the system continues to speak, the relevant parameters also need to be adjusted continuously.

Therefore, in addition to the frequency control system, a hotspot detection platform is also needed. The hotspot detection system focuses on the hotspots themselves that cause system stability risks, so that they can accurately evaluate and discover risk points.

Why self-study

We investigated existing hotspot detection solutions, such as JD.com’s hotkey (https://gitee.com/jd-platform-opensource/hotkey), Sohu’s hotCaffeine (https://github.com/sohutv/hotcaffeine), We consulted the internal fork version of hotCaffiene of Netease Cloud Music internally. After analyzing the functional characteristics and our needs, we decided to conduct self-development based on the above-mentioned solutions for the following reasons.

  • The above open source solutions are all strongly bound to etcd. etcd plays the role of configuration center and registration center in the system, and cannot be replaced by other services. We expect to reuse the existing configuration center and registration center without separate maintenance A set of etcd cluster.

  • The above open source solutions focus more on hot key caching than monitoring itself.

  • We want the framework to be as lean as possible, with minimal dependencies.

  • We hope to open relevant interfaces in a plug-in way, so that it can be more conveniently connected to the internal systems of different departments to develop custom business logic more conveniently

Based on the above reasons, we decided to develop a hot key detection framework such as camellia-hot-key.

system structure

architecture diagram

 

 

Architecture principle

As a general-purpose hotspot detection platform, the following problems need to be solved first when designing:

  • How to collect and count the number of hot keys?

  • How to define and manage hot key rules?

  • How to use the hotspot key detection results?

How to collect and count the number of hot keys?

Since the hotspot key is a definition of a global dimension, a centralized server must be needed to summarize it. The first reaction is to use a centralized cache such as redis as a centralized counter management tool. However, in the face of a large number of keys, redis has some significant performance bottlenecks. Even if performance issues are not considered, the resulting resource overhead is huge. Therefore, we need to design a solution with low resource overhead .

When we re-examine the scene of hot key detection, we can find that in fact, we don't need real-time, but only quasi-real-time (100 ms level) to meet most of the scenes. Therefore, through local caching and batch processing, the pressure on the centralized server can be greatly reduced, and the different needs of different businesses can be met by flexibly adjusting the cache duration and batch size. As for the server itself, referring to the slot sharding idea of ​​redis-cluster, we can hash shard the hot-key-server according to certain rules, so as to ensure that the same key is routed to the same server node, so that the calculation of hot key Complete localization, combined with the local cache + batch processing mentioned above, can finally complete the statistics and calculation of a large number of keys at a relatively low cost.

In addition to the centralized server, there is also a problem of massive keys to be dealt with in collection and statistics. Obviously, we cannot crudely record all keys within a time window. A simple idea is to use algorithms such as LRU/LFU for memory control. To this end, we investigated [ConcurrentLinkedHashMap/Guava/Caffeine] and other open source frameworks (these were implemented by the same author at different times, and we pay tribute to the author @Ben Manes). Caffeine is known as the king of process caching, and its W-TinyLFU The algorithm provides the optimal cache hit rate, and also prevents us from missing hot keys in the hotspot detection scenario, so Caffeine is undoubtedly our first choice. But after further analysis, we finally chose to mix ConcurrentLinkedHashMap and Caffeine to achieve optimal performance.

After the above analysis, the basic server architecture is relatively clear. The whole system includes two parts: SDK and server: the SDK uses Caffeine to collect keys and regularly reports them to the server; after the server receives them, it still uses Caffeine to collect and calculate key. For different namespaces, because the number is predictable, ConcurrentLinkedHashMap is used for management (only a simple lru is required for system protection).

How to define and manage hot key rules?

Hot key detection is obviously a service that needs to flexibly change hot key rules. Hot key rules mainly include two parts:

  • One is what key needs to be detected.

  • What kind of key is a hot key.

For the former, we provide different key rule matching modes such as prefix matching, exact matching, substring inclusion, and matching all, which facilitate business configuration in different dimensions, and provide the concept of a rule list to match according to the defined priority.

For the latter, we open the two parameters of time window and hot key threshold to define the hot key. The time window is a sliding window. Taking the rule of 500 times within 1000ms as an example, 100ms will be used as a small sliding window inside the frame, and the last 10 small windows (100ms*10=1000ms) will be used to form a target window of 1000ms. Counting, on the one hand, this method can identify the hot key at the first time, on the other hand, there will be no jumping problem in the detection of the hot key.

For the use of hot key rules, in addition to the server side, the framework will also be sent to the SDK, so that unnecessary keys can be discarded directly on the SDK side, thereby reducing unnecessary network transmission.

As a general-purpose hot key detection service, it will obviously serve different business lines, and even the same business line will have different business scenarios. Therefore, in addition to hot key rules, we define the concept of namespace. One or more rules can be defined under each namespace. A service supports multiple namespaces at the same time, and each namespace is isolated from each other.

How to use the hotspot key detection results?

First of all, we preset an optional hot-key-cache method in the framework, which is jointly completed by SDK+server. For some hot query scenarios, after detecting a hot key, the server will automatically send the detection result to the SDK, and the SDK will automatically cache the result to avoid query request penetration, protect the backend cache/database service, and provide To ensure the timeliness of the cache, the server will also notify the associated SDK of the update/delete event of the cache result, so as to ensure that the cache value is updated as soon as possible, and the SDK will also report the cache hit status to the server, which is convenient The server performs data statistics.

In addition, the server will actively push the detection results to the callback defined by the business, and the business can perform customized processing by itself, such as alarming, current limiting, blackening, etc.

In the early stage of access, you may not know how to set the hot key rule. If the setting is too small, you may be overwhelmed by the hot key notification. If the setting is too large, it may not be effective. Therefore, the server also has a built-in hot key . The topN detection function informs the business of the key with the most access requests under the namespace, and the business can locate faults based on this, or set hot key rules that meet the actual business conditions based on this.

Plug-in and custom extensions

The basic principle of the hot key detection framework was mentioned earlier, and camellia-hot-key took into account the different needs of different business lines at the beginning of the design, so the plug-in design principle was adopted to facilitate the business without modifying the source code of the framework. , can be used more flexibly, and can also be more easily connected with existing systems. Plug-in is mainly reflected in the following points:

  • registration center

Different from the existing open source hot key service, camellia-hot-key is not bound to any registration center, you only need to implement the relevant interface to integrate with the existing registration center very quickly, such as zk (built-in) , eureka (built-in), etcd, consul, nacos, etc.

  • configuration center

camellia-hot-key is not bound to any configuration center, but defines the HotKeyConfigService configuration interface. You only need to implement this interface to quickly host the hot key rules to your existing configuration center (camellia- hot-key has a built-in local configuration file + nacos two ways).

The HotKeyConfigService configuration interface is defined as follows:

public abstract class HotKeyConfigService {
    /**
     * 获取HotKeyConfig
     * @param namespace namespace
     * @return HotKeyConfig
     */
    public abstract HotKeyConfig get(String namespace);

    /**
     * 初始化后会调用本方法,你可以重写本方法去获取到HotKeyServerProperties中的相关配置
     * @param properties properties
     */
    public void init(HotKeyServerProperties properties) {
    }

    //回调方法
    protected final void invokeUpdate(String namespace) {
        //xxxx
    }
}

In addition, in the design of camellia-hot-key, the configuration center only needs to interact with the server, and the SDK will automatically obtain the configuration (configuration initialization + configuration update) through the server without directly connecting with the configuration center, so as possible Simplify the logic of the SDK (make the SDK thinner).

Monitoring endpoints and custom callbacks

The server of camellia-hot-key provides http monitoring endpoints (json format/promethus format), which are used to expose basic information such as servers (such as qps, accumulation, etc.).

In addition, a rich callback interface is provided, including but not limited to:

  • HotKeyCallback

    The hot key detection callback will push the hot key to the business side in real time through this callback. In addition to pushing the hot key itself and the current count, it will also call back the hot key hit rule and the source of the hot key to the business at the same time.

  • HotKeyTopNCallback

    The topN callback of the hot key, which is a topN statistics of the global dimension (will summarize the data of multiple server nodes), and the framework will call back to the business at regular intervals (1 minute by default).

  • HotKeyCacheStatsCallback

    When the hot key cache function is enabled, the SDK will periodically report the hit status of the hot key cache, and the server will call back the statistical data to the business through this callback interface.

Friendly SDK interface

In order to adapt to different application scenarios, the framework provides two different SDKs, CamelliaHotKeyMonitorSDK and CamelliaHotKeyCacheSDK.

  • CamelliaHotKeyMonitorSDK

It is used for pure hot key statistics function. The processing of detection results is completed by the business itself. The processing of detection results can be performed on the SDK side or on the server side. There is only one core interface:

/**
     * 推送一个key用于统计和检测热key
     * @param namespace namespace
     * @param key key
     * @param count count
     * @return Result 结果
     */
    Result push(String namespace, String key, long count);
  • CamelliaHotKeyCacheSDK

The function of hot key caching is encapsulated, and the SDK will automatically cache the detected hot keys locally, and the service access party only needs to implement the ValueLoader interface. The core interface includes the following three:

/**
     * 获取一个key的value
     * 如果是热key,则会优先获取本地缓存中的内容,如果获取不到则会走loader穿透
     * 如果不是热key,则通过loader获取到value后返回
     *
     * 如果key有更新了,hot-key-server会广播给所有sdk去更新本地缓存,从而保证缓存值的时效性
     *
     * 如果key没有更新,sdk也会在配置的expireMillis之前尝试刷新一下(单机只会穿透一次)
     *
     * @param namespace namespace
     * @param key key
     * @param loader value loader
     * @return value
     */
    <T> T getValue(String namespace, String key, ValueLoader<T> loader);
    
    /**
     * key的value被更新了,需要调用本方法给hot-key-server,进而广播给所有人
     * @param namespace namespace
     * @param key key
     */
    void keyUpdate(String namespace, String key);
    
    /**
     * key的value被删除了,需要调用本方法给hot-key-server,进而广播给所有人
     * @param namespace namespace
     * @param key key
     */
    void keyDelete(String namespace, String key);

performance

For Camellia-hot-key, we have carried out a lot of optimization and tuning, mainly including the following:

  • Transfer Protocol

    The SDK and server use long links (based on netty4) to interact with the server, and abandon protocols such as json/text, and use a more streamlined binary protocol to optimize performance (in order to reduce external dependencies as much as possible, pb, etc. are not used third-party serialization library).

  • Lock-free design

    There is a layer of hash from the SDK to the server to ensure that the same key is processed by the same server. In addition, after the server receives the message, it will also perform hash distribution according to the key, so that the same key will only be processed by one thread, which greatly simplifies the design of the sliding window and avoids locks (the whole process is lock-free).

  • JDK-17

    We use jdk8 and jdk17 to test separately, and found that under the same throughput, jdk17 has lower CPU usage than jdk8.

The following is a simple performance test result:

 

how we all use

As a general hot key detection framework, multiple business lines within Smart Enterprise have been connected to this framework, and seamlessly connected to the existing system through related custom interfaces. Taking IM as an example, we have the following business The hotspot detection service is connected in the process:

  • For requests from IM-SDK, two-dimensional detection is made according to uid+tenant id+interface and tenant id+interface to identify abnormal C-side clients and abnormal tenants.

  • For the request from IM-openAPI, a detection is made according to the tenant id + interface, so as to identify the abnormal tenant and abnormal interface.

  • The underlying database level, take db as an example. Through the plugin plug-in of mybatis, we have achieved access to the hotspot detection function in a non-intrusive way. We use the following method to assemble the detection key: type #tenant id#sql#param. The advantage of this is that on the one hand, it can identify After the hotspot is released, the source tenant can be quickly located, and on the other hand, different rules can be set for different SQL operation types such as select/update/insert/delete and different tenants

  • In terms of caching, take redis as an example. In addition to the redis-key itself, in order to locate the source conveniently, the tenant id will be assembled into the detection key. In particular, for dao_cache, the CamelliaHotKeyCacheSDK is also integrated to enable the cache function to protect redis when necessary.

The above is about input. As for output, based on the custom interface provided by the framework, we have connected to the internal monitoring and alarm system, which is convenient for detecting hot spots in the first place; and will also write data to the data platform in real time, which is convenient for subsequent traceability . In the future, it will also be connected to the frequency control flow control system, so that it can be automatically blocked as soon as abnormal traffic is sensed.

Summarize

Why open source

When developing the camellia-hot-key framework, we designed it as a part of the open source project camellia, and streamlined the core without adding business logic, so that everyone can directly connect to their own without modifying the source code. in the system.

We expect more people to use it, not just Netease Smart Enterprise, and we expect to maximize its functions and value.

Therefore, all open source enthusiasts are welcome to actively find faults, and we will continue to improve and improve camellia together, so as to achieve a real win-win situation.

Trial

Camellia is an open source project of Yunxin. In addition to hot-key, there are many components that have been fully verified by the production environment, such as redis-proxy, id-gen, delay-queue, etc. Welcome to click three links: Like (Star ), concern (PR), comment (issue)!

github address: https://github.com/netease-im/camellia

Guess you like

Origin blog.csdn.net/netease_im/article/details/131435437