Alibaba open source current limiting and downgrading artifact Sentinel mass production level application practice

Alibaba open source current limiting and downgrading artifact Sentinel mass production level application practice

 

Title picture:

 

 

Preface

 

There are a lot of articles on the Internet about current limiting algorithms, Sentinel function introduction, basic structure, and principle analysis. I do not intend to repeat the content. I will share with you the experience of using and stepping on pits in actual work and production environments.

 

If you are doing technical selection of current limiting and fusing, then this article will provide you with an objective and valuable reference;

If you want to use Sentinel in a production environment in the future, this article will help you avoid detours in the future;

If you are preparing for a job interview, you may be able to add highlights to your skill tree and experience, and avoid being written "on paper" on your interview evaluation form;

 

 

 

Is the open source version of Sentinel the same as Ali's internal? Can we apply it at the mass production level?

 

Here I will tell you the answer directly: the open source and internal versions are the same, and the core code and capabilities are open source. It can be used in production, but it is not " out of the box " . You need to do some secondary development and adjustments. Next, I will carefully expand these issues. Of course, I recommend that you directly use the AHAS Sentinel console and ASM configuration center on Alibaba Cloud , which are the output of best practices. You can save a lot of time, manpower, operation and maintenance costs, etc.

 

 

Overall operating architecture

 

 

Problems faced by large-scale production applications

After reading the original operating architecture of the Sentinel open source version, it is obvious that there are some problems:

 

  1. Rules such as current limiting and degradation are stored in the memory of the application node, and they will become invalid after the application is released and restarted, which is obviously unacceptable in the production environment;
  2. By default, the distribution of rules is based on the machine node dimension rather than the application dimension, and the application systems of normal companies are deployed in clusters, and this cannot support cluster current limiting;
  3. The metrics information is pulled up by the Dashboard and saved in the memory for only 5 minutes. If you miss it, you may not be able to restore the "crisis scene" and you cannot see the traffic trend;
  4. If there are 500+ applications that are connected to the current limit, and each application deploys an average of 4 nodes, then a total of 2000 nodes, then Dashboard will definitely become a bottleneck, and the single-machine thread pool cannot handle it at all;

 

How to optimize and solve these problems

Next, we will first introduce how to solve these obvious problems.

 

First of all, current limiting rules, downgrading rules, etc. should be issued according to the application dimension, rather than according to the dimension of the APP single node. Because Sentinel supports cluster current limiting, the open source version of Sentinel Dashbord has been extended for current limiting rules, but for fuses, system protection, etc., it has not been expanded to support distribution by application dimension. Interested readers can refer to FlowControllerV2 Realize to realize.

 

Secondly, rules should not be stored in memory, they should be persisted to the dynamic configuration center, and the application can directly subscribe to the rules from the configuration center. In this way, Dashboard and applications are decoupled through the configuration center. This is a typical producer-consumer model. The basic operating architecture is as follows:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Taking the nacos configuration center as an example, Sentinel officials and the community provide demos for saving and subscribing to current limiting rules. Then you can expand the rules such as fuse downgrade, system protection, gateway current limiting... and so on. The basic model is: Dashboard serializes the xxRuleEntity VO model and saves it to nacos. After subscribing from nacos, the application deserializes it into the xxRule domain model.

 

Here I would like to remind you that there are huge pits in front of you. Please do not copy the " hot parameter current limit rules " and " blacklist limit rules " directly, because the ParamFlowRuleEntity and AuthorityRuleEntity defined in the Dashboard

This 2 Ge VO model and domain model ParamFlowRule , AuthorityRule field definition does not match, it will lead to serialize / problem deserialization failure, which in turn cause the application can not subscribe to and use of rules limiting hotspots parameters and blacklist restriction rules, this I will submit a PR ! ! !

 

The third point is that there is a scheduling thread pool in the Dashboard, which will poll the request (initiated every 1 second by default). The machine nodes of each application query the metrics log information, aggregate and display it on the interface (after the transformation, the Need to complete the persistence action). This is a typical pull mode, which is a relatively common architecture in the field of monitoring and measurement. Because it is saved in memory, it is only reserved for 5 minutes by default, which is also problematic. The following solutions are recommended:

  1. After Dashboard pulls the metrics information, it is directly saved in the time series database, and Dashboard itself also fetches data from the time series database for display. How long to store metrics data is up to you based on your business. Take the open source Influxdb as an example, it has its own persistent strategy function (automatically clean up data expired). Moreover, you can also use open source Dashboards such as Grafana to do query and aggregation, and display various beautiful markets, graphs, rankings, etc.;
  2. You can change the pull mode to push mode, and directly write the time series database when recording the metrics log. When you, based on performance considerations, you can also change to write MQ for buffering. In addition to time-consuming, the most important thing is not to affect the progress of the main business process because of the action of recording metrics;
  3. Continue to print metrics logs, enable Sentinel Dashboard to pull metrics data, and use collectors to collect, process, and report metrics logs directly on the application machine node. You can use tools such as ELK;
  4. You can try to develop the Prometheus Exporter yourself, expose the metrics information in the form of Target, and pull it regularly by the Prometheus server. At the same time, you can also use the various rich query and aggregation syntax and capabilities provided by Prometheus, through Grafana, etc. Make a show

 

The following figure is an example of a typical time series data, which is designed for metrics index data. Well-known open source software in this field includes OpenTSDB, Influxdb, etc.

 

Grafana current limiting market display renderings

The above methods have their own advantages and disadvantages. If you want to make the smallest changes, and your application access and deployment scale is not particularly large (within 500 nodes), then please choose the first method.

 

The fourth point is about the performance bottleneck of Dashboard in pulling and aggregation due to more access applications and nodes. When solving problem 3, if you choose the methods 2, 3, 4, then the Dashboard that comes with Sentinel will only be used as a tool for rule distribution (even rule distribution can be directly passed through the nacos configuration center console Complete), naturally there will be no bottleneck problem. If you still want to use Sentinel's built-in Dashboard to complete tasks such as pulling and persisting metrics data, then I provide you with two solutions:

  1. Separate by field, and applications of different business fields are connected to their respective Sentinel Dashboards, so that the pressure is naturally distributed to reduce the possibility of bottlenecks. The advantage is that there is almost no need for modification, and the disadvantage is that it is not uniform;
  2. You can try to transform the Dashboard that comes with Sentinel to make it stateless. As we mentioned earlier, the heartbeat information will be reported regularly after the application is started. Dashboard will maintain a "node information list" data in the memory by default. This is typical state data and should be considered for centralized storage. For example: redis. Then you need to modify the thread pool of "pull metrics information" and change it to be executed in fragmented task mode, so as to achieve the effect of sharing the load, for example: use elasticjob scheduling instead. Of course, writing to time series databases may also become a bottleneck;
  3. You can sacrifice a little timeliness of monitoring indicators and increase the interval time parameter of the fetchScheduleService scheduling thread pool in Sentinel Dashboard, which can relieve the processing pressure of the downstream worker thread pool;

 

As far as I am concerned, I actually recommend the first and third methods. These are both expedient measures with relatively minor changes.

 

Of course, there are other benefits of dividing by field. If you connect to 500+ systems, take the current open source version of Dashboard as an example, how long will the list of apps on the left extend? It is estimated that it can't be used. The UI and interaction design are amateurish and obviously can't meet the mass production application. But after being separated by field, the experience may be improved. And there is another point. The current open source version of Dashboard only provides the most basic login verification functions. If you want functions such as permission control, auditing, and approval confirmation, secondary development is required. If the Dashboard is independent by field, the risk of access control will be smaller.

Of course, if you want to refactor Dashboard permission control and UI interaction, I suggest designing in accordance with the application dimension, adding basic search and so on. in case

 

Other questions

After the application is connected to Sentinel, it needs to specify the application name, Dashboard address, client port number, log configuration, heartbeat setting, etc. when starting, either through the JVM -D startup parameter, or save the configuration file in the specified path to configure . This is an unreasonable design and is intrusive to CI/CD and the deployment environment. I solved this problem in version 1.6.3 and submitted a PR. Fortunately, the community solved this problem in 1.7.0.

 

 

Some experience in rule configuration and use

 

 

Please don't get me wrong. I am not teaching you how to configure and use, but how to use it well. Remember the soul torture about current limiting in my previous article on the stability guarantee system? First, let's briefly review the key functions in Sentinel that may be used. Next, I will answer the most common doubts of users in a self-answering manner, and output the most valuable experience and suggestions.

 

  1. Single machine current limit
  2. Cluster current limit
  3. Gateway current limit
  4. Hot parameter current limit
  5. System adaptive protection
  6. Black and white list restrictions
  7. Automatic fuse downgrade

 

 

What is the current limit threshold for a single machine?

This can’t be “slapped on the head”. Too high a match may cause malfunctions. If the match is too low, you are worried about premature “manslaughter” requests. It still has to be configured according to capacity planning and water level settings, and the premise is that the monitoring alarms are sensitive. Two more practical ways are given:

  1. Refer to the idea of ​​single-machine capacity planning, adjust the traffic weight and proportion of a node in the soft load until the limit is approached. Record the QPS in the limit state, and according to the 70% water level setting standard of the single machine room, you can calculate the single machine current limit threshold of the resource;
  2. You can periodically observe the flow chart of the monitoring system to get the real peak QPS on the line. If the application system and business are in a healthy state during the peak period of the cycle, then you can assume that the peak QPS is the theoretical water level. This method may cause a waste of resources, because the peak period may not reach the system carrying limit, which is suitable for services with regular traffic cycles;

 

Do you really need cluster current limiting?

In fact, in most scenarios, you do not need to use cluster current limiting, and single-machine current limiting is sufficient. Think carefully, there are actually only a few cases where you may need to use cluster current limiting:

 

  1. When you want to configure the single-machine QPS limit <1, the single-machine mode cannot be satisfied. You can only use the cluster current limit mode to limit the total QPS of the cluster. For example, a single machine with a very poor performance interface can only hold 0.5 QPS at most. If 10 machines are deployed, the maximum capacity of the cluster is 5 QPS. Of course, this example is a bit extreme. For another example, we want a customer to call a certain interface with a total maximum QPS of 10, but in fact we have deployed 20 machines. This situation is real;
  2. In the above figure, the single-machine current limit threshold is 10 QPS and 3 nodes are deployed. Theoretically, the total QPS of the cluster can reach 30, but in fact, the total QPS of the cluster has not reached 30 due to uneven traffic, and the current limit has been triggered. Many people will say this is unreasonable, but I think it needs to be analyzed according to the real situation. If this "10QPS" is a threshold calculated based on the system carrying capacity of the capacity plan (or if the interface request exceeds 10 QPS, the system may crash), then the result of the current limit is satisfactory. If this "10 QPS" is only a business-level limitation, even if the QPS of a node exceeds 10, it will not cause any problems. In fact, we essentially want to limit the total QPS of the entire cluster, so the result of this current limit is not reasonable. , And did not achieve the best results;

Therefore, it actually depends on whether your current limit is to achieve " overload protection " or to achieve business-level restrictions.

Another point to note is that cluster current limiting cannot solve the problem of uneven traffic, and current limiting components cannot help you redistribute or schedule traffic. Cluster current limiting just makes the overall current limiting effect better in the scenario of uneven traffic.

 

The actual use suggestion is: cluster current limit (to achieve business-layer limit) + single-machine current limit (system under the covers to prevent explosion)

 

Now that the gateway layer has limited current, does the application layer still need to limit current?

If needed, double protection is necessary. In the same way, upstream aggregation services are configured with current limiting, and downstream basic services also need to be configured with current limiting. Just imagine if only upstream current limiting is configured, wouldn’t it be possible to overwhelm downstream basic services if the upstream initiates a large number of retries ? In this case, we also need to pay special attention when configuring the current limit threshold. For example, both upstream services A and B rely on downstream Y service. A and B are configured with 100 QPS respectively, then Y service must be configured at least 200 QPS, otherwise some additional requests will be passed through and processed but eventually rejected, which is not only a waste of resources, but may also cause data inconsistency and other problems if it is serious.

Therefore, it is best to configure it according to the overall capacity plan of the entire link (the barrel short board principle), the earlier the interception, the better, and each layer must be configured with current limiting.

 

Is the hot parameter current limiting function practical?

The function is very practical, which can prevent hot data (such as popular stores, dark horse products) from occupying and consuming excessive system resources, which will seriously affect the processing of other data requests.

There is another requirement. If you are a C-side product, you want to limit the maximum QPS of a user to access a certain interface, or you are a B-side SAAS product, and you want to limit a tenant’s access to the maximum QPS of a certain interface... By default, the hotspot parameter is not for Designed to meet such requirements, you need to extend SLOT yourself to achieve similar restriction requirements. Of course, the paramFlowItemList in the hotspot parameter current limit (parameter exception items can realize that the maximum QPS for a large customer with a certain customer ID=1 to access a certain resource is 100), which to some extent can achieve this special demand . There is another solution to this requirement: when we define the sourceName in the code, we directly assign it the corresponding business data identifier (for example: queryAmount#{TenantId}), and then go to the console to configure it separately according to the sourceName.

 

Why is there a system adaptive protection?

In fact, this is also a kind of bottom-up approach. When the real traffic exceeds a part of the current limit threshold, the overhead can basically be ignored. When the real traffic far exceeds the current limit threshold N times, especially in the huge traffic scenarios such as the Double Eleven promotion, the Spring Festival Gala red envelope, and the 12306 ticket purchase, Then the overhead of the current limit rejection request cannot be ignored. This situation is called "the system is touched to death" within Alibaba. In this scenario, adaptive current limit can do a good job.

 

Do black and white list restrictions need to be configured?

This function is very useful if you want to restrict based on the source of the request (only release requests from the designated upstream system). Sentinel has a built-in "cluster point link monitoring" function, which is similar to call chain monitoring but has a different purpose.

 

What are the recommendations for the use of automatic fuse downgrade?

Before configuring automatic fuse downgrade, we first need to identify possible unstable services, and then judge whether it can be downgraded. Downgrade processing usually fails quickly. Of course, we can customize the result of the downgrade processing (Fallback), for example: try to wrap and return the default result (downgrade downgrade), return the cached result of the last request (timeliness degradation), and package to return the failed result. Prompt results, etc.

 

The degradation of weak dependencies and secondary functions is usually done manually by pushing switches, while the fuse degradation of Sentinel is mainly automatically judged and executed on the " calling side " . Sentinel is based on the average response time and the average response time in the time window configured in the rules. Statistical indicators such as error ratio and number of errors are used to perform automatic fusing and downgrading.

 

For example: Our system supports both "balance payment" and "bank card payment". The interfaces corresponding to these two functions are by default in the same thread pool of the same application. RT jitter and a large number of timeouts on either side may cause a backlog of requests and cause The thread pool is exhausted. Assuming that from a business perspective, the proportion of "balance payment" is higher, and the priority of protection is also higher. Then we can perform "automatic fuse downgrade" on the "bank card payment" interface (relying on a third party, unstable) when the RT continues to rise or a large number of abnormalities occur (provided that it cannot cause data inconsistency and other problems that affect business processes ), so as to give priority to ensuring that the "balance payment" function can continue to be used normally.

 

to sum up

This article mainly introduces some of the problems and solutions faced by the Sentinel open source version in large-scale production applications, as well as some experience in actual configuration and use. These experiences come from first-line production practices, and hope that readers can avoid detours. If you have any questions, please leave a message to discuss.

about the author

Bu Ya, once worked for Ababa and Ant Financial. Familiar with high concurrency, high availability architecture, stability guarantee, etc. Enthusiastic about technical research and sharing, published many articles such as "Distributed Transaction" and "Distributed Cache" which have been widely read and reprinted

Guess you like

Origin blog.csdn.net/dinglang_2009/article/details/113094430