How can the real-time calculation of Metrics indicators on Didi’s observable platform be accurate and cost-effective?

In Didi, the Metrics data of the observable platform has some real-time computing requirements, and these real-time computing requirements are carried by sets of Flink tasks. The reason why there are multiple sets of Flink tasks is because each service requires different indicator calculations according to its business observations, which also corresponds to different data processing topologies . We try our best to abstract the same computing needs of users, but due to the limitations of Flink's real-time computing task development model and real-time computing framework, the design of these observation indicator computing tasks is not universal enough. Using Flink for real-time calculation of Metrics indicators, maintaining multiple sets of Flink tasks faces the following problems:

  • The general Metrics computing capabilities that should have been abstracted have been built repeatedly and are of mixed quality and cannot be accumulated.

  • The processing logic is randomly hardcoded into the code of the streaming task and is difficult to update and maintain.

  • Flink release, expansion, and reduction require restarting tasks, which will lead to indicator output delays, breakpoints, and errors.

  • The Flink platform is relatively expensive, accounting for a large portion of our internal cost bills, and there is certain cost pressure.

In order to solve these problems, we have developed a set of real-time computing engines: observe-compute (OBC for short). The following is an introduction to the implementation of OBC.

Design goals

At the beginning of the project, the design goals determined by OBC were as follows:

1. Create a universal real-time calculation engine in the field of Metrics indicator calculation . This engine has the following characteristics:

  • In line with industry standards: using PromQL as the description language for stream processing tasks

  • Flexible task management and control: Strategies are configured, computing tasks take effect in real time, and the execution plan can be manually intervened

  • Computing link traceability: Able to achieve policy-level traceability of calculation results

  • Cloud native containerization: Engine containerization deployment enables dynamic expansion and contraction without downtime.

2. The product can meet all Metrics indicator calculation needs of the observable platform, replace repeated calculation tasks, reduce costs and increase efficiency, and reshape the observation collection, transmission and calculation links.

Currently, except for the feature of calculating link traceability, all other engine features have been implemented. OBC has been running stably online for more than several months. Several sets of Flink computing tasks at the core of the observable platform have been migrated to OBC. It is expected that the migrated tasks will save a cumulative cost of 1 million yuan for the observable platform by the end of the year.    

Engine architecture

22954a335be0a53fec06bf17184bc0d9.png

The engine architecture is shown in the figure above and is divided into three components: obc-ruler is the control component in the engine, which provides service registration and discovery capabilities for other components in the engine. It is also responsible for the access to external computing strategies and the control of execution plans. obc-distributor ingests Metrics indicators from the Metrics message queue, matches the calculation strategy, and forwards the data to obc-worker according to the execution plan of the strategy. obc-worker is the component in the engine that is actually responsible for calculations. It is responsible for completing indicator calculations according to the execution plan and delivering the calculation results to external persistent storage.

usability discussion

Before introducing the core logic of each component in detail, let us first introduce our thoughts and compromises on usability, as well as some concepts introduced to achieve usability goals. This part of the discussion will help understand the core logic of each component.

usability thinking

The requirements for data availability in real-time computing scenarios can be divided into real-time requirements and accuracy requirements , which require low-latency and high-precision data. Accuracy can be described as good data and no loss. The requirement that the data is good is what I call the accuracy requirement, and the requirement that the data is not lost is what I call the integrity requirement.

For business observation data, from the perspective of final output results, it is also suitable to discuss availability from the above three perspectives:

  • Real-time: There should be no large delay in data falling into storage

  • Accuracy: Data dropped into storage must accurately reflect the state of the system

  • Integrity: The data dropped into the storage cannot have any missing points or breakpoints.

449e97d695048c2bfb8ae7ddd51dc70b.png

The simple observation data processing process can be simplified as: collection => storage. What is collected in such a process is what is collected, and there is no need to discuss the accuracy of the data. Among the two requirements of real-time and integrity, in order to reduce the implementation complexity and operating overhead of the system, the general approach is to ensure real-time and sacrifice integrity.

Scenarios where observation data participates in real-time calculations will be more complex than ordinary scenarios, and the accuracy of the data needs to be further discussed. Our internal data flow from observe-agent (collection end) => mq => obc-distributor => obc-worker => persistent storage, on the premise that the delay distribution of data in the same calculation window reaching obc-worker satisfies the assumption , at least the at least once semantics from the collection end to obc-worker must be guaranteed, and the logic of data deduplication and calculation window integrity must be implemented on obc-worker to ensure accurate results. For us, both the cost of design and implementation and the cost of additional computing resources to ensure accuracy are slightly higher. And when we use Flink to calculate Metrics indicators, we cannot guarantee that the final output data will be accurate. Therefore, OBC has also made some necessary compromises in the design of the solution. On the premise of ensuring the real-time output of the calculation results, we try our best to ensure the calculation results . It will only break points (completeness) and not make mistakes (accuracy), but it does not promise that the data produced will be accurate.

usability design

First, we introduce a concept called cutover time. The meaning of cutover is switching. This concept is borrowed from m3aggregator. In OBC, the cutover time will be set to a time later than the time when the configuration changes as the actual effective time of the configuration, so that obc-distributor & obc-worker have multiple opportunities to synchronize to the same configuration before the changed configuration actually takes effect. The concept of cutover time is introduced here initially because when obc-distributor forwards data to obc-worker, it will select which worker instance needs to be forwarded to on the hash ring formed by obc-worker. Once the status of a worker instance changes, the worker When the constituted hash changes, a problem will arise: if the change takes effect immediately, then obc-distributor perceives that the process of hash change is sequential, and the worker hash ring that takes effect in the distributor cluster will be inconsistent for a short period of time. The configuration will be configured through cutover. Delaying the actual effective time can give the distributor more time to sense the configuration changes. More specific usability design is divided into the following points:

1. Obc-worker crashes, drifts, and restarts, causing at most three breakpoints in some curves, without any error points.

  • There is a heartbeat synchronization mechanism between worker and ruler, which is synchronized every 3 seconds.

  • The distributor synchronizes the worker hash from the ruler with a ring frequency of 3 seconds.

  • The ruler completes the version update logic of the hash ring based on the status of the worker. A cutover is a version, and the historical version is retained for up to 10 minutes.

    • Ruler's judgment on worker death: The heartbeat has not been updated for 8 seconds, or the ruler actively calls the interface to log out of itself.

    • Propagation delay time for hash ring: 8s

febed30a008cc214ca2eb3b0e65ecaa9.png

2. Drift and restart of obc-distributor will not cause curve breakpoints or wrong points.

  • obc is deployed on our internal cloud platform. Our internal cloud platform has some support capabilities for graceful restart: container drift and restart will send a SIGTERM signal to the business process. The distributor will listen to this signal and do two things after receiving the signal:

    • Stop consuming from MQ

    • Metrics data points, data points in the cache that have not yet been sent to the worker will be sent as soon as possible.

3. Obc-distributor panics and the physical node where it is located may cause breakpoints and errors.

  • In this case, the distributor has no chance to deal with the aftermath. Part of the data that has been cached in the memory and has not yet been sent to the worker is lost, which will cause errors in the calculation results.

  • What is the impact on users of this situation? Our old metrics calculation link observe-agent (collection end) => mq => router => kafka => flink. The router module in the link undertakes part of the work similar to distributor. It also has the same problem and has been running for so many years. It seems that the user perception is not obvious

4. The accuracy requirement of indicator output is that there will be no breakpoints or errors when updating strategies within 60s or less.

  • After the policy update takes effect in the cluster, short-term errors will occur if the policy is not synchronized. In order to solve this problem, the concept of cutover time is also introduced in the policy. The cutover time is aligned with 60 seconds.

Introduction to each component

obc-ruler

The ruler module has two major functions, service registration/discovery and policy management, as shown in the following figure:

54f4dc83a7eec2d7a916404b655466db.png

The construction of service registration/discovery capabilities is divided into three layers. The kv abstract layer and memberlist kv layers use the three-party library grafana/dskit. This library supports various kvs such as consul and etcd under the kv abstract layer. We finally We chose to use memberlist kv to build the final consistency kv based on the data synchronization capability provided by the gossip protocol. The reason is that we hope that the observable itself should reduce external dependencies as much as possible to avoid relying on external components to provide observation capabilities, and these external Components rely on our observation capabilities to ensure their stability, which is a circular dependency problem. The direct explanation of the upper-layer heartbeat and hashring is the two keys registered on the kv store and their supporting conflict merge logic. The heartbeat key stores information such as the worker's address, registration time, latest heartbeat time, hash tokens assigned to the worker, etc., while the hashring key stores the worker hash rings of multiple versions of cutover.

The policy management module is divided into four layers. The lowest layer is various loaders, which are responsible for loading configurations from external policy sources. Listed in the figure are the computing policy loaders of several of our most core products. For existing computing tasks, in order to migrate OBC, we will customize a parser to convert computing strategies. For new computing tasks, we will require unified use of PromQL to describe its computing requirements and use PromQL parser analysis. The execution plan produced by the parser is a tree, and the optimizer will do some optimization on the execution plan. In fact, it is currently just a simple merge of some operators. The multi-version manager does overall scheduling of policy updates and bans abnormal policies.

obc-distributer

The core function of the distributor module is to correlate calculation strategies and forwarding of Metrics data points.

12a958770f3a14d8e909b2a19291693c.png

The work done here in strategy matching is to filter data points according to the specified label, and label the filtered data points with the ID of the corresponding strategy. One of our larger online OBC computing clusters has an input volume of nearly 1,000w/s, and the number of effective computing policies is 1.2w. The amount of data is indeed too large, so in order to improve the screening efficiency, we have made some restrictions on the effective policies: Policy screening must It must contain two labels, __name__ and __ns__, and these two labels cannot use regular matching. __name__ is the name of the indicator, and __ns__ means namespace, which represents the service cluster to which the indicator is reported. We rely on these two labels to build a two-level index to improve the efficiency of policy matching.

The relabel action is to add/delete/modify the labels of data points according to the requirements of the policy. In terms of configuration syntax, in order to unify the description, we process these rules into PromQL functions, and limit the vector parameters of these functions to only a selector or other relabel function. The mapping action is similar to the dimension table join in other streaming computing. It is the product of a compromise in order to access existing computing tasks and will not be introduced in detail.

The process of selecting workers:

  • Data point event_time alignment time: align_time = event_time - event_time % resolution, where resolution is the accuracy of the expected output indicator given in the strategy

  • Worker hash ring selection: Use align_time to find the latest worker hash ring in the hash ring list whose cutover is not greater than align_time.

  • Worker instance selection: Use planid + align_time + the value of a series of specified labels as the key to calculate the hash value, and use the hash value to find the worker instance in the worker hash ring. The specified series of labels are generated by the ruler after analyzing the strategy, and are generally empty for strategies that require a small amount of data to be processed.

obc-worker

The core function of the worker is Metrics indicator calculation

7f4857f924c8235b325b686791dbbf5f.png

Distributor ensures that Metrics points of the same policy, same aggregation dimension, and same calculation window will be forwarded to the same worker. The Metrics data points received by the worker will carry the policy ID information, and the worker will use this to find and calculate the policy content. The smallest logical operation unit for calculating strategies is Action. Functions, binary operations, and aggregation operations in PromQL are all translated into Actions. In the aggregation matrix of the worker, under each Action is a series of time windows aligned according to the resolution. Within the time window, a set of data that needs to be calculated together is written to the same computing unit. The computing unit is in the worker. The smallest physical computing unit. Data added to the same computing unit is not cached but is used for calculations in real time unless necessary.

This may be difficult to understand. Let me give you an example, taking the calculation rule sum by (caller, callee) (rpc_counter). The filtering of the rpc_counter original indicator is completed in the distributor, and the action of sum by (caller, callee) is processed as An Action. The type of Action is an aggregation operation. The aggregation type is summation. The dimensions of the data output after aggregation are caller and callee. When processing data points, the worker will compare the data with the same value of the two labels of caller & callee. The points are sent to the same aggregation unit. Each time the aggregation unit receives a point, it will perform an action such as sum += point.value.

One issue that workers need to pay attention to when processing binary operations and aggregation operations is the setting of window opening time. How long does a window wait for all the data it needs to arrive? This value directly affects the real-time and accuracy of indicator output. We have set the following default policy based on the delay distribution of our own metrics in link transmission:

  • If the original indicator step value is less than or equal to 10, the window opening time is set to 25s.

  • If the original indicator step value is greater than 10, the window opening time is set to 2*step + 5. If the window opening time is greater than 120s, the window opening time is set to 120

This part introduces the implementation of each component relatively briefly. It only introduces the core process. Many of the trade-offs and optimizations we have made in terms of performance and functions will be introduced separately when we have the opportunity.

Supplementary content

In order to facilitate everyone's understanding, some additional content is added here.       

Calculation strategy example

2fe7fa84d75322a5332bb591f11b5046.png

The internal strategy of OBC is expressed in the form of a tree. The leaf nodes of the tree must be the filtering conditions or constants of Metrics, and the leaf nodes are not allowed to be constants (such a strategy is meaningless for event-driven computing engines). The filtering of Metrics The condition is called Filter in the policy. The Filter is executed on the distributor, and the rest of the Actions are executed on the worker. If the policy is not specifically set, the Metrics matched by the policy in the same window will be sent to the same worker. Subsequent Action processing will be completed on this worker. Data from different windows may be sent to different workers. The reason for this is that load balancing among workers cannot be achieved simply by dispersing them in the policy dimension.

There is a special point to note here. If the amount of data in a single window of a certain policy is too large for a single worker to handle, we will manually make some settings to spread the policy to multiple workers according to some aggregate calculation dimensions. If it cannot be broken up or if a single worker still cannot handle it after breaking it up, we will choose to ban it. Of course, because we restrict the filtered data in the calculation policy to have the __ns__ label, and can only use the metrics data of the same service for calculation, so so far we have not encountered a situation that requires a ban policy. For situations where a single worker cannot handle it, you can also choose to implement cascade computing capabilities between workers and process large data tasks in a merged manner. If there is subsequent demand, this capability can be quickly expanded on the current architecture.

Support for PromQL

OBC hopes to use PromQL as a user access method for computing strategies, which can reduce users' understanding of strategies and access costs. As of the date of writing, our level of completion of PromQL syntax & function support is not high. The main reason for this is that there are many operations and strategic access to operator stocks that will not be used. Demand drives us to gradually implement them on demand. In the implementation, there are also some PromQL syntax that are not suitable for implementation in real-time calculations, such as offset modifier, @ modifier, and subquery.

In the previous article, we introduced that the distributor ensures that the data of the same calculation window of the same calculation strategy is forwarded to the same worker. You may be wondering here, what about the range vector? How to calculate irate when two points in the window before and after operations such as irate (http_request_total[10s]) are forwarded to different workers? If you have such doubts, then I should congratulate myself for making you read carefully and understand what I wrote before.

Range vectors, including range vector selector and range vector function, are not supported in the current version of OBC. The reason why they are not supported is because we cannot use them for the time being. You can think that our internal data are all Gauge types in Prometheus. For requests The so-called Counter indicator on our side actually counts the request volume in each 10s cycle. The data between cycles will not be accumulated, just like our Counter data has been increased (http_request_total) before being reported. [10s]) such operation. Of course, we will definitely consider implementing range vector related syntax support in the future, but we will also make some syntax restrictions.

Accurate cluster-granularity request latency

Didi Observable Platform did not have the concept of histogram as a data type before it supported Prometheus Exporter collection and PromQL data retrieval. Regarding the service request delay, we will use the t-digest algorithm on the collection end to approximate the interface delay distribution of each service instance, and report the delay data of 99 & 95 & 90 & 50 quantile values ​​by default. Due to the lack of original information, it is impossible to provide users with a relatively accurate calculation of the cluster-granularity interface delay quantile value. They can only use the average or maximum value of the single-machine granularity delay quantile value indicator instead.

We also solved this problem on OBC. The specific method is that OBC cooperates with the collection end. When the collection end reports the delay quantile value indicator, it also reports the bucket information of the interface delay distribution together. Bucketed information does not fall into persistent storage and will only be ingested into OBC on demand. OBC extends the relevant operations into a PromQL Aggregation Operator. The name of this Operator is percentile. Its action is to merge the bucketed information of a single machine to obtain A new delay distribution, producing the quantiles of this distribution on demand.

Summary and Outlook

Up to now, OBC has been running stably online for more than several months. Several core Metrics computing tasks of the observable platform have been migrated to OBC, and significant cost benefits have been achieved.

Regarding the subsequent project iteration ideas, here is an introduction to one of the core points. We hope to include the collection end into this set of computing engines to achieve the integration of collection and calculation: for the user's computing needs, we can move as far forward as possible. The acquisition end is completed, and pre-calculation can be done at the acquisition end and pre-calculation can be done at the acquisition end as much as possible.

After calculation by the calculation engine, we have produced more observation data, including original data and new calculation results. Where does this data end up stored? Whether to choose row storage or column storage, whether to use existing solutions or self-research, and how to solve massive data problems. The next article will tell the story of observation data storage in Didi.



Cloud Native Night Talk

How do you support the calculation needs of observable indicators in the production environment? Welcome to leave a message in the comment area. If you need to further communicate with us, you can also send a private message to the backend directly.

The author will select one of the most meaningful messages and send a Didi customized suitcase, wishing you a worry-free trip on October 1. The lottery will be drawn at 9pm on September 26th.

3bdbc68e068ae6a339483afb277e6e40.png

Guess you like

Origin blog.csdn.net/DiDi_Tech/article/details/133153290