Use articles丨Link tracing (Tracing) is very simple: link topology

Author: Ya Hai

In the past year, Xiaoyu's business department has launched a vigorous micro-service movement, and a large number of business middle-end applications have been split into finer-grained micro-service applications. In order to meet the upcoming Double Eleven promotion and re-insurance activities, Xiaoyu's supervisor asked her to sort out the overall key upstream and downstream dependencies of the order center within a week, and pull all parties to align the re-insurance plan in advance. Xiaoyu was very worried about this task. Normally, she only dealt with the direct upstream and downstream business parties. Now she has to sort out the complete dependency path of the order center, and she still doesn't know how to start after losing a lot of worry. In desperation, Xiao Yu once again turned to the almighty Xiao Ming for help.

In response to Xiaoyu's question, Xiao Ming proposed an idea. First, the call chain can track the complete call path of a request, but a single call chain cannot reflect all the call branches, nor can it reflect the strength of dependencies through the traffic volume. The cost of the call chain is too high. Then, is it possible to aggregate a batch of call chains with the same characteristics (such as passing through a certain application or calling a certain interface) into a tree through the program, and by analyzing the shape and flow of the tree, we can quickly sort out the key Nodes and dependent paths, and this is the prototype of the link topology function.

insert image description here

As shown in the figure above, the entry application A relies on multiple downstream applications of different depths, and the path of each call is different. In order to sort out the complete call dependencies of application A, multiple call chains can be aggregated into a tree. Each path from the root node to the leaf node represents a traffic flow path, and the state of the node reflects the characteristics of the traffic. For example, the number of times, time-consuming, error rate, etc. By calling chain aggregation, the method of comprehensively analyzing the end-to-end traffic path and state is the link topology. The relationship between the link topology and the call chain is like a sample set and discrete sample points. The former reflects the overall distribution and can effectively avoid the influence of randomness of a single sample on the evaluation results.

01 Classic application scenarios of link topology

The core value of the link topology is to provide capabilities such as strong and weak dependency combing, bottleneck point analysis, impact surface analysis, and fault propagation chain analysis by analyzing the dependency paths and states between nodes. Let's take a deeper look at these classic usages.

(1) Strong and weak dependencies

The most typical and well-known application scenario of link topology is dependency sorting, especially in a large-scale distributed system, where the dependencies between tens of thousands of applications are so complex that operation and maintenance students doubt their lives. The figure below shows the application topology of Taobao's core links in 2012. The dense cobweb-like dependencies have far exceeded the scope of manual sorting, and this situation is not uncommon in the current rapid development of microservices.

insert image description here

In a complex business environment, it is not only necessary to sort out the dependency graph, but also to identify which are the strong dependencies that affect the core business and which are the "innocuous" weak dependencies. For strong dependencies, more manpower and resources need to be invested to establish a more complete guarantee system, such as telephone alarms and joint stress testing. For weak dependencies, you can consider whether it can be removed, or establish a secondary safeguard.

There are mainly the following ways to distinguish between strong and weak dependencies:

  1. Distinguish according to the size of traffic . This is a simple and crude way of distinguishing. If the traffic is greater than a certain threshold or ratio, it will be identified as a strong dependency, otherwise it will be regarded as a weak dependency. The advantage of this judgment method is that it is simple and clear, and can be automatically identified by the link tracking platform without human intervention. The disadvantage is that it is not accurate enough. Although the flow of some special key dependencies is not large, it will directly affect the stability of the business.

  2. Distinguished according to synchronous/asynchronous call type . The advantage of this distinction is that it is simple and easy to operate, reducing the interference of asynchronous non-blocking (such as message) calls. The disadvantage is that the screening effect is not good in synchronous call-based businesses, but may cause misjudgment in asynchronous call-based businesses (red envelopes, hot push).

  3. Manual labeling . In view of the differences in business, it will be more reliable for the business owner to manually mark and identify the strength of the direct downstream that it depends on. However, this method requires high personnel experience and time costs, and cannot adapt to business changes.

  4. Semi-manual labeling . First, the link tracking platform conducts preliminary strong and weak dependency identification according to the traffic size and synchronous/asynchronous, and then manually marks and corrects it through experienced students. This not only saves labor costs, but also can adapt to future business changes to a limited extent.

insert image description here

Sorting out strong and weak dependencies is a relatively infrequent work, which usually occurs in scenarios such as the preparation stage for big promotions or re-insurance activities, the launch of new applications or the offline of old applications, and the migration of sites to the cloud. For example, before the Double Eleven promotion each year, Ali sorts out the strong and weak dependencies of its core business and compares it with previous years in order to better provide targeted protection.

(2) Bottleneck point/influence surface analysis

The most common usage of link topology in the field of problem diagnosis is bottleneck point analysis and impact surface analysis. The former is to find the cause of the problem downstream from the current node, and is mainly used for problem location; while the latter is to analyze the affected range from the problem node to the upstream, and is mainly used for business risk grading.

Next, we use a database exception case to compare the differences between the two usages and perspectives, as shown in the figure below.

  • One morning, the administrator of application A received user feedback that the service response timed out. By checking the link topology status, it was found that the interface of application D that application A relied on was slow, and the database interface called by application D was also abnormal. Therefore, Little A notifies Little D to immediately check the status of the database connection and restore availability as soon as possible. This is a process of bottleneck point analysis.

  • Because little D is not proficient in business, half an hour has passed and the effective recovery action has not been completed, and an exception on the database server is triggered. After receiving the alarm from the database server, the student in charge of DB operation and maintenance traced back to the business impact surface through the link topology, and found that the two applications D and E that were directly dependent had a lot of slow SQL, which caused the indirectly dependent applications A and B to fail. There are varying degrees of service response timeouts. The database was expanded resolutely, and the normal access of all applications was finally restored. This is the process of assisting operation and maintenance decision-making through impact surface analysis.

insert image description here

The analysis of bottleneck points and impact surfaces is mainly based on static topology data over a period of time, which does not reflect the impact of time changes on the state of topology nodes, and cannot trace back the process of fault propagation. As shown on the right side of the figure above, if we only look at this topology, it is difficult for us to judge whether the application that causes the abnormality of the database server is D or E. So, is it possible to dynamically replay the changes in the link topology, and analyze the problem source and propagation trend more intuitively? The answer is undoubtedly yes, please see the introduction below.

(3) Fault Propagation Chain Analysis

Regardless of the dimension of time, the boundary between the problem source and the impact surface is not very clear. An affected party may become a new source of problems, causing larger failures. Therefore, in order to restore the fault evolution process more realistically, we need to observe and compare a set of static link topology snapshots with continuous timelines, and restore the fault propagation chain through the changes of node states between different snapshots. This is like restoring the murder process through surveillance video, which is more reliable than a single photo.

The database failure in the previous section was taken as an example. At the beginning, application D missed a request cache, and a large number of requests to the database resulted in slow SQL, which in turn affected the upstream application and also caused a response timeout. As the situation continued to deteriorate, the database server also began to be overloaded, which in turn affected the normal calls of application E. In the end, a large number of response timeouts occurred in applications A and B, and the API gateway began to deny access due to insufficient connections, causing a larger area of ​​services to be unavailable.

insert image description here

In a real production environment, the process of topology dependence and fault propagation may be more complicated. In order to simplify the analysis process, the node status can be extracted into various abnormal events according to certain rules. Observing the number of abnormal events at different times can also assist in judging the occurrence of faults , Propagation and recovery process, as shown in the figure below.

insert image description here

02 Link topology aggregation dimension

The aggregation dimension of link topology determines the type of topology nodes, and provides differentiated analysis perspectives for different user roles. In practical applications, the most typical three link topology aggregation dimensions are application, interface, and custom dimensions, which correspond to application topology, interface topology, and service topology.

  • Application topology , as the name implies, is link aggregation based on the application name, which reflects the dependencies between applications and the overall traffic status. Due to the coarse granularity of data aggregation, local abnormalities will be covered by the average value, which is not suitable for refined problem diagnosis, but more suitable for global dependency sorting and major fault demarcation. The user role is biased towards the person in charge of PE operation and maintenance or SRE stability.

  • Interface topology is link aggregation in the dimension of service interface. Compared with application topology, it is closer to the research and development perspective, because the object of daily iteration is usually a specific service interface, whether it is a new interface going online or an old interface going offline Or core interface re-insurance, the link topology of interface granularity is more in line with the research and development test process and the division of responsibilities. Applications and interfaces are the basic objects in the field of link tracking. The corresponding topology can be automatically generated by the link tracking platform without too much manual intervention, and it is more convenient to use.

  • Business topology is a link topology from a business-oriented perspective based on custom dimension aggregation. It is usually one level deeper than the interface dimension. For example, an order interface can be further subdivided into women's clothing or home appliances according to the commodity category dimension, as shown in the figure below shown. Generally, the business topology cannot be automatically generated by the link tracking platform, and users need to customize aggregation rules based on business characteristics. In addition, custom dimensions come from a wide range of sources, such as manually added Attributes custom tags, HTTP request input and output parameters, or environment tags of the machine where they are located. In this regard, the open source community lacks corresponding standards, and the commercial implementation of major manufacturers is also quite different.

insert image description here

To sum up, the link topologies generated by different aggregation dimensions have different functional positioning and characteristics, as shown in the following table:

insert image description here

03 Link topology generation method

In order to preserve the end-to-end correlation information of link data to the greatest extent, link topology is usually generated based on direct aggregation of call chain detailed data, rather than secondary aggregation based on indicator data. Careful readers may find that there is a challenging technical problem hidden here, which is how to balance the real-time performance, accuracy and flexibility of massive link data aggregation. Ideally, we hope to quickly generate the most realistic link topology based on the detailed data of the full call chain that meets the conditions, so as to achieve "fast, accurate and flexible". However, in practical applications, we cannot have both fish and bear's paws. We can only choose between "fast", "accurate", and "flexible", and thus derive different schools of link topology generation.

insert image description here

(1) Real-time aggregation

Real-time aggregation is a way to dynamically filter call chains and generate topology maps based on user-specified query conditions. The advantage of this method is that it has high real-time performance and is very flexible to use. You can specify arbitrary conditions, such as viewing the topology generated by the call chain greater than 3S, or the topology containing only abnormal calls. The disadvantage is that when the detailed data volume of the call chain that satisfies the conditions exceeds a certain threshold, the real-time aggregation computing node may be blown up. To extremely high flexibility and real-time.

insert image description here

(2) Offline aggregation

Offline aggregation is a way to periodically generate topology data according to a set of pre-defined aggregation rules. For example, basic topology data such as applications and interfaces that do not contain any filter conditions can be generated through offline aggregation. The advantage of this aggregation method is that the horizontal scalability of offline computing can be used to support the aggregation computing of massive link data, and the generated results will be more accurate. The disadvantage is that the real-time performance is poor, and the period from aggregation rule change to new topology generation is long. Offline aggregation is often used for accurate computation of global topological data.

(3) Prepolymerization

Pre-aggregation is a theoretically feasible topology generation method. Its idea is to continuously transparently transmit the complete call path information of the entire link downwards from the entry node, and generate corresponding pre-aggregation indicators on the end side. Regardless of the length limit of the transparent transmission information and the end-side pre-aggregation overhead, the advantage of this method is that it saves the process of converting detailed data into topology data on the server side, and achieves the goal of fast and accurate. But the disadvantage is that it does not support custom rules, otherwise the overhead of transparent transmission and pre-aggregation will rise sharply, affecting the performance and stability of business processes. The schematic diagram of the prepolymerization principle is shown below.

insert image description here

04 3D topology

Traffic and resources are a pair of "good brothers", and the two are closely related. Traffic affects resource allocation, and resources in turn constrain traffic states. Most of the traffic anomalies are ultimately caused by insufficient resource quotas or unreasonable allocation, such as peak traffic instantly exhausting resources, or "hot spots" caused by uneven traffic. In the process of locating the root cause of traffic anomalies, we often need to analyze the corresponding resource status; conversely, when a resource node is abnormal, we also want to know which traffic it will affect running on it? Therefore, it seems to be a good choice to associate corresponding resource data on the link topology representing traffic to form a more complete 3D topology.

As shown in the figure below, the 3D topology not only includes the application and interface traffic node status and dependencies of the PaaS layer, but also drills down to view the corresponding IaaS layer process, instance and other resource status. The 3D topology establishes a connection between traffic and resources, helping us more intuitively locate traffic anomalies caused by resource bottlenecks.

insert image description here

3D topology conveys more information in a clever way, but it also has a very fatal flaw, that is, the information density is too concentrated. In a complex topology environment, the performance may not be as intuitive as 2D topology, which greatly reduces its practical value. As shown in the figure below, when the instance scale reaches hundreds and the number of interfaces reaches thousands, the complexity of 3D topology interaction significantly reduces the efficiency of diagnosis and troubleshooting, and is more used for large-screen display.

insert image description here

In order to reduce the interaction cost of 3D topology, one possible idea is to combine intelligent diagnosis technology to automatically highlight abnormal links and accurately converge to display the data range. However, this has high requirements for technology and products, and its practicability has yet to be tested in a large number of real production environments.

Guess you like

Origin blog.csdn.net/alisystemsoftware/article/details/130488584