Application of the full link tracking system at the technical operation level

With the introduction of microservices and distributed architecture, various applications and basic components have formed a network of distributed call relationships. This complex call relationship has greatly increased the problem location, bottleneck analysis, capacity evaluation, and current limiting and degradation. The difficulty of stability assurance work. It is this background that gave birth to the solution of full link tracking.

One of the core technical points here is TraceID. When a request comes in from the access layer, this TraceID will be created; or it will be created through the Nginx plug-in and placed in the http header; or it will be generated through the RPC service framework. Then, in subsequent requests, this field will be automatically passed to the next caller through the framework, without the need for the business to consider how to handle this core field.

With this TraceID, we can connect a complete request link in series, which is also the basis of the subsequent scenario application. Let's take a look at the specific technical operation scenarios.

1. Problem location and troubleshooting

When we build a full-link tracking system, the primary problem to be solved is to quickly and accurately locate the problem in the complicated service call relationship.

There are two main types of common problem scenarios: bottleneck analysis and abnormal error location.

A common problem is that a certain page becomes slow, or a service suddenly has a large number of timeout alarms, because whether it is a page or a service, in a distributed environment, it will rely on a large number of other services or basic components in the backend. Therefore, to locate similar problems, we expect a detailed call relationship to be presented, so that we can quickly and easily determine where the bottleneck occurs.

For example, the situation in the figure below is that a certain page is slowed down. We checked the status of a certain call based on the URL, and found that the bottleneck was serious blocking in the query interface of RateReadService. Next, based on the detailed IP address information, we can go to this machine or the monitoring system to further judge the abnormality of this application or this host, which may be a machine failure, or an application operation failure, etc. .

Through the above case, we can see that after applying the full link tracking solution, the problem location under the complex call relationship is relatively simple.

2. Service running status analysis

The above problem positioning is mainly for a single request or a relatively independent scenario. Furthermore, after collecting a large amount of request and call relationship data, we can also analyze more valuable service operation information. For example, the following types of information.

1. Service operation quality

An application may provide HTTP service or RPC interface externally. For these two types of different interfaces, we can collect data for a period of time to form an analysis of the running status of the service interface, that is, the running monitoring of the application layer. Common monitoring indicators include QPS, RT, and error codes. Trends are compared. This allows for a complete view of how an application is running, as well as the services it provides.

2. Application and service dependencies

In addition to the above-mentioned running status of a single application, we can also calculate the dependency relationship and dependency ratio between applications and between services based on the analysis of the call chain, as shown in the figure below.

We can evaluate the expansion preparation of a single link based on the source dependency and ratio; at the same time, we can split the traffic based on the destination dependency to provide a basis for the expansion of downstream applications. Because this dependency ratio is completely derived from online real calls, it can reflect the real situation. Business Access Model.

At the same time, because our business scenarios and requirements are constantly changing, the calling relationships and dependencies between applications and services are also constantly changing, which requires us to constantly analyze and adjust the strong and weak dependencies, and we must also pay attention to various There will be a lot of work that can be optimized in the process of rationality between calls.

3. Quality of service for dependencies

It will also pay attention to the real-time running status and quality of dependent applications or services, so that you can see the real-time calling status between applications. Whether some applications call QPS suddenly increased, or RT suddenly skyrocketed, you can quickly confirm through this dependency.

3. Business Hologram

Business holography is the association between the full link tracking system and business information. The application of the full link tracking system is more at the technical level, such as locating "application or service" problems, dependencies between applications or services, and so on.

But in reality, we will also encounter a large number of business link analysis scenarios, such as the status of an order at different stages. Suppose a situation is a user complaint, and his order does not enjoy the discount of free shipping for orders over 100 yuan. At this time, we need to find out the information of the user from product browsing, adding shopping cart to ordering, to determine where the problem lies. . In fact, this scenario is very similar to the full link tracking of a request.

Therefore, in order to adopt a similar idea in the business, the unique TraceID on the request link can be associated with the order ID, user ID, product ID and other information on the business. When there is a business problem that needs to be checked, it will be based on The corresponding ID extracts a whole series of business chains, and then confirms the problem. This will greatly improve the efficiency of solving business problems.

The wide application of the full-link tracking system in technical solutions provides a large amount of online operation data that can be analyzed and processed. From these data, we can extract more valuable information for online stable operation.

This article is a study note for Day 10 in April. The content comes from "Zhao Cheng's Operation and Maintenance System Management Course" in Geek Time . This course is recommended.

Guess you like

Origin blog.csdn.net/key_3_feng/article/details/130070988