Awesome! 40 pictures to understand the principle and practice of distributed tracking system

Author | Code Sea
Source | Code Sea

In the microservice architecture, a request often involves multiple modules, multiple middleware, and multiple machines to complete the collaboration.

In this series of call requests, some are serial and some are parallel. Then how to determine which applications, modules, nodes and the order of calling are called behind this request? How to locate the performance problem of each module? This article will reveal the answer for you.

This article will explain from the following aspects:

  • Principle and function of distributed tracking system

  • Principle and architecture design of SkyWalking

  • Our company's practice on the distributed call chain

Principle and function of distributed tracking system

How to measure the performance of an interface, generally we will pay attention to at least the following three indicators:

  • How do you know the RT of the interface?

  • Is there an abnormal response?

  • Where is the main slow?

Monolithic architecture

In the early days, when the company just started, it may adopt the following monolithic architecture. For the monolithic architecture, what method should we use to calculate the above three indicators?

The easiest thing to think of is obviously to use AOP:

Use AOP to print out the time before and after calling the specific business logic to calculate the overall call time. Use AOP to catch the exception and also know where the call caused the exception.

Microservice architecture

In the monolithic architecture, since all services and components are on one machine, these monitoring indicators are relatively easy to implement. However, with the rapid development of business, the monolithic architecture will inevitably develop towards a microservice architecture, as follows:

As shown in the picture: a slightly more complex microservice architecture

If some users report that a page is slow, we know that the request call chain of this page is A -----> C -----> B -----> D, how to locate which module may be at this time The problem caused. Each service Service A, B, C, D has several machines. How do I know which machine a certain request calls the service?

It can be clearly seen that due to the inability to accurately locate the exact path that each request passes, there are several pain points under the microservice architecture:

  1. Difficulty in troubleshooting and long cycle

  2. Hard to reproduce specific scenes

  3. System performance bottleneck analysis is difficult

The distributed call chain was born to solve the above problems, and its main functions are as follows:

  • Automatically take data  

  • Analyze data to generate a complete call chain: With a complete call chain of the request, the problem has a high probability of recurring

  • Data visualization: The performance visualization of each component can help us locate the bottleneck of the system and find out the problem in time

Through the distributed tracking system, each specific request link of the following requests can be well located, so that request link tracking can be easily realized, and the performance bottleneck of each module can be located and analyzed.

Distributed call chain standard-OpenTracing

Knowing the role of distributed call chains, let’s take a look at how to implement the implementation and principles of distributed call chains. First of all, in order to solve the problem of API incompatibility of different distributed tracing systems, the OpenTracing specification was born. OpenTracing is a lightweight Level of standardization layer, it is located between the application/class library and the tracking or log analysis program.

In this way, OpenTracing provides platform-independent and vendor-independent APIs so that developers can easily add the implementation of the tracking system.

Speaking of this, have you ever thought about a similar implementation in Java? Remember JDBC, by providing a set of standard interfaces for various vendors to implement, programmers can face interface programming without worrying about the specific implementation.

The interface here is actually a standard, so it is very important to formulate a set of standards to enable pluggable components.

Next, let's look at the OpenTracing data model, there are mainly the following three:

  • Trace: a complete request link

  • Span: One call process (start time and end time are required)

  • SpanContext: Trace global context information, such as traceId in it

It is very important to understand these three concepts. In order to let everyone better understand these three concepts, I specially drew a picture:

As shown in the figure, the complete request of an order is a complete trace. Obviously, for this request, there must be a global identifier to identify this request. Each call is called a Span, and each call must be brought. The global TraceId, so that the global TraceId can be associated with each call. This TraceId is transmitted through the SpanContext. Since it is necessary to transmit, it obviously must be called in accordance with the protocol.

As shown in the figure, we compare the transmission protocol to a car, SpanContext to goods, and Span to roads. It should be better to understand.

After understanding these three concepts, let me see how the distributed tracking system collects the microservice call chain in the unified graph:

We can see that there is a Collector at the bottom layer that has been collecting data in obscurity, so what information will be collected every time Collector is called.

  1. Global trace_id: This is obvious, so that each sub-call can be associated with the original request

  2. span_id: 0, 1, 1.1, 2 in the figure, so you can identify which call

  3. parent_span_id: For example, the span_id of b calling d is 1.1, then its parent_span_id is the span_id of a calling b, which is 1, so that two adjacent calls can be associated.

With this information, the information collected by Collector for each call is as follows:

Based on these chart information, it is obvious that the visual view of the call chain can be drawn as follows:

So a complete distributed tracking system is realized.

The above implementation looks really simple, but there are several issues that require us to think carefully:

  1. How to automatically collect span data: automatic collection, no invasion of business code

  2. How to transfer context across processes

  3. How to ensure the global uniqueness of traceId

  4. Will so many requests affect performance?

Next, let me see how SkyWalking solves the above four problems.

Principle and architecture design of SkyWalking


How to automatically collect span data

SkyWalking adopts the form of plug-in + javaagent to realize the automatic collection of span data, so that it can be non-invasive to the code , plug-in means pluggable, good extensibility (the following will introduce how to define your own plug-in ).

How to transfer context across processes

We know that data is generally divided into header and body, just like http has header and body, RocketMQ also has MessageHeader, Message Body, body generally contain business data, so it is not suitable to pass context in body, but should pass context in header, as shown in the figure :

The attachment in dubbo is equivalent to the header, so we put the context in the attachment, which solves the context transfer problem.

Tips: The context transfer process here is handled in the dubbo plugin, and there is no business perception. How this plugin is implemented, I will analyze it below.

How to ensure the global uniqueness of traceId

To ensure global uniqueness, we can use distributed or locally generated IDs. If you use distributed IDs, you need to have a sender. Every time you request, you must first request the sender. There will be a network call overhead, so SkyWalking will eventually It uses the method of locally generating ID, it uses the famous snowflow algorithm, and has high performance.

Illustration: id generated by snowflake algorithm

However, the snowflake algorithm has a well-known problem: time callback , which may cause duplicate id generation. So how does SkyWalking solve the time callback problem?

Every time an id is generated, the time when the id was generated (lastTimestamp) is recorded. If the current time is found to be smaller than the time when the id was last generated (lastTimestamp), it means that a time callback has occurred, and a random number will be generated as traceId.

There may be students who want to be more real here, and they may feel that the generated random number will also be the same as the generated global id. It would be better if you add another layer of verification.

Here I want to talk about the choice of system design. First of all, if the uniqueness of the generated random number is checked, there will undoubtedly be an extra layer of call, and there will be a certain performance loss.

But in fact, the probability of time callback is very small (after the occurrence of the machine time disorder, the business will be greatly affected, so the adjustment of the machine time must be cautious), and the probability of coincidence of the generated random numbers is also very small , Considering that there is really no need to add another layer of global uniqueness check.

For the selection of technical solutions, we must avoid over-designing.

With so many requests, will all collection affect performance?

If you call for each request to collect, then there is no doubt that the amount of data will be very large, but in turn, think about whether it is really necessary to collect for each request. In fact, it is not necessary. We can set the sampling frequency and only sample For part of the data, SkyWalking defaults to sampling 3 times in 3 seconds, and other requests are not sampled, as shown in the figure:

This kind of sampling frequency is actually enough for us to analyze the performance of the component. What problems will there be when sampling data at a frequency of 3 times in 3 seconds? Ideally, every service call is at the same point in time (as shown in the figure below), so that it’s okay to sample at the same point in time every time.

However, in production, it is basically impossible for each service call to be called at the same point in time, because there are network call delays during the period, and the actual call situation is likely to be as follows:

In this case, some calls will be sampled on service A, but not on services B and C, and it is impossible to analyze the performance of the call chain. So how does SkyWalking solve it?

It is solved like this: if the upstream carries Context (indicating upstream sampling), the downstream is forced to collect data. This can ensure the integrity of the link.

SkyWalking's infrastructure

The basic structure of SkyWalking is as follows. It can be said that almost all distributed calls are composed of the following components.

First of all, of course, the regular sampling of node data. After sampling, the data is reported regularly and stored in the persistence layer such as ES and MySQL. With the data, it is natural to do visual analysis based on the data.

How does SkyWalking perform

Next, everyone must be more concerned about the performance of SkyWalking, then let's take a look at the official evaluation data:

The blue in the figure represents the performance without SkyWalking, and the orange represents the performance with SkyWalking. The above data is measured with a TPS of 5000. It can be seen that whether it is CPU, memory, or response time, use SkyWalking with The performance loss that comes is almost negligible.

Next, let’s look at the comparison between SkyWalking and Zipkin, Pinpoint, another well-known distributed tracking tool in the industry (comparison with one sampling rate per second, 500 threads, and 5000 total requests). It can be seen that Zipkin (117ms) and PinPoint (201ms) are far inferior to SkyWalking (22ms) in terms of critical response time!

From the performance loss index, SkyWalking has won!

Another look at the indicator: how code invasive, requires Zipkin buried in the application point, strong invasion of the code, using the SkyWalking javaagent + widget of such modifications bytecode ways to do for There is no code intrusion . In addition to the performance and intrusiveness of the code, SkyWaking performs well. It also has the following advantages:

  • Multi-language support, rich components: currently it supports Java, .Net Core, PHP, NodeJS, Golang, LUA languages, and the components also support common components such as dubbo, mysql, and most of them can meet our needs.

  • Extensibility: For unsatisfied plug-ins, we can write one manually according to SkyWalking's rules, and the newly implemented plug-ins will not invade the code.

Practice on the distributed call chain


SkyWalking's application framework in our company

From the above, we can see that SkyWalking has many advantages, so have we used all its components? In fact, it is not. Let’s take a look at its application architecture in our company:

It can be seen from the figure that we only use SkyWalking's agent for sampling, and abandon the other three components of "data reporting and analysis", "data storage", and "data visualization", so why not directly use the whole set of SkyWalking The solution is because our Marvin monitoring ecosystem was relatively complete before we connected to SkyWalking.

If you replace it with SkyWalking, it is unnecessary. Marvin can meet our needs in most scenarios. Second, the system replacement cost is high. Third, if you reconnect to the user, the learning cost is high.

This also gives us an enlightenment: it is very important for any product to seize the opportunity, and the replacement cost of subsequent products will be high. Seizing the first opportunity means seizing the user’s mind. This is like WeChat. Although the UI is well-made, it is Whatsapp can't be done in foreign countries, because the first opportunity is gone.

On the other hand, for architecture, there is no best, but the most suitable. The essence of architecture design is to balance the trade-offs with current business scenarios.

What transformations and practices our company has made to SkyWalking

Our company mainly made the following transformations and practices:

  1. Pre-release environment requires mandatory sampling due to debugging

  2. Achieve more fine-grained sampling?

  3. Embedded traceId in the log

  4. Self-developed implementation of SkyWalking plug-in

Pre-release environment requires mandatory sampling due to debugging

From the above analysis, we can see that Collector is sampling regularly in the background. Isn't that good? Why do we need to implement forced sampling?

It is still to troubleshoot positioning problems. Sometimes there are problems online. We hope to reproduce it on pre-release, hoping to see the complete call chain of this request, so it is necessary to implement forced sampling on pre-release. So we modified the dubbo plug-in of Skywalking to implement mandatory sampling.

We bring a key-value pair similar to force_flag = true on the requested cookie to indicate that we want to force sampling. After the gateway receives this cookie, it will bring the force_flag = true key-value pair in the dubbo attachment, and then The dubbo plug-in of skywalking can judge whether it is forced sampling based on this. If there is this value, it means forced sampling. If there is no such value, it will take normal timing sampling.

Achieve more fine-grained sampling?

Hah called more fine-grained sampling. Let's first look at the default sampling method of skywalking, namely unified sampling.

We know that this method defaults to 3 times before sampling in 3 seconds, and all other requests are discarded. In this case, there is a problem. Assume that there are multiple dubbo, mysql, redis calls within 3 seconds on this machine, but if the first three times If they are all dubbo calls, other calls like mysql, redis, etc. cannot be sampled, so we modified skywalking to achieve group sampling, as follows:

That is to say, redis, dubbo, mysql, etc. are sampled 3 times in 3 seconds, which avoids this problem.

How to embed traceId in the log?

The traceId embedded in the output log is convenient for us to troubleshoot problems, so it is very necessary to type the traceId. How to embed the traceId in the log?

We are using log4j. Here we need to understand the log4j plug-in mechanism. Log4j allows us to customize the plug-in to output the log format. First, we need to define the log format and embed %traceId in the custom log format as the account Symbols, as follows:

Then we implement a log4j plug-in, as follows:

First, the log4j plug-in must define a class that inherits the LogEventPatternConverter class, and declares itself as a Plugin with the standard Plugin. The placeholder to be replaced is specified through the @ConverterKeys annotation, and then replaced in the format method. Drop.

In this way, the TraceId we want will appear in the log, as follows:

What skywalking plugins have been developed by our company

SkyWalking has implemented many plug-ins, but memcached and druid plug-ins are not provided, so we self-developed these two plug-ins according to its specifications.

How to implement the plug-in, you can see that it is mainly composed of three parts:

  1. Plug-in definition class: Specify the definition class of the plug-in. Finally, the plugin will be packaged and generated according to the definition class here

  2. Instrumentation: Specify the aspect, the point of contact, which method of which class should be enhanced

  3. Interceptor, specify step 2 to write enhanced logic in the front, back or exception of the method

Maybe you still don’t understand it after reading it. Let’s briefly explain it with the dubbo plugin. We know that in the dubbo service, each request receives a message from netty and submits it to the business thread pool for processing, until the actual call to the end of the business method. After more than a dozen Filter processing in the middle.

And MonitorFilter can intercept all client requests or server processing requests, so we can enhance MonitorFilter, before calling the invoke method, inject the global traceId into the attachment of its Invocation, so as to ensure that the request reaches the real The global traceId already exists before the business logic.

So obviously we need to specify the class we want to enhance (MonitorFilter) in the plug-in, and enhance its method (invoke). What enhancements should be made to this method? This is what the interceptor (Inteceptor) does, let’s take a look The instrumentation (DubboInstrumentation) in the Dubbo plug-in.

Let's take a look at what the Inteceptor described in the code does. The key steps are listed below:

First of all, beforeMethod represents that the method will be called before the invoke method of MonitorFilter is executed, and the corresponding is afterMethod, which represents the enhancement logic after the invoke method is executed.

Secondly, we can see from points 2 and 3 that whether it is a consumer or a provider, the global ID is processed accordingly, so as to ensure that the global traceid is guaranteed when the real business layer is reached. After the Instrumentation and Interceptor are defined , The last step is to specify the defined class in skywalking.def.

// skywalking-plugin.def 文件dubbo=org.apache.skywalking.apm.plugin.asf.dubbo.DubboInstrumentation

The packaged plug-in will enhance MonitorFilter's invoke method, and perform operations such as injecting global traceId into the period attachment before the invoke method is executed. All of these are silent and non-intrusive to the code.

to sum up

This article introduces the principle of distributed tracking system from the shallower to the deeper. I believe that everyone has a deeper understanding of its function and working mechanism.

It is particularly important to note that when introducing a certain technique, we must combine the existing technical architecture to make the most reasonable choice. Just like SkyWalking has four modules, our company only uses its agent sampling function. There is no best technique, only The most suitable technology.

Through this article, I believe everyone should have a clearer understanding of the implementation mechanism of SkyWalking. This article only introduces the implementation of SkyWalking plug-ins, but it is industrial-grade software after all. To understand its extensiveness and depth, you must read more source code.


更多精彩推荐
☞谷歌软件工程师薪资百万,大厂薪资有多高?
☞这都是啥软件?你能猜到吗?| 每日趣闻
☞杜甫在线演唱《奇迹再现》、兵马俑真人还原……用AI技术打破次元壁的大谷来参加腾讯全球数字生态大会啦!
☞开放源码,华为鸿蒙HarmonyOS 2.0来了
☞20张图,带你搞懂高并发中的线程与线程池!
☞跨链,该怎么跨?
点分享点点赞点在看

Guess you like

Origin blog.csdn.net/csdnnews/article/details/108544144