[Architecture design] Distributed link tracking

0. Preface

This article mainly introduces the principles of distributed tracing systems, the three pillars of "observability", and the OpenTracing standard. At the same time, it briefly compares the current mainstream open source distributed tracing systems.

With the rise of application container and the micro-services, by means of Dockerand Kubernetesrapid development tools, services and deployment made possible by building micro-services applications become more simple. But with the large single application is split into micro-services, dependency and calls between the service has become extremely complex, these services may be developed by different teams, it could probably use between different languages, based on micro-services RPC, RESTful APIbut also How to sort out the service dependency call relationship, how to quickly debugtrack service processing time in such an environment , find service performance bottlenecks, and reasonably evaluate service capacity has become a tricky task.

1.  可观察性(Observability) and the three pillars

In order to deal with these problems, Observabilitythe concept of observability ( ) was introduced into the software field. Traditional monitoring and alarming mainly focus on the abnormal conditions and failure factors of the system. Observability is more concerned with displaying the operating status of the system from the system itself, which is more like a self-examination of the system. An observable system pays more attention to the state of the application itself, rather than indirect evidence such as the machine or network where it is located. We hope to directly obtain the current throughput and latency information of the application. In order to achieve this goal, we need to reasonably and proactively expose more application operating information. In the current application development environment, in the face of complex systems, we will gradually focus on the combination of point to point, line and surface. This allows us to better understand the system, not only knowing What, but also answering Why.

Observability currently mainly includes the following three pillars:

  • Log ( Logging):Logging  Mainly records some discrete events. Applications often output log information in a defined format to a file, and then use a log collection program to collect it for analysis and aggregation. At present, there are mature solutions like ELK . In contrast, the log records are the most comprehensive and rich, and the storage resources are occupied the most under normal circumstances. Although it is possible to connect all log point events in time, it is difficult to display the complete information. The call relationship path;
  • Measurement (Metrics ): It is Metric often aggregated information. Compared with the  Logging loss of some specific information, it occupies much smaller space than the complete log. It can be used for monitoring and alarming. In this regard, Prometheus has basically become the de facto standards ;
  • Distributed tracing ( Tracing):Tracing  between  Logging and  Metric , in the dimensions of the request, connect the calling relationship between services and record the call time, which not only retains the necessary information, but also connects the scattered log events through Span to help us better Understanding the behavior of the system, assisting in debugging, and troubleshooting performance problems are also the focus of this article.

Logging, Metrics And  Tracing both have their own focused parts and overlapping parts.

In recent years  Metric ,  Tracing there has been a trend of integration, and now many popular  APM (application performance management) systems, such Datadog as integration  Tracing and Metric information.

At the same time as writing this article, the   merger  of the project KubeCon 2019CNCF announced  OpenTracing and  Googleinitiated OpenCensus. The new project is still under construction, but it has already promised to OpenTracing provide compatibility with existing  agreements.

The following is a  CNCF summary of the popular software or services of common realization of the system can be observed, Monitoring bar to Prometheus represented itself can be achieved  Metric collect monitoring, but in conjunction with other drawing tools can achieve a more powerful and comprehensive monitoring program:

2.  Distributed tracing system (Tracing) positioning and its standards

Distributed tracing systems have developed rapidly and have a wide variety of types, but there are generally three core steps: code embedding, data storage, and query display.

2.1 Case analysis

The following figure is an example of a distributed call. The client initiates a request. The request first reaches the load balancer, then passes through the authentication service, billing service, then requests the resource, and finally returns the result.

opentracing1.png

After the data is collected and stored, the distributed tracing system generally chooses to use a timing diagram containing a time axis to present this trace.

opentracing2.png

However, in the data collection process, the user code needs to be hacked and the APIs of different systems are not compatible, which leads to major changes if you want to switch the tracking system.

2.2 OpenTracing

In order to solve the problem of API incompatibility between different distributed tracing systems, the OpenTracing  specification was born  . OpenTracing is a lightweight standardization layer that sits between the application/class library and the tracing or log analysis program.

+-------------+  +---------+  +----------+  +------------+
| Application |  | Library |  |   OSS    |  |  RPC/IPC   |
|    Code     |  |  Code   |  | Services |  | Frameworks |
+-------------+  +---------+  +----------+  +------------+
       |              |             |             |
       |              |             |             |
       v              v             v             v
  +------------------------------------------------------+
  |                     OpenTracing                      |
  +------------------------------------------------------+
     |                |                |               |
     |                |                |               |
     v                v                v               v
+-----------+  +-------------+  +-------------+  +-----------+
|  Tracing  |  |   Logging   |  |   Metrics   |  |  Tracing  |
| System A  |  | Framework B |  | Framework C |  | System D  |
+-----------+  +-------------+  +-------------+  +-----------+

Tracing's functional positioning

  • Fault location-you can see the complete path of the request, which is more convenient to locate the problem than the discrete log (because the sampling rate will be set in the real online environment, you can use the debug switch to achieve full sampling of a specific request);
  • Dependency combing-generating a service dependency graph based on the calling relationship;
  • Performance analysis and optimization-it is convenient to record the time-consuming occupancy and proportion of different processing units on the system link;
  • Capacity planning and evaluation;
  • Cooperate Loggingand Metricstrengthen monitoring and alarm.

The first Google paper "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure" made it Tracingpopular. Twitter developed Zipkinand open sourced this project based on this paper . After that, a hundred flowers blossomed in the industry, and a large number of open source and commercial Tracingsystems were born .

2.3 OpenTracing Standard

In recent years a variety of link monitoring product after another, the current mainstream market tool both as Datadoga package of business programs such monitoring, but also AWS X-Ray, and Google Stackdriver Tracethis cloud vendor products, as well as Zipkin, Jaegerthis open-source product.

The Cloud Native Foundation ( CNCF) has introduced OpenTracingstandards to promote the standardization of Tracingprotocols and tools, and unify Tracedata structures and formats. OpenTracingBy providing platform-independent and vendor-independent APIs, developers can easily add (or replace) the implementation of the tracking system. For example, from Zipkinreplacement to Jaeger/ Skywalkingetc. backend.

In OpenTracing, the following basic concepts are mainly defined:

  • Trace (call chain) : Trace (call chain) in OpenTracing is implicitly defined by the Span belonging to this call chain. A Trace (call chain) can be considered as a directed acyclic graph (DAG graph) composed of multiple Spans, and the relationship between Span and Span is named References;
  • Span (span) : can be translated as span, can be understood as a method call, a program block call, or an RPC/database access, as long as it is a program access with a complete time period, it can be considered as a span .

Single Trace, the Spancausal relationship between:

 1        [Span A]  ←←←(the root span)
 2            |
 3     +------+------+
 4     |             |
 5 [Span B]      [Span C] ←←←(Span C 是 Span A 的孩子节点, ChildOf)
 6     |             |
 7 [Span D]      +---+-------+
 8               |           |
 9           [Span E]    [Span F] >>> [Span G] >>> [Span H]
10                                       ↑
11                                       ↑
12                                       ↑
13                         (Span G 在 Span F 后被调用, FollowsFrom)

Each Spancontains the operation name, start and end time, additional information Span Tag, can be used to record Spanspecial events Span Log, used to transfer Spancontext SpanContextand define the Spanrelationship between References.

2.4 AboutSpanContext

SpanContextIt is a OpenTracingvery important concept, the creation Span, the transport protocol Inject(injection) and from the transport protocol Extractwhen (extract) call chain information, SpanContextplay an important role.

SpanContextThe data structure is as follows:

SpanContext:
- trace_id: "abc123"
- span_id: "xyz789"
- Baggage Items:
- special_id: "vsid1738"
  • trace_idAnd  span_id to distinguish Tracethe Span;
  • Baggage Items It  Span Tag is the same as the structure, the only difference is: it Span Tagonly Spanexists in the current , not tracepassed in the whole , but Baggage Items will be passed along with the call chain.

Straddling the (inter-service or protocol) transmission implemented during transfer and associated call relationship, need to be able to SpanContext inject into the downstream medium and extracted in a downstream transmission medium  SpanContext.

The similar HTTP Headersmechanism provided by the protocol itself can often be used to achieve such information transfer, Kafkaand message middleware like this also provides a Headersmechanism for achieving such a function .

OpenTracing To achieve, you can use Tracer.Inject(...) and Tracer.Extract(...) provided in the api to facilitate  SpanContextinjection and extraction.

The following is a pseudo code example:

3. Current mainstream open source solutions and comparison

More mainstream Tracingopen source programs have Jaeger, Zipkin, Apache SkyWalking, CAT, Pinpoint, Elastic APMetc. These items are now the source code is hosted on Github.

We made a comparison according to the following dimensions:

The following factors need to be considered when introducing existing systems:

  1. Low performance loss
  2. Application-level transparency, to minimize business intrusion, the goal is to change as little as possible or not to modify the code
  3. Scalability

Based on the above survey, it can be summarized as follows:

  • If it is a Javastack-oriented application that has low cross-language and customization requirements, priority can be given to the low-intrusive Apache SkyWalkingproject. The project is dominated by Chinese people and is used by more companies;
  • Consider multi-language support, customization and high expansion, give priority to selection  Jaeger(  similar Jaeger to Zipkincomparison, compatible with the Zipkinoriginal protocol, compared Jaeger with a certain late-comer advantage), Jaeger and Zipkincompared with other solutions, more focused and Tracingself-sufficient, and the monitoring function is relatively weak ;
  • Preference for pure web applications, no need for customization and an already built ELK log system can consider low-cost access Elastic APM;
  • CAT Based on the full log collection of index data, it has certain advantages for large-scale collection, and it integrates a complete monitoring and alarm mechanism. There are many companies in China that use it, but it does not support it  OpenTracing;
  • PinpointThe main feature is low intrusiveness, with a complete APMand call chain tracking function, but currently only supports the Javasum PHP, and does not support the  OpenTracingstandard.

related articles

  1. opentracing-java github
  2. opentracing-java  Chinese document

 

Guess you like

Origin blog.csdn.net/qq_41893274/article/details/113959181