0. Preface
This article mainly introduces the principles of distributed tracing systems, the three pillars of "observability", and the OpenTracing standard. At the same time, it briefly compares the current mainstream open source distributed tracing systems.
With the rise of application container and the micro-services, by means of Docker
and Kubernetes
rapid development tools, services and deployment made possible by building micro-services applications become more simple. But with the large single application is split into micro-services, dependency and calls between the service has become extremely complex, these services may be developed by different teams, it could probably use between different languages, based on micro-services RPC
, RESTful API
but also How to sort out the service dependency call relationship, how to quickly debug
track service processing time in such an environment , find service performance bottlenecks, and reasonably evaluate service capacity has become a tricky task.
- GitHub: https://github.com/opentracing/opentracing-java
- Chinese document: https://wu-sheng.gitbooks.io/opentracing-io/content/
1. 可观察性
(Observability) and the three pillars
In order to deal with these problems, Observability
the concept of observability ( ) was introduced into the software field. Traditional monitoring and alarming mainly focus on the abnormal conditions and failure factors of the system. Observability is more concerned with displaying the operating status of the system from the system itself, which is more like a self-examination of the system. An observable system pays more attention to the state of the application itself, rather than indirect evidence such as the machine or network where it is located. We hope to directly obtain the current throughput and latency information of the application. In order to achieve this goal, we need to reasonably and proactively expose more application operating information. In the current application development environment, in the face of complex systems, we will gradually focus on the combination of point to point, line and surface. This allows us to better understand the system, not only knowing What, but also answering Why.
Observability currently mainly includes the following three pillars:
- Log (
Logging
):Logging
Mainly records some discrete events. Applications often output log information in a defined format to a file, and then use a log collection program to collect it for analysis and aggregation. At present, there are mature solutions like ELK . In contrast, the log records are the most comprehensive and rich, and the storage resources are occupied the most under normal circumstances. Although it is possible to connect all log point events in time, it is difficult to display the complete information. The call relationship path; - Measurement (
Metrics
): It isMetric
often aggregated information. Compared with theLogging
loss of some specific information, it occupies much smaller space than the complete log. It can be used for monitoring and alarming. In this regard, Prometheus has basically become the de facto standards ; - Distributed tracing (
Tracing
):Tracing
betweenLogging
andMetric
, in the dimensions of the request, connect the calling relationship between services and record the call time, which not only retains the necessary information, but also connects the scattered log events through Span to help us better Understanding the behavior of the system, assisting in debugging, and troubleshooting performance problems are also the focus of this article.
Logging
, Metrics
And Tracing
both have their own focused parts and overlapping parts.
In recent years Metric
, Tracing
there has been a trend of integration, and now many popular APM
(application performance management) systems, such Datadog
as integration Tracing
and Metric
information.
At the same time as writing this article, the merger of the project KubeCon 2019
CNCF
announced OpenTracing
and Google
initiated OpenCensus
. The new project is still under construction, but it has already promised to OpenTracing
provide compatibility with existing agreements.
The following is a CNCF
summary of the popular software or services of common realization of the system can be observed, Monitoring
bar to Prometheus
represented itself can be achieved Metric
collect monitoring, but in conjunction with other drawing tools can achieve a more powerful and comprehensive monitoring program:
2. Distributed tracing system (Tracing) positioning and its standards
Distributed tracing systems have developed rapidly and have a wide variety of types, but there are generally three core steps: code embedding, data storage, and query display.
2.1 Case analysis
The following figure is an example of a distributed call. The client initiates a request. The request first reaches the load balancer, then passes through the authentication service, billing service, then requests the resource, and finally returns the result.
After the data is collected and stored, the distributed tracing system generally chooses to use a timing diagram containing a time axis to present this trace.
However, in the data collection process, the user code needs to be hacked and the APIs of different systems are not compatible, which leads to major changes if you want to switch the tracking system.
2.2 OpenTracing
In order to solve the problem of API incompatibility between different distributed tracing systems, the OpenTracing specification was born . OpenTracing is a lightweight standardization layer that sits between the application/class library and the tracing or log analysis program.
+-------------+ +---------+ +----------+ +------------+
| Application | | Library | | OSS | | RPC/IPC |
| Code | | Code | | Services | | Frameworks |
+-------------+ +---------+ +----------+ +------------+
| | | |
| | | |
v v v v
+------------------------------------------------------+
| OpenTracing |
+------------------------------------------------------+
| | | |
| | | |
v v v v
+-----------+ +-------------+ +-------------+ +-----------+
| Tracing | | Logging | | Metrics | | Tracing |
| System A | | Framework B | | Framework C | | System D |
+-----------+ +-------------+ +-------------+ +-----------+
Tracing's functional positioning
- Fault location-you can see the complete path of the request, which is more convenient to locate the problem than the discrete log (because the sampling rate will be set in the real online environment, you can use the debug switch to achieve full sampling of a specific request);
- Dependency combing-generating a service dependency graph based on the calling relationship;
- Performance analysis and optimization-it is convenient to record the time-consuming occupancy and proportion of different processing units on the system link;
- Capacity planning and evaluation;
- Cooperate
Logging
andMetric
strengthen monitoring and alarm.
The first Google paper "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure" made it Tracing
popular. Twitter developed Zipkin
and open sourced this project based on this paper . After that, a hundred flowers blossomed in the industry, and a large number of open source and commercial Tracing
systems were born .
2.3 OpenTracing Standard
In recent years a variety of link monitoring product after another, the current mainstream market tool both as Datadog
a package of business programs such monitoring, but also AWS X-Ray
, and Google Stackdriver Trace
this cloud vendor products, as well as Zipkin
, Jaeger
this open-source product.
The Cloud Native Foundation ( CNCF
) has introduced OpenTracing
standards to promote the standardization of Tracing
protocols and tools, and unify Trace
data structures and formats. OpenTracing
By providing platform-independent and vendor-independent APIs, developers can easily add (or replace) the implementation of the tracking system. For example, from Zipkin
replacement to Jaeger
/ Skywalking
etc. backend.
In OpenTracing
, the following basic concepts are mainly defined:
- Trace (call chain) : Trace (call chain) in OpenTracing is implicitly defined by the Span belonging to this call chain. A Trace (call chain) can be considered as a directed acyclic graph (DAG graph) composed of multiple Spans, and the relationship between Span and Span is named References;
- Span (span) : can be translated as span, can be understood as a method call, a program block call, or an RPC/database access, as long as it is a program access with a complete time period, it can be considered as a span .
Single Trace
, the Span
causal relationship between:
1 [Span A] ←←←(the root span)
2 |
3 +------+------+
4 | |
5 [Span B] [Span C] ←←←(Span C 是 Span A 的孩子节点, ChildOf)
6 | |
7 [Span D] +---+-------+
8 | |
9 [Span E] [Span F] >>> [Span G] >>> [Span H]
10 ↑
11 ↑
12 ↑
13 (Span G 在 Span F 后被调用, FollowsFrom)
Each Span
contains the operation name, start and end time, additional information Span Tag
, can be used to record Span
special events Span Log
, used to transfer Span
context SpanContext
and define the Span
relationship between References
.
2.4 AboutSpanContext
SpanContext
It is a OpenTracing
very important concept, the creation Span
, the transport protocol Inject
(injection) and from the transport protocol Extract
when (extract) call chain information, SpanContext
play an important role.
SpanContext
The data structure is as follows:
SpanContext:
- trace_id: "abc123"
- span_id: "xyz789"
- Baggage Items:
- special_id: "vsid1738"
trace_id
Andspan_id
to distinguishTrace
theSpan
;Baggage Items
ItSpan Tag
is the same as the structure, the only difference is: itSpan Tag
onlySpan
exists in the current , nottrace
passed in the whole , butBaggage Items
will be passed along with the call chain.
Straddling the (inter-service or protocol) transmission implemented during transfer and associated call relationship, need to be able to SpanContext
inject into the downstream medium and extracted in a downstream transmission medium SpanContext
.
The similar HTTP Headers
mechanism provided by the protocol itself can often be used to achieve such information transfer, Kafka
and message middleware like this also provides a Headers
mechanism for achieving such a function .
OpenTracing
To achieve, you can use Tracer.Inject(...) and Tracer.Extract(...) provided in the api to facilitate SpanContext
injection and extraction.
The following is a pseudo code example:
3. Current mainstream open source solutions and comparison
More mainstream Tracing
open source programs have Jaeger
, Zipkin
, Apache SkyWalking
, CAT
, Pinpoint
, Elastic APM
etc. These items are now the source code is hosted on Github.
We made a comparison according to the following dimensions:
The following factors need to be considered when introducing existing systems:
- Low performance loss
- Application-level transparency, to minimize business intrusion, the goal is to change as little as possible or not to modify the code
- Scalability
Based on the above survey, it can be summarized as follows:
- If it is a
Java
stack-oriented application that has low cross-language and customization requirements, priority can be given to the low-intrusiveApache SkyWalking
project. The project is dominated by Chinese people and is used by more companies; - Consider multi-language support, customization and high expansion, give priority to selection
Jaeger
( similarJaeger
toZipkin
comparison, compatible with theZipkin
original protocol, comparedJaeger
with a certain late-comer advantage),Jaeger
andZipkin
compared with other solutions, more focused andTracing
self-sufficient, and the monitoring function is relatively weak ; - Preference for pure web applications, no need for customization and an already built ELK log system can consider low-cost access
Elastic APM
; CAT
Based on the full log collection of index data, it has certain advantages for large-scale collection, and it integrates a complete monitoring and alarm mechanism. There are many companies in China that use it, but it does not support itOpenTracing
;Pinpoint
The main feature is low intrusiveness, with a completeAPM
and call chain tracking function, but currently only supports theJava
sumPHP
, and does not support theOpenTracing
standard.
related articles
- opentracing-java github
- opentracing-java Chinese document