[Distributed System] Jaeger Installation & Getting Started for Link Tracking

1. Past lives

Before introducing Jaeger, there is some background we should know.
The implementation of Jaeger follows the OpenTracing specification. What is the OpenTracing specification?
OpenTracing has formulated a set of platform-independent and vendor-independent Trace protocols, enabling developers to easily add or replace the implementation of distributed tracing systems. It was formulated nearly ten years ago; it is also related to Google's OpenCensus, There is also OpenTelemetry after the merger of the two; since this article mainly talks about Jaeger, please refer to this article written by Alibaba Cloud- OpenTelemetry-A New Era of Observability for more information about the previous life you should know .

2. Jaeger与Zipkin

Jaeger is a distributed link tracking software developed by Uber and later contributed to CNCF;
another software that is benchmarked against Jaeger is Zipkin, inspired by Google Dapper papers, developed by Twitter, and now maintained by a dedicated team. It should be noted that , Zipkin was developed earlier than Jaeger, the former was launched in 2012, and the first official version of Jaeger was released in 2017, but the number of stars on Github is about the same;

Some comparisons are as follows (data statistics are as of the time of publication of this article):

Zipkin Jaeger
Stars 13.3K 11.4K
release time 2012 2017
Development language Java Go
issues opened 161 332
versions-released 202 34
last-release-time 2020-4-16 2020-6-19
contributors 139 156
official-supported-lang C#, Go, Java, JS, Ruby, Scala, PHP C#, C++, Go, Java, Node.js, Python
official-docs Zipkin-docs Jaeger-docs
backend-storage Cassandra、ElasticSearch Cassandra3.4+, ElasticSearch5.x/6.x/7.x

The above listed are the officially supported development languages, both also have many unofficially supported languages, which can be found on the official website.
Jaeger is a rising star, and its architecture also refers to the design of Zipkin. There is not much difference, and the gap between the two is gradually narrowing.
In addition, both support memory-only data storage, making it easy to build a test environment.

2.1 About Jaeger

The official feature introduction:

  • Distributed context passing
  • Distributed transaction monitoring
  • Root Cause Analysis
  • Service Dependency Analysis
  • performance, latency optimization

scalability

Jaeger's backend is designed with no single point of failure and can be expanded as needed at any time; Uber uses it to process billions of spans every day.

span,表示一个逻辑工作单元,包含有操作名称、起始时间以及操作耗时。span之间可以存在嵌套和并排关系,span之间也有顺序。

Native support for OpenTracing

  • Represent traces as directed acyclic graphs via span references
  • Supports strongly typed span tags and structured logs
  • Support distributed context passing through baggage

The design and implementation of Jaeger's backend, webUI, and related framework adaptation libraries all support the OpenTracing standard;
in addition, the OpenTelemetry mentioned at the beginning is also compatible with OpenTracing, so the blogger recommends using the OpenTelemetry API directly. Click here to view OpenTelemetry SDKs and adaptation libraries for various languages ​​and frameworks (Instrumentation in English).

Cloud native deployment

  • The official Jaeger backend has been packaged as a docker image and released;
  • The binary file supports loading configuration in three ways: CLI option, ENV, and configuration file;
  • K8s department, Reference Kubernetes operator , Kubernetes templates and a Helm chart .

Observability
Jaeger backend components all support exposing indicators to Prometheus (other monitoring backends also support);
Log is written to stdout through the third log library zap

Backward compatibility with Zipkin client
If you already use Zipkin as the trace platform and want to migrate to Jaeger, don’t worry too much; you don’t
need to rewrite the client code, the Jaeger backend supports Zipkin format span, just point the data forwarding destination to Jaeger Just end it.

Its creators have released a book, Mastering Distributed Tracing , which covers all aspects of Jaeger's design and operation, as well as common distributed link tracing.

Regarding the choice of the two, it is also a matter of opinion; in my personal opinion, if you or your team mainly use Go language development, I still recommend using Jaeger, so that after you encounter library problems during use, you can easily Query the client source code or Jaeger source code to locate the problem, maybe you can even solve the problem yourself, or you can efficiently report bugs or ask questions on Github;
in addition, it is convenient for us to read the source code to learn the design method.

3. Installation

For the convenience of demonstration, use the officially recommended Docker quick start method:
docker run -d --name=jaeger -p6831:6831/udp -p16686:16686 jaegertracing/all-in-one:latest

Browser Web UI: http://localhost:16686/

Note: The jaeger encapsulated by this docker image stores data in memory, which is only used for testing, and the backend storage needs to be specified for official use.
Web UI

4. use

Now we need a little data for us to operate the page.

4.1 Starting an application

This application is an example from the Jaeger repository
Here I start from source:

git clone [email protected]:jaegertracing/jaeger.git jaeger
cd jaeger/examples/hotrod
go run ./main.go all

It also provides the way to start the docker image, refer to the example link.

Access hotrod's webUI: http://127.0.0.1:8080

Dakeng : When the blogger visited this page for the first time, the speed was extremely slow, and there was no response when clicking the button. It took an hour or two to find the problem. Simply put, you must be able to access the Internet properly locally page, because a js file with an external network address will be loaded when the page is loaded, which you can see through F12. (I really vomited here, and I opened a list to them. I don’t understand why a local example needs to load a js file from the public network and can’t put it locally?) 2020-08-20 Update: The main repo has been
merged The PR submitted by the blogger to fix this pit, but please note that this page will also load an online js and css file, but these two can be accessed without scientific Internet access.

After complaining, I have to continue. The following is the page after it is loaded correctly. There is an id in the upper left corner, and a new one will be generated every time it is refreshed; the
four buttons represent four customers. Sending it over is an order request, and the response data is the license plate number and estimated time of arrival.
insert image description here

4.2 Send request

Click a button to send an order request. The effect is shown in the figure below.
insert image description here
They are the license plate number, estimated time of arrival, request serial number, and time-consuming request.

4.2 Jaeger view service architecture

Switch to Jaeger webUI, click on the above System Architecture --> DAG (directed acyclic graph, directed acyclic graph)
insert image description here
This is the microservice architecture diagram of the hotrod we started. You can see how many services there are and how their dependencies are.
First of all, there are four services (real) and two storages (simulated by components). The numbers in them are the number of request calls. The figure shows that redis calls are the most. I am a little puzzled. Go back to the Jaeger main page to have a look.

4.3 View a trace

insert image description here
It can be specified through the previous architecture diagram that frontend is the top-level service, and all call records should be queried through it.
The above query result is the effect after clicking the query button. In the first record, you can see 3 Errors, followed by several services or stages of a request to the manager.
Click here to see the details.
insert image description here
Here is a word Span. Here we first introduce what is Trace. A trace represents the execution process of a transaction or process in a (distributed) system; a span
represents a single unit of work completed in a distributed system , also contains "references" to other spans, which allows multiple spans to be combined into a complete Trace.
For example, call a request to query user information /get_user?uid=1, here are several steps:

  • In the first step, the request reaches the routing service, and the routing service calls the handler according to the path
  • In the second step, the handler calls the service service internally
  • In the third step, the service calls its handler according to the route
  • The fourth step is to call the database in the service-handler

There are four time-consuming stages here, and each stage is time-consuming. Each stage is called a span, but in fact, you will find that the previous stage will include all the following stages. Here is the explanation of the aforementioned span Quoting the question; whereas each span has some meta information:

  • span-id
  • operation name
  • start time and end time, and elapsed time
  • tag, user-defined tags are convenient for querying, filtering and understanding data
  • log, which records the log information of a specific time or event within the Span, as well as other debugging or information output of the application itself
  • The span context, across process boundaries, is passed to the state of the child Span. Often used when creating context in trace diagrams

Here is an explanation of the calling process through the above diagram:

  1. frontendThe service receives an external HTTP GET request, and the route is /dispatch
  2. frontendcustomerThe /customer interface of the service call service
  3. customerExecute mysql query in the service, return the result and then return to frontendthe service
  4. Then the service initiates an RPC call frontendto the service, the interface isdriverDriver::findNearest
  5. driverCall redis multiple times in the service, and you can see that an error occurred
  6. Then frontendthe service routeinitiates multiple HTTP GET calls to the service, the route is /route
  7. Finally frontendthe service returns the result.

Click on each span to see some details, including tags and logs; for example, if you send a wrong redis call, you can see more details in the log section, which we typed in the code
insert image description here

4.4 Contextualized logging

To put it simply, the log part under each span is the record generated for this call, which is not very helpful to us; in the past, we used to go directly to tail -f x.log, if there are multiple With concurrent requests, it is difficult for us to find out the problem from so many log records.

4.5 Span Tags & Logs

As mentioned earlier, the tag is defined by itself, and the kv format can be used to filter traces in jaeger, while the log records events closely related to the request of the span, usually with a timestamp, such as redis timeout. (If you really like, you can still record this timeout event as a tag, such as event=timeout)

Note: The semantic data conventions (semantic data conventions) in OpenTracing's specification warehouse describe some tag names and log fields that can handle most situations, so we should still refer to this convention and be less special.

4.6 Link Disassembly Analysis

Now take the frontend service as the protagonist, fold all the spans with it as the root, and get the following figure.
insert image description here
This figure can make us more clear about the trace process, which spans are executed sequentially and which are parallel.
Sequentially executed spans:

  • /customer
  • /driver.DriverService/FindNearest

Parallel spans:

  • route (3 requests concurrently each time, the blue bars shown in the figure coincide on the timeline)

I don't need to say more about how convenient this is, the concurrent calls in the code are clearly visible.

4.6 Filter & Sort

The query is filtered through service, tag, min/max duration;
here is the sorting function, we can sort the search results in several ways, see the figure below
insert image description here

4.7 Comparing Multiple Traces

Select at least two traces in the search results, and click the compare button in the upper right corner.
insert image description here
The picture below is the comparison page
insert image description here
. I was confused when I understood this picture. If you are interested in the article introduced, you can view the original
text. The following describes how to look at the comparison chart;
first, ignore the color of the blocks, the block diagram connected by these arrows is a trace call chain, and a block is a span.
Generally, we only choose two identical requests to compare. The above picture is actually the picture after the two traces overlap; if you choose two different requests to compare, you will see the picture below. Let me talk about this picture by the way, it is
insert image description here
obvious , the request traces of two different routes cannot be overlapped. In fact, there is no comparison, but this color can be explained. Let's zoom in to see;
insert image description here

  • Dark green, indicating that this span only exists in trace-B, and A does not have this span
  • Dark red, indicating that this span only exists in trace-A, B does not have this span

But this is not absolute, because I saw that the comparison
insert image description here
is also dark green, but both trace types have this span, but B is more than A. I think it can be concluded from the value +8%, the span of B The number is 14, and A is 13, which is about 8% more. It seems that we should pay more attention to this value, but I think the depth of the color can still represent the size of the gap.

Let's look at another picture:
insert image description here
here are two more colors;

  • Light green, indicating that the number of spans in trace-B (the one on the right) exceeds trace-A
  • Light red, indicating that the number of spans in trace-A (the one on the left) is more than trace-B

In the end, more gray is seen, indicating that the span exists in both traces, and the number is the same.

So how exactly do you draw conclusions from the comparison? Look at the picture below
insert image description here
to infer : First, A and B overlap near the root span, but a large number of child spans are displayed in dark red, indicating that trace-B lacks these dark red spans, which generally indicate that a call failure event has occurred at the gray span , causing a chain of spans to disappear.

Such comparisons can provide very timely and granular clues when investigating incidents. We can narrow your search quickly and confidently.

If an event like the above picture occurs, we should not directly look at the dark red span details, but should check the log information of the gray spans near them to quickly locate the problem.

Finish

It took a lot of energy and time to write this article, and there may be errors in the text or expressions in the article. Readers are welcome to correct them.

If reproduced, please indicate source!

Reference article:

  1. Take OpenTracing for a HotROD ride
  2. Trace comparisons arrive in Jaeger 1.7

Guess you like

Origin blog.csdn.net/sc_lilei/article/details/107834597