Distributed Tracking System (2): Zipkin's Span Model

         In the article "Distributed Tracking System (1): Zipkin's Background and Design", Zipkin's design and data model have been initially introduced. This article will introduce Zipkin's Span model in detail, as well as the design of other "alternative" Span models.

          One more word here. In fact, the professional name should be Distributed Tracing System. Tracing is more suitable for human scenarios, such as someone being traced, while tracing is more suitable for use in the computer field. However, do not have eggs? This article will continue to use "tracking".

        Zipkin's Span model almost completely imitates the design of the Span model in Dapper. We know that Span is used to describe an RPC call, so an RPC call should only be associated with one spanId (not counting the parent spanId). Span in Zipkin mainly contains three Data part:

  • Basic data: used to trace the association and interface display of nodes in the tree, including traceId, spanId, parentId, name, timestamp and duration, where the span whose parentId is null will be displayed as the root node of the tracking tree, of course, it is also the call chain Starting point, in order to save the cost of creating a spanId and make the top-level span more obvious, the spanId in the top-level span will be the same as the traceId. timestamp is used to record the start time of the call, and duration represents the total time spent on the call, so timestamp+duration will represent the end time of the call, and duration will represent the length of the span's time bar in the trace tree. It should be noted that the name here is used to display on the time bar of the tracking tree node.
  • Annotation data: used to record key events, there are only four types, cs (Client Send), sr (Server Receive), ss (Server Send), cr (Client Receive), so in the Span model, Annotation is a list with the longest length. is 4. Each key event includes value, timestamp, and endpoint. Value is one of cs, sr, ss, and cr. Timestamp indicates the time of occurrence, and endpoint is used to record the machine (ip) and service name (serviceName) that occurred. It is natural to think that the machine names of cs and cr, sr and ss are the same. For simplicity, the service names of cs and cr can be the same, and the service names of sr and ss can be the same. Annotation data is mainly used to display specific Span information when a user clicks a Span node.
  • BinaryAnnotation data: We are not satisfied to display only the time information of the call chain on the trace tree. If you need to bind some business data (logs), you can write the data into BinaryAnnotation. Its structure is exactly the same as that of Annotation data. In Span It is also a list, which will not be described here, but it is not advisable to put too much data in BinaryAnnotation, otherwise it will lead to a decline in performance and experience.

      

       Now we have understood the internal structure of a Span, but this is the final form of the Span, which means that this is the final form that Zipkin sees after collecting data and showing it to the user. The generation of Span is "incomplete". The Zipkin server needs to assemble the collected Spans with the same traceId and spanId into the final complete Span, which is the Span mentioned above. It may not be intuitive to say this, we will use the following figure to illustrate:


 

zipkin data collection (Figure 1)

 

The above picture has been used in my first Zipkin blog post, and I will not elaborate it here. Let's look directly at the internal Span detail diagram corresponding to the picture:



 

span data flow (Figure 2)

Note that the above picture does not show all the details of Span (such as name and binaryAnnotation, etc.), but this does not affect our analysis of the problem. ① and ⑥ in the above figure are a complete RPC call, which occurs between server 0 and server 1. Obviously, the spanId of the span used to describe the RPC call is 1000, so this is the same span, It's just that its data comes from two different servers (applications): server 0 and server 1. At a low level, the span is represented by two trace logs, one is collected on server 0 and the other is collected on server 1, and their span's traceId, spanId and parentSpanId are all the same! And that Span will be the top node in the trace tree because their parentSpanId is null. For step 1, the time from sr on server 1 minus cs on server 0 is approximately equal to the network time (the difference between different server clocks is ignored here). Similarly, for other steps, sr-cs and cr-ss All you get is network time. Let's look at the request steps ② and ④. From the level of the trace tree, they belong to the sub-calls under ①, so their parentSpanId is 1000 of ①. Steps 2 and 4 will generate a spanId (1001 and 1002 above) respectively, so as shown in the figure above, it looks like a simple RPC process, in fact, a total of 6 span logs are generated, which will be assembled into 3 spans on the Zipkin server.

         Then, here comes the problem. This call has 3 spanIds on server 1: 1000, 1001 and 1002. If I want to record the business data on server 1 and this call (recorded by BinaryAnnotation), it is to convert these data Which Span is it bound to? If we were to choose, we would definitely choose 1000, because it is uncertain to call the downstream service in this request on server 1 (although only server 2 and server 3 are drawn in the figure), it is possible that it will call more than a dozen downstream services service, generating more than a dozen spanIds, it seems more reasonable to bind the business data language to the parent span (1000) of these spans. And when the business log is generated, the downstream call may not be started, so it can only be bound to 1000.

        Let's first take a look at what the Span in Figure 2 will look like in Zipkin's tracking tree, as shown below:


Zipkin trace tree (Figure 3)

Of course, some data will be different from those in Figure 2 (such as timestamp and duration), but it does not affect our analysis. It can be seen that the smallest time unit in Zipkin is microseconds (thousandths of a millisecond), so the total time taken for this RPC shown in Figure 3 is 96.2ms. Some people will definitely wonder why they have experienced it. RPC calls to four servers with only three nodes in the trace tree in the graph? Because in the tracking tree, a Span (to be precise, a spanId) will only be displayed as a tree node. For example, the tree node Service1 represents the process of Gateway (server 0) calling Service1 (server 1), and the tree node Service2 represents Service1 (server). 1) The process of calling Service2 (Server 2). Some people will definitely ask, for the tree node Service1, we recorded four times cs, sr, ss and cr, but the display of the time bar only uses cs and cr (duration=cr-cs), then sr and ss go to Which (don't forget that we can calculate network time via sr-cs and cr-ss)? We can click on the Serice1 node and open the detailed information of the Span (the annotation and binaryAnnotation data of the Span), as shown below:

Span details (Figure 4)

Relative Time is relative time, indicating how long this event (cs, sr, ss, cr) has occurred (relative start time), because Service1 is the top node, so the Relative Time in the first line is empty, so, The network time-consuming of the request (Gateway requests Service1) is 10ms, and the network time-consuming of the response (Service1 responds to Gateway) is 96.3-94.3=2ms. Therefore, from the current page design of Zipkin, the network time-consuming can only pass through the point tree. It is not intuitive to look at the detailed information page of the node, and also need to do simple calculations. Taobao's eagle eye system is displayed by dividing it into two colors on the time bar, using four timestamps of cs, sr, ss and cr, which is more intuitive.

          可能大多数人觉得对于跨四个系统的RPC调用却只显示三个节点,有些别扭。对于图1的调用,我们更希望是Gateway的节点下挂着一个Service1节点,表示Gateway调用了Service1,而Service1节点下挂着Service2和Service3两个节点,表示Service1调用了Service2和Service3,这样更容易理解。于是我们想到了在RPC链路中经过某个节点(服务器应用),那么这个节点就产生几个spanId,这样的话,在图中RPC经过Gateway、Service1、Service2和Service3各一次,所以一共将产生4个spanId(Zipkin在图2中只产生3个spanId),这样就变成了spanId和节点个数一致(前提是RPC链路中只经过每个节点各一次,也就是节点之间没有相互依赖)。这样设计Span数据的流转如下图:



 

修改过的Span数据流转(图5)

图5中可以很明显的看出,还是6条Span的日志,每个服务器节点上会产生一个spanId(1000、1001、1002和1003),而不是像原有图2一样只有3个spanId。这样还有一个好处,就是RPC调用时只需要传递traceId和spanId,而不是像Zipkin的设计那样,需要传递traceId、spanId还有parentSpanId。但立马我们就发现了问题,在图5的服务器1的节点上,1001的spanId记录了两套cs和cr,则也导致了无法区分哪个对应的是调用服务器2,哪个对应的是调用服务器3,所以,这种设计方案直接被否决了。

        于是我们换一种思路,不采用spanId和parentSpanId,换成spanId和childSpanId,childSpanId由父亲节点生成并传递给子节点,如下图:



 

新的Span数据流转(图6)

<!--StartFragment--> <!--EndFragment-->

从图6可以看到明显的变化,不再有parentSpanId,而使用了childSpanId,这样RPC之间传递的就是traceId和childSpanId,这也直接解决了图5中所遇到的问题。虽然图5和图6的设计违背了一次RPC调用由一个spanId的数据来进行维护的设计理念,但确实在跟踪树的界面展示上更容易让人接受和理解(树节点和服务器节点对应),而且还减少了RCP间的数据传送,何乐而不为?

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326742270&siteId=291194637