Distributed call tracking and monitoring in practice

For more in-depth articles, please pay attention to the cloud computing channel: https://yq.aliyun.com/cloud

The Status Quo of Distributed Calling Systems

At present, with the expansion of the Internet architecture, distributed systems have become increasingly complex, and more and more components have begun to be distributed, such as microservices, messaging, distributed databases, distributed caches, distributed object storage, Domain calls, these components together form a complex distributed network.

Distributed call tracking and monitoring in practice

As shown on the right side of the figure above, when application A makes a request, dozens or even more services may be called behind it, which can be described as "a single trigger that affects the whole body".

If the distributed system is compared to a highway network, each front-end request is equivalent to a vehicle traveling on the highway, and the application processing the request is the toll station on the highway, where the vehicle traffic information is recorded into a log, including Time, license plate, station, road, price, etc. If the logs on all toll booths are integrated, the complete traffic record of the car can be determined through the unique license plate number; distributed call system tracking and monitoring are analogous to this idea , to track each request, and then clarify the application, time-consuming and other information that each request passes through.

Alibaba's Distributed Call Tracing Implementation - Eagle Eye

Distributed call tracking and monitoring in practice

Alibaba's distributed call tracking is implemented by the EagleEye system. EagleEye is a log-based distributed call tracking system. Its concept is derived from the Google Dapper paper. The key core is the call chain. Generate a globally unique ID (Traceld), and restore more valuable data by associating the "isolated" call information of different systems together.

Distributed call tracking and monitoring in practice

The above figure is a call chain from the generation environment. In the application name column, you can see a series of applications that have passed through the intermediate process of the request. You can see that the Buy application first passes through, and then delivery, tee, inventoryplatform, etc. are called to form a call tree ( The indentation on the tree indicates nested relationships), and it is easy to see the complete processing of front-end requests from the call tree.

Also worth noting, the above image is composed of a white background and a blue background. The blue background indicates that after the call chain passes through the message, it becomes an asynchronous message channel, and the subsequent processing processes are also asynchronous; the white background indicates the synchronous process. Generally speaking, the waiting time for front-end users does not include the time spent in the blue background, that is, only the time for synchronization processing is included.

The page shown in the figure above also clearly shows the specific time-consuming process of each application processing request, which is very intuitive to locate; in addition, the status information is also worth paying attention to. If there is an error, an exception will occur (marked in the red area in the figure). By clicking on the status code, the user can view the specific information of the error.

Eagle Eye was launched in Alibaba in 2013 and currently supports Alibaba Group's pan-e-commerce, AutoNavi, Youku and other businesses. The technical level covers front-end gateway access layer, remote service call framework (RPC), message queue, database, distribution In addition, Eagle Eye also supports private cloud output.

scenes to be used

Let's take a look at the specific usage scenarios of the call chain.

Locating abnormal and time-consuming problems

Distributed call tracking and monitoring in practice

Traceld can be found in the error information of the business exception log (TraceId=ac18287913742691251746923 in the figure), and then you only need to enter Traceld in the Hawkeye system to see the specific situation in the call chain, which is more intuitive on the call chain. Locate the problem (as shown in the figure above), and check the problem layer by layer to determine the location of the problem.

Monitoring report with call chain drill-down

Distributed call tracking and monitoring in practice

For the distributed call tracking system, it does not only provide the function of call chaining, because it embeds the calls of all middleware, so all the conditions on the middleware can be monitored. Therefore, in the process of forming the call chain, a detailed call monitoring report will also be formed. The difference between it and other monitoring is that the monitoring report is a report with the function of drilling up and down. Because the call chain is a detailed underlying statistics, the report dimensions that can be formed are very rich. In the call report shown in the figure above, you can not only see the status of the service, but also drill down to the status of the service it calls. ; In addition, you can drill down the call chain from the monitoring report to view clear call chain information.

link analysis

The link is different from the call chain, the link is a statistical concept, and the call chain is the process of the single call. The value of the analysis link is mainly reflected in the following points:

(1)拓扑形态分析:分析来源、去向,识别不合理来源;

(2)依赖树立:识别易故障点/性能瓶颈、强依赖等问题;

(3)容量估算:根据链路调用比例、峰值QPS评估容量;

(4)异常个体识别:寻找集群中与其他个体有差异的实例。

下面来具体分析这四点。

拓扑形态分析:分析来源、去向,识别不合理来源

Distributed call tracking and monitoring in practice

上图是全局调用拓扑图,可以明显的看到不同的应用之间存在复杂的调用关系,也可以查看某个应用和其他应用之间的调用关系以及调用的频次;图中红点表示在调用过程中出现错误。

通过该拓扑图,架构师可以清楚地观察到系统上的调用情况,此外,点击全局调用拓扑图上的某个节点,可以下钻到下图所示的单应用链路拓扑图。

Distributed call tracking and monitoring in practice

在以某应用为中心的单应用链路拓扑图,可以查看该应用在调用链上下游的应用之间的具体调用关系。

依赖梳理和容量估算

链路分析除了进行拓扑形态分析之外,还能进行依赖梳理:识别易故障点、性能瓶颈、强依赖等问题;也可以根据链路调用比例、峰值QPS 评估容量。

Distributed call tracking and monitoring in practice

上图是一份单链路报表,单链路报表是指同一HTTP入口的调用链叠加形成、包含所有依赖情况的调用关系。上图左侧模糊部分是一棵调用树,它表现了应用之间的依赖关系,与调用链不同的是,这种依赖关系是统计学意义上的依赖,因此在该报表上包含了QPS和统计QPS统计类型的数据。在进行容量预估时,可以很容易分析上游应用对下游造成的压力。

在该报表上,还可以进行依赖梳理方面的工作,根据出错率确定易故障点;此外,那些存在强依赖、错误阻塞的地方都是潜在故障点;最后,还可以根据耗时比例进行相关的性能优化。

异常个体识别:寻找集群中与其他个体有差异的实例

异常个体识别是链路分析的另一大作用,可以很容易地识别集群中与其他个体存在差异的实例。

Distributed call tracking and monitoring in practice

在集群中,由于网络或机器配置等原因导致不同的计算机处理能力有差别,因此可能存在某些机器空闲而其他机器繁忙的现象,正如上图的热点图显示,集群中会存在负载不均匀的情况,通过调用链可以非常容易地识别集群中存在问题的机器;此外,还可以通过离散点、波形图等方式发现异常的个体。

分析出异常个体只是链路分析的第一步,分析出异常个体之后,能够自动提醒用户或者是自动处理异常情况才是后续的关键问题,而不是人工去检查每个报表中是否存在异常。

实现原理

在分析调链实现的原理之前,首先需要明确几个关键概念:

(1)全局唯一ID:Traceld。Traceid是与每次请求相对应,保证全局唯一;用于将调用链的各个调用重新关联起来。Traceld的实现有以下三种方案:

方案一:由中心节点统一分配。

方案二:在本地直接生成,无业务语言(UUID)。

方案三:在本地直接生成,但是附带业务语义。

方案一对中心节点过于依赖,存在性能问题;方案二不包含业务语言。因此,综合方案一、二,我们最后选择了方案三。

Distributed call tracking and monitoring in practice

图中给出了目前使用的Traceid的组成部分,包括IPv4、毫秒时间、顺序数、标志位、进程PID五部分。

(2)调用链内部的唯一ID:RpcId(也叫SpanId)。它用于还原调用顺序和调用间的嵌套关系,需要考虑的调用关系包括同步、并发、异步、一对多。那么用什么方式实现RpcId 适合表示上述关系呢?

Distributed call tracking and monitoring in practice

阿里内部实现RpcId的方案(方案二)是多级序号的方案:生成一个RpcId的同时,也会包含所有上级的RpcId,具体示意图如上图所示。从上图可以形象地看到RpcId的概念,例如从0.3.1.1,可以得知它上一级序号为0.3.1,再由0.3.1得知其上一级序号为0.3,以及上上一级序号0;此外,多级序号的实现方式相对于单级序号方式在成本方面也得到降低。

另外,在整个调用链里还涉及其他一些关键概念:

(1)调用相关属性Tags:它记录仅和本次调用相关的信息,是一个K-V 集合。常见的Tags一般包括调用时间、调用耗时、调用类型、服务名、操作名、响应状态码、对端服务器地址等信息。

(2)调用透传信息UserData(也叫Baggage):它记录同一条调用链上下游共享的信息,也是一个K-V 集合。常见的UserData包括通用数据、特殊指令、调用路由控制,影响部分中间件的路由走向。

(3)调用上下文RpcContext(也叫SpanContext):它由TraceId、RpcId、Tags、UserData 共同组成,在进程内部时,上下文保存于ThreadLocal,对业务透明;在网络通讯时,上下文会和实际内容一起被传输到对端。

整体架构

下面来看一下分布式调用跟踪的整体架构。

Distributed call tracking and monitoring in practice

如上图所示,在架构的最上端是应用集群,每台机器中都有一个带鹰眼埋点的中间件,该中间件负责向日志文件中写入数据,每台机器上的数据收集agent从日志文件读取数据,实现实时收集日志;在鹰眼系统中通过实时处理集群对实时日志进行计算分析,得到两种类型的数据,分别是统计类型的报表(存放在HBase中)和调用链调用明细详情(存放在HiStore中);另外,涉及到离线数据分析的数据使用ODPS离线分析集群进行计算,主要是一些模型建设方面的分析。

架构中最下侧室鹰眼控制台,通过控制台可以查看实时处理集群后的结果;此外,控制台还负责实时处理集群和埋点相关的配置推送。

整体处理流程

数据整体处理流程抽象为埋点、采集、分析、存储四部分,下面来具体分下每个部分。

埋点客户端

埋点客户端首先需要注意减少对业务线程的影响,降低资源消耗;其次由于每个网络请求至少1次调用记录,QPS越高则数据产生越快,进而导致成本很难控制;另外,业务方是希望埋点对业务是无侵入的,部署在任何位置均可,进而导致运维环境相对复杂。

Distributed call tracking and monitoring in practice

为了解决上述挑战,在阿里内部我们选用了完全自行实现的基于日志的输出方案,在提升输出性能方面:采用异步无锁队列写日志(定制的Disruptor);减少输出内容,长字符串编码;日志输出缓存,限制IO次数,每秒刷新;高效多进程并发写文件。

在运维层面,也进行了很多优化:日志文件按大小滚动,自动清理;统一字符编码,统一时区;全局配置推送管控;异常状态自动降级。

因为对业务是无侵入埋点,所以在内部中间件直接进行埋点植入;对于非内部中间件,如开源组件,第一种方式是提供了埋点的组件包或通过扩展的方式增加埋点;第二种方法是使用AOP一类技术,在运行期对目标埋点位置做字节码增强;第三种方式是适配已有的链路埋点,如OpenTraceing或Zipkin等。

调用链采样

因为QPS越高,需要生成的调用日志也就越高。因此,为了降低整体的输出数据量,引入例如采样的概念,根据TraceId中的顺序数进行采样,提供了多种采样策略搭配:

• 100% 采样:它会对业务带来影响,因此最好内置支持客户端自动降级;

• 固定阈值采样:全局或租户内统一控制;

• 限速采样:在入口处按固定频率采样若干条调用链;

• 异常优先采样:调用出错时优先采样;

• 个性化采样:按用户ID、入口IP、应用、调用链入口、业务标识等配置开启采样。

通过上面这五种采样策略的搭配使用,可以灵活地控制调用链上数据的输出,确保数据量不会过大。

数据采集

Distributed call tracking and monitoring in practice

在数据采集方面主要面临内部和云上两个挑战:

•内部:在阿里内部,其实是一个比较可控的环境,但是因为数据量太大,每日百TB级别的规模,面临着节约成本的问题;

•云上:尽管数据量相对较小,由于网络环境比较复杂以及各种限制,数据采集依旧困难。

针对内部的挑战,我们的解决方案是在埋点时输出日志到本地,通过日志Agent读取日志,然后再通过实时计算的处理层主动拉取日志再进行处理。该方案直接复用应用机器存储日志,并且采用拉模式防止流量冲击过大。

针对云上的挑战,我们的解决方案利用消息队列的方式,埋点层主动发送消息,消息队列对消息进行存储,数据处理层从消息队列上订阅消息。这种方案可做到数据不丢,且主动推送可以提高实时性,环境适应性强;但这种解决方案的成本是比较高的。

数据分析

Distributed call tracking and monitoring in practice

数据采集的下一阶段是数据分析。数据分析主要分为调用级分析和链路级分析:

•调用级分析是指对单次调用数据,可以马上进行分析,按照指定的统计维度进行实时聚合即可,实时性为秒级。对于上文提到的采样情况下,为了保证数据统计的准确性,埋点层需要另外输出一份统计日志。

•链路级分析是调用链中最为关键的分析,它一般采用离线计算方案,将统一个TraceId的调用链先汇总在同一个Reduce任务重,调用链重组后再进行分析。类似于强弱依赖分析、调用频繁度分析、耗时瓶颈分析都是属于链路级分析。链路级分析的难点在于数据残缺补全,维度归一化和统计维度爆炸等问题。

数据存储

Distributed call tracking and monitoring in practice

在数据存储时,不同的数据类型存储的方式存在差异:

对于时间序列数据,我们是基于HBase存储,是OpenTSDB的修改后方案,解决聚合数据重复提交以及统计维度爆炸等问题。

对于调用链数据存储,分为三个阶段:

(1)阶段一:使用HBase存储,TraceId做rowkey;

(2)阶段二:使用Hadoop/ODPS 存储,按TraceId 的时间戳和哈希值进行分片,分片内按TraceId 排序;针对调用中记录的每个列进行针对性压缩,节省存储;

(3)阶段三:使用分库分表的HiStore 存储(将在DRDS 铂金版推出), 按TraceId 的时间戳和哈希值进行分库分表;HiStore 支持列式高压缩比存储,兼容MySQL 生态,非常适合写多读少的场景。

最佳实践

调用链作为排查问题的核心,通过其可以将各类数据关联在一起,提高问题排查能力。下面来看一下调用链的最佳实践——全息排查。

全息排查

Distributed call tracking and monitoring in practice

在实际问题排查中经常会遇到上图所示的问题,这些问题都具有明确的业务含义,这些问题尽管看上去和调用链并无关系,但可以用调用链得到很好的解决。如上图右侧所示,A-E五个节点在调用链上承载的调用关系实际上都是一些具体的业务,例如节点A处理HTTP请求表示卖家abc点击下单;在调用B时其实在计算卖家xyz在该路线的运费等等。在排查问题时,最有价值的切入点在于先从业务问题出发,再进一步在调用链中确认问题所在之处。

Distributed call tracking and monitoring in practice

Distributed call tracking and monitoring in practice

我们可以根据业务时间ID反查调用链,从而顺藤摸瓜找到更多的上下游业务信息。例如一个交易订单(2135897412389123)发现存在问题,我们可以根据订单号查到与之绑定的TraceId,根据TraceId不仅可以查看系统调用的事件,还可以看到与业务相关的事件,如用户下单、当前库存情况等,也就是说根据交易ID可以在调用链上查看交易、商品库存以及支付等信息,大大提升错误排查速度。

Distributed call tracking and monitoring in practice

回到刚才提到的三个问题:要分析由哪笔订单操作引起的调用异常其实是TraceId到OrderId的一次关联;要分析异常订单是否由卖家对所在商品的运费模板的某些异常操作导致其实是根据OrderId关联ItemId再关联TemplateId,最后关联到TraceId;对于第三个问题,通常是由UserId关联到TraceId再关联到MyBizld。

根据这些问题及其解决方案可以看到,全息排查的关键在于:业务时间id与TraceId/RpcId的双向绑定。

常见的双向绑定有三种实现方式:

(1)在调用链的Tags 或UserData 中放入业务事件id,从而建立调用链到业务事件id 的关联;

(2)打通TraceId 到数据库的数据变更的关联,从而建立调用链到每次数据变更的关联;

(3)在业务日志中记录TraceId、业务事件id 等信息,从而建立调用链与业务事件日志的关联。

目前,基于阿里云ARMS集成了上述三种双向绑定实现方式,用户可以在产品上轻松配置搞定。

全息排查全景图

Distributed call tracking and monitoring in practice

The picture above is a panorama of Alibaba's internal holographic troubleshooting. The core part of this figure is the back-end system initially covered by Eagle Eye, including services, messages and caches; it involves front-end user access logs at the front-end level, and has the ability to associate TraceId; it also has the ability to associate TraceId on the mobile side; At the database level, the TraceId is transmitted to the binlog of the database through SQL statements, and the association between the record of each data change and the TraceId can be easily obtained during data replication and data distribution; in addition , the business can also print the TraceId through its own business log and exception stack; in this way, all components of the business layer, mobile terminal, front end, and data layer are associated with the TraceId, and then associated with the order number and user number in the business , commodity number, logistics order number, transaction order number, and finally form a very powerful ecology - from a call chain, you can see the upstream and downstream related orders, the detailed information of users, and you can find the order related to the order according to the order. Business ID, and then expand to more IDs related to it according to the business ID, even TraceId, and finally form a mesh structure of TraceId-->Business ID-->New TraceId, which transforms the troubleshooting into searching from the mesh structure. The entire piece of information required.

Three-dimensional monitoring system created by EDAS+ARMS

At present, EDAS provided by Alibaba Cloud combined with ARMS can create a three-dimensional monitoring system. EDAS is used for application management and control, and is used to control links and applications; while ARMS focuses more on business operations, such as e-commerce transactions, Internet of Vehicles, and retail. ; In fact, monitoring requires all-round attention to business, link, application, and system, and a three-dimensional monitoring system is formed by complementing each other with ARMS and EDAS.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326163686&siteId=291194637