Research on Distributed Tracking System

Research on Distributed Tracking System

introduce

Aggregating the work of the various components in a distributed system results in a comprehensive tracking system.

Every company will have its own distributed tracing system. Google's Dapper, Twitter's zipkin, Taobao's Eagle Eye, Sina's Watchman, Jingdong's Hydra, Vipshop's Microscope, and Wowo's Tracing.

important infrastructure.

Application scenarios

What happens in a bid request? What if the ad space cannot be found? What if the cookie mapping cannot be found? call stack.

What is the average QPS for bidding? What is the highest QPS? How is the volatility? Monitor QPS.

Why is this request so slow? Which part is the problem? Monitor latency.

The request volume of the database has suddenly increased. How to check the source? Link Analysis.

What does this operation depend on? Is it a database or a message queue? If a redis hangs, which businesses will be affected? Dependency analysis.

Architecture

(~~stolen from~~)Inspired by 鹰眼

Research on Distributed Tracking System

The processing process includes application internal embedding, log data collection, online and offline data analysis, and result storage and display.

Buried point

No performance burden: It is very difficult to promote something in the company that has unproven value but affects performance!

Because the log needs to be written, the higher the business QPS, the heavier the performance impact. Solved by sampling and asynchronous logging.

type Span struct {
  TraceID	int64
  Name	   string
  ID		 int64
  ParentID   int64
  Annotation []Annotation
  Debug	  bool
}

 

Span represents a specific method call and has a name and an id. It consists of a series of annotations.

type Annotation struct {
  Timestamp int64
  Value	 string
  Host	  Endpoint
  Duration  int32
}

 

A trace is a series of spans associated with the same request.

collect

There is a deamon on each machine for log collection. The business process sends its own Trace to the daemon.

The daemon sends the collected trace one level up.

多级的collector,类似pub/sub架构。可以负载均衡。

对聚合的数据进行实时分析和离线存储。

离线分析需要将同一条调用链的日志汇总在一起。

分析

调用链跟踪:把同一TraceID的Span收集起来,按时间排序就是timeline。把ParentID串起来就是调用栈。

抛异常或者超时,在日志里打印TraceID。利用TraceID查询调用链情况,定位问题。

依赖度量:

  • 强依赖:调用失败会直接中断主流程
  • 高度依赖:一次链路中调用某个依赖的几率高
  • 频繁依赖:一次链路调用同一个依赖的次数多

Research on Distributed Tracking System

离线分析按TraceID汇总,通过Span的ID和ParentID还原调用关系,分析链路形态。

实时分析对单条日志直接分析,不做汇总,重组。得到当前QPS,延迟。

存储

数据保留两个星期。

展示

必须能读才有价值。

技术选型

zipkin算是整套的解决方案,但是按照它的get start,装不上!

打算自己组装轮子。尽量采用Go语言的!

埋点肯定是自己做的。可以参考 这个 ,但是性能方面要注意下。

日志收集系统听说有flume/scribe等。知乎开源的kid看了一下,很小巧,redis的pub/sub协议很不错。heka的可扩展性比较好,实时分析应该可以直接做在里面。

展现如果有前端帮忙可以考虑ECharts或D3.js,不懂前端。graphite可以做数据展现。在osx下安装,依赖好麻烦!

初步决定:Heka + Influxdb + Grafana

展望

tracing和monitor的区别。

monitor可分为系统监控和应用监控。系统监控比如CPU,内存,网络,磁盘等等整体的系统负载的数据,细化可具体到各进程的相关数据。这 一类信息是直接可以从系统中得到的。应用监控需要应用提供支持,暴露了相应的数据。比如应用内部请求的QPS,请求处理的延时,请求处理的error数, 消息队列的队列长度,崩溃情况,进程垃圾回收信息等等。monitor主要目标是发现异常,及时报警。

The foundation and core of tracing are call chains. Most of the related metrics are obtained by analyzing the call chain. The main goal of tracing is system analysis. It's better to find problems ahead of time than to fix them later.

Tracing has a lot in common with the application-level monitor stack. All have data collection, analysis, storage and expansion. It’s just that the dimensions of the data collected are different, and the analysis process is different.

Tracing is the content of the first issue, and each component used in this research has the opportunity to be used in other places. After these wheels are used, the second phase can do more monitoring things.

Our goal is - to make our infrastructure more complete and powerful!

References

  1. Google's tracking system  Dapper for large-scale distributed systems  , classic paper
  2. Twitter's  zipkin  , open source, scala, can't be installed
  3. Taobao's  eagle eye  technology shares PPT, dry goods!
  4. Wowo.com introduces a  blog of Tracing
  5. Vipshop  Microscope
  6. PPT written by a foreigner
  7. wheels to bury
  8. Kids  know the open source log aggregation system
  9. Introducing a PPT of heka
  10. graphite,Scalable Realtime Graphing
  11. InfluxDB  is an open source distributed time series, events and metrics database

http://www.open-open.com/lib/view/open1422236665926.html

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326977055&siteId=291194637