In-depth analysis of open source distributed monitoring CAT

CAT (Central Application Tracking) is a real-time and near-full monitoring system. It focuses on monitoring Java applications and basically accesses all core applications on the Shanghai side of Meituan Dianping. At present, it is widely used in the framework of middleware (MVC, RPC, database, cache, etc.), providing system performance indicators, health status, monitoring alarms, etc. for Meituan-Dianping's various business lines. Since its open source in 2014, in addition to Meituan Dianping, CAT has also been used in the production environment of many Internet companies such as Ctrip, Lufax, Liepin, and Zougang. The open source address of the project is GitHub - dianping/cat: Central Application Tracking .

This article will give a detailed and in-depth introduction to some design ideas of CAT's overall design, client, and server.

Background introduction

The entire product development of CAT started at the end of 2011, when Dianping was in the core initial stage of migrating from .NET to Java. Dianping already had core basic middleware, RPC component Pigeon, and unified configuration component Lion. The overall Java migration is already on the way to service. With the deepening of serviceization, the overall Java online deployment scale has gradually increased, and at the same time, more and more problems have been exposed. Typical questions are:

  • A large number of errors, especially core services, take a long time to locate.
  • Exception logs require online permission to log in to the online machine for troubleshooting, which takes a long time to troubleshoot.
  • Some simple bugs are very difficult to locate (configured the online library to Beta at one time, spent the whole night debugging).
  • Many unresolved problems are suspected to be network problems (from now on, there are really few problems with the intranet).

Although there were some simple monitoring tools at that time (such as Zabbix, the Hawk system developed by ourselves, etc.), the functions of a single tool may be good in some aspects, but the overall service level is uneven, and the expansion ability is relatively weak. The lack of interconnection makes it necessary to switch between multiple systems to find the root cause of the problem. Sometimes it really depends on "character" to find the root cause.

It happened that Wu Qimin, who had worked at eBay for more than ten years, joined Dianping as the chief architect. He had a deep understanding of the very successful CAL system applied inside eBay. In this situation, we began to develop the first-generation monitoring system for Dianping - CAT.

The prototype and concept of CAT came from eBay's CAL system, which was originally designed and developed by Wu Qimin during his work in Dianping. He has previously worked with CAT not only to enhance the CAL system core model, but also to add richer reporting.

overall design

The overall requirement of monitoring is to quickly find faults, quickly locate faults, and assist in program performance optimization. In order to do this, we have made the following requirements for some non-functions of the monitoring system:

  • Real-time processing: The value of information decreases dramatically over time, especially during incident handling.
  • Full data: The initial design goal is to collect full data. There are many benefits of full data.
  • High availability: All applications are down, need to be monitored and still stand, and tell engineers what happened, so as to restore faults and locate problems.
  • Fault tolerance: The failure of the CAT itself should not affect the normal operation of the business. If the CAT hangs, the application should not be affected, but the monitoring capability is temporarily weakened.
  • High throughput: To restore the truth, you need to monitor and measure in all directions, and you must have super processing throughput.
  • Scalable: It supports distributed, cross-IDC deployment and horizontal expansion monitoring system.
  • Reliability is not guaranteed: message loss is allowed, which is a very important trade-off. Currently, the CAT server can achieve four nines of reliability. The design of a reliable system and an unreliable system is very different.

Since its development, CAT has been adhering to the principle that simple architecture is the best architecture . It is mainly divided into three modules: CAT-client, CAT-consumer, and CAT-home.

  • Cat-client provides the bottom-level SDK for business and middle-tier buried points.
  • Cat-consumer is used for real-time analysis of data provided from clients.
  • Cat-home acts as a control terminal that provides users with presentations.

In actual development and deployment, Cat-consumer and Cat-home are deployed inside a JVM, and each CAT server can be used as a consumer or a home, which can not only reduce the entire hierarchy, but also increase system stability.

The above picture is the overall structure of CAT's current multi-machine room, which can be seen in the figure:

  • The routing center determines the CAT server address reported by the client based on the information of the computer room where the application is located. Currently, Meituan Dianping has three computer rooms in Guangzhou, Beijing, and Shanghai.
  • Each computer room has an independent original information storage cluster HDFS.
  • CAT-home can be deployed in one computer room or in multiple computer rooms. When it is finally displayed, the home will call across the computer rooms from the consumer, and display all the data to the user.
  • In the actual process, the consumer, home, and routing center are all deployed together, and each server node can play any role.

Client Design

Client design is the most core part of CAT system design. Client requirements are to achieve simple API and high reliability performance. No matter in any scenario, customer service performance cannot be affected. Monitoring is only a bypass link in the company's core business process. The core client of CAT is Java, and it also supports Net client. Recently, the company is also developing other multi-language clients. The following client design and details are based on the Java client as a template.

Design Architecture

The CAT client uses ThreadLocal (thread local variable) in collecting data on the end, which is a thread local variable, which can also be called thread local storage. In fact, the function of ThreadLocal is very simple, that is, to provide a copy of the variable value for each thread that uses the variable, which belongs to a relatively special thread binding mechanism in Java. Each thread can change its own copy independently. Will conflict with copies of other threads.

In the monitoring scenario, the services provided to users are all web containers, such as tomcat or Jetty, and the back-end RPC servers such as Dubbo or Pigeon are also implemented based on thread pools. When the business side processes business logic, it basically calls back-end services, databases, caches, etc. within a thread, takes back the data, encapsulates the business logic, and finally displays the results to the user. So it is very appropriate to store all monitoring requests as a monitoring context in thread variables.

As shown in the figure above, when the business executes the business logic, the monitoring corresponding to this request will be stored in the thread context, which is actually a monitoring tree structure. At the end of the last business thread execution, the monitoring object is stored in an asynchronous memory queue, and CAT has a consumer thread that asynchronously sends the data in the queue to the server.

API Design

The monitoring API definition often depends on the understanding of the field of monitoring or performance analysis. The scenarios for monitoring and performance analysis are as follows:

  • The execution time of a piece of code. A piece of code can be time-consuming for URL execution, or it can be time-consuming for SQL execution.
  • The number of executions of a piece of code, such as the number of exception records thrown by Java, or the number of executions of a piece of logic.
  • Execute a certain piece of code regularly, such as periodically reporting some core indicators: JVM memory, GC and other indicators.
  • Key business monitoring indicators, such as monitoring the number of orders, transaction volume, payment success rate, etc.

On the basis of the above domain model, CAT designs its own core monitoring objects: Transaction, Event, Heartbeat, and Metric.

A code example of a monitoring API is as follows:

Serialized sum communication

Serialization and communication are a key link in the performance of the entire client, including the server.

  • The CAT serialization protocol is a custom serialization protocol. The custom serialization protocol is much more efficient than the general serialization protocol. This is still very necessary in large-scale data real-time processing scenarios.
  • CAT communication is NIO data transmission based on Netty. Netty is a very good NIO development framework, so I won't introduce it in detail here.

Client buried point

Log tracking is one of the most important aspects of monitoring activities, and log quality determines monitoring quality and efficiency. The current CAT's burying goal is to focus on the problem, such as the program throwing an exception is a typical problem. My personal definition of a problem is that it can be considered a problem if it does not meet expectations, such as unfinished requests, faster and slower response times, more or less TPS requests, uneven time distribution, and so on.

In the Internet environment, the most prominent problem scenario, the most prominent understanding is: behavior across boundaries. including but not limited to:

  • HTTP/REST、RPC/SOA、MQ、Job、Cache、DAL;
  • Search/query engines, business applications, outsourced systems, legacy systems;
  • Between third-party gateways/banks, partners/suppliers;
  • Various business indicators, such as user login, number of orders, payment status, and sales.

problems encountered

Usually, the place where Java clients are prone to problems in business use is memory, and the other is CPU. Memory is often a memory leak, and occupying a lot of memory increases the pressure on the GC of the business side; the CPU overhead ultimately depends on the performance of the code.

以前我们遇到过一个极端的例子,我们一个业务请求做餐饮加商铺的销售额,业务一般会通过for循环所有商铺的分店,结果就造成内存OOM了,后来发现这家店是肯德基,有几万分店,每个循环里面都会有数据库连接。在正常场景下,ThreadLocal内部的监控一个对象就存在几万个节点,导致业务Oldgc特别严重。所以说框架的代码是不能想象业务方会怎么用你的代码,需要考虑到任何情况下都有出问题的可能。

在消耗CPU方面我们也遇到一个case:在某个客户端版本,CAT本地存储当前消息ID自增的大小,客户端使用了MappedByteBuffer这个类,这个类是一个文件内存映射,测试下来这个类的性能非常高,我们仅仅用这个存储了几个字节的对象,正常情况理论上不会有任何问题。在一次线上场景下,很多业务线程都block在这个上面,结果发现当本身这台机器IO存在瓶颈时候,这个也会变得很慢。后来的优化就是把这个IO的操作异步化,所以客户端需要尽可能异步化,异步化序列化、异步化传输、异步化任何可能存在时间延迟的代码操作

服务端设计

服务端主要的问题是大数据的实时处理,目前后端CAT的计算集群大约35台物理机,存储集群大约35台物理机,每天处理了约100TB的数据量。线上单台机器高峰期大约是110MB/s,接近千兆网打满。

下面我重点讲下CAT服务端一些设计细节。

架构设计

在最初的整体介绍中已经画了架构图,这边介绍下单机的consumer中大概的结构如下:

如上图,CAT服务端在整个实时处理中,基本上实现了全异步化处理。

  • 消息接受是基于Netty的NIO实现。
  • 消息接受到服务端就存放内存队列,然后程序开启一个线程会消费这个消息做消息分发。
  • 每个消息都会有一批线程并发消费各自队列的数据,以做到消息处理的隔离。
  • 消息存储是先存入本地磁盘,然后异步上传到HDFS文件,这也避免了强依赖HDFS。

当某个报表处理器处理来不及时候,比如Transaction报表处理比较慢,可以通过配置支持开启多个Transaction处理线程,并发消费消息。

实时分析

CAT服务端实时报表分析是整个监控系统的核心,CAT重客户端采集的是是原始的logview,目前一天大约有1000亿的消息,这些原始的消息太多了,所以需要在这些消息基础上实现丰富报表,来支持业务问题及性能分析的需要。

CAT是根据日志消息的特点(比如只读特性)和问题场景,量身定做的,它将所有的报表按消息的创建时间,一小时为单位分片,那么每小时就产生一个报表。当前小时报表的所有计算都是基于内存的,用户每次请求即时报表得到的都是最新的实时结果。对于历史报表,因为它是不变的,所以实时不实时也就无所谓了。

CAT基本上所有的报表模型都可以增量计算,它可以分为:计数、计时和关系处理三种。计数又可以分为两类:算术计数和集合计数。典型的算术计数如:总个数(count)、总和(sum)、均值(avg)、最大/最小(max/min)、吞吐(tps)和标准差(std)等,其他都比较直观,标准差稍微复杂一点,大家自己可以推演一下怎么做增量计算。那集合运算,比如95线(表示95%请求的完成时间)、999线(表示99.9%请求的完成时间),则稍微复杂一些,系统开销也更大一点。

报表建模

CAT每个报表往往有多个维度,以transaction报表为例,它有5个维度,分别是应用、机器、Type、Name和分钟级分布情况。如果全维度建模,虽然灵活,但开销将会非常之大。CAT选择固定维度建模,可以理解成将这5个维度组织成深度为5的树,访问时总是从根开始,逐层往下进行。

CAT服务端为每个报表单独分配一个线程,所以不会有锁的问题,所有报表模型都是非线程安全的,其数据是可变的。这样带来的好处是简单且低开销。

CAT报表建模是使用自研的Maven Plugin自动生成的。所有报表是可合并和裁剪的,可以轻易地将2个或多个报表合并成一个报表。在报表处理代码中,CAT大量使用访问者模式(visitor pattern)。

性能分析报表

故障发现报表

  • 实时业务指标监控 :核心业务都会定义自己的业务指标,这不需要太多,主要用于24小时值班监控,实时发现业务指标问题,图中一个是当前的实际值,一个是基准值,就是根据历史趋势计算的预测值。如下图就是当时的情景,能直观看到支付业务出问题的故障。

  • 系统报错大盘。

  • 实时数据库大盘、服务大盘、缓存大盘等。

存储设计

CAT系统的存储主要有两块:

  • CAT的报表的存储。
  • CAT原始logview的存储。

报表是根据logview实时运算出来的给业务分析用的报表,默认报表有小时模式、天模式、周模式以及月模式。CAT实时处理报表都是产生小时级别统计,小时级报表中会带有最低分钟级别粒度的统计。天、周、月等报表都是在小时级别报表合并的结果报表。

原始logview存储一天大约100TB的数据量,因为数据量比较大所以存储必须要要压缩,本身原始logview需要根据Message-ID读取,所以存储整体要求就是批量压缩以及随机读。在当时场景下,并没有特别合适成熟的系统以支持这样的特性,所以我们开发了一种基于文件的存储以支持CAT的场景,在存储上一直是最难的问题,我们一直在这块持续的改进和优化。

消息ID的设计

CAT每个消息都有一个唯一的ID,这个ID在客户端生成,后续都通过这个ID在进行消息内容的查找。典型的RPC消息串起来的问题,比如A调用B的时候,在A这端生成一个Message-ID,在A调用B的过程中,将Message-ID作为调用传递到B端,在B执行过程中,B用context传递的Message-ID作为当前监控消息的Message-ID。

CAT消息的Message-ID格式ShopWeb-0a010680-375030-2,CAT消息一共分为四段:

  • 第一段是应用名shop-web。
  • 第二段是当前这台机器的IP的16进制格式,01010680表示10.1.6.108。
  • 第三段的375030,是系统当前时间除以小时得到的整点数。
  • 第四段的2,是表示当前这个客户端在当前小时的顺序递增号。

存储数据的设计

消息存储是CAT最有挑战的部分。关键问题是消息数量多且大,目前美团点评每天处理消息1000亿左右,大小大约100TB,单物理机高峰期每秒要处理100MB左右的流量。CAT服务端基于此流量做实时计算,还需要将这些数据压缩后写入磁盘。

整体存储结构如下图:

CAT在写数据一份是Index文件,一份是Data文件.

  • Data文件是分段GZIP压缩,每个分段大小小于64K,这样可以用16bits可以表示一个最大分段地址。
  • 一个Message-ID都用需要48bits的大小来存索引,索引根据Message-ID的第四段来确定索引的位置,比如消息Message-ID为ShopWeb-0a010680-375030-2,这条消息ID对应的索引位置为2*48bits的位置。
  • 48bits前面32bits存数据文件的块偏移地址,后面16bits存数据文件解压之后的块内地址偏移。
  • CAT读取消息的时候,首先根据Message-ID的前面三段确定唯一的索引文件,在根据Message-ID第四段确定此Message-ID索引位置,根据索引文件的48bits读取数据文件的内容,然后将数据文件进行GZIP解压,在根据块内便宜地址读取出真正的消息内容。

服务端设计总结

CAT在分布式实时方面,主要归结于以下几点因素:

  • 去中心化,数据分区处理。
  • 基于日志只读特性,以一个小时为时间窗口,实时报表基于内存建模和分析,历史报表通过聚合完成。
  • 基于内存队列,全面异步化、单线程化、无锁设计。
  • 全局消息ID,数据本地化生产,集中式存储。
  • 组件化、服务化理念。

总结感悟

最后我们再花一点点时间来讲一下我们在实践里做的一些东西。

一、MVP版本,Demo版本用了1个月,MVP版本用了3个月。

为什么强调MVP版本?因为做这个项目需要老板和业务的支持。大概在2011年左右,我们整个生产环境估计也有一千台机器(虚拟机),一旦出现问题就到运维那边看日志,看日志的痛苦大家都应该理解,这时候发现一台机器核心服务出错,可能会导致更多的问题。我们就做了MVP版本解决这个问题,当时我们大概做了两个功能:一个是实时知道所有的API接口访问量成功率等;第二是实时能在CAT平台上看到异常日志。这里我想说的是MVP版本不要做太多内容,但是在做一个产品的时候必须从MVP版本做起,要做一些最典型特别亮眼的功能让大家支持你。

二、数据质量。数据质量是整个监控体系里面非常关键,它决定你最后的监控报表质量。所以我们要和跟数据库框架、缓存框架、RPC框架、Web框架等做深入的集成,让业务方便收集以及看到这些数据。

三、单机开发环境,这也是我们认为对整个项目开发效率提升最重要的一点。单机开发环境实际上就是说你在一台机器里可以把你所有的项目都启起来。如果你在一个单机环境下把所有东西启动起来,你就会想方设法地知道我依赖的服务挂了我怎么办?比如CAT依赖了HDFS。单机开发环境除了大幅度提高你的项目开发效率之外,还能提升你整个项目的可靠性。

四、最难的事情是项目上线推动。CAT整个项目大概有两三个人,当时白天都是支持业务上线,培训,晚上才能code,但是一旦随着产品和完善以及业务使用逐渐变多,一些好的产品后面会形成良性循环,推广就会变得比较容易。

五、开放生态。公司越大监控的需求越多,报表需求也更多,比如我们美团点评,产品有很多报表,整个技术体系里面也有很多报表非常多的自定义报表,很多业务方都提各自的需求。最后我们决定把整个CAT系统里面所有的数据都作为API暴露出去,这些需求并不是不能支持,而是这事情根本是做不完的。美团点评内部下游有很多系统依赖CAT的数据,来做进一步的报表展示。

CAT项目从2011年开始做,到现在整个生产环境大概有三千应用,监控的服务端从零到几千,再到今天的两万多的规模,整个项目是从历时看起来是一个五年多的项目,但即使是做了五年多的这样一个项目,目前还有很多的需求需要开发。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325339600&siteId=291194637