Introduction to open source service internal monitoring system graphite

Open source monitoring systems, the famous ones are nagois and catis. The company has adopted nagios for server and service status monitoring in operation and maintenance, and combined with plug-ins to provide email and SMS alarm functions; catis monitors the server through the snmp protocol, and uses RRDTool to draw beautiful reports for you to do performance analysis.

These are powerful tools for operation and maintenance personnel, but service developers rarely use such tools because it is difficult for them to monitor the internal running status of the services we develop. If you want to monitor the response time of your own development service, draw reports every five minutes, or monitor the internal cache hit rate of your service at various times, these tools are basically not helpful.

Common scheme

In order to meet such needs, developers often develop a monitoring system by themselves, send the internal status of the service to the monitoring server regularly, store these statuses in the database, and then compile reports by themselves. If the service you want to monitor is a cluster, you also need to solve the problem of monitoring data aggregation.
Another common solution is to print various status data to log files, summarize these logs regularly, and then run job analysis on the summary results (some are aggregated into hadoop and run mapreduce jobs), so as to monitor the effect The real-time performance is poor.
In order to avoid the above workload, we have contacted and used two open source systems that can be used for internal data monitoring of services and provide excellent reporting effects, namely Graphite and Ganglia.

Graphite

The biggest experience that Graphite gave me was how easy it was to use. catis uses the snmp protocol, which means that you need to install the snmp agent on the monitored node; ganglia monitoring also requires you to install gmond on the monitored node to collect information. Graphite uses a simple text protocol and simply sends text data to the graphite server through a TCP socket

quentinxxz.server.count  1234 1440245016

Among them, quentinxxz.server.count is the name of a specific monitoring indicator, 1440245016 is the timestamp of data generation, and 1234 is the value of the indicator cut at this time. Then you can see the corresponding data curve on the graphite web.
Graphite is implemented in python and mainly consists of three parts:

1.whisper

Whisper is a fixed-size database, similar in design to RRD (round-robin-database)

whipser is a fixed file size database. This means that whipser's data files are created with a fixed size.
For example, we configure the quentinxxz.server.count entry
as follows in the /opt/graphite/conf/storage-schemas.conf file

[quentinxxz]
pattern = quentinxxz.server.count
retentions = 1min:50d,10min:50d

1min means to record a point with an accuracy of 1 minute, and the 50d table stores it for 50 days. Therefore, the number of points that need to be saved to create a file should be 1 * 60 * 24 * 50.
Another interesting aspect of whisper is its powerful aggregation function. In the above configuration, 10min represents another precision of our configuration, and whisper will specify it according to us. The aggregation method (for example, take the maximum value, minimum value, and average value among 10 points), and store the result in another storage area with the precision of one point of 10min. The configuration of the specific aggregation method is located in the /opt/graphite/conf/storage-aggregation.conf file.
In addition, RRD does not receive updates that are earlier than the current most recent time cutoff, while whisper can do it (but there seems to be not much demand for this). For more comparisons between RRD and whisper, you can refer to the document  http://graphite.wikidot.com/whisper

2. carbon (Twisted daemon that monitors data)

Carbon是基于Twisted实现,是Graphite的后端实现。
Carbon的主要作用,是接收被监控节点的连接,收集各个指标的数据,将这些数据写入缓存并最终持久化到whisper存储文件中去。Carbonr能保证Graphite web 绘制出实时接到的指标更新,其原理也很简单位,有点类似lucence,carbon接收到的数据会先存在缓存中,然后再一起写入whisper的硬盘存储。Graphite web通过向carbon-cache发起请求,会同时查询位于缓存与硬盘中的数据。

3. graphite-web

Graphite web是基于Django实现的webapp,其主要功能自然是绘制报表与展示。我不并建议直接使用Graphite web,因为尽管其功能还是比较强大,但界面让我觉得丑得要死。这里我推荐使用第三方的一个Graphite前端开源应用Tessera。
使用Tessera的前提还是需要Graphite web的安装,因为它会直接请求Graphite web获取数据。Tessera的界面还是相当酷炫的,比较符合技术人员的审美。这是它吸引我去使用Graphite的一个重要的加分项。
此外,其灵活的配置,可以让我们自由组合,我们的Dashboard。 不多说,直接上图。



 

Graphite使用总结

个人使用感受,Graphite配合Tessera,最主要的优点是在于界面简洁漂亮,传输协议简单。其的缺点在于,当你的应用是一个大集群时,目前Graphite没有为你把集群中来自不同服务器的数据整合汇总的能力,例如,对集群中有10个搜索结点的缓存命中情况进行监控,就需要采用10个不同的指标名称(一般在其中加入主机名区分),意味着10条不同的曲线,但无法直接利用Graphite进行汇总成一个指标或曲线,让你看到搜索集群整体的缓存命中情况。

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325773895&siteId=291194637