Share Docker monitoring system (Kubernetes Mesos monitoring)

PS: The monitoring system is the most important link in the entire operation and maintenance link, and even the entire product life cycle.

The value and system of monitoring

In the operation and maintenance system, monitoring is a very important part. Through monitoring, you can master the status of system operation in real time, give early warning of faults, and playback historical status, etc., and can also provide auxiliary decision-making for system capacity planning through monitoring data, and provide real user behavior and experience for system performance optimization.

With the rapid development of Internet services in recent years, users' requirements for the system are getting higher and higher, and good monitoring can escort the system and effectively improve the reliability, availability and user experience of the system. The value of monitoring is mainly reflected in the following points:

1. Cost saving

Failures are unavoidable in the production environment. If anomalies can be detected in advance through accurate monitoring and early warning, the failure can be solved or an emergency plan can be implemented before the failure occurs, thereby reducing the economic loss caused by the failure. You can also monitor the usage of system resources, re-plan idle resources, and help save costs.

2. Improve efficiency

Operation and maintenance used to be firefighters, and because of the lack of system operation data, the efficiency in troubleshooting was very slow, but through monitoring, the historical state of the system can be played back and the failure scene can be saved. When you need to analyze the system, you can directly pull the monitoring data to generate a trend graph, which can clearly show what kind of problems the system has. For example, problems such as sudden increase in access connections, abnormal memory recovery, and abnormal database connections can help quickly locate the cause of the fault and solve the fault.

3. Improve quality

By analyzing the historical monitoring data of the system operation, the performance bottleneck of the system can be found and optimized in time, which can improve the operation quality of the system. End-to-end performance optimization can be performed not only from the perspective of infrastructure performance, but also from the perspective of application performance, effectively improving user experience.

However, it is not easy to monitor well. A good monitoring system needs to do the following:

20160819091237

Complete monitoring system

An enterprise's monitoring system includes the following components:

  1. Monitor the timeliness and accuracy of data collection
  2. Monitoring data collection, storage and archiving
  3. Graphical display of monitoring data
  4. Automated analysis and linkage processing of monitoring data
  5. Monitored alarms and automated processing
  6. Security controls for monitoring tools themselves
  7. Monitor alarm response and tracking

20160819091253

从上图我们可以看到,一个完整的监控体系是从监控数据的采集开始,再将数据进行存储,处理从而产生价值。比如智能生成分析报告,可图形化展示,通过监控联动完成高可用,伸缩,限流等事件处理,还有就是监控告警了。对于运维人员而言最高兴的莫过于由于一条告警短信后马上又再收到一条自动恢复的短信了,所以在监控体系里,故障告警的自动化处理也是非常重要的。

目前监控数据的采集方式有以下几种:

  • 主动输出

提前在应用中埋点,应用主动上报。比如一些应用系统的业务状态,可以通过在日志中主动输出状态用于采集。

  • 远程接入

通过对应用进程接口调用获取应用的状态。比如使用JMX的方式连接到java进程中,对进程的状态进行采集。

  • 嵌入式

通过在进程中运行agent的方式获取应用的状态。如目前的APM产品都是通过将监控工具嵌入到应用内部进行数据采集。

  • 旁路式

通过外部获取的方式采集数据。比如对网站url的探测,模拟业务的报文 ,对服务器的ping,流量的监控。可以通过在交换机上将流量进行端口复制,将源始流量复制到另一个端口后再进行处理,这样这业务系统是完全没有侵入。

  • 入侵式

不同于嵌入式,入侵式的agent是独立运行的进程,而不是运行在进程中。这个目前监控工具比较常用的方式,比如zabbix,在主机上运行一个进程进行相关数据的采集。

  • CLI方式

命令行的方式是最基本的方式,比如在linux系统上使用top,vmstat,netstat写一些shell脚本进行数据的采集,再把数据存储在文本文件中进行处理。

在不同的场景可以选用不同的数据采集方式,比如要实时看主机的CPU使用情况,登陆到主机用CLI方式用top命令就可以看到系统CPU的使用和进程CPU的使用情况。比如有一些应用的状态需要记录,就一定要在日志里输出相应的数据。而在数据采集时需要注意以下三个问题:

1、采集的时间间隔

对应用平台的监控数据采集的频率非常重要,关系到数据的及时性,有效性。在做监控数据采集时,也是根据不同的监控对象设定不同的时间间隔。比如对日志的监控是实时的,对实例的状态也是实时的,而对于一些后期用来分析的状态性数据则采集的时间间隔会长一些,3-5分钟的样子。

2、监控工具自身的安全控制

有些监控工具可能是时时运行,特别是侵入式监控,如果运行不当,自身就可能造成故障,比如执行过程异常不释放资源,造成高CPU占用;比如进程结束异常,不停的重启相同的进程;比如日志级别设置过低,大量日志输出,影响进程性能和占用大量磁盘空间。所以做监控时一定要遵循有自我安全控制的能力。监控工具在拿到生产环境中运行前,一定要先在测试环境中进行一段时间的试运行 。

3、触发式的数据采集

需要关注异常点的现场数据采集,比如threaddump,heapdump,主机的性能数据等。这些故障点的数据重启后就会失去,有些故障不能重现时,相关的分析数据就很重要了,所以对于这些数据,需要进行触发式的数据采集。当满足某些条件时触发采集,而在平常不运行。

容器的监控方案

传统的监控系统大多是针对物理机或虚拟机设计的,物理机和虚拟机的特点是静态的,生命周期长,一个环境安装配置好后可能几年都不会去变动,那么对监控系统来说,监控对像是静态的,对监控对象做的监控配置也是静态的,系统上线部署好监控后基本就不再需要管理。

虽然物理机,虚拟机,容器对于应用进程来说都是host环境,容器也是一个轻量级的虚拟机, 但容器是动态的, 生命周期短,特别是在微服务的分布式架构下,容器的个数,IP地址随时可能变化。如果还采用原来传统监控的方案,则会增加监控的复杂度。比如对于一个物理机或虚拟机,我们只要安装一个监控工具的agent就可以了,但如果在一个物理机上运行了无数个容器,也采用安装agent的方式,就会增加agent对资源的占用,但因为容器是与宿主机是共享资源,所以在容器内采集的性能数据会是宿主机的数据,那就失去在容器内采集数据的意义了。

而且往往容器的数量比较多,那么采集到的数量也会非常多,容器可能启动几分钟就停止了,那么原来采集的数据就没有价值了,则会产生大量这样没有价值的监控数据,维护起来也会非常的复杂。那么应该如何对容器进行监控呢?答案是在容器外,宿主机上进行监控。这样不仅可以监控到每个容器的资源使用情况,还可以监控到容器的状态,数量等数据。

单台主机上容器的监控

单台主机上容器的监控实现最简单的方法就是使用命令Docker stats,就可以显示所有容器的资源使用情况,如下输出:

20160819091303

虽然可以很直观地看到每个容器的资源使用情况,但是显示的只是一个当前值,并不能看到变化趋势。而谷歌提供的图形化工具不仅可以看到每个容器的资源使用情况,还可以看到主机的资源使用情况,并且可以设置显示一段时间内的越势。以下是cAdvisor的面板:

20160819091311

而且cAdivsor的安装非常简单,下载一个cAdvisor的容器启动后,就可以使用主机IP加默认端口8080进行访问了。

跨多台主机上容器的监控

cAdivsor虽然能采集到监控数据,也有很好的界面展示,但是并不能显示跨主机的监控数据,当主机多的情况,需要有一种集中式的管理方法将数据进行汇总展示,最经典的方案就是 cAdvisor+ Influxdb+grafana,可以在每台主机上运行一个cAdvisor容器负责数据采集,再将采集后的数据都存到时序型数据库influxdb中,再通过图形展示工具grafana定制展示面板。结构如下:

20160819091319

这三个工具的安装也非常简单,可以直接启动三个容器快速安装。如下所示:

20160819091326

在上面的安装步骤中,先是启动influxdb容器,然后进行到容器内部配置一个数据库给cadvisor专用,然后再启动cadvisor容器,容器启动的时候指定把数据存储到influxdb中,最后启动grafana容器,在展示页面里配置grafana的数据源为influxdb,再定制要展示的数据,一个简单的跨多主机的监控系统就构建成功了。下图为Grafana的界面:

20160819091333

Kubernetes上容器的监控

在Kubernetes的新版本中已经集成了cAdvisor,所以在Kubernetes架构下,不需要单独再去安装cAdvisor,可以直接使用节点的IP加默认端口4194就可以直接访问cAdvisor的监控面板。而Kubernetes还提供一个叫heapster的组件用于聚合每个node上cAdvisor采集的数据,再通过Kubedash进行展示,结构如下:

20160819091340

在Kubernetes的框架里,master复杂调度后有的node,所以在heapster启动时,当heapster配合k8s运行时,需要指定kubernetes_master的地址,heapster通过k8s得到所有node节点地址,然后通过访问对应的node ip和端口号(10250)来调用目标节点Kubelet的HTTP接口,再由Kubelet调用cAdvisor服务获取该节点上所有容器的性能数据,并依次返回到heapster进行数据聚合。再通过kubedash进行展示,界面如下:

20160819091348

Mesos的监控方案

而Mesos提供一个mesos-exporter工具,用于导出mesos集群的监控数据prometheus,而prometheus是个集 db、graph、statistic、alert 于一体的监控工具,安装也非常简单,下载包后做些参数的配置,比如监控的对象就可以运行了,默认通过9090端口访问。而mesos-exporter工具只需要在每个slave节点上启动一个进程,再mesos-exporter监控配置到prometheus server的监控目标中就可以获取到相关的数据。架构如下:

20160819091356

在Prometheus的面板上我们可以看到Prometheus的监控对象可以为mesos-export,也可以为cAdvisor。

20160819091404

下面为Prometheus的展示界面:

20160819091412

采集工具的对比

cAdvisor can collect local and container resource monitoring data, such as CPU, memory, filesystem and network usage statistics). It can also display Docker information and the downloaded images on the host. Because cAdvisor caches data in memory by default, the trend can only be displayed for about 1 minute on the display interface, so historical data cannot be seen, but it also provides different persistent storage backends, such as influxdb.

The premise of Heapster is to use cAdvisor to collect the usage of host and container resources on each node, and then aggregate the data on all nodes, so that you can not only see the resources of the entire Kubernetes cluster, but also view each node/namespace separately And the resources of pods under each node/namespace. In this way, detailed resource usage can be provided from all levels of cluster, node, and pod. It is also stored in memory by default, and different persistent storage backends are also provided, such as influxdb.

The feature of mesos-exporter is that it can collect monitoring data of tasks. Mesos starts task executors on each slave during resource scheduling. These task executors can be containers or not. And mesos-exporter can understand resource usage from the perspective of tasks, rather than a container that has no relationship.

The above introduces some monitoring from several typical architectures, but they are not the best practices. It is necessary to combine the advantages of each monitoring product according to the characteristics of the production environment to achieve the purpose of monitoring. For example, Grafana's chart display ability is strong, but there is no alarm function, then Prometheus can be combined with the data processing ability to improve the display of data analysis. Some monitoring products are listed below, but they are not strictly classified according to the table. For example, Prometheus and Zabbix have the functions of collection, display and alarm. All can understand, each has his own strengths.

20160819091419

 

 

http://www.dockerinfo.net/1718.html

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326835286&siteId=291194637
Recommended