prometheus配置文件动态管理

Prometheus是一套开源的监控、报警解决方案，是由SoundCloud公司开发的，从 2012 年开始编写代码，再到 2015 年开源以来，该项目有非常活跃的社区和开发人员，目前在全世界最大的男性交友社区上已经有了1.1w多star；2016 年 Prometheus 成为继 k8s 后，成为第二名 CNCF(Cloud Native Computing Foundation) 成员。

Google SRE的书内也曾提到跟他们BorgMon监控系统相似的开源实现是Prometheus，作为新一代开源解决方案，很多理念与 Google SRE 运维之道不谋而合。作为新一代的监控解决方案，现在最常见的用法是与Kubernetes容器管理系统进行结合进行监控，但不要误解为它仅仅是一个容器的监控，当你深入了解他之后，你会发现他能做很多事情。

这里我想多说一下，之前一直纠结于选择Prometheus还是Open-falcon。这两者都是非常棒的新一代监控解决方案，后者是小米公司开源的，目前包括小米、金山云、美团、京东金融、赶集网等都在使用Open-Falcon，最大区别在于前者采用的是pull的方式获取数据，后者使用push的方式，暂且不说这两种方式的优缺点。简单说下我喜欢Prometheus的原因，大概有5点吧，1、开箱即用，部署运维非常方便 2、prometheus的社区非常活跃 3、自带服务发现功能 4、简单的文本存储格式，进行二次开发非常方便。 5、最重要的一点，他的报警插件我非常喜欢，带有分组、报警抑制、静默提醒机制。这里并没有贬低open-falcon的意思，还是那句老话适合自己的才是最好的。

Consul-template自动刷新配置文件

由于Prometheus是“拉”的方式主动监测，所以需要在server端指定被监控节点的列表。当被监控的节点增多之后，每次增加节点都需要更改配置文件，非常麻烦，我这里用consul-template+consul动态生成配置文件，这种方式同样适用于其他需要频繁更改配置文件的服务。另外一种解决方案是etcd+confd，基本现在主流的动态配置系统分这两大阵营。consul-template的定位和confd差不多，不过它是consul自家推出的模板系统。

实现

先看下Prometheus的配置文件样例：

  - job_name: redis_exporter
    static_configs:
      - targets: ['10.100.11.53:9121','10.100.11.54:9121','10.100.11.55:9121']
        labels:
          instance: redis
  - job_name: 'linux'
    static_configs:
      - targets: ['10.100.11.53:9100','10.100.11.54:9100','10.100.11.55:9100']
        labels:
          instance: linux

每次新加监控节点的时候，只需要添加一个新的targets即可,“instance”是一个label标签，方便区分。那么这里就产生一个问题，当targets的数量达到几百上千之后，配置文件看起来就会特别冗余。所以有经验的运维人就会想到用include的方式，把其他的配置文件包含进来，这样就把一个大而冗余的主配置文件，切分成一个个小的配置文件。Prometheus这里用的方法就是基于文件的服务发现–“file_sd_config”。

- job_name: 'file_ds'
  file_sd_configs:
  - refresh_interval: 1m
    files:
    - ./conf.d/*.json

prometheus 实时更新./conf.d/下以.json结尾的文件。有变化自动更新到prometheus的监控页面上展示。

file_sd_config参考样例

[root@prometheus01 prometheus]# cat conf.d/targets.json |more
[
  {
    "targets": [ "100.100.110.71:9090" ],
    "labels": {
      "env": "product",
      "job": "prometheus",
      "instance": "100.100.110.71_prometheus_server"
    }
  },
  {
    "targets": [ "100.100.110.53:9121" ],
    "labels": {
      "env": "product",
      "job": "redis",
      "instance": "redis53"
    }
  }
]

结合服务发现实现文件的动态更新

启动单机版consul：


consul agent -server -bootstrap-expect 1 -bind=100.100.110.71 -client=100.100.110.71 -data-dir=/tmp/consul -node=agent-one -config-dir=/etc/consul.d -ui

有了子配置文件，新加监控节点的时候只需要更改子配置文件的内容即可。我们可以预先定义一个子配置文件的模板，用consul-template渲染这个模板，实现文件的动态更新。具体方法如下：

下载consul-template:

#选择对应需要下载的系统和版本
https://releases.hashicorp.com/consul-template/
unzip xxxx.zip
mkdir templates  ##创建模板目录
cp consul-template /usr/local/bin
consul-template -version

创建consul-template的配置文件

配置文件的格式遵循：HashiCorp Configuration Language。我的配置文件示例如下：

[root@prometheus01 prometheus]#  cat consul-template.conf 
log_level = "warn"
syslog {
# This enables syslog logging.
enabled = true
# This is the name of the syslog facility to log to.
facility = "LOCAL5"
}
consul {
# auth {
# enabled = true
# username = "test"
# password = "test"
# }
address = "100.100.110.71:8500"
# token = "abcd1234"
retry {
enabled = true
attempts = 12
backoff = "250ms"
# If max_backoff is set to 10s and backoff is set to 1s, sleep times
# would be: 1s, 2s, 4s, 8s, 10s, 10s, ...
max_backoff = "3m"
}
}
template {
source = "/usr/local/prometheus/templates/redis-discovery.ctmpl"
destination = "/usr/local/prometheus/conf.d/redis-discovery.json"
command = ""
command_timeout = "60s"
backup = true
left_delimiter = "{$"
right_delimiter = "$}"
wait {
min = "2s"
max = "20s"
}

主要配置参数：

syslog: 启用syslog，这样服务日志可以记录到syslog里。

consul: 这里需要设置consul服务发现的地址，我这里无需认证，所以把auth注释了。consul服务的搭建可以参考我之前的文章。值得一提的是，backoff和max_backoff选项，backoff设置时间间隔，当未从consul获取到数据时会进行重试，并以2的倍数的时间间隔进行。比如设置250ms，重试5次，那么每次的时间间隔为：250ms,500ms,1s,2s,4s，直到达到max_backoff的阀值；如果max_backoff设为2s，那么第五次重试的时候还是间隔2s，即250ms,500ms,1s,2s,2s。

template：定义模板文件位置。主要选项是source，destination和command，当backup=true的时候，会备份上一次的配置，并以bak后缀结尾。

source：consul-template的模板文件，用来进行渲染的源文件。

destination：consul-template的模板被渲染之后的文件位置。比如这里即是我prometheus基于文件发现的子配置文件位置:/usr/local/prometheus/conf.d/下的文件。

command:文件渲染成功之后需要执行的命令。prometheus这里会自动发现文件的更改，所以我这里无需任何命令，给注释掉了。像nginx、haproxy之类的服务，一般更改完配置文件之后都需要重启，这里可以设置“nginx -s reload”之类的命令。

command_timeout：设置上一步command命令执行的超时时间。

left_delimiter和right_delimiter：模板文件中分隔符。默认是用“{{}}”设置模板，当产生冲突的时候可以更改这里的设置。比如我这里由于用ansible去推送的模板文件，“{{}}”符号与Jinja2的语法产生了冲突，所以改为了“{$$}”符号。

当有多个模板需要渲染的时候，这里可以写多个template。

服务启动

启动consul-template服务，指定配置文件。

#./consul-template -config ./consul-template.conf

模板渲染

根据目标文件的格式去渲染consul-template的模板，比如我这里的prometheus基于文件的服务发现模板如下：

[root@prometheus01 prometheus]# cat templates/redis-discovery.ctmpl 
[
{$ range tree "prometheus/redis" $}
{
"targets": ["{$ .Value $}"],
"labels": {
"instance": "{$ .Key $}",
"env": "pre-product"
}
},
{$ end $}
{
"targets": ["100.100.110.71:9090"],
"labels": {
"instance": "prometheus01",
"env": "pre-product"
}
}
]

循环读取consul的K/V存储prometheus/redis/目录下的值，”targets”取的是Key，instance取的是Key的值。

Consul的K/V存储示例如下，每次录入一个数据，即是对应prometheus配置文件里的”instance:targets”：
这里写图片描述

consul K/V示例

这里有一个小技巧：prometheus的配置文件里，多个targets是用逗号“,”分割的，而最后的那一个targets后面不能带逗号，所以我在模板文件里单独写了一个targets，这样就无需关心这一例外情况。

数据在线添加实现配置文件的动态更新

现在在打开consul的ui界面，默认是8500端口，在KEY/VALUE的prometheus/redis/目录下新加一个redis01、redis02…，最后生成的配置文件格式如下：

[root@prometheus01 prometheus]# cat conf.d/redis-discovery.json
[

{
"targets": ["172.16.162.96:9121"],
"labels": {
"instance": "redis_172.16.162.96:9121",
"env": "pre-product"
}
},

{
"targets": ["172.16.162.97:9121"],
"labels": {
"instance": "redis_172.16.162.96:9222",
"env": "pre-product"
}
}

]

prometheus web监控界面：

这里写图片描述

参考：http://blog.51cto.com/xujpxm/1964878