Elast Alert Spike解析

Elast Alert Spike 可以对Elasticsearch中索引的事件的峰值进行监控告警，Spike的规则主要需要指定：

1）监控日志骤然上升还是下降（spike_type）：

# (Required, spike specific)
# The direction of the spike
# 'up' matches only spikes, 'down' matches only troughs
# 'both' matches both spikes and troughs
spike_type: "down"

2）如果是监控日志上升（up），那么上限的峰值是多少（threshold_cur）

3）如果是监控日志上升（up），并且达到了上限的峰值（threshold_cur），那么当前上升了多少倍（spike_height），发出告警。

4）因为要比较前后上升/下降的倍数，需要指定一个时间范围（timeframe）来计算两段时间内发生的事件数量。

同样，如要我们要监控日志骤然下降，除了指定为类型为down，则也指定一个下限峰值（threshold_ref），下降的倍数（spike_height）和计数的时间段（timeframe）。

up时，指定threshold_cur与spike_height

down时，指定threshold_ref与spike_height

这里的threshold_cur可以理解成当前的事件，而threshold_ref是参考事件，参考事件当然是当前之前的一个事件段，用于参考。

如下图，spike_height = 2， timeframe为10分钟，分别有up， down两个栗子

在example - up中，因为是要监控波峰，因此我们只需要指定threshold_cur = 20，不需要指定threshold_ref。如果我们发生的事件如下：

eg1:

14:00 -- 14:10 10

14:10 -- 14:20 15

实际上的cur是15 < threshold_cur(20),因此不报警。

eg2:

14:00 -- 14:10 10

14:10 -- 14:20 30

实际的cur(30) > threshold_cur(20)，此时是否报警还要看是否超过增长的倍数(spike_height)，实际的增长倍数为 30/10 = 3, 大于spike_height（2），因此要报警。

在config.yml中，我配置的扫描时间是

run_every:
  minutes: 1

在rule 文件中，配置的time frame是2分钟

# (Required, spike specific)
# The size of the window used to determine average event frequency
# We use two sliding windows each of size timeframe
# To measure the 'reference' rate and the current rate
timeframe:
  minutes: 2

下面可以看到每分钟都会有一条“Queried rule Event spike...”，表示去es里面查询事件。

每两分钟就会有一条“Event spike from ... to ..” 此时就是在同级一个timeframe中有多少hits，matches 和alerts sent。

INFO:elastalert:Queried rule Event spike from 2017-08-12 23:10 CST to 2017-08-12 23:11 CST: 0 / 0 hits
INFO:elastalert:Queried rule Event spike from 2017-08-12 23:11 CST to 2017-08-12 23:12 CST: 0 / 0 hits
INFO:elastalert:Ran Event spike from 2017-08-12 23:10 CST to 2017-08-12 23:12 CST: 0 query hits (0 already seen), 0 matches, 0 alerts sent
INFO:elastalert:Sleeping for 59.928103 seconds

Elast Alert Spike解析

猜你喜欢