【Monitoring】LogQL学习（二）：Metric queries部分

【前置文章】

【Monitoring】LogQL学习（一）：Log queries部分

1. Metric queries介绍

Metric queries建立在Log queries的结果集上，进而创建metrics。
可以用来计算错误的日志发生率或是在过去的3小时内打印的top n条log。
结合Log queries的Parsers功能，metric queries可以对日志中简单的value值进行计算，如延迟时间或是request size。所有的labels，包括Parsers新生成的，都可以用来当然被聚合或是新生成一个series表。

2. Range Vector aggregation

LogQL借鉴了Prometheus中的range vector概念，可以按时间维度对已经filter出来的日志再进行聚合。如查询出过去3小时的日志，可以再按每秒进行统计，这里的每秒就是一个Time Durations的概念，LogQL支持的Time Durations和Prometheus中的一样：

ms - milliseconds
s - seconds
m - minutes
h - hours
d - days - assuming a day has always 24h
w - weeks - assuming a week has always 7d
y - years - assuming a year has always 365d

示例如：5h，1h30m，5m，10s。

Loki支持两种类型的聚合：

a. log range aggregations (#2.1)
b. unwrapped range aggregations (#2.2)

2.1 Log range aggregations

利用function在某个区间内做聚合。这个区间写在在Log stream selector或是Log pipeline后面。

Functions列表:

rate(log-range): calculates the number of entries per second
count_over_time(log-range): counts the entries for each log stream within the given range.
bytes_rate(log-range): calculates the number of bytes per second for each stream.
bytes_over_time(log-range): counts the amount of bytes used by each log stream for a given range.
absent_over_time(log-range): returns an empty vector if the range vector passed to it has any elements and a 1-element vector with the value 1 if the range vector passed to it has no elements. (absent_over_time is useful for alerting on when no time series and logs stream exist for label combination for a certain amount of time.)

【举例】

每5分钟一个区间，统计job=mysql的日志数量：
```
count_over_time({job="mysql"}[5m])
```

更为具体的说，Log范围为：
00:01 --> log1
00:02 --> log2
...
00：10 --> log 10

那么用count_over_time函数统计的时候，会变为：
00:01 ~ 00:05，这一期间的log数量为5。
00:06 ~ 00:10，这一期间的log数量也为5。

每1分钟为一个统计区间，统计的对象为标签为job的mysql日志，并且日志内容包括error，但不包含timeout字符，执行时间大于10s的每秒的个数。最终用sum函数，把各个host中的value都加起来。
```
sum by (host) (rate({job="mysql"} |= "error" != "timeout" | json | duration > 10s [1m]))
```

2.2. Unwrapped range aggregations

Unwrapped ranges可以使用通过Parser新生成的label中的values值（统计的对象不仅仅是log日志的内容本身）。语法为| unwrap label_identifier

自带的functions有：

duration_seconds(label_identifier) (or its short equivalent duration) which will convert the label value in seconds from the go duration format (e.g 5m, 24s30ms).
bytes(label_identifier) which will convert the label value to raw bytes applying the bytes unit (e.g. 5 MiB, 3k, 1G).

还支持其它很多functions：

rate(unwrapped-range): calculates per second rate of the sum of all values in the specified interval.
rate_counter(unwrapped-range): calculates per second rate of the values in the specified interval and treating them as “counter metric”
sum_over_time(unwrapped-range): the sum of all values in the specified interval.
avg_over_time(unwrapped-range): the average value of all points in the specified interval.
max_over_time(unwrapped-range): the maximum value of all points in the specified interval.
min_over_time(unwrapped-range): the minimum value of all points in the specified interval
first_over_time(unwrapped-range): the first value of all points in the specified interval
last_over_time(unwrapped-range): the last value of all points in the specified interval
stdvar_over_time(unwrapped-range): the population standard variance of the values in the specified interval.
stddev_over_time(unwrapped-range): the population standard deviation of the values in the specified interval.
quantile_over_time(scalar,unwrapped-range): the φ-quantile (0 ≤ φ ≤ 1) of the values in the specified interval.
absent_over_time(unwrapped-range): returns an empty vector if the range vector passed to it has any elements and a 1-element vector with the value 1 if the range vector passed to it has no elements. (absent_over_time is useful for alerting on when no time series and logs stream exist for label combination for a certain amount of time.)

【举例】

每1分钟为一个统计区间，日志包含metrics.go，并按日志内容进行logfmt，使用新生成的label名=bytes_proceed进行相加，最后按org_id进行求和。

sum by (org_id) (
  sum_over_time(
  {cluster="ops-tools1",container="loki-dev"}
      |= "metrics.go"
      | logfmt
      | unwrap bytes_processed [1m])
  )

3. Built-in aggregation operators

类似PromQL，LogQL支持在aggregation的基础上，再进行新的聚合操作，以下函数可以将日志内容统计后聚合到新的图表中：

sum: Calculate sum over labels
avg: Calculate the average over labels
min: Select minimum over labels
max: Select maximum over labels
stddev: Calculate the population standard deviation over labels
stdvar: Calculate the population standard variance over labels
count: Count number of elements in the vector
topk: Select largest k elements by sample value
bottomk: Select smallest k elements by sample value

语法：

<aggr-op>([parameter,] <vector expression>) [without|by (<label list>)]

当使用topk和bottomk的时候，需要parameter。如topk的时候，可以传入parameter为10，即统计top 10。

当需要按某个input vector进行group的时候，可以使用by和without，without语句可以将结果集中的labels移除掉（但保留剩余的）。by表示按某个labels进行统计。

【举例】

每5分钟一个区间进行统计，region=us-east1下的日志，每秒Log出现最多次的前10条，显示name。即：列出前10个日志吞吐量最大的应用名。

topk(10,sum(rate({region="us-east1"}[5m])) by (name))

列出mysql job过去5分钟的日志数量，按level进行统计：

sum(count_over_time({job="mysql"}[5m])) by (level)