[Cloud Native Monitoring] mtail lightweight log monitoring system
foreword
"The author has built a temporary environment on the public cloud, you can log in to experience it first:"
http://124.222.45.207:17000/login
账号:root/root.2020
Introduction
"The three pillars of the observability platform: log monitoring, call chain monitoring, and metric monitoring. Among them, log monitoring is the most well-known, because our development system is basically inseparable from logs, and it is also the most common way to solve problems. The characteristic of the log is that it is a discrete event, because the generation of an event leads to the generation of a log, which provides more detailed clues for problem analysis and judgment.”
For example: if the program suddenly fails to connect MySQL数据库
, it is easy to find out through the exception log (as follows) that Too many connections
the database connection failed:
[ERROR] [2023-05-20 21:14:43] com.alibaba.druid.pool.DruidDataSource.init(629) | init datasource error
com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: Too many connections
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:408)
at com.mysql.jdbc.Util.getInstance(Util.java:383)
Too many connections
This kind of exception is usually MySQL
caused by exceeding the maximum number of connections allowed by the server, which can be solved by increasing the number of database connections or closing unused connections. It can be seen from this example: "Log monitoring is more direct for problem analysis in abnormal situations, such as request parameters, abnormal events, etc., providing more detailed clues for problem analysis and judgment; call chain monitoring is biased towards link analysis of request calls, often It is used to discover performance problems between links; while the monitoring of metrics is characterized by numerical types, which are often used for statistical aggregation or to provide running trends of performance indicators. It requires extensive experience to discover potential problems. Generally, it is a possible guess in the process of problem analysis and location. Provide effective data support basis. If the above Too many connections lead to abnormal database connections, you can use the metrics to view the number of MySQL connections for evidence, and use the connection number trend line to check when the number of connections began to rise abnormally for further analysis investigation."
"When it comes to log monitoring, everyone's first reaction may be ELK's solution or Loki's solution. Both of these solutions collect logs and send them to the center, where they are stored, viewed, and analyzed. The problem with this log solution is Using elasticsearch, the storage capacity is very large, and the full-text indexing overhead is huge. For example, the daily log storage capacity of an ELK monitoring system may exceed TB; another problem is that these logs are discrete and massive, which is not convenient to carry out aggregate statistics."
Then, in actual production, there are often the following requirements:
For example, count the call volume of a business interface in a day according to the log, so as to obtain the business order volume, payment amount, etc.;
The number of Errors triggered in the log, and what Errors are triggered every day, the statistics of the number of Triggers for each Error, etc.;
Alert based on log keywords:
As above
MySQL
connection exception:Too many connections
Such as
jvm
memory overflow:java.lang.OutOfMemoryError: Java heap space
If connection refused:
java.net.ConnectException: Connection refused
The above requirements do not require a full amount of logs, so it can be solved by converting logs to metrics: "Convert and calculate log stream messages to generate metrics, and then the monitoring system captures and uses PromQL syntax for aggregate statistical analysis . "
Here I will introduce a Google
produced tool mtail
, mtail
which is to read the logs in a streaming manner, and extract indicators from the logs through regular expression matching metrics
. This method can use the computing power of the target machine. Another advantage is that it is non-invasive and does not require Business burying point, if the business program is provided by a third-party supplier, and we cannot change its code, mtail
it is very suitable at this time.
mtail installation and use
1. mtail installation:
[root@VM-4-14-centos tools]# mkdir /disk/tools/mtail
[root@VM-4-14-centos tools]# cd /disk/tools/mtail
[root@VM-4-14-centos mtail]# tar -zxvf mtail_3.0.0-rc51_Linux_x86_64.tar.gz
2. Start mtail:
[root@VM-4-14-centos mtail]# ./mtail --progs /disk/tools/mtail/conf --logs '/disk/tools/mtail/logs/*.log' --logs /var/log/messages --log_dir /disk/tools/mtail/logdir --poll_interval 250ms
❝"The core parameters are as follows:"
1.
--progs
: specify a directory, a bunch of*.mtail
files are placed in this directory, and eachmtail
file is the regular extraction rule described2.
--logs
: The list of monitored log files, which can be used to separate multiple files, and--logs
parameters can also be used multiple times, or a file directory can be specified, wildcards are supported, and single quotes must be used for the directory when specifying a file directory. like:--logs a.log,b.log,c.log
--logs a.log
--logs b.log
--logs c.log
--logs '/export/logs/*.log'
3.
--log_dir
:mtail
The component's own log storage directory4.
❞--port
:mtail
componenthttp
listening port, default 3903
mtail
/metrics
After startup, it will automatically monitor a port 3903, and expose Prometheus
monitoring data that conforms to the protocol on the interface of 3903 , Prometheus
or Categraf
or wait to extract monitoring data Telegraf
from the interface./metrics
From this point of view, the principle is very clear. mtail
After starting, --logs
find the relevant log file according to , seek
and start streaming reading at the end of the file. Every time a line is read, --progs
match it according to the specified rule files to see if it meets certain rules. , extract time series data from it, and then /metrics
expose the collected monitoring indicators through 3903 .
rule grammar
mtail
The purpose of is to extract information from the logs and pass it to the monitoring system. Therefore, the indicator variable must be exported and named, the indicator supports three types of counter
, gauge
and histogram
, and the named variable must COND
precede the script.
The standard format is:
COND {
ACTION
}
where COND
is a conditional expression. It can be a regular expression or boolean
a conditional statement of type. as follows:
/foo/ {
ACTION1
}
variable > 0 {
ACTION2
}
/foo/ && variable > 0 {
ACTION3
}
COND
The operators available for expressions are as follows:
Relational operators:
❝< , <= , > , >= , == , != , =~ , !~ , || , && , !
❞
Arithmetic operators:
❝| , & , ^ , + , - , * , /, << , >> , **
❞
The operators available for "Exported Indicator Variables" are as follows:
❝= , += , ++ , –
❞
rule example
"1. Export a counter type indicator lines_total: the number of statistical log lines"
# simple line counter
counter lines_total
/$/ {
lines_total++
}
"2. Export a counter-type indicator error_count: count the number of log lines with the four keywords ERROR, error, Failed, and faild"
counter error_count
/ERROR|error|Failed|faild/ {
error_count++
}
"3. Export a counter-type indicator out_of_memory_count: count the number of occurrences of memory overflow"
counter out_of_memory_count
/java.lang.OutOfMemoryError/ {
out_of_memory_count++
}
After conversion into metrics, PromQL
it is easy to configure alert rules with syntax:
groups:
- name: memory.rules
rules:
- alert: OutOfMemoryError
expr: increase(out_of_memory_count[1m]) > 0
labels:
severity: series
annotations:
summary: "java.lang.OutOfMemoryError"
description: "{
{ $labels.instance }} 出现JVM内存溢出错误"
"4. Here I use mtail to monitor the log of n9e-server, and extract the number of notify triggered by each alarm rule. This log is an example:"
2021-12-27 10:00:30.537582 INFO engine/logger.go:19 event(cbb8d4be5efd07983c296aaa4dec5737 triggered) notify: rule_id=9 [__name__=net_response_result_code author=qin ident=10-255-0-34 port=4567 protocol=tcp server=localhost]2@1640570430
Obviously, there is such a keyword in the log: notify: rule_id=9
, which can be matched with regular expressions, and the number of rows that appear can be counted, and the ruleid can also be extracted from it. In this way, we can report the ruleid as a label, so we can write it like this The mtail rule has:
counter mtail_alert_rule_notify_total by ruleid
/notify: rule_id=(?P<ruleid>\d+)/ {
mtail_alert_rule_notify_total[$ruleid]++
}
"5, java exception type statistics"
counter exception_count by exception, log
/(?P<exception>[A-Z]*(.[A-Za-z]*)*(Exception|Error)):(?P<log>.*)/ {
exception_count[$exception][$log]++
}
Then enter the null pointer exception and jvm memory overflow exception into the log file:
java.lang.NullPointerException: Some error message here.
at com.example.myapp.MyClass.someMethod(MyClass.java:123)
at com.example.myapp.OtherClass.doSomething(OtherClass.java:45)
java.lang.OutOfMemoryError: Java heap space
Dumping heap to d://\java_pid10000.hprof ...
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at cn.intsmaze.dump.OOMDump$OOMIntsmaze.<init>(OOMDump.java:27)
at cn.intsmaze.dump.OOMDump.fillHeap(OOMDump.java:34)
at cn.intsmaze.dump.OOMDump.main(OOMDump.java:47)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
Heap dump file created [10195071 bytes in 0.017 secs]
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Access /metrics
the endpoint to see the metrics data:
# HELP exception_count defined at gj.mtail:7:9-23
# TYPE exception_count counter
exception_count{exception="java.lang.NullPointerException",log=" Some error message here.",prog="gj.mtail"} 3
exception_count{exception="java.lang.OutOfMemoryError",log=" Java heap space",prog="gj.mtail"} 2
❝The exception tag identifies the java exception type, and the log tag identifies a simple exception description.
Note: It is not suitable to actually produce log tags, because too many cardinality of log tags will cause index inflation.
❞
Convert abnormal indicators into measurement indicators, and use PromQL
syntax to easily count the number of times each abnormal type is triggered per day, or combine Grafana
real-time display statistics, trend lines, etc.
[For more cloud-native monitoring and operation and maintenance, please follow the WeChat public account: Reactor2020]