[Cloud Native Monitoring] mtail lightweight log monitoring system

[Cloud Native Monitoring] mtail lightweight log monitoring system

foreword

"The author has built a temporary environment on the public cloud, you can log in to experience it first:"

http://124.222.45.207:17000/login
账号:root/root.2020

Introduction

"The three pillars of the observability platform: log monitoring, call chain monitoring, and metric monitoring. Among them, log monitoring is the most well-known, because our development system is basically inseparable from logs, and it is also the most common way to solve problems. The characteristic of the log is that it is a discrete event, because the generation of an event leads to the generation of a log, which provides more detailed clues for problem analysis and judgment.”

For example: if the program suddenly fails to connect MySQL数据库, it is easy to find out through the exception log (as follows) that Too many connectionsthe database connection failed:

[ERROR] [2023-05-20 21:14:43] com.alibaba.druid.pool.DruidDataSource.init(629) | init datasource error
com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: Too many connections
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at com.mysql.jdbc.Util.handleNewInstance(Util.java:408)
    at com.mysql.jdbc.Util.getInstance(Util.java:383)

Too many connectionsThis kind of exception is usually MySQLcaused by exceeding the maximum number of connections allowed by the server, which can be solved by increasing the number of database connections or closing unused connections. It can be seen from this example: "Log monitoring is more direct for problem analysis in abnormal situations, such as request parameters, abnormal events, etc., providing more detailed clues for problem analysis and judgment; call chain monitoring is biased towards link analysis of request calls, often It is used to discover performance problems between links; while the monitoring of metrics is characterized by numerical types, which are often used for statistical aggregation or to provide running trends of performance indicators. It requires extensive experience to discover potential problems. Generally, it is a possible guess in the process of problem analysis and location. Provide effective data support basis. If the above Too many connections lead to abnormal database connections, you can use the metrics to view the number of MySQL connections for evidence, and use the connection number trend line to check when the number of connections began to rise abnormally for further analysis investigation."

"When it comes to log monitoring, everyone's first reaction may be ELK's solution or Loki's solution. Both of these solutions collect logs and send them to the center, where they are stored, viewed, and analyzed. The problem with this log solution is Using elasticsearch, the storage capacity is very large, and the full-text indexing overhead is huge. For example, the daily log storage capacity of an ELK monitoring system may exceed TB; another problem is that these logs are discrete and massive, which is not convenient to carry out aggregate statistics."

Then, in actual production, there are often the following requirements:

  • For example, count the call volume of a business interface in a day according to the log, so as to obtain the business order volume, payment amount, etc.;

  • The number of Errors triggered in the log, and what Errors are triggered every day, the statistics of the number of Triggers for each Error, etc.;

  • Alert based on log keywords:

    • As above MySQLconnection exception: Too many connections

    • Such as jvmmemory overflow:java.lang.OutOfMemoryError: Java heap space

    • If connection refused:java.net.ConnectException: Connection refused

The above requirements do not require a full amount of logs, so it can be solved by converting logs to metrics: "Convert and calculate log stream messages to generate metrics, and then the monitoring system captures and uses PromQL syntax for aggregate statistical analysis . "

Here I will introduce a Googleproduced tool mtail, mtailwhich is to read the logs in a streaming manner, and extract indicators from the logs through regular expression matching metrics. This method can use the computing power of the target machine. Another advantage is that it is non-invasive and does not require Business burying point, if the business program is provided by a third-party supplier, and we cannot change its code, mtailit is very suitable at this time.

mtail installation and use

1. mtail installation:

[root@VM-4-14-centos tools]# mkdir /disk/tools/mtail
[root@VM-4-14-centos tools]# cd /disk/tools/mtail
[root@VM-4-14-centos mtail]# tar -zxvf mtail_3.0.0-rc51_Linux_x86_64.tar.gz

2. Start mtail:

[root@VM-4-14-centos mtail]# ./mtail --progs /disk/tools/mtail/conf --logs '/disk/tools/mtail/logs/*.log' --logs /var/log/messages --log_dir /disk/tools/mtail/logdir --poll_interval 250ms

"The core parameters are as follows:"

1. --progs: specify a directory, a bunch of *.mtailfiles are placed in this directory, and each mtailfile is the regular extraction rule described

2. --logs: The list of monitored log files, which can be used to separate multiple files, and --logsparameters can also be used multiple times, or a file directory can be specified, wildcards are supported, and single quotes must be used for the directory when specifying a file directory. like:--logs a.log,b.log,c.log --logs a.log --logs b.log --logs c.log --logs '/export/logs/*.log'

3. --log_dir: mtailThe component's own log storage directory

4. --port: mtailcomponent httplistening port, default 3903

mtail/metricsAfter startup, it will automatically monitor a port 3903, and expose Prometheusmonitoring data that conforms to the protocol on the interface of 3903 , Prometheusor Categrafor wait to extract monitoring data Telegraffrom the interface./metrics

From this point of view, the principle is very clear. mtailAfter starting, --logsfind the relevant log file according to , seekand start streaming reading at the end of the file. Every time a line is read, --progsmatch it according to the specified rule files to see if it meets certain rules. , extract time series data from it, and then /metricsexpose the collected monitoring indicators through 3903 .

rule grammar

mtailThe purpose of is to extract information from the logs and pass it to the monitoring system. Therefore, the indicator variable must be exported and named, the indicator supports three types of counter, gaugeand histogram, and the named variable must CONDprecede the script.

The standard format is:

COND {
  ACTION
}

where CONDis a conditional expression. It can be a regular expression or booleana conditional statement of type. as follows:

/foo/ {
  ACTION1
}

variable > 0 {
  ACTION2
}

/foo/ && variable > 0 {
  ACTION3
}

CONDThe operators available for expressions are as follows:

  • Relational operators:

< , <= , > , >= , == , != , =~ , !~ , || , && , !

  • Arithmetic operators:

| , & , ^ , + , - , * , /, << , >> , **

The operators available for "Exported Indicator Variables" are as follows:

= , += , ++ , –

rule example

"1. Export a counter type indicator lines_total: the number of statistical log lines"

# simple line counter
counter lines_total
/$/ {
  lines_total++
}

"2. Export a counter-type indicator error_count: count the number of log lines with the four keywords ERROR, error, Failed, and faild"

counter error_count
 
/ERROR|error|Failed|faild/ {
  error_count++
}

"3. Export a counter-type indicator out_of_memory_count: count the number of occurrences of memory overflow"

counter out_of_memory_count
 
/java.lang.OutOfMemoryError/ {
  out_of_memory_count++
}

After conversion into metrics, PromQLit is easy to configure alert rules with syntax:

groups:
- name: memory.rules
  rules:
  - alert: OutOfMemoryError
    expr: increase(out_of_memory_count[1m]) > 0
    labels:
      severity: series
    annotations:
      summary: "java.lang.OutOfMemoryError"
      description: "{
    
    { $labels.instance }} 出现JVM内存溢出错误"

"4. Here I use mtail to monitor the log of n9e-server, and extract the number of notify triggered by each alarm rule. This log is an example:"

2021-12-27 10:00:30.537582 INFO engine/logger.go:19 event(cbb8d4be5efd07983c296aaa4dec5737 triggered) notify: rule_id=9 [__name__=net_response_result_code author=qin ident=10-255-0-34 port=4567 protocol=tcp server=localhost]2@1640570430

Obviously, there is such a keyword in the log: notify: rule_id=9, which can be matched with regular expressions, and the number of rows that appear can be counted, and the ruleid can also be extracted from it. In this way, we can report the ruleid as a label, so we can write it like this The mtail rule has:

counter mtail_alert_rule_notify_total by ruleid

/notify: rule_id=(?P<ruleid>\d+)/ {
    mtail_alert_rule_notify_total[$ruleid]++
}

"5, java exception type statistics"

counter exception_count by exception, log
/(?P<exception>[A-Z]*(.[A-Za-z]*)*(Exception|Error)):(?P<log>.*)/ {
   exception_count[$exception][$log]++
}

Then enter the null pointer exception and jvm memory overflow exception into the log file:

java.lang.NullPointerException: Some error message here.
        at com.example.myapp.MyClass.someMethod(MyClass.java:123)
        at com.example.myapp.OtherClass.doSomething(OtherClass.java:45)
java.lang.OutOfMemoryError: Java heap space
Dumping heap to d://\java_pid10000.hprof ...
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
 at cn.intsmaze.dump.OOMDump$OOMIntsmaze.<init>(OOMDump.java:27)
 at cn.intsmaze.dump.OOMDump.fillHeap(OOMDump.java:34)
 at cn.intsmaze.dump.OOMDump.main(OOMDump.java:47)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
Heap dump file created [10195071 bytes in 0.017 secs]
 at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

Access /metricsthe endpoint to see the metrics data:

# HELP exception_count defined at gj.mtail:7:9-23
# TYPE exception_count counter
exception_count{exception="java.lang.NullPointerException",log=" Some error message here.",prog="gj.mtail"} 3
exception_count{exception="java.lang.OutOfMemoryError",log=" Java heap space",prog="gj.mtail"} 2

The exception tag identifies the java exception type, and the log tag identifies a simple exception description.

Note: It is not suitable to actually produce log tags, because too many cardinality of log tags will cause index inflation.

Convert abnormal indicators into measurement indicators, and use PromQLsyntax to easily count the number of times each abnormal type is triggered per day, or combine Grafanareal-time display statistics, trend lines, etc.

9d76a2d831cc6f22b7b74db2ebc6f66d.gif

[For more cloud-native monitoring and operation and maintenance, please follow the WeChat public account: Reactor2020]

Guess you like

Origin blog.csdn.net/god_86/article/details/130979779