[Cloud Native Monitoring] Categraf's unified monitoring data collector

[Cloud Native Monitoring] Categraf's unified monitoring data collector

foreword

"The author has built a temporary environment on the public cloud, you can log in to experience it first:"

http://124.222.45.207:17000/login
账号:root/root.2020

Introduction

CategrafIt is a monitoring collection Agent, similar to Telegraf, Grafana-Agent, Datadog-Agent, and hopes to provide monitoring data collection capabilities for all common monitoring objects. All-in-oneThe design adopted not only supports indicator collection, but also hopes to support data collection of logs and call links.

e1fd60a4b1d986b9501bb5ac0a9c30d9.png

CategrafThe component integrates 5 commonly used components agentand adopts all-in-onethe design, which mainly solves the problem that each machine may need to deploy a lot of prometheus exporterproblems that lead to difficult operation and maintenance. All the collection work can be done with one, and the collection of agentlogs and can also be included :traceagent

  • "ibex agent:" fault self-healing agent, mainly used to execute self-healing scripts;

  • "logs agent:" log collection agent;

  • "metrics agent:" indicator collection agent, adopts plug-in mode design;

  • "prometheus agent:" embedding prometheus sdk, implementation prometheus agentmode;

  • "traces agent:" link collection agent;

Compile and run

categrafThe code is hosted on github: https://github.com/flashcatcloud/categraf.

1. Download and compile:

# export GO111MODULE=on
# export GOPROXY=https://goproxy.cn
go build

2. Packaging:

tar zcvf categraf.tar.gz categraf conf

A sample categraf.service file is also provided in the conf directory, so that you can use systemd to host categraf.

3. Run

# test mode: just print metrics to stdout
./categraf --test

# test system and mem plugins
./categraf --test --inputs system:mem

# print usage message
./categraf --help

# run
./categraf

# run with specified config directory
./categraf --configs /path/to/conf-directory

# only enable system and mem plugins
./categraf --inputs system:mem

# use nohup to start categraf
nohup ./categraf &> stdout.log &

"test"

We often need to test the behavior of a certain collector, and temporarily check which monitoring indicators the collector outputs. For example, if it is configured and conf/input.mem/mem.tomlwants to see what Linux memory indicators are collected, you can execute the command:./categraf --test --inputs mem

[root@swarm-worker02 categraf-v0.3.3-linux-amd64]# ./categraf --test  --inputs mem                
2023/05/15 23:41:48 I! tracing disabled
2023/05/15 23:41:48 metrics_agent.go:269: I! input: local.mem started
2023/05/15 23:41:48 agent.go:47: I! [*agent.MetricsAgent] started
2023/05/15 23:41:48 agent.go:50: I! agent started
23:41:48 mem_free agent_hostname=swarm-worker02 1264836608
23:41:48 mem_high_free agent_hostname=swarm-worker02 0
23:41:48 mem_sreclaimable agent_hostname=swarm-worker02 279576576
23:41:48 mem_swap_cached agent_hostname=swarm-worker02 0
23:41:48 mem_available agent_hostname=swarm-worker02 5180682240
23:41:48 mem_used agent_hostname=swarm-worker02 520003584
23:41:48 mem_total agent_hostname=swarm-worker02 6017568768
23:41:48 mem_available_percent agent_hostname=swarm-worker02 86.09261380691878
23:41:48 mem_write_back agent_hostname=swarm-worker02 0
23:41:48 mem_used_percent agent_hostname=swarm-worker02 8.64142320674847
23:41:48 mem_high_total agent_hostname=swarm-worker02 0
23:41:48 mem_low_total agent_hostname=swarm-worker02 0

ibex agent

ibex agentIt is mainly used in alarm self-healing scenarios. Fault self-healing is mainly used to solve faults, automatically execute related operations based on the fault management plan, and has achieved the ability to automatically repair faults. For example, if the disk usage of the host is high, scripts can be executed to clean up the disk files; for example, if the CPU and memory usage of PaaS components are high, you can call the restart component to release resources through the three-party interface.

Nightingale Monitoring supports the associated callback address method in the alarm rule, and realizes the self-healing ability of the alarm by calling the three-party interface when the alarm is triggered, as shown in the figure below:

c5b90810dac3317aec10d04a3121e9a0.png

In addition to configuring the HTTP address for the callback address, it also supports the ability to call back the self-healing script based on the ibex component, as shown in the figure below:

400aa79a353b68a0bb5a46ea5c31c5dc.png

Here 1 is the corresponding self-healing script ID, you can query the corresponding ID information on the [Alarm Self-healing] - [Self-healing script] page:

66a7b166aeabbe44ab2983da8309c27f.png

There are usually two ways for self-healing scripts: ssh-based execution and client agent execution. Nightingale uses the latter method. The latest method of ibex-agent has been integrated into the categraf component:

4278ac88d0c058d31c9afe583f8b0784.png

categrafComponent- ibexrelated configuration examples are as follows:

# 是否启用告警自愈agent
[ibex]
enable = false
## ibex flush interval
interval = "1000ms"
## n9e ibex server rpc address
servers = ["127.0.0.1:20090"]
## temp script dir
meta_dir = "./meta"

The data flow of fault self-healing core components is as follows:

ebee76b218c0db2b0ca618d2a3327f04.png

Self-healing scripts and alarm configurations are all on n9ethe component side, and then distributed to ibex serverthe components, which independently manage the execution and distribution of fault self-healing scripts, and then ibex agentthe components periodically pull the script tasks executed on the current host through the RPC interface and execute them. At the same time, the execution results and execution output are reported to ibex serverthe component, so that n9ethe management end can ibex serverquery and display the task execution details from the end.

The ibex agent component has been integrated into the categraf component, so the ibex agent here is categraf.

The detailed core process of ibex fault self-healing is shown in the figure below:

9d2014314284aa28f3bb62c07d0173cb.png

prometheus agent

Prometheus Agentis Prometheus v2.32.0an experimental feature launched since the version, when this feature is enabled, blocks will not be generated on the local file system and cannot be queried locally. If the network fails to connect to the remote endpoint due to an abnormal state, the data will be temporarily stored on the local disk, but only for two hours of buffering.

When the mode prometheusis enabled agent, prometheus.yml"remote_write" must be configured in "remote_write", and cannot be configured alertmanager, rulesbecause agentthe node only has the function of collecting indicators, and an error will be reported when adding alertmanagerand rulesstarting.

prometheus agentConfiguration conf/prometheus.toml:

[prometheus]
# 是否启动prometheus agent
enable=false
# 原来prometheus的配置文件
# 或者新建一个prometheus格式的配置文件
scrape_config_file="/path/to/in_cluster_scrape.yaml"
## 日志级别,支持 debug | warn | info | error
log_level="info"
# 以下配置文件,保持默认就好了
## wal file storage path ,default ./data-agent
# wal_storage_path="/path/to/storage"
## wal reserve time duration, default value is 2 hour
# wal_min_duration=2

categrafThe component prometheus agentuses prometheus sdkthe implementation prometheus agentmode, so that prometheusthe collection configuration can be used directly:

50c0dbd329d4e207db09a7443859b9ed.png

categrafAs shown in the figure above, the typical prometheus agentapplication scenario I understand is prometheusto monitor cloud-native cluster node machines, cAdvisor containers, ksm clusters, and upper-level business components with the help of powerful service discovery capabilities.

metrics agent

As categrafthe core module of the component, metrics agentit mainly realizes the collection of performance indicators of various components, similar to the prometheusexporter in , but categrafthe component adopts All-in-onethe design idea of ​​and uses the plug-in mode to implement.

The code of the collection plug-in, in inputsthe directory of the code, each plug-in has an independent directory, under the directory is the collection code, and related monitoring dashboard JSON (if any) and alarm rule JSON (if any), Linux-related dashboard and alarm rules It is not scattered in the collector directories such as cpu, mem, disk, etc., but is placed in the system directory together, which is convenient to use.

Plug-in configuration files are placed in confthe directory, input.starting with each configuration file. Each configuration file has detailed comments. If you don’t understand it, just go to inputsthe code of the corresponding collector in the directory. Go’s code is very easy to read, such as a certain If you don’t know what the configuration does, you can easily find the answer by searching for relevant configuration items in the collector code.

categrafAbout 70+ plug-ins have been implemented. How to use each plug-in can be viewed in the README document of the corresponding plug-in. The specific use will not be introduced here. Let's mainly look at the core implementation principle of the plug-in (see the figure below), which is convenient for us to carry out secondary development:

01596230c72956b3a9e52c041260974c.png

heartbeat

categrafThe component can n9eenable the heartbeat function with the back-end component, and report the basic information such as the CPU, memory, and OS of the cluster nodes through the heartbeat mode. The configuration is as follows:

# categraf心跳配置,启动心跳功能,categraf会定时周期将节点cpu、内存等信息通过心跳包发送给n9e后端服务
[heartbeat]
enable = true

# report os version cpu.util mem.util metadata
url = "http://127.0.0.1:17000/v1/n9e/heartbeat"

# interval, unit: s
interval = 10

# Basic auth username
basic_auth_user = ""

# Basic auth password
basic_auth_pass = ""

## Optional headers
# headers = ["X-From", "categraf", "X-Xyz", "abc"]

# timeout settings, unit: ms
timeout = 5000
dial_timeout = 2500
max_idle_conns_per_host = 100

The heartbeat package will send the basic information of the categraf component server to the backend, the specific information is as follows:

data := map[string]interface{}{
 "agent_version": version,//categraf版本
 "os":            runtime.GOOS,//系统
 "arch":          runtime.GOARCH,//架构
 "hostname":      hostname,//主机名
 "cpu_num":       runtime.NumCPU(),//CPU核数
 "cpu_util":      cpuUsagePercent,//cpu使用率
 "mem_util":      memUsagePercent,//内存使用率
 "unixtime":      time.Now().UnixMilli(), //后端可以用于计算客户端和后端时间偏移
}

After the server receives the heartbeat packet, it req.Offset = (time.Now().UnixMilli() - req.UnixTime)calculates the time offset of the backend and the categraf collection end, and then writes the information into the memory together, and refreshes it into the cache at regular intervals of 1 second. This is the format, as shown in the Redisfigure keybelow n9e_meta_$hostname:

7e8cf7e6851cd7a8f13bb68ab5483aa7.png

a3d7338e1fa96110c2bb3123df3f518a.gif

[For more cloud-native monitoring and operation and maintenance, please follow the WeChat public account: Reactor2020]

Guess you like

Origin blog.csdn.net/god_86/article/details/130858354