[Cloud Native Monitoring] Categraf's unified monitoring data collector
foreword
"The author has built a temporary environment on the public cloud, you can log in to experience it first:"
http://124.222.45.207:17000/login
账号:root/root.2020
Introduction
Categraf
It is a monitoring collection Agent
, similar to Telegraf
, Grafana-Agent
, Datadog-Agent
, and hopes to provide monitoring data collection capabilities for all common monitoring objects. All-in-one
The design adopted not only supports indicator collection, but also hopes to support data collection of logs and call links.
Categraf
The component integrates 5 commonly used components agent
and adopts all-in-one
the design, which mainly solves the problem that each machine may need to deploy a lot of prometheus exporter
problems that lead to difficult operation and maintenance. All the collection work can be done with one, and the collection of agent
logs and can also be included :trace
agent
"ibex agent:" fault self-healing
agent
, mainly used to execute self-healing scripts;"logs agent:" log collection
agent
;"metrics agent:" indicator collection
agent
, adopts plug-in mode design;"prometheus agent:" embedding
prometheus sdk
, implementationprometheus agent
mode;"traces agent:" link collection agent;
Compile and run
categraf
The code is hosted on github: https://github.com/flashcatcloud/categraf.
1. Download and compile:
# export GO111MODULE=on
# export GOPROXY=https://goproxy.cn
go build
2. Packaging:
tar zcvf categraf.tar.gz categraf conf
❝A sample categraf.service file is also provided in the conf directory, so that you can use systemd to host categraf.
❞
3. Run
# test mode: just print metrics to stdout
./categraf --test
# test system and mem plugins
./categraf --test --inputs system:mem
# print usage message
./categraf --help
# run
./categraf
# run with specified config directory
./categraf --configs /path/to/conf-directory
# only enable system and mem plugins
./categraf --inputs system:mem
# use nohup to start categraf
nohup ./categraf &> stdout.log &
"test"
We often need to test the behavior of a certain collector, and temporarily check which monitoring indicators the collector outputs. For example, if it is configured and conf/input.mem/mem.toml
wants to see what Linux memory indicators are collected, you can execute the command:./categraf --test --inputs mem
[root@swarm-worker02 categraf-v0.3.3-linux-amd64]# ./categraf --test --inputs mem
2023/05/15 23:41:48 I! tracing disabled
2023/05/15 23:41:48 metrics_agent.go:269: I! input: local.mem started
2023/05/15 23:41:48 agent.go:47: I! [*agent.MetricsAgent] started
2023/05/15 23:41:48 agent.go:50: I! agent started
23:41:48 mem_free agent_hostname=swarm-worker02 1264836608
23:41:48 mem_high_free agent_hostname=swarm-worker02 0
23:41:48 mem_sreclaimable agent_hostname=swarm-worker02 279576576
23:41:48 mem_swap_cached agent_hostname=swarm-worker02 0
23:41:48 mem_available agent_hostname=swarm-worker02 5180682240
23:41:48 mem_used agent_hostname=swarm-worker02 520003584
23:41:48 mem_total agent_hostname=swarm-worker02 6017568768
23:41:48 mem_available_percent agent_hostname=swarm-worker02 86.09261380691878
23:41:48 mem_write_back agent_hostname=swarm-worker02 0
23:41:48 mem_used_percent agent_hostname=swarm-worker02 8.64142320674847
23:41:48 mem_high_total agent_hostname=swarm-worker02 0
23:41:48 mem_low_total agent_hostname=swarm-worker02 0
ibex agent
ibex agent
It is mainly used in alarm self-healing scenarios. Fault self-healing is mainly used to solve faults, automatically execute related operations based on the fault management plan, and has achieved the ability to automatically repair faults. For example, if the disk usage of the host is high, scripts can be executed to clean up the disk files; for example, if the CPU and memory usage of PaaS components are high, you can call the restart component to release resources through the three-party interface.
Nightingale Monitoring supports the associated callback address method in the alarm rule, and realizes the self-healing ability of the alarm by calling the three-party interface when the alarm is triggered, as shown in the figure below:
In addition to configuring the HTTP address for the callback address, it also supports the ability to call back the self-healing script based on the ibex component, as shown in the figure below:
Here 1 is the corresponding self-healing script ID, you can query the corresponding ID information on the [Alarm Self-healing] - [Self-healing script] page:
There are usually two ways for self-healing scripts: ssh-based execution and client agent execution. Nightingale uses the latter method. The latest method of ibex-agent has been integrated into the categraf component:
categraf
Component- ibex
related configuration examples are as follows:
# 是否启用告警自愈agent
[ibex]
enable = false
## ibex flush interval
interval = "1000ms"
## n9e ibex server rpc address
servers = ["127.0.0.1:20090"]
## temp script dir
meta_dir = "./meta"
The data flow of fault self-healing core components is as follows:
Self-healing scripts and alarm configurations are all on n9e
the component side, and then distributed to ibex server
the components, which independently manage the execution and distribution of fault self-healing scripts, and then ibex agent
the components periodically pull the script tasks executed on the current host through the RPC interface and execute them. At the same time, the execution results and execution output are reported to ibex server
the component, so that n9e
the management end can ibex server
query and display the task execution details from the end.
❝The ibex agent component has been integrated into the categraf component, so the ibex agent here is categraf.
❞
The detailed core process of ibex fault self-healing is shown in the figure below:
prometheus agent
Prometheus Agent
is Prometheus v2.32.0
an experimental feature launched since the version, when this feature is enabled, blocks will not be generated on the local file system and cannot be queried locally. If the network fails to connect to the remote endpoint due to an abnormal state, the data will be temporarily stored on the local disk, but only for two hours of buffering.
When the mode prometheus
is enabled agent
, prometheus.yml
"remote_write" must be configured in "remote_write", and cannot be configured alertmanager
, rules
because agent
the node only has the function of collecting indicators, and an error will be reported when adding alertmanager
and rules
starting.
prometheus agent
Configuration conf/prometheus.toml
:
[prometheus]
# 是否启动prometheus agent
enable=false
# 原来prometheus的配置文件
# 或者新建一个prometheus格式的配置文件
scrape_config_file="/path/to/in_cluster_scrape.yaml"
## 日志级别,支持 debug | warn | info | error
log_level="info"
# 以下配置文件,保持默认就好了
## wal file storage path ,default ./data-agent
# wal_storage_path="/path/to/storage"
## wal reserve time duration, default value is 2 hour
# wal_min_duration=2
categraf
The component prometheus agent
uses prometheus sdk
the implementation prometheus agent
mode, so that prometheus
the collection configuration can be used directly:
categraf
As shown in the figure above, the typical prometheus agent
application scenario I understand is prometheus
to monitor cloud-native cluster node machines, cAdvisor containers, ksm clusters, and upper-level business components with the help of powerful service discovery capabilities.
metrics agent
As categraf
the core module of the component, metrics agent
it mainly realizes the collection of performance indicators of various components, similar to the prometheus
exporter in , but categraf
the component adopts All-in-one
the design idea of and uses the plug-in mode to implement.
The code of the collection plug-in, in inputs
the directory of the code, each plug-in has an independent directory, under the directory is the collection code, and related monitoring dashboard JSON (if any) and alarm rule JSON (if any), Linux-related dashboard and alarm rules It is not scattered in the collector directories such as cpu, mem, disk, etc., but is placed in the system directory together, which is convenient to use.
Plug-in configuration files are placed in conf
the directory, input.
starting with each configuration file. Each configuration file has detailed comments. If you don’t understand it, just go to inputs
the code of the corresponding collector in the directory. Go’s code is very easy to read, such as a certain If you don’t know what the configuration does, you can easily find the answer by searching for relevant configuration items in the collector code.
categraf
About 70+ plug-ins have been implemented. How to use each plug-in can be viewed in the README document of the corresponding plug-in. The specific use will not be introduced here. Let's mainly look at the core implementation principle of the plug-in (see the figure below), which is convenient for us to carry out secondary development:
heartbeat
categraf
The component can n9e
enable the heartbeat function with the back-end component, and report the basic information such as the CPU, memory, and OS of the cluster nodes through the heartbeat mode. The configuration is as follows:
# categraf心跳配置,启动心跳功能,categraf会定时周期将节点cpu、内存等信息通过心跳包发送给n9e后端服务
[heartbeat]
enable = true
# report os version cpu.util mem.util metadata
url = "http://127.0.0.1:17000/v1/n9e/heartbeat"
# interval, unit: s
interval = 10
# Basic auth username
basic_auth_user = ""
# Basic auth password
basic_auth_pass = ""
## Optional headers
# headers = ["X-From", "categraf", "X-Xyz", "abc"]
# timeout settings, unit: ms
timeout = 5000
dial_timeout = 2500
max_idle_conns_per_host = 100
The heartbeat package will send the basic information of the categraf component server to the backend, the specific information is as follows:
data := map[string]interface{}{
"agent_version": version,//categraf版本
"os": runtime.GOOS,//系统
"arch": runtime.GOARCH,//架构
"hostname": hostname,//主机名
"cpu_num": runtime.NumCPU(),//CPU核数
"cpu_util": cpuUsagePercent,//cpu使用率
"mem_util": memUsagePercent,//内存使用率
"unixtime": time.Now().UnixMilli(), //后端可以用于计算客户端和后端时间偏移
}
After the server receives the heartbeat packet, it req.Offset = (time.Now().UnixMilli() - req.UnixTime)
calculates the time offset of the backend and the categraf collection end, and then writes the information into the memory together, and refreshes it into the cache at regular intervals of 1 second. This is the format, as shown in the Redis
figure key
below n9e_meta_$hostname
:
[For more cloud-native monitoring and operation and maintenance, please follow the WeChat public account: Reactor2020]