[IT Operation and Maintenance] Basic introduction to Prometheus and monitoring platform deployment (Prometheus+Grafana)

1. Basic introduction to Prometheus

Prometheus (developed by go language) is a combination of open source monitoring & alarming & time series (sorted by time) databases. Because the popularity of kubernetes (commonly known as k8s) has driven the development of prometheus. It can monitor hosts, services, and containers, supports multiple exporters to collect data, and supports pushgateway for data reporting. Prometheus performance is sufficient to support tens of thousands of clusters.
https://prometheus.io/docs/introduction/overview/

Time Series Data : Data that records changes in system and equipment status in chronological order is called time series data. This kind of time series data will be applied to many scenarios, such as:

  • The most common one is the log in our system
  • The longitude, latitude, speed, direction, distance of nearby objects, etc. need to be recorded during the operation of the unmanned vehicle. Data must be recorded and analyzed at all times.
  • Driving trajectory data and traffic volume of each vehicle in a certain area
  • Real-time transaction data of traditional securities industry
  • Real-time operation and maintenance monitoring data, network card traffic graphs, current status of services, and resource usage. For example, if the content you are monitoring has a sharp surge, a cliff drop, or a disconnection, it generally means that there is a problem, whether it is Whenever it happens, you should quickly find out what went wrong.

1.1. Main advantages of time series database

Time series database is mainly used to process data with time tags (changing in the order of time, that is, time serialization). Data with time tags is also called time series data.

  • Good performance

Relational databases have poor performance in processing large-scale data, which can be clearly reflected in I/O. Using NOSQL can handle large-scale data better, but it is still not as good as a time series database.

  • Low storage costs

Since it adopts the data storage method of metrics:key=value (tag:keyword=value) and uses an efficient compression algorithm, the average storage cost is about 3.5 bytes, so it saves storage space and can Effectively reduce IO

Prometheus has a very efficient time series data storage method. Each sampled data only takes up about 3.5 bytes of space. Millions of time series data are collected every 30 seconds and retained for 60 days, taking up about 200 G of space (from official document data)

1.2. Main features of Prometheus

  1. Multi-dimensional data model can model data through multiple dimensions and query data through multiple dimensions.

  2. Flexible query language, provides flexible PromQL query method, and also provides HTTP query interface, which can be easily combined with Grafana and other components to display data

  3. It does not rely on distributed storage and supports local storage of a single node. Through the time series database that comes with Prometheus, data storage of millions per second can be completed. If a large amount of historical data needs to be stored, it can also be connected to a third-party time series database.

  4. Using HTTP to pull time series data through the pull model, and providing an open indicator data standard

  5. The push model can also be supported through the intermediate gateway
    . Pull monitoring is actually active and passive monitoring. By default, it is in the pull mode, that is, the monitoring host goes to the monitored host to get the data. If it is to be implemented The push method requires the support of an intermediate gateway, which is just different from the name of zabbix.

  6. Discover target service objects through service discovery or static configuration

  7. Supports a variety of charts and interface displays, and can use third-party tools to display content, such as Grafana

1.3. Prometheus monitoring principle

  1. Prometheus Server is responsible for regularly capturing metrics data on the target.
  2. Each crawling target [host, service] needs to expose an HTTP service interface for Prometheus scheduled crawling. That is to say, prometheus will package the obtained monitoring data into an accessible web page, and determine the status of the host by accessing the specified URL.

The advantage of the pull method is that it can automatically perform upstream monitoring and horizontal monitoring, requires less configuration, is easier to expand, is more flexible, and is easier to achieve high availability. Simply put, the Pull method can reduce coupling. Because in the push system, it is easy to cause the problem of paralysis of the monitored system due to failure to push data to the monitoring system. Because if there are many monitored hosts pushing data to the monitoring host at the same time, it is very likely that the monitoring host cannot process it. Therefore, through the Pull method, the collected end does not need to be aware of the existence of the monitoring system and is completely independent of the monitoring system. , so that data collection is completely controlled by the monitoring system.

1.4. The meaning of the six major configuration sections of the Prometheus configuration file

  • Each large configuration section of the prometheus configuration file

    • scrape_configs collects configuration segments as a collector
    • rule_files alarm, pre-aggregation configuration file segment
    • remote_read remote query segment
    • remote_write remote write segment
    • alerting: Alertmanager information segment
  • The picture below shows the Prometheus architecture and how each component interacts and collaborates.

Insert image description here

  • Most excellent open source projects are modular, allowing users to decide which configurations to enable based on business scenarios.
The corresponding configuration section use
Collection configuration section Be a collector and save the data locally
Collection configuration section + remote writing section Work as a collector + transmitter, and the data is saved locally + remotely stored
remote query segment Be a queryer to query remote storage data
Collection configuration section + remote query section Be a collector + queryer, query local data + remote storage data
Collection configuration section + Alertmanager information section + Alert configuration file section Make a collector + alarm trigger, query local data, generate alarms and send them to Alertmanager
Remote query section + Alertmanager information section + Alert configuration file section Create a remote alarm trigger, query remote data, generate an alarm and send it to Alertmanager
Remote query segment + remote write segment + pre-aggregated configuration file segment Pre-aggregate indicators and write the generated result set indicators to remote storage
  • yaml specific configuration format
# 全局配置段
global:
  # 采集间隔 
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  # 计算报警和预聚合间隔
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # 采集超时时间
  scrape_timeout: 10s 
  # 查询日志,包含各阶段耗时统计
  query_log_file: /opt/logs/prometheus_query_log
  # 全局标签组
  # 通过本实例采集的数据都会叠加下面的标签
  external_labels:
    account: 'huawei-main'
    region: 'node1'

# Alertmanager信息段
alerting:
  alertmanagers:
  - scheme: http
    static_configs:
    - targets:
      - "localhost:9090"

# 告警、预聚合配置文件段
rule_files:
    - /etc/prometheus/rules/record.yml
    - /etc/prometheus/rules/alert.yml

# 采集配置段
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

# 远程查询段
remote_read:
  # prometheus 
  - url: http://prometheus/v1/read
    read_recent: true

  # m3db 
  - url: "http://m3coordinator-read:7201/api/v1/prom/remote/read"
    read_recent: true

# 远程写入段
remote_write:
  - url: "http://m3coordinator-write:7201/api/v1/prom/remote/write"
    queue_config:
      capacity: 10000
      max_samples_per_send: 60000
    write_relabel_configs:
      - source_labels: [__name__]
        separator: ;
        # 标签key前缀匹配到的drop
        regex: '(kubelet_|apiserver_|container_fs_).*'
        replacement: $1
        action: drop

2. Deploy prometheus monitoring platform

  • Install and deploy prometheus service monitoring terminal
  • Monitor a remote machine
  • Monitor a service: mysql
prometheus 主程序包:wget https://github.com/prometheus/prometheus/releases/download/v2.11.1/prometheus-2.16.0.linux-amd64.tar.gz

远端主机监控插件(类似于zabbix-agent): wget  https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-1.0.0-rc.0linux-amd64.tar.gz

mysql业务监控插件: wget   https://github.com/prometheus/mysqld_exporter/releases/download/v0.12.0/mysqld_exporter-0.12.1.linux-amd64.tar.gz
  • Experimental topology diagram
    prometheus experimental diagram.png

2.1. Deploy prometheus service monitoring terminal

[root@node1 ~]# tar xf prometheus-2.11.1.linux-amd64.tar.gz -C /usr/local/
[root@node1 ~]# cd /usr/local/prometheus-2.11.1.linux-amd64/
[root@node1 prometheus-2.11.1.linux-amd64]# ./prometheus --config.file=prometheus.yml &

Start testing
Prometheus_1.png

Seeing this page shows that prometheus has started successfully and has monitored itself by default. Let’s take a look at the monitoring status of this machine.
Prometheus_2.png

Click status-targets to see the monitored machines or resources
Prometheus_3.png

After seeing this machine, you can also enter http://IP or domain name:9090/metrics in the browser according to the prompts to view the monitoring data.

显示监控数据
http://192.168.98.201:9090/metrics

Prometheus_node1_metrics4.png

If you can see this information, it means that the monitoring system has obtained the data. Once the data is obtained, it can be displayed normally. Through this URL, we can know that prometheus stores all the monitored data together, and then generates a web page. Users can view relevant data through the web page. These data follow the format of the time series database, that is, in the form of key=value. These data are our monitoring indicators, but there is no way to analyze them yet. We need to use graphical display to make it easier to read.

The prometheus display also provides charts. You can intuitively see the status of monitoring items through the charts, but the built-in graphics are really not very good-looking.

Click Graph to display the following charts. Enter keywords in the search bar to match the monitoring items you want to see.

image20200225140312916.png
The input here is process_cpu_seconds_total, and the CPU usage status table will appear. Pay attention to the Graph button in the upper left corner of the chart. The default is on the console button page.

2.2. Monitor a remote business machine

a. Install monitoring client

[root@node2 ~]# tar xf node_exporter-0.18.1.linux-amd64.tar.gz -C /usr/local/
[root@node2 ~]# cd /usr/local/node_exporter-0.18.1.linux-amd64/
[root@node2 node_exporter-0.18.1.linux-amd64]# ls
LICENSE  node_exporter  NOTICE

#后台启动
[root@node2 node_exporter-0.18.1.linux-amd64]# nohup /usr/local/node_exporter-0.18.1.linux-amd64/node_exporter &
[1] 7281
[root@node2 node_exporter-0.18.1.linux-amd64]# nohup: 忽略输入并把输出追加到"nohup.out"


#业务机器监控插件服务端口
[root@node2 node_exporter-0.18.1.linux-amd64]# lsof -i :9100
COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
node_expo 7281 root    3u  IPv6  42486      0t0  TCP *:jetdirect (LISTEN)

#验证  http://被监控机名称:9100/metrics
http://192.168.98.202:9100/metrics
现在这台机器上的数据被打包成了一个可以访问的页面,所以可以使用浏览器去访问这个页面,看下能否获取到相关的数据,如果能够获取的话就表示没有问题了。 

b. Add monitoring information in prometheus

#被监控主机设置完成之后,需要在prometeus主配置文件中添加被监控机信息
[root@node1 prometheus-2.11.1.linux-amd64]# tail -4  prometheus.yml 

  - job_name: 'node2'	#定义名称
    static_configs:#定义具体配置
    - targets: ['192.168.98.202:9100']#定义目标
 
 ####注意缩进  两个空格
     
 #重启服务    
[root@node1 prometheus-2.11.1.linux-amd64]# pkill prometheus
[root@node1 prometheus-2.11.1.linux-amd64]# ./prometheus --config.file=prometheus.yml &


注意:prometheus启动报错
**lock DB directory: resource temporarily unavailable"** 
原因:prometheus没有正常关闭,锁文件存在
rm $prometheus_dir/data/lock

c. Test verification

After setting up, view the prometheus page
Prometheus_6.png

After viewing the Status-Targets page, you can see that the monitored machine node2 (192.168.98.202) is already in the monitoring list, and you can view its monitoring data through the browser.

Prometheus_node2_metrics5.png

Enter http://192.168.98.202:9100/metrics in the browser to see the data

2.3. Monitor a service: mysql

To monitor mysql, two conditions are required. One is that there is mysql in the system, and the other is that there is a monitoring plug-in. Now the monitoring plug-in has been downloaded, so we need to install mysql first, and then perform the corresponding authorization so that the plug-in can obtain all the information. Required information, then set up relevant plug-ins and modify the prometheus configuration file

a. Deploy mysql business

[root@node2 node_exporter-0.18.1.linux-amd64]# dnf -y install mariadb-server mariadb
[root@node2 mysqld_exporter-0.12.0.linux-amd64]# systemctl enable mariadb
Created symlink from /etc/systemd/system/multi-user.target.wants/mariadb.service to /usr/lib/systemd/system/mariadb.service.
[root@node2 mysqld_exporter-0.12.0.linux-amd64]# systemctl start mariadb

#创建监控用户
MariaDB [(none)]> grant select,replication client,process on *.* to 'hello'@'localhost' identified by '123456';
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> flush privileges;
Query OK, 0 rows affected (0.00 sec)

b. Deploy monitoring plug-in

[root@node2 ~]# tar xf mysqld_exporter-0.12.1.linux-amd64.tar.gz -C /usr/local
[root@node2 ~]# vim /usr/local/mysqld_exporter-0.12.1.linux-amd64/.my.cnf
[root@node2 ~]# cat /usr/local/mysqld_exporter-0.12.1.linux-amd64/.my.cnf
[client]
user=hello
password=123456

#启动
[root@node2 ~]# nohup /usr/local/mysqld_exporter-0.12.1.linux-amd64/mysqld_exporter --config.my-cnf=/usr/local/mysqld_exporter-0.12.1.linux-amd64/.my.cnf &

[root@node2 ~]# lsof -i :9104
COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
mysqld_ex 7698 root    3u  IPv6  46415      0t0  TCP *:peerwire (LISTEN)

c. Add monitoring in the prometheus main configuration file

#
[root@node1 prometheus-2.11.1.linux-amd64]# tail -10 prometheus.yml 
    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'node2'
    static_configs:
    - targets: ['192.168.98.202:9100']
  
  - job_name: 'mariadb'
    static_configs:
    - targets: ['192.168.98.202:9104']

d. Restart the prometheus service

[root@node1 prometheus-2.11.1.linux-amd64]# pkill prometheus
[root@node1 prometheus-2.11.1.linux-amd64]# ./prometheus --config.file=prometheus.yml &

e. View services through the monitoring page

Prometheus_node2_mysql9.png

Check out related charts through the Graph page

Prometheus_node2_mysql8.png

You can check stacked to display the graph in a stacked shape.

3. Prometheus Grafana data display and alarm

The display interface of prometheus monitoring software is really ugly, so let’s change the display method: Grafana. Grafana is an open source measurement analysis and visualization tool (without monitoring function). It can analyze, query and then visualize the collected data. Display, and can achieve alarm.

3.1. Deploy grafana

a. grafana installation

Obtain the software package from
the official website: grafana: https://grafana.com/

Package installation

[root@manage01 ~]# dnf -y localinstall grafana-6.6.1-1.x86_64...

Service start

#服务启动
[root@manage01 ~]# systemctl enable grafana-server
Created symlink from /etc/systemd/system/multi-user.target.wants/grafana-server.service to /usr/lib/systemd/system/grafana-server.service.
[root@manage01 ~]# systemctl start grafana-server

#验证启动
[root@manage01 ~]# lsof -i :3000
COMMAND     PID    USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
grafana-s 17154 grafana    8u  IPv6  53939      0t0  TCP *:hbci (LISTEN)

After grafana is started successfully, you can access the grafana page through the page

Enter http://IP or domain name:3000 in the browser

grafana1.png
Requires account password: admin/admin (default)

When you see this page, it means that grafana has been installed successfully and is working.

When you enter your account and password to log in, since this is the first time you log in, you must change your password for security reasons before you can log in. After entering the
grafana2.png
new password twice, click save to log in.

b. Grafana page settings-add prometheus data source

After successful login, the page will provide you with a usage guidance line. Set up according to the guidance requirements, mainly setting the data source for grafana.

image20200225145354437.png

As can be seen from the icon, we need to set up the data source - display the dashboard - and add these user operations.
Click Add data source to add data source

image20200225145454014.png

Select Prometheus to proceed to the next step.

image20200225153536052.png

The settings of the auth part are mainly used in conjunction with HTTPS. If you use https, you need certificates, authentication, etc., and you need to make some configurations for this part.

Just fill in the corresponding information according to the requirements of the page. Generally, errors are caused by input errors. Click Save & Test and save successfully.

grafana6.png

You can see the data source you just added through Data Source in the drop-down menu of the gear icon in the left navigation bar.

3.2. Draw graphics

a. Dashboard management

graphana7.png

After adding the data source, you can continue to add a dashboard so that we can see the data in the form of a chart. Continue to click New Dashboard

grafana8.png

The picture shows that you can add a graph to the dashboard, and you can also choose a style icon

You can choose any one here. I don’t know what the author thinks about this version. In fact, all the functions can be realized.

Select the first Add query here

grafana9.png

After entering the page, there are four logos on the left side, namely

data source
image20200225154121203.png

chart
image20200225154146665.png

set up
image20200225154213541.png

Alarm
image20200225154254663.png

Let’s follow the icon steps to set up the data source first.

As shown above, in item A, match your monitoring items according to your needs. If there are multiple items, you can add them through add query in the upper right corner. After the setting is completed, you can set the chart style. Click on the chart.

grafana10.png

The chart mainly talks about the style of the chart, and the main items are explained:

  • The first Draw Modes refers to the display method of the picture in the chart. There are three types: strips, lines and points.
  • The second Mode Options talks about the thickness of the line on the chart's fill shadow transparency chart.
  • Whether the last one turns on chart stacking and display percentages

After the settings are completed, look at the settings icon.

This page is mainly used to set the chart name and comments.

grafana11_1.png

After setting it up, our chart settings will be like this for the time being. We will discuss the alarm settings in detail for subsequent alarms.

Click Save to save the chart.

grafana11.png

When saving the chart, you will be asked to enter a dashboard name. Enter Node2 here

grafana12.png

After confirming everything is correct, click Save

grafana13.png

The dashboard is ready, and our graphics are also visible.

Next, the user should be set up. This user is added using an invitation mechanism, which means we need to generate an invitation link and send it to the corresponding user. Then the other party visits the corresponding link to register. In this way, the user can be added successfully and click the Add users button.

grafana14.png

Add a user as required

grafana15.png

Click to invite users

image20200225161454447.png

Enter the user name, user role and click Invite

image20200225161653564.png

After clicking the invitation, you need to send the invitation link to the user or open it in the browser to confirm the invitation.
Switch to another host and open it in the browser.

image20200225162223119.png

Enter your email address and user password and click Register

image20200225162409462.png

Go back to the host where you use the admin account to log in to grafana. After refreshing, you can see the newly registered user. You can also delete the user and modify the permissions.

b. grafana settings – add graphics to monitor cpu load

Click on the left sidebar: ➕—Choose Visualization

Select graph chart style

Enter data items as required:

  • node_load1 CPU average load in one minute
  • node_load5 CPU five-minute average load
  • node_load15 CPU average load in fifteen minutes

Note: If multiple machines are monitored at the same time, the chart will display all machines. If you only want to display a certain machine, you can use monitoring matching.

The input method is as follows:

Monitoring item {instance="Monitored machine IP:port"}

As shown below

image20200225163828382.png

This will display a machine.

c. grafana settings—use template charts to display MySQL monitoring

mysql monitoring template download
https://github.com/percona/grafana-dashboards

Template settings

#在grafana配置文件中添加插件信息
[root@manage01 ~]# vim /etc/grafana/grafana.ini 
[root@manage01 grafana]# tail -3 /etc/grafana/grafana.ini 
[dashboards.json]
enabled = true
path = /var/lib/grafana/dashboards

#下载插件
[root@manage01 ~]# unzip grafana-dashboards-master.zip 


#拷贝插件到指定位置
[root@manage01 ~]# cd grafana-dashboards-master/
[root@manage01 grafana-dashboards-master]# cp -r dashboards /var/lib/grafana/
[root@manage01 dashboards]# vim /var/lib/grafana/dashboards/MySQL_Overview.json
#搜索pmm-singlestat-panel替换为singlestat
#重启生效
[root@manage01 grafana]# systemctl restart grafana-server.service 

Import templates through web interface

grafana20.png

Select the left menu—➕—Import

Select the corresponding json file and import it

grafana21.png

You can see the picture after clicking Import

image20200225171755015.png

3.3 Grafana alarm

Set the alarm channel of grafana

image20200225174606552.png

Click on the bell chart on the left—notification channels—Add channel

image20200225175559736.png

  • Name part: fill in a name
  • Type part: select webhook method
  • Send on all alerts: If checked, it means that all alarms will be sent through this channel by default.
  • Include image: If checked, it means that screenshots will be sent at the same time when the alarm occurs. Because the current alarm notification does not support images, there is no need to check it here.
  • Disable Resolve Message: If checked, it means that when the status returns to normal from the alarm, no more messages will be sent, that is, no notification will be given to return to normal. There is no need to check here.
  • Send reminders: If checked, it means that in addition to sending an alarm message when the status first changes to alarm, every once in a while, if it is still in the alarm state, a repeated alarm will be sent.
  • Send reminder every: Indicates how often to send repeated alarms. The default value here is 30 minutes.
  • Url: official server alarm server
  • Http Method: Select POST

After the setting is completed, click send test to check the alarm email in the email address used when registering the account.

image20200225175125898.png

After setting up the channel and completing the verification, set alarms for the chart

onealter7.png

Select the chart and click the drop-down menu of the chart name—edit to enter the edit menu.

onealter8.png

Select the bell chart—create alert to set chart alert

onealter9.png

The alarm threshold setting here is to take the average CPU Load. Because it is an experiment, the early warning value is 0.5 to facilitate testing of alarms.

onealter10.png

After the setting is completed, an early warning line appears on the chart, click Save

Next, increase the CPU load on node2

image20200225220905872.png

image20200225222218919.png

Alarm completed.

Guess you like

Origin blog.csdn.net/qq_45277554/article/details/130917620