Prometheus monitoring (1)

The importance of monitoring to enterprises and operations

monitor?

Monitoring is a set of actions to observe and record the behavioral status of the server.

Alarm?

Alarms analyze monitored behavioral data. Once the monitored data is abnormal or unexpected server actions occur, the alarm mechanism will convey the information to operation and maintenance or server managers in some way.

The relationship between monitoring and alarming

Insert image description here

data collection

The data required for monitoring needs to be obtained through data collection, and data collection will provide capabilities for monitoring, analyzing user behavior, and formulating security policies.

Insert image description here

Introduction to Prometheus

The advantages and disadvantages of Prometheus compared to traditional monitoring

Advantage

  • Monitoring data granularity: 1 to 4 seconds
  • Deploy clusters quickly using monitoring scripts
  • Plug-ins are rich in functions, including exporter, pushgateway, etc.
  • It is based on mathematical models and has a large number of practical models that can implement monitoring business logic for many complex functions.
  • Beautiful graphic display and outstanding visualization

insufficient

  • If the number of clusters is too large, single-point monitoring will have performance bottlenecks. Currently, clusters are not supported and can only be
    workaround.
  • It also consumes a lot of disk resources. This depends on the number of monitored clusters, the number of monitoring items, and the length of storage time.

Implementation of an ideal monitoring system

Insert image description here

Surveillance system design (architect)

The overall system design is a very important thing when forming the system. If the design is not good, the entire monitoring system will not realize its functions.

The design part includes the following content:

• Evaluate the system's business processes, business types, and architecture systems.
Each enterprise has different products, business directions, program codes, and system architectures.
A certain degree of knowledge of the details of each place is required before the design can be started.

• Classify the types of required monitoring items, which
can generally be divided into: business level monitoring, system level monitoring, network monitoring, program code monitoring, log monitoring, user behavior analysis monitoring, and other types of monitoring.

Monitoring type classification

Business monitoring

It can include user access QPS, DAU daily activity, access status (http code), business access (login, registration, chat, upload, message, SMS, search), product conversion rate, recharge amount, user complaints, etc. These are very macro-level Concept
(upper level)

System monitoring

Mainly related to the operating system, basic monitoring items CPU/memory/hard disk/IO/TCP link/traffic, etc. (Nagios - plugins, prometheus)

Network monitoring

(IDC) Monitoring of network status (switches, routers, firewalls, VPN) is essential for Internet companies but is often ignored. For example: between intranets (physical intranet, logical intranet availability zone, create virtual machine intranet IP, external network packet loss rate, delay, etc.

Log monitoring

The main tasks in monitoring (Splunk, ELK) are often designed and built separately, and all types of logs need to be collected (syslog, soft, network equipment, user behavior)

Program monitoring

Generally, it is necessary to cooperate with developers to embed various log formats that directly obtain data or characteristics into the program.

Construction of monitoring system

Include the following steps:

  • Construction of single-point server (Prometheus)
  • Single point client deployment
  • Single point client server testing
  • Single point deployment of collection programs
  • Collection program cluster deployment
  • Monitoring terminal HA/cloud
  • Monitoring side data visualization, grafana
  • Alarm system testing, paperduty
  • Alarm rule test
  • Monitoring + alarm joint test
  • Official online monitoring

Preparation of data collection

Practical scripting data collection tools
shell, python, PHP, go language, etc.

Data collection method

  • The one-time collection mode
    has better stability and is less prone to various errors and performance bottlenecks. The development logic is simple and the implementation
    is fast. For some collection projects, the implementation is not smart enough or in place enough, such as: real-time collection of logs
    (using one-time collection). Collecting log files 200/5xx diff grep can also be implemented, but it is very slow and not accurate enough or intuitive enough)
  • Background collection: The collection program runs in the Linux background as a daemon process and continuously collects data. For example, the daemon program developed by python/go collects continuously in the background. Advantages: Background collection program has high data accuracy and fine collection density
    . Easy management
    Disadvantages: If the development process of the background collection program is not careful enough, various memory leaks and zombie process performance bottlenecks may occur, and the development cycle is long.
  • Bridged collection: It runs as a background process but the collection cannot be independent and is still associated with the server to collect the collected data in a bridged manner.
    For example: NRPE for nagios

Monitor data analytics and algorithms

It requires a very professional data calculation team to provide the most reasonable algorithm to assist our alarm rules.
For business-level monitoring algorithms, operation and maintenance itself cannot be very professional. Because it has nothing to do with the operating system itself, but is related to the data algorithm.
For example: If I want to achieve precise monitoring of user access QPS through Prometheus, then the monitoring graphic curve QPS rises, QPS falls, QPS bulges, QPS and historical data comparison methods, etc. These are all business-level monitoring threshold types. , it requires the assistance of professional data analysts to calculate an excellent algorithm.

Stability test

Whether it is one-time collection or background collection, anything running on Linux will have a certain impact on the system.

The stability test is to observe whether there is any impact on the line through single-point deployment for a period of time.

Monitoring automation

Batch deployment of monitoring clients, HA reinstallation of monitoring servers, modifications to monitoring projects, and various changes in the monitoring clusters of monitoring projects. These places require the introduction of a large amount of manual automation, which will greatly reduce our maintenance costs for the monitoring system. .

Here are a few examples:
Puppet (configuration file deployment)
Jenkins (CI continuous integration deployment)
CMDB (the highest resource management platform and concept for operation and maintenance automation), etc.
By making good use of the above examples, you can realize monitoring automation. control

Graphical display

The collected data and prepared monitoring algorithms ultimately require a good graphic display to play their best role.

Guess you like

Origin blog.csdn.net/m0_49265034/article/details/132446943