Nightingale (Flashcat) V6 monitoring (2): the most detailed function introduction and case list on the whole network of Nightingale page

Table of contents

(1): How to forward data to multiple timing libraries

(2): Configuration of the monitoring dashboard

(3): Alarm configuration management 

          (1): Alarm rules

①: Basic configuration

②: Rule configuration: alarms divided into Metric and Host machine types

③: Generate configuration

④: Notification configuration

  (2): built-in rules

  (3) Blocking rules

  (4) Subscription rules

  (5) Active alarm

  (6) History alarm

(3): timing index

(4): log analysis

(5): link tracking

(6): Infrastructure

(7): Alarm self-healing

(8): Personnel organization

(9): system configuration

(1) Data source:

(2) Notification settings:

(10): Docking mailbox alarm list configuration

  (1) Create an alert rule

  (2) Configure mailbox SMTP for alarm

  (3) Trigger an alarm


(1): How to forward data to multiple timing libraries

        The last issue talked about the single-node deployment process built by the entire Nightingale center. In this issue, I will talk about the specific usage, page functions, and the actual function usage of Nightingale. 

        First, let’s talk about how to forward data to multiple timing libraries, such as victoria-metrics, Prometheus, etc.

    Open the configuration file (config.toml) of n9e and find the [[Pushgw.Writers]] data field

    The Url here is the address of the timing library. Now the address of the timing library in the figure is the address of a stand-alone version of victoria-metrics. If you want to add another Prometheus timing library, you only need to add this [[Pushgw.Writers ] ] Copy and paste a section of the entire data field, and then add the address of Prometheus

Restart the n9e and the addition is complete 

(Here I will add how to add multiple host monitoring)

    If I have host A and host B now, and I also want to monitor host B, I just need to unzip the categraf installation package on host B, then open the configuration file, change the url in the following two fields to the IP address of n9e, and then start that's it

    And if your host B has victoria-metrics installed and the port is the same as that of the n9e time series database and the same type of time series database has been added to the n9e web page, it will be automatically recognized and viewed directly on the web visualization page just query

 

(2): Configuration of the monitoring dashboard

    The next step is to introduce the functions of Nightingale one by one, starting with the visual dashboard

   First click on the built-in dashboard of the dashboard to configure and add the data collected by monitoring. Here we choose linux host monitoring. Nightingale has many things that can be monitored. All services in the classification column can be configured on the dashboard

 

   Here you can view the dashboard information, the first is the data collected by Prometheus Exporter, the second is collected by Telegraf, and the third is collected by Categraf, because we only configure and monitor the information of two linux hosts, so click to view The dashboards of Telegraf and Gategraf can see the information content of the dashboards

 

 

 

 

 

 Of course, you can also customize the monitored data or the desired dashboard

 

(3): Alarm configuration management

    Alarm rules are divided into several small sections, alarm rules, built-in rules, blocking rules, subscription rules, active rules, and historical alarms

 

(1): Alarm rules

    Alarm rules, as the name implies, are to formulate alarm rules, formulate the rules you need to trigger alarm conditions to monitor the status of services, and are divided into several major sections: basic configuration, rule configuration, generation configuration, notification configuration

①: Basic configuration

Rule Name : Customize the name of your rule

Additional tags : This tag is in the format of key=value. If you type a tag, such as servce=dream, then filter it in the subsequent rule processing

Remarks : Additional Remarks for Alarm Rules

②: Rule configuration: alarms divided into Metric and Host machine types

(1): Metirc type alarm

Associated data source : configured data source time series library database

Alarm condition : When the service host reaches this condition, an alarm alarm will occur

PromQL : A query language for the Prometheus monitoring system for retrieving and aggregating metrics data from Prometheus storage. PromQL is a very flexible and extensible query language that can support various types of indicator data analysis and retrieval. Here we write a trigger condition, such as mem_available_percent < 50, it will trigger an alarm

Trigger alarm: There are three levels here, the most serious and the highest priority is the first level alarm, and then in turn. Of course, you can also add other alarm conditions, here you can add multiple alarm conditions to trigger the alarm

Here, if the level limit is turned on, the high-level alarm will directly suppress the low-level alarm, and repeated alarms will not bother us to deal with it.

(2): Machine type warning

 There are three rules for machine alarms: machine disconnection, machine cluster failure, and machine time offset

Machine lost connection : It means that the machine loses connection, there is a problem with the connection status, or the machine crashes

Machine time offset : monitoring of the machine time clock, monitoring has requirements for time, if the time offset is large, it will affect the synchronous viewing of data

Machine cluster disconnection: Set the cluster disconnection ratio. For example, if 60% of the machines in a machine cluster are disconnected, an alarm will be issued.

Screening of machines:

Business group: It is a machine in a project group, or a project group cluster screening

Label : Label the machines we monitor with specific labels in the infrastructure to filter

Machine ID : It is a specific host screening or multiple host screening

The machine type alarm can also set multiple alarm conditions here, and the level is suppressed

Execution frequency: how often the alarm is executed

Duration: the number of seconds before the alarm is met

③:  Generate configuration

When to take effect : When will the alert be sent to me

④: Notification configuration

 

Notification medium : the method of notifying the alarm, such as DingTalk, emall, mobile phone, etc.

Alarm accepting group: which business group accepts these alarms

Start recovery notification: alarm recovery, whether to notify

Observation time : If the problem of the alarm returns to normal occasionally before the observation period expires, it is still not considered to have recovered the alarm. It will return to normal only after the observation period is exceeded and the operation error does not appear, otherwise the alarm will still be issued. . For example, I set the observation time here to 5 minutes. If my CPU occupancy rate suddenly reaches more than 80% and meets my alarm conditions, if the CPU occupancy rate is within 5 minutes during the time I deal with the alarm error. Occasionally drops to the normal value, but rises again after a while, then the alarm will not be restored. If the cpu occupancy rate is still the normal value after 5 minutes, then the alarm will be restored

Repeat notification interval : notification interval, after an alarm notification, send me an alarm notification 60 minutes later

Maximum number of sending times : the maximum number of times to send notifications to me, not to us all the time.

Callback address : If a fault alarm occurs, then he will call back the fault to some fault handling platforms for processing, and then we will receive the content of the alarm, such as my disk is full, alarm, I call back a platform here, or say It is Nightingale’s alarm self-healing platform, he will go to the fault handling platform, receive the alarm content, and then deal with it accordingly, clean up the disk, etc.

Additional information : Divided into plan links, dashboard links, and descriptions. The link to the plan is that if such an alarm occurs, we have made a plan in advance to solve this failure, and then we can paste the link of that plan. Dashboard Link: The dashboard link of the faulty machine. Description: Note this alert rule

(2): built-in rules

Some alert rules provided by the Nightingale platform, if you are a novice or find it troublesome to create by yourself, you can directly clone the rules here to the business group for rule setting

 There are also options for alarm rules for multiple collectors

After cloning the alarm rules, check the alarm rules set by this business group in the alarm rules and start

 

(3) Blocking rules

    After we receive the alarm, when we are dealing with the fault, we don’t want him to keep pushing us the alarm information to disturb us, then we can set the shielding rule to shield the alarm and prevent him from disturbing us to process the event

If we have such an alarm, we want it to no longer alarm, block it, and directly click the block setting

 Then you can set the alarm masking time, masked label, masked data content time, etc.

 Another scenario for shielding alarms is, for example, if we want to deploy or upgrade a service, change the settings of the host, or shut down the host, etc., that is, some events we will do will trigger this alarm condition, then we can create a new Set a rule to block the alarm so that he can automatically block the alarm,

Blocking time : For example, here we want to upgrade and change the service for 1 hour, so the blocking time here can be set to 1 hour

Blocking event tags : For example, if we have several rules tagged, such as service=dream, then after the blocking rule declares the blocking tag here, it will block all the alarm rules with this tag for one hour

 (4) Subscription rules

The first application scenario : For example, in a company, Xiao Wang is in the operation and maintenance department, and Xiao Li is in the development department. Xiao Wangneng is responsible for collecting these business alarms and processing them. One day, Xiao Li developed and launched his own business called Dream-stack, and wanted to check and obtain the alarms of this business in person , but the authority of this alarm rule is controlled by Xiao Wang, and when configuring alarm rules, the alarm receiving group also Xiao Wang's operation and maintenance team, but because Xiao Wang and Xiao Li usually have a good relationship, Xiao Wang opened the subscription rule of Nightingale at this time , and forwarded all the alarms about the alarm rules of the Dream-stack label to the alarm acceptance group Xiao Li, at this time, Xiao Li can also receive the alarm information of this business in real time.

 

 In this way, all the alarms of this Dream-stack are subscribed and forwarded to the people of Xiao Li's development team, and the medium used is to notify them of the alarm information by email, and can also change the level of the alarm, the callback address, and the information about Dream- stack The alarms are forwarded and subscribed to the business group of Xiao Li's team

The second application scenario: In a company, Xiao Li is in the development department, and Xiao Wang is in the operation and maintenance department. One day, a service business that went online suddenly reported a failure, and the rule acceptance of the alarm was Xiao Li's team. At this time, after Xiao Li received the alarm, he began to deal with the failure. It has not been resolved, and Xiao Li is anxious, thinking that if the problem cannot be solved, he will blame Xiao Wang. He thinks that they are brothers after all, and it doesn’t matter if Xiao Wang’s salary is higher, and then Xiao Li alerts the subscription The Subscription Duration Exceeded field has been changed to 10 minutes. If the alarm has not been resolved within 10 minutes, the alarm will be pushed to Xiao Wang through the first-level alarm level.

  (5) Active alarm

 Active alarm means that after the alarm is generated, we can see in real time which alarms have not been recovered, and which faults are still continuing

 

  (6) History alarm

As the name suggests, historical alarms are the total number of alarm failures, which is convenient for us to summarize and view

(3): timing index

Instant query : just look at the PromQL statement like Prometheus to query the desired data

Quick view : You can directly click to query the desired data fields with some built-in PromQL statements in Nightingale.

(4): log analysis

Connect to the log analysis component, directly set the system here, add the data source of Elasticsearch to the data source, and then you can view the log in real time here

 

(5): link tracking

Like log analysis, first configure the jaeger data source, and then add the data source here to view and use it directly

Can also support topology analysis

 

(6): Infrastructure

View the monitored hosts, machine list status, etc.

(7): Alarm self-healing

    Alarm self-healing is a powerful function that can automatically process alarm rules. You can create some self-healing scripts to handle and solve faults.  We create a self-healing script. This script can be written according to the actual environment, which is very friendly to operation and maintenance, because You can use shell scripts, etc.

 

 Most of the fields are introduced here. For example, here we wrote a command script to view the port to solve the alarm. How to use this alarm self-healing function. Now we open our alarm rule settings and pay attention to the settings we created here . The ID of the self-healing script is 2

Open the alarm rule configuration and find the callback address

 The ${ibex}/ in front of ${ibex}/2 is a fixed format, and 2 is the ID number of our self-healing script. After adding the settings, when the alarm rule is triggered, it will automatically execute our self-healing script. For the tasks in the script, ${ibex} here can be customized in the notification settings of the later system configuration

(8): Personnel organization

Here are some modules for user creation, business group creation, team creation, and authority management

After the user is created, it can be assigned to a designated team, and the team can also enter a certain business group, and then the rights management is to create different roles correspondingly, and assign different rights to each role for management

Role is the role permission given when creating a user

 

 Here you can also bind the user directly to the DingTalk robot, or the enterprise wx to send an alarm push

 

The contact information here is added in the following system configuration to introduce to you 

(9): system configuration

System configuration is also divided into several modules: data source, notification settings, notification template, single sign-on, alarm engine, system version

(1) Data source:

The data source is used to configure and add the receiving information and receiving address of some data sources. Commonly used ones include adding the data source of Prometheus, the data source of victor-metrics, the log analysis of Elasticsearch, etc. Time series indicators, log analysis or link tracking to view and monitor in real time

 (2) Notification settings:

Notification settings are also divided into several small modules:

Callback address : It is the platform for handling alarm failures configured in the previous alarm rules, and it can also be used as a global callback .

Notification script : configure some custom notification media, such as mobile phone notification alarm, SMS, etc., or directly use the local file path

 

Notification medium : the method of alarm notification, which can be customized to add notification channels such as mobile phone, emall, and DingTalk

 

Contact method : Control the contact method of each role user, such as mobile phone number, DingTalk robot, the contact method created by the user in the personnel organization column above is to customize and add robots or other contact methods here.

 

SMTP : It is a statement configuration for mailbox alarms, and the mail gateway for docking

 

Alarm self-healing : The address (${ibex}/2) of the alarm self-healing in the alarm rules we used above is here to declare the configuration customization.

(10): Docking mailbox alarm list configuration

Here we do a small experiment to verify Nightingale's alarm push and rule triggering. We use QQ mailbox as the notification medium for alarm notification

(1) Create an alert rule

Here we create an alarm rule called test. The alarm configuration is to trigger an alarm when the machine loses connection and disconnection for more than 5 seconds , and the notification medium of the alarm is email , and the callback address is the script with the self-healing script ID 2 we created above.

 

 (2) Configure mailbox SMTP for alarm

 First open the SMTP function of qq mailbox

Next open Nightingale's SMTP settings configuration

 The format is:

Host = "smtp.qq.com"    # 这里固定的
Port = 465              #  这里端口也要固定
User = "[email protected]"     # 这里为你qq邮箱地址
Pass = ""      # 这里为你刚刚qq邮箱开启smtp功能的密钥
From = "[email protected]"     # 这里也为你qq邮箱的地址
InsecureSkipVerify = true     # 默认不变
Batch = 5           # 默认不变

then save the settings 

  (3) Trigger an alarm

Here we have two hosts locally, and our alarm rule is set to trigger when the machine is disconnected. Here we go to linux to turn off this host, and then wait 5 seconds to receive the email alarm notification

 

The second time Nightingale finished speaking, and I worked on it all night. It can be said that it is the most detailed panel tutorial on the whole network. Viewers, please like and repost this domestic light Nightingale! Those who have learned about it say it is awesome, and I think the advantage of Nightingale is that it can unify multiple data monitoring on the monitoring template of Nightingale n9e, which is very convenient for us to check, and the alarm self-healing function is also very powerful. This series will continue to be updated. In the next issue, I will talk about mysql monitoring and some other functions about Nightingale, which will be updated within a week!

 

 

 

 

Guess you like

Origin blog.csdn.net/m0_61323675/article/details/130205190