Two Nightingale n9e (Flashcat) monitoring V6: page function introduction (monitoring data display alarm function personnel organization system configuration)

One n9e three data sources:

1 time series 2 log 3trace

(1) Time series data source

  1. Object list
    displays the metadata of the machine. Here, the group management of the machine is done (by modifying the business group). The subsequent alarm configuration can be configured from the dimension of the business group.
    image.png
  2. Dashboard
    Provides built-in dashboards for some open source services. Dashboards can be cloned. Dashboard
    image.png
    supports multiple styles of charts
    image.png

3. Function of recording rules

(1) You can save the promql query statement as an indicator and configure it in the market. In this way, when multiple people query, they only check a single indicator, which will reduce the query pressure on the time series library. (2) Multiple alarm rules are
required To calculate a certain indicator, you can save the calculated promql as a new indicator to reduce stress.

image.png

(2) Log data source

Log collection: first configure the data source.
The log retrieval rules support the syntax of es.

image.png

(3) trace data source

link tracing

Generally, there are two scenarios for trace. One is to see the time-consuming distribution, and the other is to see where the latency of API requests is high.

Configure jaeger type data source

image.png

image.png

Topology analysis

The calling relationship of some modules
image.png

Second alarm function

Alarm rule configuration, built-in rules, blocking rules, subscription rules, active alarms, historical alarms

1. Built-in rules:

Similar to the market, the alarm rules also provide some built-in rules recommended by open source services, which can be cloned.
image.png

2. Alarm rules:

image.png
Metric type alarms :
image.png
Level suppression : We have multi-level threshold alarm settings for a certain indicator, which can be turned on through level suppression. For the following n rules, high-level ones will suppress low-level ones, and low-level alarm notifications will not Sent.

Machine type alarms : machine lost connection, machine cluster lost connection (a specified percentage of machines in the cluster lost connection alarmed), machine time offset

image.png

Execution frequency The execution frequency
and duration of the alarm rule How long it takes to trigger the alarm to take effect when meeting the conditions of the alarm rule
Configuration supports receiving alarms within a specified time period
and only takes effect in this business group Only machines in this business group will match these alarms

image.png
Notification media
The display of notification media is selected in the system configuration:

image.png

Observation duration Prevent indicators from frequently fluctuating above and below the threshold, resulting in frequent alarm triggers and recovery reminder
notification intervals . Reduce the interference of frequent alarms.
The maximum number of sending times. Reduce the interference of frequent alarms. The callback
address can perform alarm self-healing. After the alarm is triggered, configure the callback address. , call back to the fault self-healing platform. You can also configure your own channels.
Additional information Remarks: You can put the plan connection, or the corresponding large disk connection

image.png

3. Alarm shielding:

  • Temporarily block when handling the alarm (just block it directly in the alarm details)
  • Enable alarm shielding when services change and support periodic shielding.

image.png

4. Subscription rules:

  • In addition to the receivers configured in the alarm rules, for example, if the research and development of the corresponding business also needs to receive alarms, you can configure subscription rules. The subscription rules can redefine alarm levels, media, etc.
  • It can also be used for alarm escalation. If the front-line students fail to deal with the corresponding alarm for an hour, they can escalate the alarm to the person in charge of the business.

image.png

4. Alarm events:

Historical alarms : support exporting
image.png
active alarms : which alarms have not been recovered

image.png

image.png
Supports the aggregation of alarm events through configuration.
Format: field:聚合的字段 (equivalent to the group by field) the severity in the figure is equivalent to the field of the alarm level. The
following fields can be used
image.png

4. Fault self-healing

If you use Nightingale's built-in alarm self-healing
[Example], it means calling the self-healing rule with ID 3, which is only executed on the n9e machine.

image.png
image.png

execution history
image.png
image.png

three-person organization

1.Permission management

Create roles, configure corresponding permission points, and then assign corresponding roles to users

image.png
image.png

2.User management

Users can configure some association methods, such as DingTalk, etc. Then add people to the alarm receiving group. The alarm rules associated with the alarm receiving group will be triggered, and the corresponding DingTalk and other methods will also be triggered.
image.png
Customization of related Token methods
image.png

Four system configuration

1. Notification settings

Global callback address : All Nightingale alarms will be pushed to this address.
image.png
Notification script : Call SMS or phone gateway through a custom script to implement SMS or phone alarms, etc.
image.png
Notification media : control the notification media displayed in the alarm rules.
image.png
Contact information : control establishment Contact method selected by the user
image.png
SMTP : Mail gateway

image.png
Alarm self-healing : the address and other information corresponding to the variables configured when Nightingale uses its own alarm self-healing

image.png

2. Notification template

image.png

3. Single sign-on

image.png

4. System version

image.png

Guess you like

Origin blog.csdn.net/weixin_62173811/article/details/130213560