One n9e three data sources:
1 time series 2 log 3trace
(1) Time series data source
- Object list
displays the metadata of the machine. Here, the group management of the machine is done (by modifying the business group). The subsequent alarm configuration can be configured from the dimension of the business group.
- Dashboard
Provides built-in dashboards for some open source services. Dashboards can be cloned. Dashboard
supports multiple styles of charts
3. Function of recording rules
(1) You can save the promql query statement as an indicator and configure it in the market. In this way, when multiple people query, they only check a single indicator, which will reduce the query pressure on the time series library. (2) Multiple alarm rules are
required To calculate a certain indicator, you can save the calculated promql as a new indicator to reduce stress.
(2) Log data source
Log collection: first configure the data source.
The log retrieval rules support the syntax of es.
(3) trace data source
link tracing
Generally, there are two scenarios for trace. One is to see the time-consuming distribution, and the other is to see where the latency of API requests is high.
Configure jaeger type data source
Topology analysis
The calling relationship of some modules
Second alarm function
Alarm rule configuration, built-in rules, blocking rules, subscription rules, active alarms, historical alarms
1. Built-in rules:
Similar to the market, the alarm rules also provide some built-in rules recommended by open source services, which can be cloned.
2. Alarm rules:
Metric type alarms :
Level suppression : We have multi-level threshold alarm settings for a certain indicator, which can be turned on through level suppression. For the following n rules, high-level ones will suppress low-level ones, and low-level alarm notifications will not Sent.
Machine type alarms : machine lost connection, machine cluster lost connection (a specified percentage of machines in the cluster lost connection alarmed), machine time offset
Execution frequency The execution frequency
and duration of the alarm rule How long it takes to trigger the alarm to take effect when meeting the conditions of the alarm rule
Configuration supports receiving alarms within a specified time period
and only takes effect in this business group Only machines in this business group will match these alarms
Notification media
The display of notification media is selected in the system configuration:
Observation duration Prevent indicators from frequently fluctuating above and below the threshold, resulting in frequent alarm triggers and recovery reminder
notification intervals . Reduce the interference of frequent alarms.
The maximum number of sending times. Reduce the interference of frequent alarms. The callback
address can perform alarm self-healing. After the alarm is triggered, configure the callback address. , call back to the fault self-healing platform. You can also configure your own channels.
Additional information Remarks: You can put the plan connection, or the corresponding large disk connection
3. Alarm shielding:
- Temporarily block when handling the alarm (just block it directly in the alarm details)
- Enable alarm shielding when services change and support periodic shielding.
4. Subscription rules:
- In addition to the receivers configured in the alarm rules, for example, if the research and development of the corresponding business also needs to receive alarms, you can configure subscription rules. The subscription rules can redefine alarm levels, media, etc.
- It can also be used for alarm escalation. If the front-line students fail to deal with the corresponding alarm for an hour, they can escalate the alarm to the person in charge of the business.
4. Alarm events:
Historical alarms : support exporting
active alarms : which alarms have not been recovered
Supports the aggregation of alarm events through configuration.
Format: field:聚合的字段
(equivalent to the group by field) the severity in the figure is equivalent to the field of the alarm level. The
following fields can be used
4. Fault self-healing
If you use Nightingale's built-in alarm self-healing
[Example], it means calling the self-healing rule with ID 3, which is only executed on the n9e machine.
execution history
three-person organization
1.Permission management
Create roles, configure corresponding permission points, and then assign corresponding roles to users
2.User management
Users can configure some association methods, such as DingTalk, etc. Then add people to the alarm receiving group. The alarm rules associated with the alarm receiving group will be triggered, and the corresponding DingTalk and other methods will also be triggered.
Customization of related Token methods
Four system configuration
1. Notification settings
Global callback address : All Nightingale alarms will be pushed to this address.
Notification script : Call SMS or phone gateway through a custom script to implement SMS or phone alarms, etc.
Notification media : control the notification media displayed in the alarm rules.
Contact information : control establishment Contact method selected by the user
SMTP : Mail gateway
Alarm self-healing : the address and other information corresponding to the variables configured when Nightingale uses its own alarm self-healing