SkyWalking - implement micro-service monitoring alarm

[TOC]

SkyWalking on the basis of the use of reference:

The official document:


SkyWalking alarm function

SkyWalking alarm function is new in version 6.x, the core is driven by a set of rules that are defined in the config/alarm-settings.ymlfile. Defined alarm rules is divided into two parts:

  1. Alarm rules : they define how the measure should trigger an alarm, what conditions should be considered.
  2. Webhook (network hooks) : The definition of when a warning is triggered, the terminal needs to be told which services

Alerting Rule

SkyWalking release will provide the default config/alarm-settings.ymlfile, which defines some commonly used pre-alarm rules. as follows:

  1. Past service within three minutes average response time over one second
  2. Service success rate of less than 80% in the last two minutes
  3. Service 90% response time is less than 1000 ms in the last 3 minutes
  4. Examples of services average response time in the last two minutes of more than 1 second
  5. Endpoint average response time in the past two minutes more than 1 second

These predefined alarm rules, open the config/alarm-settings.ymlfile to see. The specific contents are as follows:

rules:
  # Rule unique name, must be ended with `_rule`.
  service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
  service_sla_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_sla
    op: "<"
    threshold: 8000
    # The length of time to evaluate the metrics
    period: 10
    # How many times after the metrics match the condition, will trigger alarm
    count: 2
    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
    silence-period: 3
    message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
  service_p90_sla_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_p90
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: 90% response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes
  service_instance_resp_time_rule:
    metrics-name: service_instance_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 2
    silence-period: 5
    message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes

In addition, the official also offers a config/alarm-settings-sample.ymlfile, the file is a sample file an alarm rule, which shows all the alarm rules currently supported configuration items:

# Sample alarm rules.
rules:
  # Rule unique name, must be ended with `_rule`.
  endpoint_percent_rule:
    # Metrics value need to be long, double or int
    metrics-name: endpoint_percent
    threshold: 75
    op: <
    # The length of time to evaluate the metrics
    period: 10
    # How many times after the metrics match the condition, will trigger alarm
    count: 3
    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
    silence-period: 10
    message: Successful rate of endpoint {name} is lower than 75%
  service_percent_rule:
    metrics-name: service_percent
    # [Optional] Default, match all services in this metrics
    include-names:
      - service_a
      - service_b
    exclude-names:
      - service_c
    threshold: 85
    op: <
    period: 10
    count: 4

Alarm Rule Parameter description:

  • Rule name: rule name, a unique name is displayed in the alarm information. Must _ruleend, the prefix can be customized
  • Metrics name: metric name, metric value is the name of oal script, currently only supports long, doubleand inttype. See Official OAL script
  • Include names: What are the name of the entity to apply the rule, such as the service name, terminal name (optional, defaults to all)
  • Exclude names: the rule for which the entity name is not used, such as the service name, terminal name (optional, default is empty)
  • Threshold: Threshold
  • OP: operator, currently supports &gt;, &lt;,=
  • Period: How long the alarm about the rules need to be verified. This is a time window, and the back-end deployment environment matches the time
  • Count: In a Period window, if the values exceed the Threshold value (by OP), to achieve the Count value, an alert needs to be sent
  • Silence period: After the alarm is triggered at time N, the TN -> TN + period at this stage is not an alarm. By default, it Period, this means the same alarms (with the same in the same Metrics name Id) in the same Period will only be triggered once
  • message: alarm message

Webhook (network hook)

Webhook can be simply understood as a Web-level callback mechanism, usually triggered by events, with code similar event callback, but the Web level. Because it is Web-level, so that when an event occurs, the callback method is no longer a code or function, but the service interface. For example, in this scenario the alarm, the alarm is an event. When the event occurs, SkyWalking will take the initiative to call a configured interface, which is called Webhook.

SkyWalking alarm message by transmitting an HTTP request, request method POST, Content-Typeis application/jsonthat the solid based on JSON data List&lt;org.apache.skywalking.oap.server.core.alarm.AlarmMessageto serialize. Example JSON data:

[{
    "scopeId": 1,
    "scope": "SERVICE",
    "name": "serviceA",
    "id0": 12,
    "id1": 0,
    "ruleName": "service_resp_time_rule",
    "alarmMessage": "alarmMessage xxxx",
    "startTime": 1560524171000
}, {
    "scopeId": 1,
    "scope": "SERVICE",
    "name": "serviceB",
    "id0": 23,
    "id1": 0,
    "ruleName": "service_resp_time_rule",
    "alarmMessage": "alarmMessage yyy",
    "startTime": 1560524171000
}]

Field Description:

  • scopeId, scope: all available Scope Seeorg.apache.skywalking.oap.server.core.source.DefaultScopeDefine
  • name: goals Scope of entity name
  • id0: ID Scope entities
  • id1: reserved field, yet currently use
  • ruleName: Alarm Rule Name
  • alarmMessage: alarm message content
  • startTime: alarm time in the format timestamp

E-mail alarm function practice

According to the introduction of two or more sections, you can learn: SkyWalking does not support sending alarm information directly to the mail, text messaging and other services, SkyWalking will only send alarm information to the interface configured Webhook when an alarm occurs.

But we can not manually log staring at the interface information to know whether a service alarm occurs, we need to implement functions such as sending e-mail or text message on the interface in order to achieve a personalized alarm notification.

Then start hands-on, here Based on Spring Boot. The first is to add a dependency:

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-mail</artifactId>
</dependency>

Configure mail service:

server:
  port: 9134

#邮箱配置
spring:
  mail:
    host: smtp.163.com
    #发送者邮箱账号
    username: 你的邮箱@163.com
    #发送者密钥
    password: 你的邮箱服务密钥
    default-encoding: utf-8
    port: 465   #端口号465或587
    protocol: smtp
    properties:
      mail:
        debug:
          false
        smtp:
          socketFactory:
            class: javax.net.ssl.SSLSocketFactory

The JSON data defining a transmission SkyWalking DTO, an interface for receiving data:

@Data
public class SwAlarmDTO {

    private Integer scopeId;
    private String scope;
    private String name;
    private Integer id0;
    private Integer id1;
    private String ruleName;
    private String alarmMessage;
    private Long startTime;
}

Next define an interface, the received alarm notification SkyWalking, and transmits the data to the mailbox:

package com.example.alarmdemo.controller;

import com.example.alarmdemo.dto.SwAlarmDTO;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.mail.SimpleMailMessage;
import org.springframework.mail.javamail.JavaMailSender;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

import java.util.List;

@Slf4j
@RestController
@RequiredArgsConstructor
@RequestMapping("/alarm")
public class SwAlarmController {

    private final JavaMailSender sender;

    @Value("${spring.mail.username}")
    private String from;

    /**
     * 接收skywalking服务的告警通知并发送至邮箱
     */
    @PostMapping("/receive")
    public void receive(@RequestBody List<SwAlarmDTO> alarmList) {
        SimpleMailMessage message = new SimpleMailMessage();
        // 发送者邮箱
        message.setFrom(from);
        // 接收者邮箱
        message.setTo(from);
        // 主题
        message.setSubject("告警邮件");
        String content = getContent(alarmList);
        // 邮件内容
        message.setText(content);
        sender.send(message);
        log.info("告警邮件已发送...");
    }

    private String getContent(List<SwAlarmDTO> alarmList) {
        StringBuilder sb = new StringBuilder();
        for (SwAlarmDTO dto : alarmList) {
            sb.append("scopeId: ").append(dto.getScopeId())
                    .append("\nscope: ").append(dto.getScope())
                    .append("\n目标 Scope 的实体名称: ").append(dto.getName())
                    .append("\nScope 实体的 ID: ").append(dto.getId0())
                    .append("\nid1: ").append(dto.getId1())
                    .append("\n告警规则名称: ").append(dto.getRuleName())
                    .append("\n告警消息内容: ").append(dto.getAlarmMessage())
                    .append("\n告警时间: ").append(dto.getStartTime())
                    .append("\n\n---------------\n\n");
        }

        return sb.toString();
    }
}

Finally, the configuration of the interface to SkyWalking, WebHook configuration located config/alarm-settings.ymlat the end of the file format http://{ip}:{port}/{uri}. The following example:

[root@localhost skywalking]# vim config/alarm-settings.yml
webhooks:
  - http://127.0.0.1:9134/alarm/receive

Test alarm function

After the completion of the development and configuration of the alarm interface, we have to carry out a simple test. There is a call link below:
SkyWalking - implement micro-service monitoring alarm

I /producerincreased the lead to abnormal line of code interface, deliberately inflicting on the interface is not available:

@GetMapping
public String producer() {
    log.info("received a request");
    int i = 1 / 0;
    return "this message from producer";
}

Then write a test code, its services to meet the success rate of less than 80% of this default alert rule in the last two minutes:

public static void main(String[] args) {
    RestTemplate restTemplate = new RestTemplate();
    for (int i = 0; i < 100; i++) {
        String result = restTemplate.getForObject("http://127.0.0.1:8936/consumer", String.class);
        log.info(result);
    }
}

After executing the test code, wait about two minutes after the alarm console interface output for some log information:
SkyWalking - implement micro-service monitoring alarm

In this case, normal mail received warning message:
SkyWalking - implement micro-service monitoring alarm

Guess you like

Origin blog.51cto.com/zero01/2463976