Detailed explanation and practical implementation of SkyWalking for distributed link tracking

SkyWalking

1.SkyWalking Overview

Insert image description here

In 2015, Wu Sheng (Huawei developer) led the open source project. The author is Huawei's development cloud monitoring product manager. He leads the planning, technical roadmap and related research and development of monitoring products. He is also a member of the OpenTracing distributed tracing standard organization. The project joined in 2017 Apache Incubator is a distributed system application performance monitoring tool (APM) designed for microservices, cloud native architecture and container-based (Docker, K8s, Mesos) architecture.

Official site:http://skywalking.apache.org/

GitHub project address:https://github.com/apache/skywalking

Its core functional points are as follows:

  • Indicator analysis: Service, instance, endpoint indicator analysis
  • Problem Analysis: Analyze the code at runtime to find the root cause of the problem
  • Service topology: Topology diagram analysis of services provided
  • Dependency Analysis: Service instance and endpoint dependency analysis
  • Service detection: Detect slow services and endpoints
  • Performance Optimization: Provide performance optimization ideas based on service monitoring results
  • Link tracing: distributed tracing and context propagation
  • Database monitoring: Database access indicator monitoring and statistics, detecting slow database access statements (including SQL statements)
  • Service Alarm: Service Alarm Function

Glossary:

  • Service: Business resource application system
  • Endpoint: the functional interface exposed by the application system to the outside world
  • Instance: physical machine

Insert image description here

2.SkyWalking architecture design

The overall architectural design of SkyWalking is shown in the figure below:

Insert image description here

SkyWalking as a whole can be divided into: client and server

Client: agent component

​ Collect service-related information (including tracking data and statistical data) based on probe technology, and then report the collected data to Skywalking's data collector

Server: divided into OAP, Storage, WebUI

OAP: observability analysis platform. The observability analysis platform is responsible for receiving the data reported by the client, analyzing, aggregating, and storing the data after calculation. It also provides some query APIs for data query. This module is actually our Collector of the link tracking system

Storage: skyWalking's storage medium, the default is H2, and it also supports many other storage media, such as: ElastaticSearch, mysql, etc.

WebUI: Provides some graphical interfaces to display corresponding tracking data, indicator data, etc.

SkyWalking adopts component development and is easy to expand. The functions of the main components are as follows:

  • Skywalking Agent: Link data collects tracing (call chain data) and metric (metric) information and reports it. The report sends the data to Skywalking Collector through HTTP or gRPC.
  • Skywalking Collector: Link data collector, which integrates and analyzes the tracing and metric data passed by the agent through the Analysis Core module and processes it into the relevant data storage. It also performs secondary statistics and monitors alarms through the Query Core module.
  • Storage: Skywalking's storage supports mainstream storage such as ElasticSearch, Mysql, TiDB, and H2 as storage media for data storage. H2 is only used as a temporary demonstration stand-alone machine. The current production test environment uses ES storage.
  • SkyWalking UI: Web visualization platform used to display implemented data.

Integration can be carried out without modifying a single line of code of the original project. SkyWalking used to put this statement in the README document. In fact, this statement is both right and wrong. That's right for the end user, they don't need to modify the code (at least in the vast majority of cases). But this statement is also wrong, because the code is actually modified, just by the agent. This approach is often called "manipulating the code at runtime". The underlying principle is that the automatic R&D agent uses the interface provided by the virtual machine for modifying the code to dynamically add the R&D code. In other words, we do not manually bury the points, but skywalking automatically buries the points through the java agent. The java agent mechanism I understand it as an interceptor of the main function. It provides the premain() method to modify the java class. The literal understanding of premain is before the main function. That is to say, skywalking uses the premain method of the java agent to perform the burying operation to complete the automatic Buried.

3.SkyWalking deployment

docker-compose.yml in the reference material

(1)Install OAP

docker pull image

docker pull apache/skywalking-oap-server:latest-es6

Create container

docker run --name skywalking -d -p 1234:1234 -p 11800:11800 -p 12800:12800 --restart always apache/skywalking-oap-server 

SkyWalking uses H2 for information storage by default, but the data will be lost once H2 is restarted. Therefore, ES is used to replace H2 for data information storage in SkyWalking. Elasticsearch has been installed in the project and can be used directly. You need to specify the address of elasticsearch. You need to pay attention to ES here. version of

(2)Install UI

Pull image

docker pull apache/skywalking-ui

Create container

docker run --name skywalking-ui -d -p 8686:8080 --link skywalking:skywalking -e SW_OAP_ADDRESS=skywalking:12800 --restart always apache/skywalking-ui

After successful startup, visit the webui page of skywalking:http://192.168.xx.xxx:xxxx/

Insert image description here

4. Connect the application to SkyWalking

It is very simple for the application to access skywalking. You only need to specify the components of skyWalking agent through-javaagent when the application is started

First find the agent component in the downloaded skyWalking:

Insert image description here

Pass-javaagent to specify the skywalking-agent.jar of the skywalking agent component

In addition: the agent is responsible for collecting data and then submitting the data to OAP (collector), so we need to specify the OAP address in the agent's configuration file. The default is local 127.0.0.1

Enter the config directory and find: agent.config configuration file

Insert image description here

Next, we start the application in sequence, taking xxx-leadnews-admin-gateway and xxx-leadnews-user in the xxx-leadnews-admin of the code on the 13th day of Dark Horse Toutiao as an example. We only need to modify the startup parameters. ,

We need to modify the startup parameters as follows (note: point to the location where the agent is stored in your computer)

-javaagent:D:\develop\agent\skywalking-agent.jar
-Dskywalking.agent.service_name=leadnews-admin

Note: If a service is deployed on multiple nodes, it is best to ensure that the service names are different.

Of course, the same thing is possible, but one service has multiple instances.

The diagram is as follows:

Insert image description here

All three services must modify the startup parameters, and then start the project

Visit: http://192.168.xxx.xxx:xxxx to view the UI of skywalking

Introduction to UI monitoring perspectives and indicators

Let’s interpret some of the unfamiliar indicators:

User Satisfaction Apdex Score

Apdex is a measure of response time based on a set threshold. It measures the ratio of satisfactory response times to unsatisfactory response times. Response time from asset request to completed delivery back to the requester.

The administrator, owner or add-on manager defines the response time thresholdT. All responses processed within aTshort time satisfy the user.

For example, ifT is 1.2 seconds and the response is completed within 0.5 seconds, the user will be satisfied. All responses longer than 1.2 seconds dissatisfy the user. Responses longer than 4.8 seconds frustrate users.

cpm requests per minute

The full name of cpm is call per minutes, which is a throughput (Throughput) indicator.

The following figure shows the spliced ​​global throughput, service, instance and interface throughput and average throughput.

Insert image description here

185cpm=185/60=3.08 requests/second

SLA Service Level Agreement

Service level agreement is used to express the level of service provision and can measure the availability of the platform. The following is the calculation of N nines.

1年 = 365天 = 8760小时
99     = 8760 * 1%     => 3.65天----------------》相当于全年有3.65天不可用,2个9就基本不可用了
99.9   = 8760 * 0.1%   => 8.76小时--------------》相当于全年有8.76小时不可用
99.99  = 8760 * 0.01%  => 52.6分钟
99.999 = 8760 * 0.001% => 5.26分钟

Therefore, as long as there is a large-scale downtime accident throughout the year, four 9s will definitely be out of business. Generally, three 9s are about the same for the platform.

Percent Response percentile statistics

Skywalking has"p50, p75, p90, p95, p99" some series of values, in the picture"p99:390" means that the response time of 99% of the requests is within 390ms.

Insert image description here

Heatmap Heatmap

Heapmap can be translated as heat map or heat map. The darker the color in the picture, the greater the number of requests. This is very similar to GitHub Contributions. The more commits, the darker the color.

The vertical axis is the response time. If you put the mouse on it, you can see the specific number.

Through the heat map, on the one hand, you can intuitively feel the overall traffic of the platform, and on the other hand, you can also feel the overall performance.

5.SkyWalking configuration application alarm

SkyWalking alarm function was added in version 6.x. Its core is driven by a set of rules, which are defined in the config/alarm-settings.yml file. The definition of an alarm is divided into two parts:

  1. Alert rules: They define how metric alerts should be triggered and what conditions should be considered.
  2. Webhook (network hook): Define which service endpoints need to be notified when the warning is triggered

5.1.Alarm rules

SkyWalking releases will provide config/alarm-settings.yml files by default, which pre-define some common alarm rules. As follows:

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Sample alarm rules.
rules:
  # Rule unique name, must be ended with `_rule`.
  service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: Response time of service {
    
    name} is more than 1000ms in 3 minutes of last 10 minutes.
  service_sla_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_sla
    op: "<"
    threshold: 8000
    # The length of time to evaluate the metrics
    period: 10
    # How many times after the metrics match the condition, will trigger alarm
    count: 2
    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
    silence-period: 3
    message: Successful rate of service {
    
    name} is lower than 80% in 2 minutes of last 10 minutes
  service_resp_time_percentile_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_percentile
    op: ">"
    threshold: 1000,1000,1000,1000,1000
    period: 10
    count: 3
    silence-period: 5
    message: Percentile response time of service {
    
    name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
  service_instance_resp_time_rule:
    metrics-name: service_instance_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 2
    silence-period: 5
    message: Response time of service instance {
    
    name} is more than 1000ms in 2 minutes of last 10 minutes
#  Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
#  Because the number of endpoint is much more than service and instance.
#
#  endpoint_avg_rule:
#    metrics-name: endpoint_avg
#    op: ">"
#    threshold: 1000
#    period: 10
#    count: 2
#    silence-period: 5
#    message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes

webhooks:
#  - http://127.0.0.1/notify/
#  - http://127.0.0.1/go-wechat/

Description of alarm rule configuration items:

  • **Rule name: **Rule name, which is also the only name displayed in the alarm information. Must end with _rule, the prefix can be customized
  • **Metrics name: **Metric name, the value is the metric name in the oal script, currently only supports long, double and a>int type.
  • **Include names:** Which entity names this rule applies to, such as service names and terminal names (optional, default is all)
  • **Exclude names:** Which entity names this rule does not apply to, such as service names and terminal names (optional, empty by default)
  • **Threshold: **Threshold
  • OP: Operation mark, immediate support >, <, =
  • **Period:** How long does it take for the alert rule to be verified. This is a time window that matches the backend deployment environment time
  • **Count:** In a Period window, if the values ​​exceed the Threshold value (press op) and reach the Count value, an alert needs to be sent
  • **Silence period:** After the alarm is triggered in time N, there will be no alarm in the TN -> TN + period period. By default, it is the same as Period, which means that the same alarm (with the same Id in the same Metrics name) will only be triggered once in the same Period.
  • **message:**Alarm message

The alarm rules predefined in the configuration file are summarized as follows:

  1. The average service response time exceeded 1 second 3 times in the past 10 minutes
  2. The service success rate was lower than 80% 2 times in the past 10 minutes
  3. The service 90% response time was less than 1 second 3 times in the past 10 minutes
  4. The service's response time exceeded 1 second 2 times in the past 10 minutes
  5. The endpoint's response time exceeded 1 second 2 times in the past 10 minutes

5.2.Webhook (webhook)

Webhook can be simply understood as a Web-level callback mechanism, which is usually triggered by some events. It is similar to event callbacks in code, but at the Web level. Because it is at the Web level, when an event occurs, the callback is no longer the method or function in the code, but the service interface. For example, in the alarm scenario, the alarm is an event. When this event occurs, SkyWalking will actively call a configured interface, which is the so-called Webhook.

Insert image description here

5.3. Email Alert Practice

According to the introduction in the above two sections, we can know that SkyWalking does not support sending alarm information directly to email, SMS and other services. SkyWalking will only send alarm information to the configured Webhook interface when an alarm occurs.

However, we cannot always manually stare at the log information of this interface to know whether an alarm has occurred in the service. Therefore, we need to implement functions such as sending emails or text messages in this interface to achieve personalized alarm notifications.

1: First, you need to configure the webhook interface. The configuration in theconfig/alarm-settings.yml file is as follows:

Insert image description here

Note: Configure in the OAP container of 192.168.85.143,

Enter the container docker exec -it container ID /bin/bash

The default location is /skywalking/config

In addition: the Webhook interface does not need to be configured on all services. We only need to find a service to add the interface, and the service we are looking for should be as "leisurely" as possible.

webhooks:
  - http://192.168.85.143:9010/alarm/mailNotify/

After configuration, you need to restart the container

3: New module: heima-leadnews-alarm

pom file:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <parent>
        <artifactId>heima-leadnews</artifactId>
        <groupId>com.heima</groupId>
        <version>1.0-SNAPSHOT</version>
    </parent>
    <modelVersion>4.0.0</modelVersion>

    <artifactId>heima-leadnews-alarm</artifactId>

    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-mail</artifactId>
        </dependency>
    </dependencies>
</project>

3: Create the configuration file: application.yml, add account and other information required for sending emails.

server:
  port: 9010
spring:
  mail:
    host: smtp.163.com #发件服务器地址
    username: [email protected] #发件账号
    password: AWUJNNGRDCEKLUTN #对应账号的授权码
    port: 25

  #配置邮件接收人
skywalking:
  alarm:
    from: [email protected] # 发送邮件的地址,和上面username一致
    receiveEmails:
      - [email protected]

To use 163 mailbox, you need to activate the SMTP service

Insert image description here

At the same time, you need to write a configuration class to load the configuration of the email recipient: com.itheima.skywalking.alarm.AlarmEmailProperties



import lombok.Data;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.boot.context.properties.EnableConfigurationProperties;
import org.springframework.context.annotation.Configuration;

import java.util.List;


@Data
@Configuration
@EnableConfigurationProperties
@ConfigurationProperties("skywalking.alarm")
public class AlarmEmailProperties {
    
    

    private String from;
    private List<String> receiveEmails;

}

4: skyWalking needs to call the configured webhook interface (POST) after the alarm event occurs, and will pass some parameters at the same time, which are in application/json format, as follows:

[{
    
    
    "scopeId": 1,
    "scope": "SERVICE",
    "name": "serviceA",
    "id0": 12,
    "id1": 0,
    "ruleName": "service_resp_time_rule",
    "alarmMessage": "alarmMessage xxxx",
    "startTime": 1560524171000
}, {
    
    
    "scopeId": 1,
    "scope": "SERVICE",
    "name": "serviceB",
    "id0": 23,
    "id1": 0,
    "ruleName": "service_resp_time_rule",
    "alarmMessage": "alarmMessage yyy",
    "startTime": 1560524171000
}]

Therefore we can define a DTO in service1 to receive data: com.xxx.skywalking.dto.AlarmDTO



import lombok.Data;

/**
 * Created by 传智播客*黑马程序员.
 */
@Data
public class AlarmDTO {
    
    
    private Integer scopeId;
    private String scope;
    private String name;
    private String id0;
    private String id1;
    private String ruleName;
    private String alarmMessage;
    private Long startTime;

}

5: Define an interface to receive alarm notifications from SkyWalking and send the data to the relevant email address of the system leader: com.xxxx.skywalking.controller.AlarmController

package com.xxx.alarm.controller;

import com.xxx.alarm.config.AlarmEamilProperties;
import com.xxx.alarm.model.AlarmDTO;
import lombok.extern.log4j.Log4j2;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.mail.SimpleMailMessage;
import org.springframework.mail.javamail.JavaMailSender;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

import java.util.List;

@RestController
@RequestMapping("/alarm")
@Log4j2
public class AlarmController {
    
    

    @Autowired
    private JavaMailSender javaMailSender;

    @Autowired
    private AlarmEmailProperties alarmEmailProperties;

    @PostMapping("/mailNotify")
    public void emailAlarm(@RequestBody List<AlarmDTO> alarmDTOList){
    
    
        SimpleMailMessage mailMessage = new SimpleMailMessage();
        //从哪个邮箱发出
        mailMessage.setFrom(alarmEmailProperties.getFrom());
        //发送邮件
        mailMessage.setTo(alarmEmailProperties.getReceiveEmails().toArray(new String [] {
    
    }));
        //主题
        mailMessage.setSubject("skywalking告警邮件");
        //邮件内容
        mailMessage.setText(alarmDTOList.toString());
        javaMailSender.send(mailMessage);
        log.info("告警邮件已发送");
    }
}

Package and deploy to server and run service

6: Because skywalking has an alarm rule by default: the service success rate is lower than 80% more than 2 times within 10 minutes.

start simultaneously

xxxx-leadnews-alarm

xx-leadnews-admin

xx-leadnews-admin-gateway

xx-leadnews-user

xx-leadnews-wemedia

xx-leadnews-article

To do a user review, first ensure at least one normal request, then you can stop the self-media microservice, request a few more times, and wait a few minutes to receive the email.

6. Project automated deployment is connected to SkyWalking

6.1 Overall idea

  • Each microservice startup parameter is added to the agent component.
  • Set the directory mounting when Jenkins starts the container, and each microservice points to the agent component in a host.

Insert image description here

6.2 Modification of startup parameters

Add agent startup parameters to the Dockerfile in each microservice, and set each service name to a different name. Below is the configuration of the admin microservice

-javaagent:/usr/share/agent/skywalking-agent.jar -Dskywalking.agent.service_name=leadnews-admin

Insert image description here

6.3 Upload agent components to the server

Make the local agent into a compressed package, upload it to the Linux server, move it to the directory: /usr/local/skyWalking and decompress it, the following effect will appear.

Insert image description here

Note: Do not write the directory name incorrectly

6.4 Modify the project configuration in Jenkins

Modify the Execute shell in the configuration of each project, and add directory mapping when creating the container to point to the same agent.

-v /usr/local/skyWalking/agent:/usr/share/agent

Insert image description here

Guess you like

Origin blog.csdn.net/Java__EE/article/details/132149295