Instructions for use of Skywalking

demand background

With the prevalence of distributed, the complexity of the system is gradually increasing, and the interaction between different services puts forward higher requirements for performance positioning. Any abnormality of any node may cause losses to the business system. For link tracing, an excellent monitoring tool is urgently needed.

The requirements are as follows

functional requirements

  • Request link tracking, quickly locate the fault, shorten the troubleshooting time and judge the scope of the fault
  • Visualize the time consumption of each stage of the link, perform performance analysis, and eliminate business bottlenecks
  • Sort out service dependencies and optimize the rationality of dependencies
  • System indicator monitoring, throughput (TPS), response time and error records, etc.

Non-functional requirements:

Performance consumption of probes: Service call burying itself will bring performance loss, which requires components to affect the performance of the business system. Small
code intrusion: as little or no intrusion to the business system as possible. Others, transparent to the user, Reduce the burden on developers.

Skywalking

  • Introduction

A full link tracking tool

Instructions for use are as follows

dash board

Dashboard is the home page of Skywalking, which provides multiple dashboards to visualize indicators, such as: service (APM), database (Database) and so on.

https

APM

The APM panel is generally divided into four dimensions: Global (global), Service (service), Instance (instance), Endpoint (API), providing filtering functions, and each block contains some indicators.

https

  • Global metrics:

https
1、Services Load:

Service requests per minute

2. Slow Services:
slow response services, topN sorted by response time, unit ms

3、Un-Health Services (Apdex):

Apdex performance index, that is, the unhealthy value of the service, 1 is a full score, Apdex is a comprehensive consideration based on the set threshold and response time, and is the ratio of satisfactory response time and unsatisfactory response time to the total response time, measured by User satisfaction with the service, as traditional metrics such as average response time can quickly become prone to bias.

4、Slow Endpoints:

The average response time of slow interfaces is sorted, in ms

5、Global Response Latency:

Response time percentage, delay time of different percentages, unit ms. percentile tag meaning, for example, p99 is 3500ms, which means that 99% of requests should be faster than 3500ms

6、Global Heatmap:

The service response time heat distribution chart shows the color depth according to the number of different response times in the time period. The darker the color, the more requests.

  • Service (service) dimension: **

https
Service Apdex Numbers:
Apdex Performance Indicators

Service Apdex line chart:
Apdex score over time

Service Avg Response Time :
Service average response time

Service Response Time Percentile:
Percentage response delay

**Successful Rate (%)**Number:
request success rate

**Successful Rate (%) line chart: **Request success rate over a period of time
**Service Load (CPM - calls per minute):
**Calls per minute

Service Load(CPM - calls per minute):

Calls per minute over time

Service Instances Load(CPM - calls per minute):

Requests per minute per instance

Slow Service Instance:

The average delay of each service instance topN

Service Instance Successful Rate:

The request success rate topN of the service instance

  • Instance (instance) dimension:

https

https

  • Endpoint (API) dimension:

https

https

Database:

https

https

Topology

The topology map can intuitively display the dependencies between services, which is very helpful for us to sort out services, and supports custom grouping, as shown in the figure below, ai-search, social-search, social -scan customizes a group of three services, and intuitively shows the dependencies among the three through the topology diagram:

https
In addition, the topology map can also view service operation information for measurement, including development framework type, average service response time, throughput, percentage response, Apdex score, SLA value, etc.

https

Link tracking:

https

  • View database operation details:

https

  • View Redis cache operation details:

https

Performance analysis:

Skywalking is very powerful in performance analysis. It provides stack-based analysis results, allowing developers to see the time consumed by each step in the calling process at a glance, so that they can be optimized in a targeted manner.

Performance analysis samples different endpoints by creating new tasks, and provides more detailed reports, such as more thread stack information than link tracking, slow method prompts, and so on. Next, we will introduce how to perform performance analysis:

  • Create a new task:
    In the performance analysis module -> New task -> Select service, fill in the endpoint, monitor time, the operation is as follows:

https
Reminder: For each service, only one task can be added at the same time, and the added task cannot be changed or deleted, it can only be automatically deleted after it expires

  • Execution request:
    visit the "/api/searchByWholeOcr" interface multiple times, and then select this task, and the monitored data will appear, as shown in the figure below:

https

Note: Multiple requests need to be performed in succession because of the adoption setting. If the number of executions is small, sampling data may not appear, and analysis will not be possible

Performance analysis:

https
As can be seen from the figure above, the "/api/searchByWholeOcr" interface took 681ms. By analyzing the detailed stack information, we can see that the most time-consuming operation is the executeSearchRequest() method of the SearchServiceImpl class, which took 563ms, mainly by calling ES to do Full-text search is enabled, as shown in the figure below:

https

write at the end

Skywaling is a very good link tracking tool, especially in the production environment. It plays an important auxiliary role in locating the time-consuming interface and the link process. For its use, at least you have to understand it to know how to locate the problem. This is "Anqianmahou", an account that focuses on sharing practical and dry goods. If you think it is organized carefully and helpful to the readers, please give a three-link. More practical dry goods are being continuously updated...

This article is published by mdnice multi-platform

Guess you like

Origin blog.csdn.net/weixin_42329623/article/details/131756972