Still using Zipkin distributed service link tracking? Give this a try!

 Since the advent of Spring Cloud, microservices have taken the world by storm, and enterprise architectures are transforming from traditional SOA to microservices. However, while the double-edged sword of microservices brings various advantages, it also brings great difficulties to operation and maintenance, performance monitoring, and error troubleshooting.

  In large projects, the service architecture will contain dozens or even hundreds of service nodes. Often a request will be designed to multiple microservices. If you want to check which nodes a request link passes through and how each node performs, it is an urgent problem to be solved. So the APM management system of distributed systems came into being.

  What is an APM system?

  The APM system can help understand system behavior and be a tool for analyzing performance problems so that when a failure occurs, the problem can be quickly located and solved. This is the APM system, whose full name is (Application Performance Monitor).

  [Google Dapper](http://bigbully.github.io/Dapper-translation) mentioned in Google's public paper can be said to be the earliest APM system, which has helped Google's developers and operation and maintenance teams a lot, so Google Public papers shared with Dapper.

  Since then, many technology companies have designed and developed many excellent APM frameworks based on the principles of this paper, such as `Pinpoint`, `SkyWalking`, etc.

  The SpringCloud official website also integrates such a system: `Spring Cloud Sleuth`, combined with `Zipkin`.

  Basic principles of APM

  At present, most APM systems are implemented based on Google's Dapper principle. Let's briefly take a look at the concepts and implementation principles in Dapper.

  Let’s first look at a request call example:

  1. The service cluster includes: front-end (A), two middle layers (B and C), and two back-ends (D and E)

  2. When the user initiates a request, it first reaches the front-end service A, and then A makes RPC calls to service B and service C respectively;

  3. Service B responds to A after processing, but service C still needs to interact with the back-end service D and service E before returning to service A. Finally, service A responds to the user's request;

 

1695607176367_1.jpg

  How can tracking be achieved?

  Google's Dapper designed the following concepts to record request links:

  - Span: The basic unit of work in the request. Each link call (RPC, Rest, database call) will create a Span. The approximate structure is as follows:

  type Span struct {
      TraceID    int64 // 用于标示一次完整的请求id
      Name       string // 单元名称
      ID         int64 // 当前这次调用span_id
      ParentID   int64 // 上层服务的span_id,最上层服务parent_id为null,代表根服务
      Annotation []Annotation // 注释,用于记录调用中的详细信息,例如时间
  }

  - Trace: a complete call link, including a tree structure of multiple spans, with a unique TraceID

  Each link requested at one time can be connected in series through spanId and parentId:

1695607240209_2.jpg

  Of course, starting from the request to the server and ending with the server returning the response, each span has the same unique identifier trace_id.

  APM screening criteria

  The current mainstream APM framework will include the following components to complete the collection and display of link information:

  - Probe (Agent): Responsible for searching for service call link information when the client program is running and sending it to the collector

  - Collector: Responsible for formatting and saving data to memory

  - Storage: save data

  - UI interface (WebUI): statistics and display of collected information

  Therefore, to select a qualified APM framework, it is necessary to compare the usage differences of each component. The main comparison items are:

  -Probe performance

  It is mainly the impact of the agent on the throughput, CPU and memory of the service. If the probe has a relatively large performance impact on the operation of the microservice when collecting microservice operation data, I believe few people will be willing to use it.

  - Collector scalability

  It can be expanded horizontally to support large-scale server clusters and ensure the high availability of the collector.

  - Comprehensive call link data analysis

  The data analysis should be fast and the analysis dimensions should be as many as possible. A tracking system that can provide information feedback quickly enough can respond quickly to anomalies in the production environment, and it is best to provide code-level visibility to easily locate failure points and bottlenecks.

  - Transparent for development and easy to switch

  That is, as a business component, it should have as little or no intrusion into other business systems as possible, be transparent to users, and reduce the burden on developers.

  - Complete call chain application topology

  Automatically detect application topology to help you figure out the application architecture

  Next, we will compare the indicators of the three common APM frameworks, which are:

  - [Zipkin](https://link.juejin.im/?target=http%3A%2F%2Fzipkin.io%2F): Open source by Twitter, an open source distributed tracking system used to collect service information Timing data to solve latency problems in microservice architecture, including: data collection, storage, search and presentation.

  - [Pinpoint](https://pinpoint.com/): An APM tool for large-scale distributed systems written in Java, a distributed tracing component open sourced by Koreans.

  - [Skywalking](https://skywalking.apache.org/zh/): An excellent domestic APM component, it is a system for tracking, alerting and analyzing the business operation of JAVA distributed application clusters. It is now one of Apache's top projects.

  The comparison between the three is as follows:

  | Comparison items | zipkin | pinpoint | skywalking |

  | ---------------- | ------ | -------- | ---------- |

  | Probe Performance | Medium | Low | **High** |

  | Collector Scalability| **High** | Medium| **High** |

  | Call Link Data Analysis | Low | **High** | Medium |

  | Transparency to Development | Medium | **High** | **High** |

  | Call Chain Application Topology | Medium | **High** | Medium |

  | Community Support| **High** | Medium| **High** |

  It can be seen that zipkin’s probe performance, development transparency, and data analysis capabilities are not superior, and it is really the best choice.

  Pinpoint has greater advantages in data analysis capabilities and development transparency, but Pinpoint's deployment is relatively complex and requires high hardware resources.

  Skywalking has great advantages in probe performance and development transparency, and its data analysis capabilities are also good. The important thing is that its deployment is more convenient and flexible, and it is more suitable for small and medium-sized enterprises than Pinpoint.

  Therefore, this article will teach you how to use Skywalking.

  Introduction to Skywalking

  SkyWalking was created in 2015 and provides distributed tracing capabilities. Starting from 5.x, the project evolved into a fully functional Application Performance Management system.

  It is used to track, monitor and diagnose distributed systems, especially using microservices architecture, cloud native or volumetric technologies. Provides the following main functions:

  - Distributed tracing and context transfer

  - Analysis of application, instance, and service performance indicators

  - Root cause analysis

  - Apply topology analysis

  - Application and service dependency analysis

  - Slow service detection

  - Performance optimization

  Official website address: http://skywalking.apache.org/

1695607291587_3.jpg

  Main features:

  - Multi-language probes or libraries

  - Java automatically probes, tracks and monitors programs without modifying the source code.

  - Other multilingual probes provided by the community

  - [.NET Core](https://github.com/OpenSkywalking/skywalking-netcore)

  - [Node.js](https://github.com/OpenSkywalking/skywalking-nodejs)

  - Multiple backend storage: ElasticSearch, H2

  - support

  OpenTracing

  - Java automatic probe support and OpenTracing API work together

  - Lightweight, full-featured backend aggregation and analysis

  - Modern Web UI

  - Log integration

  - Alerts for applications, instances and services

  Skywalking installation

  Let’s first take a look at the official structure diagram of Skywalking:

 

1695607324729_4.jpg

  It is roughly divided into four parts:

  - skywalking-oap-server: It is the service of Observability Analysis Platformd, used to collect and process data sent by probes

  - skywalking-UI: It is the Web UI service provided by skywalking, which graphically displays service links, topology maps, traces, performance monitoring, etc.

  - Agent: Probe, obtains link information and performance information of service calls, and sends them to the OAP service of skywalking

  - Storage: Storage, generally choose elasticsearch

  Skywalking supports deployment in windows or Linux environments. Here we choose to install Skywalking under Linux. You must first ensure that elasticsearch is starting in your Linux environment.

  The next installation is divided into three steps:

  - Download the installation package

  - Install Skywalking's OAP service and WebUI

  - Deploy probes in services

  Download the installation package

  The installation package can be downloaded from Skywalking’s official website, http://skywalking.apache.org/downloads/

  The latest version currently is version 8.0.1:

  

1695607352467_5.jpg

  Downloaded installation package:

1695607367832_6.jpg

  Install OAP service and WebUI

  Install

  Unzip the downloaded installation package to a directory on Linux:

tar xvf apache-skywalking-apm-es7-8.0.1.tar.gz

  Then rename the unzipped folder:

mv apache-skywalking-apm-es7 skywalking

  Enter the unzipped directory:

cd skywalking

  View the directory structure:

1695607452470_7.jpg

  Several key directories:

  - agent: probe

  - bin: startup script

  - config: configuration file

  - logs: logs

  - oap-libs: dependencies

  - webapp:WebUI

  Here you need to modify the application.yml file in the config directory. For detailed configuration, see the official website: https://github.com/apache/skywalking/blob/v8.0.1/docs/en/setup/backend/backend-setup.md

  Configuration

  Enter the `config` directory and modify `application.yml`, mainly to change the storage solution from h2 to elasticsearch

  You can use the following configuration directly:

cluster:
  selector: ${SW_CLUSTER:standalone}
  standalone:
core:
  selector: ${SW_CORE:default}
  default:
    role: ${SW_CORE_ROLE:Mixed} # Mixed/Receiver/Aggregator
    restHost: ${SW_CORE_REST_HOST:0.0.0.0}
    restPort: ${SW_CORE_REST_PORT:12800}
    restContextPath: ${SW_CORE_REST_CONTEXT_PATH:/}
    gRPCHost: ${SW_CORE_GRPC_HOST:0.0.0.0}
    gRPCPort: ${SW_CORE_GRPC_PORT:11800}
    gRPCSslEnabled: ${SW_CORE_GRPC_SSL_ENABLED:false}
    gRPCSslKeyPath: ${SW_CORE_GRPC_SSL_KEY_PATH:""}
    gRPCSslCertChainPath: ${SW_CORE_GRPC_SSL_CERT_CHAIN_PATH:""}
    gRPCSslTrustedCAPath: ${SW_CORE_GRPC_SSL_TRUSTED_CA_PATH:""}
    downsampling:
      - Hour
      - Day
      - Month
    # Set a timeout on metrics data. After the timeout has expired, the metrics data will automatically be deleted.
    enableDataKeeperExecutor: ${SW_CORE_ENABLE_DATA_KEEPER_EXECUTOR:true} # Turn it off then automatically metrics data delete will be close.
    dataKeeperExecutePeriod: ${SW_CORE_DATA_KEEPER_EXECUTE_PERIOD:5} # How often the data keeper executor runs periodically, unit is minute
    recordDataTTL: ${SW_CORE_RECORD_DATA_TTL:3} # Unit is day
    metricsDataTTL: ${SW_CORE_RECORD_DATA_TTL:7} # Unit is day
    # Cache metric data for 1 minute to reduce database queries, and if the OAP cluster changes within that minute,
    # the metrics may not be accurate within that minute.
    enableDatabaseSession: ${SW_CORE_ENABLE_DATABASE_SESSION:true}
    topNReportPeriod: ${SW_CORE_TOPN_REPORT_PERIOD:10} # top_n record worker report cycle, unit is minute
    # Extra model column are the column defined by in the codes, These columns of model are not required logically in aggregation or further query,
    # and it will cause more load for memory, network of OAP and storage.
    # But, being activated, user could see the name in the storage entities, which make users easier to use 3rd party tool, such as Kibana->ES, to query the data by themselves.
    activeExtraModelColumns: ${SW_CORE_ACTIVE_EXTRA_MODEL_COLUMNS:false}
    # The max length of service + instance names should be less than 200
    serviceNameMaxLength: ${SW_SERVICE_NAME_MAX_LENGTH:70}
    instanceNameMaxLength: ${SW_INSTANCE_NAME_MAX_LENGTH:70}
    # The max length of service + endpoint names should be less than 240
    endpointNameMaxLength: ${SW_ENDPOINT_NAME_MAX_LENGTH:150}
storage:
  selector: ${SW_STORAGE:elasticsearch7}
  elasticsearch7:
    nameSpace: ${SW_NAMESPACE:""}
    clusterNodes: ${SW_STORAGE_ES_CLUSTER_NODES:localhost:9200}
    protocol: ${SW_STORAGE_ES_HTTP_PROTOCOL:"http"}
    trustStorePath: ${SW_STORAGE_ES_SSL_JKS_PATH:""}
    trustStorePass: ${SW_STORAGE_ES_SSL_JKS_PASS:""}
    dayStep: ${SW_STORAGE_DAY_STEP:1} # Represent the number of days in the one minute/hour/day index.
    user: ${SW_ES_USER:""}
    password: ${SW_ES_PASSWORD:""}
    secretsManagementFile: ${SW_ES_SECRETS_MANAGEMENT_FILE:""} # Secrets management file in the properties format includes the username, password, which are managed by 3rd party tool.
    indexShardsNumber: ${SW_STORAGE_ES_INDEX_SHARDS_NUMBER:1} # The index shards number is for store metrics data rather than basic segment record
    superDatasetIndexShardsFactor: ${SW_STORAGE_ES_SUPER_DATASET_INDEX_SHARDS_FACTOR:5} # Super data set has been defined in the codes, such as trace segments. This factor provides more shards for the super data set, shards number = indexShardsNumber * superDatasetIndexShardsFactor. Also, this factor effects Zipkin and Jaeger traces.
    indexReplicasNumber: ${SW_STORAGE_ES_INDEX_REPLICAS_NUMBER:0}
    # Batch process setting, refer to https://www.elastic.co/guide/en/elasticsearch/client/java-api/5.5/java-docs-bulk-processor.html
    bulkActions: ${SW_STORAGE_ES_BULK_ACTIONS:1000} # Execute the bulk every 1000 requests
    flushInterval: ${SW_STORAGE_ES_FLUSH_INTERVAL:10} # flush the bulk every 10 seconds whatever the number of requests
    concurrentRequests: ${SW_STORAGE_ES_CONCURRENT_REQUESTS:2} # the number of concurrent requests
    resultWindowMaxSize: ${SW_STORAGE_ES_QUERY_MAX_WINDOW_SIZE:10000}
    metadataQueryMaxSize: ${SW_STORAGE_ES_QUERY_MAX_SIZE:5000}
    segmentQueryMaxSize: ${SW_STORAGE_ES_QUERY_SEGMENT_SIZE:200}
    profileTaskQueryMaxSize: ${SW_STORAGE_ES_QUERY_PROFILE_TASK_SIZE:200}
    advanced: ${SW_STORAGE_ES_ADVANCED:""}
  h2:
    driver: ${SW_STORAGE_H2_DRIVER:org.h2.jdbcx.JdbcDataSource}
    url: ${SW_STORAGE_H2_URL:jdbc:h2:mem:skywalking-oap-db}
    user: ${SW_STORAGE_H2_USER:sa}
    metadataQueryMaxSize: ${SW_STORAGE_H2_QUERY_MAX_SIZE:5000}
receiver-sharing-server:
  selector: ${SW_RECEIVER_SHARING_SERVER:default}
  default:
    authentication: ${SW_AUTHENTICATION:""}
receiver-register:
  selector: ${SW_RECEIVER_REGISTER:default}
  default:

receiver-trace:
  selector: ${SW_RECEIVER_TRACE:default}
  default:
    sampleRate: ${SW_TRACE_SAMPLE_RATE:10000} # The sample rate precision is 1/10000. 10000 means 100% sample in default.
    slowDBAccessThreshold: ${SW_SLOW_DB_THRESHOLD:default:200,mongodb:100} # The slow database access thresholds. Unit ms.

receiver-jvm:
  selector: ${SW_RECEIVER_JVM:default}
  default:

receiver-clr:
  selector: ${SW_RECEIVER_CLR:default}
  default:

receiver-profile:
  selector: ${SW_RECEIVER_PROFILE:default}
  default:

service-mesh:
  selector: ${SW_SERVICE_MESH:default}
  default:

istio-telemetry:
  selector: ${SW_ISTIO_TELEMETRY:default}
  default:

envoy-metric:
  selector: ${SW_ENVOY_METRIC:default}
  default:
    acceptMetricsService: ${SW_ENVOY_METRIC_SERVICE:true}
    alsHTTPAnalysis: ${SW_ENVOY_METRIC_ALS_HTTP_ANALYSIS:""}

prometheus-fetcher:
  selector: ${SW_PROMETHEUS_FETCHER:default}
  default:
    active: ${SW_PROMETHEUS_FETCHER_ACTIVE:false}

receiver_zipkin:
  selector: ${SW_RECEIVER_ZIPKIN:-}
  default:
    host: ${SW_RECEIVER_ZIPKIN_HOST:0.0.0.0}
    port: ${SW_RECEIVER_ZIPKIN_PORT:9411}
    contextPath: ${SW_RECEIVER_ZIPKIN_CONTEXT_PATH:/}

receiver_jaeger:
  selector: ${SW_RECEIVER_JAEGER:-}
  default:
    gRPCHost: ${SW_RECEIVER_JAEGER_HOST:0.0.0.0}
    gRPCPort: ${SW_RECEIVER_JAEGER_PORT:14250}

query:
  selector: ${SW_QUERY:graphql}
  graphql:
    path: ${SW_QUERY_GRAPHQL_PATH:/graphql}

alarm:
  selector: ${SW_ALARM:default}
  default:

telemetry:
  selector: ${SW_TELEMETRY:none}
  none:
  prometheus:
    host: ${SW_TELEMETRY_PROMETHEUS_HOST:0.0.0.0}
    port: ${SW_TELEMETRY_PROMETHEUS_PORT:1234}

configuration:
  selector: ${SW_CONFIGURATION:none}
  none:
  grpc:
    host: ${SW_DCS_SERVER_HOST:""}
    port: ${SW_DCS_SERVER_PORT:80}
    clusterName: ${SW_DCS_CLUSTER_NAME:SkyWalking}
    period: ${SW_DCS_PERIOD:20}
  apollo:
    apolloMeta: ${SW_CONFIG_APOLLO:http://106.12.25.204:8080}
    apolloCluster: ${SW_CONFIG_APOLLO_CLUSTER:default}
    apolloEnv: ${SW_CONFIG_APOLLO_ENV:""}
    appId: ${SW_CONFIG_APOLLO_APP_ID:skywalking}
    period: ${SW_CONFIG_APOLLO_PERIOD:5}
  zookeeper:
    period: ${SW_CONFIG_ZK_PERIOD:60} # Unit seconds, sync period. Default fetch every 60 seconds.
    nameSpace: ${SW_CONFIG_ZK_NAMESPACE:/default}
    hostPort: ${SW_CONFIG_ZK_HOST_PORT:localhost:2181}
    # Retry Policy
    baseSleepTimeMs: ${SW_CONFIG_ZK_BASE_SLEEP_TIME_MS:1000} # initial amount of time to wait between retries
    maxRetries: ${SW_CONFIG_ZK_MAX_RETRIES:3} # max number of times to retry
  etcd:
    period: ${SW_CONFIG_ETCD_PERIOD:60} # Unit seconds, sync period. Default fetch every 60 seconds.
    group: ${SW_CONFIG_ETCD_GROUP:skywalking}
    serverAddr: ${SW_CONFIG_ETCD_SERVER_ADDR:localhost:2379}
    clusterName: ${SW_CONFIG_ETCD_CLUSTER_NAME:default}
  consul:
    # Consul host and ports, separated by comma, e.g. 1.2.3.4:8500,2.3.4.5:8500
    hostAndPorts: ${SW_CONFIG_CONSUL_HOST_AND_PORTS:1.2.3.4:8500}
    # Sync period in seconds. Defaults to 60 seconds.
    period: ${SW_CONFIG_CONSUL_PERIOD:1}
    # Consul aclToken
    aclToken: ${SW_CONFIG_CONSUL_ACL_TOKEN:""}

exporter:
  selector: ${SW_EXPORTER:-}
  grpc:
    targetHost: ${SW_EXPORTER_GRPC_HOST:127.0.0.1}
    targetPort: ${SW_EXPORTER_GRPC_PORT:9870}

  start up

  Make sure elasticsearch is started and the firewall opens ports 8080, 11800, and 12800.

  Enter the `bin` directory and execute the command to run:

./startup.sh

  The default UI port is 8080, which can be accessed: http://192.168.150.101:8080

1695607545595_8.jpg

  Deploy microservice probes

  Now that the Skywalking server has been started, we still need to add service probes to the microservices to collect data.

  Unzip

  First, unzip the compressed package provided with the pre-class materials:

1695607567939_9.jpg

  Extract the `agent` to a directory. Do not allow Chinese characters to appear. You can see that its structure is as follows:

1695607579908_10.jpg

  One of them is `skywalking-agent.jar` which is the probe we want to use.

  Configuration

  If you are running a jar package, you can enter parameters at runtime to specify the probe:

java -jar xxx.jar -javaagent:C:/lesson/skywalking-agent/skywalking-agent.jar -Dskywalking.agent.service_name=ly-registry -Dskywalking.collector.backend_service=192.168.150.101:11800

  In this example, we use development tools to run and configure.

  Use the IDEA development tool to open a project of yours. In the IDEA tool, select the startup item you want to modify, right-click and select `Edit Configuration`:

1695607641316_11.jpg

  Then in the pop-up window, click `Environment` and select the corresponding expand button behind `VM options`:

1695607654627_12.jpg

  In the expanded input box, enter the following configuration:

-javaagent:C:/lesson/skywalking-agent/skywalking-agent.jar
-Dskywalking.agent.service_name=ly-registry
-Dskywalking.collector.backend_service=192.168.150.101:11800

  Notice:

  - `-javaagent:C:/lesson/skywalking-agent/skywalking-agent.jar`: The configuration is the location of the skywalking-agent.jar package. You need to change it to the directory where you store it.

  - `-Dskywalking.agent.service_name=ly-registry`: is the name of the current project, which needs to be modified to `ly-registry`, `ly-gateway`, `ly-item-service` respectively.

  - `-Dskywalking.collector.backend_service=192.168.150.101:11800`: It is the OPA service address of Skywalking. It uses GRPC communication, so the port is 11800, not 8080

  start up

  Skywalking's probe will modify the class file before the project starts, complete the probe implantation, and have **zero intrusion** on the business code, so we only need to start the project and it will take effect.

  Start the project, then access the business interface in the project, and the probe will start working.

  WebUI interface

  Visit: http://192.168.150.101:8080 and you can see that the statistics have come out:

1695607703705_13.jpg

  Performance monitoring of service instances:

1695607718516_14.jpg

  Service topology diagram:

1695607731197_15.jpg

1695607747478_16.jpg

  Link tracking information for a certain request:

1695607760642_17.jpg

  Table view:

1695607774566_18.jpg

  The copyright of this article belongs to Dark Horse Programmer Java Training Academy. Reprinting is welcome. Please indicate the source of the author. Thanks!

  Author: Dark Horse Programmer Java Training Academy

  First release: https://java.itheima.com

Guess you like

Origin blog.csdn.net/Blue92120/article/details/133268134