Full Link Log Collection and Tracking Solution- Meituan

July 21, 2022 Author: Haiyou Huaiyu Yaping Lisen  Article link  11914 words 24 minutes to read

1. Background

1.1 Business systems are increasingly complex

With the rapid development of Internet products, the ever-changing business environment and user demands have brought complicated business requirements. Business systems need to support more and more business scenarios, covering more and more business logic, and the complexity of the system is also rapidly increasing. At the same time, due to the evolution of the microservice architecture, the realization of business logic often needs to rely on the cooperation between multiple services. All in all, the increasing complexity of business systems has become a norm.

1.2 Business tracking faces challenges

Business systems are often faced with various daily customer complaints and unexpected problems, and "business tracking" has become a key response method. Business tracking can be regarded as the on-site restoration process of a business execution. The original site can be restored through various records during execution, which can be used to analyze the execution of business logic and locate problems. It is an important part of the entire system construction.

At present, in the distributed scenario, there are two mainstream implementation methods of business tracking, one is the log-based ELK solution, and the other is the session tracking solution based on a single request call. However, with the increasing complexity of business logic, the above solutions are increasingly unsuitable for current business systems.

1.2.1 Traditional ELK solution

As an essential capability of the business system, the log is responsible for recording discrete events that occur during the running of the program, and is used for program behavior analysis in the post-event stage, such as what methods have been called, what data has been manipulated, and so on. In distributed systems, the ELK technology stack has become a general solution for log collection and analysis. As shown in Figure 1 below, along with the execution of business logic, business logs will be printed, collected and stored in Elasticsearch (hereinafter referred to as ES) [2].

Figure 1 Business system ELK case

The traditional ELK solution requires developers to print as much logs as possible when writing code, and then collect and filter log data related to business logic from ES through key fields, and then piece together on-site information about business execution. However, this solution has the following pain points:

Log collection is cumbersome : Although ES provides log retrieval capabilities, log data is often a text segment that lacks structure, and it is difficult to quickly and completely collect all relevant logs. Difficulty in log screening : There are overlaps between different business scenarios and business logic, and the business logs printed by overlapping logic may interfere with each other, making it difficult to filter out the correct associated logs. Time-consuming log analysis : The collected logs are just pieces of discrete data, and only the code can be read, combined with logic, the logs are manually analyzed in series to restore the scene as much as possible.

To sum up, with the increase of business logic and system complexity, the traditional ELK solution is more and more time-consuming and labor-intensive in log collection, log screening and log analysis, and it is difficult to quickly track the business.

1.2.2 Distributed session tracking scheme

In a distributed system, especially a microservice system, a request in a business scenario often needs to be processed through complex links of multiple services, multiple middleware, and multiple machines. In order to solve the problem of difficult troubleshooting of complex links, the "Distributed Session Tracking Solution" was born. The theoretical knowledge of the program was published by Google in the 2010 "Dapper" paper [3], and then Twitter developed an open source version Zipkin [4].

Almost all frameworks of the same type on the market are implemented based on Google Dapper papers. The overall structure is similar. They use a distributed globally unique id (traceId) to connect the same request distributed on each service node in series, restore call relationships, track system problems, analyze call data, and count system indicators. Distributed session tracking is a session-level tracking capability. As shown in Figure 2 below, a single distributed request is restored into a call link, starting from the time when the client initiates the request and arrives at the boundary of the system, and records each service that the request flows through until a response is returned to the client.

Figure 2 The whole process of a typical request (from "Dapper")

The main function of distributed session tracking is to analyze the calling behavior of distributed systems , which cannot be well applied to business logic tracking. Figure 3 below is a tracking case of an audit business scenario. The business system provides audit capabilities to the outside world. The audit of objects to be audited needs to go through two stages: "preliminary review" and "review" (the two links are associated with the same taskId). Therefore, the execution of the entire review process calls the review interface twice. As shown on the left side of the figure, the complete audit scenario involves the execution of many "business logics", while distributed session tracking only generates two call links on the right based on two RPC calls, and there is no way to accurately describe the execution of the audit scenario business logic. The problems are mainly reflected in the following aspects:

Figure 3 Distributed session tracking case

(1) Unable to track multiple calling links at the same time

Distributed session tracking only supports the call tracking of a single request. When the business scenario contains multiple calls, multiple call links will be generated; since the call links are connected in series through traceId, different links are independent of each other, which makes complete business tracking more difficult. For example, when troubleshooting a business problem in an audit scenario, since the initial audit and re-examination are different RPC requests, it is impossible to directly obtain two call links at the same time, and usually need to store two additional traceId mapping relationships.

(2) It is impossible to accurately describe the panorama of business logic

The call link generated by distributed session tracking only contains the actual call of a single request, and some unexecuted calls and local logic cannot be reflected in the link, resulting in an inability to accurately describe the panorama of business logic. For example, for the same review interface, the initial review link 1 includes the call of service b, but the review link 2 does not. This is because there is a "judgment logic" in the review scene, and this logic cannot be reflected in the call link, and it still needs to be manually combined with the code for analysis.

(3) Unable to focus on the logic execution of the current business system

Distributed session tracking covers all services, components, machines, etc. that a single request flows through. It not only includes the current business system, but also involves many downstream services. When the internal logic of the interface is complex, the depth and complexity of the call link will increase significantly. However, business tracking only needs to focus on the logic execution of the current business system. For example, the call link generated by the review scenario involves the internal calls of many downstream services, which adds complexity to the troubleshooting of the current business system.

1.2.3 Summary

The traditional ELK solution is a kind of lagging business tracking. It needs to collect and filter the required logs from a large number of discrete logs afterwards, and manually perform serial analysis of the logs. The process is bound to be time-consuming and labor-intensive. The distributed session tracking solution completes the dynamic concatenation of links in real time while the call is being executed, but because it is at the session level and only focuses on issues such as call relationships, it cannot be well applied to business tracking.

Therefore, neither the traditional ELK solution nor the distributed session tracking solution can meet the increasingly complex business tracking requirements. This paper hopes to realize an efficient solution focusing on business logic tracking, efficiently organize and connect business execution logs with business links as the carrier, and support restoration and visual viewing of business execution sites, thereby improving the efficiency of locating problems, that is, .

The following will introduce the design idea and general scheme of visual full-link log tracking, and introduce the implementation of the new scheme on the Dianping content platform, aiming to help students with similar business system development needs to provide some ideas .

2. Visualized full link log tracking

2.1 Design ideas

Visualized full-link log tracking is considered in the pre-stage, that is, efficient organization and dynamic concatenation of business logs at the same time as business execution, as shown in Figure 4 below. At this time, discrete log data will be organized according to business logic, and the execution site will be drawn, so that efficient business tracking can be realized.

Figure 4 Case of business system log tracking

The new solution needs to answer two key questions: how to efficiently organize business logs, and how to dynamically concatenate business logs. The following will answer one by one.

Question 1: How to efficiently organize business logs?

In order to achieve efficient business tracking, it is first necessary to accurately and completely describe the business logic to form a panorama of business logic, and business tracking is actually to restore the scene of business execution in the panorama through the log data during execution.

The new solution abstracts the business logic and defines the business logic link. The following uses the "audit business scenario" as an example to illustrate the abstraction process of the business logic link:

  • Logical nodes : Many logics of the business system can be split according to business functions to form independent business logic units, that is, logical nodes , which can be local methods (the "judgment logic" node in Figure 5 below) or remote calling methods such as RPC (the "Logic A" node in Figure 5 below).
  • Logical link : The business system supports many business scenarios externally, and each business scenario corresponds to a complete business process, which can be abstracted into a logical link composed of logical nodes . The logical link in Figure 5 below accurately and completely describes the "audit business scenario".

A business trace is the restoration of a certain execution of a logical link . The logical link completely and accurately describes the business logic panorama, and as a carrier, it can realize the efficient organization of business logs.

Figure 5 Business logic link case

Question 2: How to dynamically concatenate business logs?

The log data during business logic execution is originally stored discretely, but what needs to be realized at this time is to dynamically connect the logs of each logical node with the execution of business logic, and then restore the complete business logic execution site.

Since logical nodes and logical nodes often interact through MQ or RPC, the new solution can use the distributed parameter transparent transmission capability [5] provided by distributed session tracking to realize the dynamic series connection of business logs:

  • Through the continuous transparent transmission of parameters in the execution thread and network communication, the identification of links and nodes is transmitted uninterruptedly while the business logic is executed, and the coloring of discrete logs is realized.
  • Based on the identification, the dyed discrete logs will be dynamically connected in series to the nodes being executed, gradually converging a complete logical link, and finally realizing the efficient organization and visual display of the business execution site.

Different from the distributed session tracking solution, when multiple distributed calls are connected in series at the same time, the new solution needs to select a public id as an identifier in combination with business logic. For example, the audit scenario in Figure 5 involves two RPC calls. In order to ensure that the two executions are connected in series to the same logical link, at this time, combined with the audit business scenario, the same "task id" as the initial review and re-examination is selected as the identifier to completely realize the logical chain connection of the audit scenario and the on-site restoration of execution.

2.2 General scheme

After clarifying the two basic issues of efficient organization of logs and dynamic series connection, this article selects "logical link 1" in the business system in Figure 4 to describe the general solution in detail. The solution can be disassembled into the following steps:

Figure 6 General solution disassembly

2.2.1 Link definition

The meaning of "link definition" is to use a specific language to statically describe a complete logical link . A link is usually composed of multiple logical nodes according to certain business rules . Business rules are the execution relationships between each logical node, including serial , parallel , and conditional branches .

DSL (Domain Specific Language) is a computer language specially designed to solve a certain type of task. It can define the combination relationship (business rules) of a series of nodes (logical nodes) through JSON or XML. Therefore, this solution chooses to use DSL to describe the logical link, so as to realize the logical link from abstract definition to concrete realization .

Figure 7 Abstract definition and concrete realization of links

Logical Link 1 - DSL

  [
    {
      "nodeName": "A",
      "nodeType": "rpc"
    },
    {
      "nodeName": "Fork",
      "nodeType": "fork",
      "forkNodes": [
        [
          {
            "nodeName": "B",
            "nodeType": "rpc"
          }
        ],
        [
          {
            "nodeName": "C",
            "nodeType": "local"
          }
        ]
      ]
    },
    {
      "nodeName": "Join",
      "nodeType": "join",
      "joinOnList": [
        "B",
        "C"
      ]
    },
    {
      "nodeName": "D",
      "nodeType": "decision",
      "decisionCases": {
        "true": [
          {
            "nodeName": "E",
            "nodeType": "rpc"
          }
        ]
      },
      "defaultCase": [
        {
          "nodeName": "F",
          "nodeType": "rpc"
        }
      ]
    }
  ]

2.2.2 Link coloring

The meaning of "link dyeing" is: in the process of link execution, through transparent transmission and serial identification, it is clear which link is being executed and which node has been executed.

Link coloring consists of two steps:

  • Step 1: Determine the series identifier . When the logical link is opened, determine the unique identifier, which can clarify the subsequent link and node to be executed.

    • Link unique identifier  = business identifier + scenario identifier + execution identifier (the three identifiers jointly determine "a certain execution under a certain business scenario")

    • Service ID: Give the link service meaning, such as "user id", "activity id" and so on.

    • Scene ID: Give the meaning of the link scene, for example, the current scene is "logical link 1".

    • Execution ID: Give the link execution meaning. For example, if only a single call is involved, you can directly select "traceId"; if multiple calls are involved, select the same "public id" for multiple calls according to business logic.

    • Node unique identifier  = link unique identifier + node name (the two identifiers jointly determine "a logical node in a certain execution in a certain business scenario")

    • Node name: the unique name of the node preset in DSL, such as "A".

  • Step 2: Transmit the series identifier . When the logical link is executed, the serial identifier is transparently transmitted in the distributed complete link, and the nodes that have been executed in the dynamic serial link are dynamically connected to realize the coloring of the link. For example in "Logical Link 1":

    • When the "A" node triggers the execution, it starts to transmit the serial identification in the subsequent links and nodes, and gradually completes the coloring of the entire link as the business process is executed.
    • When the identifier is passed to the "E" node, it means that the judgment result of the "D" conditional branch is "true", and at the same time, the "E" node is dynamically connected in series to the executed link.

2.2.3 Link reporting

The meaning of "link reporting" is: during the link execution process, the log is reported in the form of link organization, so as to realize accurate preservation of the business site.

Figure 8 Link reporting diagram

As shown in Figure 8 above, the reported log data includes: node logs and business logs . Among them, the role of the node log is to draw the executed nodes in the link, and record the start, end, input, and output of the node; the function of the business log is to display the execution of the specific business logic of the link node, and record any data that can explain the business logic, including input and output parameters interacting with upstream and downstream, intermediate variables of complex logic, and exceptions thrown by logic execution.

2.2.4 Link storage

The meaning of "link storage" is: to store the logs reported during the link execution and use them for subsequent "on-site restoration". Reported logs can be divided into three categories: link logs, node logs, and business logs:

  • Link log : In a single execution of the link, the basic information of the link is extracted from the logs of the start node and the end node, including link type, link meta information, link start/end time, etc.
  • Node log : In a single execution of the link, the basic information of the executed node, including node name, node status, node start/end time, etc.
  • Business log : In a single execution of the link, the business log information in the executed node, including log level, log time, log data, etc.

The following figure shows the storage model of link storage, which includes link logs, node logs, service logs, and link metadata (configuration data), and is a tree structure as shown in Figure 9 below, in which the service identifier is used as the root node for subsequent link queries.

Figure 9 Tree storage structure of links

3. Dianping content platform practice

3.1 Business Features and Challenges

In the Internet age, content is king. The core strategy of a content-based platform is to build a content pipeline to ensure sustainable, healthy and valuable flow of content to content consumers, and ultimately form a virtuous circle of content "production→governance→consumption→production".

Dianping and Meituan App have rich and diverse content, and business parties and partners inside and outside the site have many consumption scenarios. The three parties in the content pipeline have the following requirements:

  • Content producers : hope that the produced content can be distributed in more channels, gain more traffic, and be loved by consumers.
  • Content governance party : Hope to filter out legal and compliant content as a "firewall", and at the same time integrate machine and human capabilities to enrich content attributes.
  • Consumers of content : hope to obtain content that meets their individual needs, attracts them to plant grass, or assists them in making consumption decisions.

Producers have different content models and require different processing methods, and consumers also have personalized requirements for content. If each producer and consumer is connected separately, the problems of heterogeneous content models , different processing procedures and output thresholds will bring about high costs and low efficiency of connection. In this context, the review content platform came into being. As the "governor" of the content pipeline, it realized the unified access, unified processing and unified output of content:

Figure 10 Business Form of Review Content Platform

  • Unified access : Unify the content data model, connect different content producers, and transform heterogeneous content into a common data model for the content platform.
  • Unified processing : Unified processing capacity building, accumulation and improvement of general machine processing and manual operation capabilities, to ensure that the content is legal and compliant, with rich attributes.
  • Unified output : Unified output threshold construction, docking with different content consumers, and providing downstream with standardized content data that meets their individual needs.

As shown in Figure 11 below, it is the core business process of the review content platform. Every piece of content will go through this process, and finally decide whether to distribute it under each channel.

Figure 11 Review content platform business process

Whether the content is processed by the content platform in a timely and accurate manner is the core concern of content producers and consumers, and it is also the main type of customer complaints on daily duty. However, the business tracking construction of the content platform mainly faces the following difficulties and complexities:

  • Many business scenarios : The business process involves multiple different business scenarios with different logics, such as real-time access, manual operation, distribution recalculation, and some other scenarios listed in the figure.
  • Many logical nodes : Business scenarios involve many logical nodes, and different content types have different nodes. For example, in the same real-time access scenario, there is a big difference in the logical nodes for execution of note content and live content.
  • Multiple trigger executions : business scenarios will be triggered multiple times, and the logic will be different due to different sources. For example, after the content of the note is edited by the author, reviewed by the system, etc., it will trigger the re-execution of the real-time access scenario.

The comment content platform processes millions of pieces of content every day, involving the execution of millions of business scenarios and the execution of up to 100 million logical nodes. Business logs are scattered in different applications, and logs of different content, different scenarios, different nodes, and multiple executions are mixed together. Both log collection and on-site restoration are quite cumbersome and time-consuming. Traditional business tracking solutions are increasingly unsuitable for content platforms.

The review content platform urgently needs new solutions to achieve efficient business tracking, so we have carried out the construction of visual full-link log tracking . The following article will introduce related practices and achievements.

3.2 Practice and results

3.2.1 Practice

The comment content platform is a complex business system that supports many business scenarios externally. Through sorting out and abstracting business scenarios, multiple business logic links such as real-time access, manual operation, task import, and distribution recalculation can be defined. Since the review content platform involves many internal services and downstream dependent services, it supports a large number of content processing services every day, and a large amount of log data will be generated along with the execution of the business. At the same time, the link reporting also needs to modify many services. Therefore, on the basis of the general full-link log tracking solution, the review content platform has carried out the following specific practices.

(1) Support the reporting and storage of large data volume logs

The comment content platform implements the log reporting architecture shown in Figure 12, supports unified log collection, processing, and storage for many services, and can well support the construction of log tracking under large data volumes.

Figure 12 Review content platform log reporting structure

Log collection : Each application service collects asynchronously reported log data through the log_agent deployed on the machine, and transmits it to the Kafka channel in a unified manner. In addition, for a small number of services that do not support log_agent, a transit application as shown in the figure is built.

Log parsing : Collected logs are connected to Flink through Kafka for unified parsing and processing. Logs are classified and aggregated according to log types, and parsed into link logs, node logs, and business logs.

Log storage : After the log analysis is completed, the logs will be stored in a tree-like storage model. Based on the analysis of storage requirements and the characteristics of each storage option, the review content platform finally chooses HBase as the storage option .

demand analysis Selection advantages
OLTP business : The real-time reading and writing data of the logical link
is very large : the number of records is massive, and it will continue to grow in the future.
Write-intensive and read-intensive : the peak QPS of log reporting is high .
The business scenario is simple : simple reading and writing can meet the needs
Storage features : Support horizontal expansion and rapid expansion Field
query features : Support exact and prefix matching queries, and support fast random access
Economic costs : Low cost of storage media

Overall, the log reporting and storage architecture of log_agent + Kafka + Flink + HBase can well support complex business systems, naturally supports log reporting for many applications in distributed scenarios, and is also suitable for high-traffic data writing.

(2) Realize the low-cost transformation of many back-end services

The comment content platform implements a "custom log toolkit" (that is, the TraceLogger toolkit in Figure 13 below ), which shields the reporting details in link tracking and minimizes the cost of many service transformations. Features of the TraceLogger Toolkit include:

  • Imitation of slf4j-api : The implementation of the toolkit is based on the slf4j framework, and imitates slf4j-api to provide the same API externally, so there is no learning cost for the user.
  • Shield internal details , internally encapsulate a series of link log reporting logic, shield dyeing and other details, and reduce the development cost of the user.

    • Report judgment :

    • Judging the link ID: When there is no ID, report the bottom-line logs to prevent log loss.

    • Judging the reporting method: When there is a logo, two reporting methods, log and RPC transfer, are supported.

    • Log assembly : Realize functions such as parameter occupancy, exception stack output, etc., and assemble related data into Trace objects, which is convenient for unified collection and processing.

    • Exception reporting : Actively report exceptions through ErrorAPI, which is compatible with ErrorAppender in the original log reporting.

    • Log reporting : Adapt to the Log4j2 log framework to achieve the final log reporting.

Figure 13 TraceLogger log toolkit

The following is a use case of the TraceLogger toolkit for reporting business logs and node logs respectively , and the overall transformation cost is relatively low.

Business log reporting : no learning cost, basically no transformation cost.

  // 替换前:原日志上报
  LOGGER.error("update struct failed, param:{}", GsonUtils.toJson(structRequest), e);
  // 替换后:全链路日志上报
  TraceLogger.error("update struct failed, param:{}", GsonUtils.toJson(structRequest), e);

Node log reporting : supports both API and AOP reporting methods, which are flexible and low-cost.

  public Response realTimeInputLink(long contentId) {
    // 链路开始:传递串联标识(业务标识 + 场景标识 + 执行标识)
    TraceUtils.passLinkMark("contentId_type_uuid");
    // ...
    // 本地调用(API上报节点日志)
    TraceUtils.reportNode("contentStore", contentId, StatusEnums.RUNNING)
    contentStore(contentId);
    TraceUtils.reportNode("contentStore", structResp, StatusEnums.COMPLETED)
    // ...
    // 远程调用
    Response processResp = picProcess(contentId);
    // ...
  }
  // AOP上报节点日志
  @TraceNode(nodeName="picProcess")
  public Response picProcess(long contentId) {
    // 图片处理业务逻辑
    // 业务日志数据上报
    TraceLogger.warn("picProcess failed, contentId:{}", contentId);
  }

3.2.2 Results

Based on the above practices, the review content platform has realized the visual full-link log tracking, which can track the execution of any content and all business scenarios with one click, and restore the execution site through the visual link. The tracking effect is shown in the following figure:

[Link query function] : Query the execution of all logical links of the content in real time according to the content id, covering all business scenarios.

Figure 14 Link query

[Link display function] : Visually display the panorama of business logic through the link diagram, and at the same time display the execution status of each node.

Figure 15 link display

[Node details query function] : Supports displaying details of any executed node, including node input, output, and key business logs during node execution.

Figure 16 Node details

At present, the visual full-link log tracking system has become a "troubleshooting tool" for the review content platform. We can reduce the time spent on troubleshooting from hours to less than 5 minutes. It is also a "testing aid" that uses visual log series and display to significantly improve the efficiency of RD self-test and QA testing. Finally, summarize the advantages of visual full link log tracking:

  • Low access cost : DSL configuration combined with simple log reporting transformation enables fast access.
  • Wide range of tracking : All logical links of any piece of content can be tracked.
  • High use efficiency : the management background supports visual query display of links and logs, which is simple and fast.

4. Summary and Outlook

With the increasing complexity of distributed business systems, observability is becoming more and more important for the stable operation of business systems [6]. As a complex business system in the Dianping content pipeline, in order to ensure the stability and reliability of content circulation, the Dianping content platform has implemented a full-link observable construction , and has carried out certain exploration and construction in the three specific directions of Logging , Metrics , and Tracing .

One of them is the "visualized full-link log tracking" in this paper. Combining logging and tracing , we propose a new general solution for business tracking. By combining complete business logic in the business execution phase, the organization and connection of logs can be dynamically completed, replacing the inefficient and lagging manual log connection of traditional solutions. Finally, efficient tracking of the entire business process and efficient positioning of business problems can be achieved. In addition, in the direction of metrics , the review content platform has implemented "visual full-link indicator monitoring" in practice, which supports real-time and multi-dimensional display of key business and technical indicators of the business system, and supports corresponding alarm and abnormal attribution capabilities, realizing effective control of the overall operating status of the business system.

In the future, the review content platform will continue to cultivate and realize an observability system [7] covering functions such as alarms, overviews, troubleshooting, and analysis, and continue to accumulate and output related general solutions, hoping to provide some reference and inspiration for observability construction for business systems (especially complex business systems).

5. References

6. About the author

Haiyou, Huaiyu, Yaping, Lisen, etc. are all from the review division/content platform technical team and are responsible for the construction of the review content platform.

Guess you like

Origin blog.csdn.net/star1210644725/article/details/129895877