Exploration and Practice of Distributed System Log Printing Optimization Scheme

01

   background

iQIYI's overseas back-end R&D team supports iQIYI's overseas PHONE/PCW/TV three-terminal back-end related business. In addition to being responsible for the back-end services of the three terminals, it also includes overseas points business, pop-up windows, reservation systems for various programs, etc. In addition, there are also a series of infrastructures, such as the IQ background that quickly supports various operational configurations and experimental demands of products; the strategy engine that helps product operations to achieve refined operations; the quality assurance platform that realizes traffic playback and pressure testing wait.

The stable operation of various businesses depends on a complete log system, so business codes often print many logs to help monitor and troubleshoot problems in the operation of business codes. However, the printing of the log has a certain impact on the performance of the project. We have read a lot of information and can find a lot of log performance comparisons of different frameworks, but there is a lack of refined log printing SOPs required for engineering practice. Furthermore, distributed systems are currently the mainstream. In the process of log printing, nodes are stateless and independently print the information of the same request, which will lead to a lot of information redundancy and waste of resources. Service performance is also not friendly.

3bea764b55533712878b15367fc36a28.png

As shown above, service 1 calls service 3, and then calls service 2 serially, and service 2 depends on service 3 internally. In the above cases, service 1 will record the detailed information of the requested service 3, service 2 will record the detailed information of the requested service 3, and service 3 will record all the requested information. So a link contains 4 times the same log.

In order to solve the above problems, we have carried out the special construction of the distributed system log printing optimization scheme, which mainly includes two parts:

(1) Obtain quantitative data on the actual loss of project performance caused by logging, and refine the log printing SOP to help improve project performance and provide reference for future log printing.

(2) Consider the link globality of log printing, and realize stateful log records between distributed service nodes.

Through the construction of the above two parts, the resource and performance consumption of log printing under the distributed link can be reduced. Improve system performance and reduce system loss.

This article mainly shares our exploration, thinking and practice process in the distributed system log printing optimization scheme.

02

   Exploration and practice of single system optimization log printing

Currently the most popular frameworks are log4j, log4j2 and logback. It is generally believed that log4j2 is an upgrade of log4j, so we will conduct experimental comparisons between the most mainstream log4j2 and logback 1.3.0 frameworks, obtain performance data, and summarize best practices to provide process specifications for our business system to print logs.

2.1 Comparison of multiple dimensions to obtain the best practice plan for log printing

We chose container deployment, resource: 2c4g, independent deployment project. The project contains an API whose function is to control and output logs of different sizes according to input parameters.

2.1.1 Quantitative research on the printing performance of log4j2

log4j2 asynchronous

Log size: 2KB, asynchronous

525540c94ec4e1df2fa1535c81c7bca5.pnglog4j2 synchronization

5246dd033f196c28225ea93e991a0557.pngLog4j2 pressure test conclusion

From the above data it is easy to generate the following graph.

13d96380416f1dedb9000fd64b01648c.png

From the above figure we can see:

  • When the number of concurrency is small, the performance of log4j2 is consistent. When the number of concurrency increases to a certain number, the performance of asynchronous printing is obviously better than that of synchronous printing.

  • Log4j2 asynchronous printing also has performance bottlenecks.

  • The performance bottleneck of log4j2's synchronous printing lies in the IO bottleneck, and the IO size per second is about 2635 ✖️ 2kb = 5.15MB.

2.1.2 Quantitative research on the printing performance of logback

logback asynchronous

1e42c303a7c4e9d185239eac7d18a716.png

logback synchronous printing

23e0ae57ae5d26afa0ced7218cf8479c.png

Logback pressure test conclusion

It is easy to generate the following charts through the pressure measurement data of logback.

3ffdde2ba3e33acee24c2269f66a2d8f.png

From the above figure we can see:

  • When the number of concurrency is small, the performance of log4j2 is consistent. When the number of concurrency increases to a certain number, the performance of asynchronous printing is obviously better than that of synchronous printing.

  • Logback asynchronous printing also has performance bottlenecks

  • The performance bottleneck of log4j2's synchronous printing lies in the IO bottleneck. The IO size per second is about 2650 ✖️ 2kb = 5.2MB, and the IO size is roughly the same as that of log4j2.

2.1.3 Comparison between logback and log4j2

synchronous comparison

dd369068f73c415e2f812b4a7d23e587.png

Through the above table data, the following comparison chart can be obtained

e32c9acf6cc6f728444ab58ec61cf624.png

asynchronous comparison

3a24cce8a8ccddc1563bcfee72ec137f.png

Through the above table data, the following comparison chart can be obtained

a491fe71866a32d87f26ede8f90faddc.png

From the above we can see that:

  • Whether synchronous or asynchronous. After the concurrency exceeds a certain threshold, the performance of logback is better than log4j2.

  • Within a certain range of concurrency, the performance of logback is comparable to that of log4j2.

2.1.4 Quantitative comparison of the performance of logback in different scenarios

From the conclusion in Section 2.1.3, it can be seen that no matter synchronous or asynchronous. Before the concurrency is less than the threshold, the performance of logback and log4j2 is equivalent. After the concurrency exceeds a certain threshold, the performance of logback is better than log4j2. Therefore, below we will explore the performance of logback in different scenarios in detail.

Logback synchronously prints logs of different sizes under the same number of concurrency interface performance data

Concurrency = 100, synchronous

85d0555c0af223629c62c484883e3e16.png

Logback asynchronously prints logs of different sizes under the same number of concurrency interface performance data

Concurrency = 100, asynchronous

a52d2c4823b4ef0d0990330170142de9.png

Through the above data, the following relationship diagram can be drawn

63c0fe9958e6498248d0186f817f9b69.png

From the above figure we can get:

  • If the log size is within a certain range, it will have no impact on performance. If it exceeds a certain limit, the performance will drop significantly.

  • From the data of the synchronously printed logs, it can be known that when the number of concurrency is constant, there is a bottleneck in the number of IOs, that is, the size of IOs per unit time does not increase proportionally with the increase of each IO. The bottleneck of the experimental data is around 160MB.

  • Logback asynchronous printing is less sensitive to log size, because after the asynchronous printing queue is full, a strategy of discarding business logs can be adopted.

2.2 Summary of Best Practices

  • Prioritize the use of logback as the framework for log output to reduce the impact of log printing on project performance

  • In the case of high concurrency and business logs are not necessary, use logback to print asynchronously

  • Judging that business logs are strongly dependent, logback needs to pay special attention to configuring nerveBlock = true to asynchronously print logs. At this time, if the print log size of a single request of the project is less than 2KB, the IO data per second of the project should not be greater than 5MB.

2.3 Engineering project optimization

Based on the above best practices, we selected a project of iQIYI's overseas back-end R&D team for pilot transformation and analyzed performance changes.

2.3.1 Project introduction

iQIYI's overseas back-end R&D team is responsible for the corresponding overseas iQIYI back-end business of PHONE/PCW/TV, mainly including the TOC service for the stable output of page business data, and the flexible, efficient and scalable IQ operation background for A strategy engine for refined operations, and a data center service for synchronizing program data.

We choose a TOC service with a non-card structure as a pilot, because the project contains many APIs, and there are many scenarios of log printing. The project is deployed in Singapore, 4C8G containerized deployment, the peak stand-alone QPS is around 120.

2.3.2 Performance optimization results

d6ccb41cefea3664f3a6c09dcbe68566.png

After the asynchronous transformation, P99 dropped from 78.8ms to 74ms, and P999 dropped from 180ms to 164.5ms.

The situation of different projects is different. If it is a project with a large single-machine traffic or a large number of printed logs, asynchronous transformation is believed to have greater performance improvement.

2.4 Summary

The main work of performance optimization of stand-alone logs is to obtain the performance of different log frameworks and obtain the performance of the same log framework in different scenarios. I hope this part of the data can help colleagues who encounter the same dilemma. In addition, we have also standardized the printing method of logs. By grading the business SLA, we can explain the reason for the log here at the log level, and if it is an exception, what level of exception is it, so that all business students can learn about different alarms in a timely manner. The degree of urgency is conducive to making priority judgments and process-based responses.

03

   Application of Distributed Variable Sharing in Log Printing

The above chapters introduced the log printing optimization of the stand-alone system, but the current systems are basically distributed systems, so what are the pain points and solutions for log printing in the distributed system? We have thought about the optimization scheme of log link printing under the distributed system. The following is the thinking and practice of log printing optimization under the distributed system.

3.1 Introduction

71adba5985683634e5f2e876644ca4a4.png

Looking at the evolution process of Internet technology, the evolution from a single system to a distributed system is a very important feature. But it is undeniable that changes in things are more advantageous in general, but new challenges will always be encountered in branches. Monolithic systems also have advantages that distributed systems do not have, such as local transactions, shared code, and shared variables. From the perspective of printing logs, the monolithic system can sink calls to the same service. Because the log records are in the same project, they can be viewed and even agreed upon. In a distributed system, functional modules are usually divided and belong to different development teams, so the log printing of different service nodes is usually unable to communicate. For this reason, basically all development teams will do their best. The log of the entire link will be very redundant, resulting in waste of resources and performance degradation of the distributed system.

3.2 Introduction to the full link tracking system

In a distributed system, an external request often requires multiple internal modules, multiple middleware, and multiple machines to call each other to complete. In this series of calls, some may be serial and some may be parallel. In this case, how can we determine which applications were invoked by this entire request? Which modules? Which nodes? And what is their sequence and the performance of each part? The purpose of link tracking is to solve the above-mentioned problems, that is, to restore a distributed request to a call link, and to display the call status of a distributed request in a centralized manner, for example, the time-consuming and specific requests on each service node. Which machine to reach, the request status of each service node, and so on. For example, zipkin, one of the full-link systems, has an architecture diagram as follows:

00343280b4a0730dd48318b99b0377bc.png

The full link tracking system supports link log sampling and variable transparent transmission. We refer to the full link design idea to optimize our distributed system.

3.3 The use of shared variables for log sampling

3.3.1 Background introduction

This problem is concentrated in our strategy engine system. The strategy engine system is a system designed and implemented to achieve refined operations. It is mainly able to identify and identify the portraits of different groups of people, and deploy different strategies accordingly. With the implementation of the system, the access business has grown rapidly. At present, the CARD business, pop-up window business, advertising business, interactive marketing business, recommendation business, navigation and journey business of iQIYI’s overseas page data have been connected. However, the above different systems have certain dependencies and call each other. And because of business needs, the request of the policy engine depends on the post request, and the gateway log cannot parse the request parameters of the post, so we need the business to record the details of each request. In addition, the policy engine strongly relies on user portrait data, which are stored in BI and Facebook services respectively. According to past experience, about 90% of the reasons for policy hit failures are due to user portrait data not being updated in time. In order to facilitate troubleshooting To locate the problem, the policy engine will record the user portrait data requested by each user. The QPS of the policy engine is very high, so the daily log volume is about 150G. How to elegantly optimize this part of the log is a pain point we have encountered.

20ab42d48cceda96091f04713a0959fd.png

The traffic of the strategy engine mainly includes pop-up windows, advertisements, top navigation, iq, recommendation, and interactive marketing.

However, we found that some traffic has obvious link characteristics. When the client request determines the top navigation, the associated page will be obtained.

  • There may be a set of data associated with the page. In this case, the policy engine needs to be requested to obtain a page that matches the user profile.

  • There are multiple sets of cards associated with the page. At this time, it is necessary to request the policy engine to obtain a card that matches the user's portrait.

  • The card is actually associated with different business data, including interactive marketing services and recommendation services. And interactive marketing and recommendation will request the strategy engine to obtain data that matches the portrait of the crowd.

It can be concluded from the analysis that once a user requests, the included microservices request the policy engine respectively. In a life cycle of the request, the user's request data is definitely the same, and the user portrait data is definitely the same.

3.3.2 Solutions

According to the above analysis, it is easy to think that it is uniformly recorded by the policy engine. The policy engine adds an identifier to the TraceContext to indicate whether there is a record in the life cycle of the request, if it is, it will not be recorded, and if it is not, it will be recorded.

23fd4e41eb184d91cbc078c25130886c.png

As shown in the figure above, after Service 1 requests the policy engine, an identifier is added to the TraecContext, indicating that the request Trace has been logged. When subsequent nodes, namely service 2 and service 3 request the policy engine again, there is no need to repeat the record. Doing so can reduce a lot of logging.

However, there will be the following problems: if there is a 5xx error, the policy engine service fails and the corresponding request is not recorded, and the request record is lost, which will make it very difficult for the troubleshooting agent.

After comparative analysis, it is finally determined that the above two problems can be well solved by sharing variables in a distributed system.

ea9ba382dd321c97d52fb327a3d9e365.png

The above requests 1, 2, 4, 5, 7, and 8 all need to judge the logBusiness field of the traceContext context. If it exists, it does not need to be recorded. If it does not exist, it will record the log and set logBusiness to true.

For example, this is a total of 200 requests, then after 1 request arrives at the policy engine and judges that the logBusiness field of the traceContext context is false, then it will record the request and set logBusiness to true, and in the subsequent 4, 5, 7, 8, there is no need to log again. Even if there are exceptions in 4, 5, 7, and 8, the scene of the request policy engine can still be restored through the traceId and the request recorded in 1.

Another example is that 1 request fails. If it is a 499 timeout request, because the timeout is transparent to the server and the policy engine continues to execute, then the policy engine prints the log and sets the logBusiness field of the traceContext context to true. But for service 1, due to timeout, a copy of data will be redundantly recorded. In the 4th request, because the logBusiness of the traceContext context is still false, it will be re-recorded, and the logBusiness will be set to true, and will not be recorded in the subsequent 7th request.

For example, if 1 request fails and it is a 5xx request, then 1 will be recorded, and 4 will be recorded again.

Therefore, this method can minimize the link log on the premise of ensuring the integrity of the restore log link.

3.3.3 Summary

We have optimized the policy engine service, and the log can be reduced from the previous 150G/day to 30G/day. Queue resource consumption for flink task collection and log processing is also reduced accordingly.

04

   Summary Outlook

This paper mainly introduces the exploration and practice process of the log printing optimization scheme of the distributed system from the aspects of single log optimization and distributed system log optimization. Compare the performance of the current mainstream log frameworks, obtain data, form best practices, provide a standard access solution for the log printing method of our business projects, and introduce the benefits obtained by our team project by improving the log printing method. In a specific scenario, an innovative solution is given to the stateless problem of distributed system printing logs, which solves the problem of repeated logging of different or the same distributed nodes. One more thing to mention here. After our series of thinking, distributed shared variables have broad application prospects. In addition to the stateful optimized log printing introduced in this article, this method can be used for scenarios with zero tolerance or weak tolerance for data inconsistency. It is good to reduce the traffic pressure of bottleneck services and improve link performance. Of course, this requires the cooperation of data compression and decompression algorithms. In future practice, we will have the opportunity to share our thinking and practice process with our peers.

ab261be2a50ba9583982b9eb0975d1af.jpeg

maybe you want to see

GC pause investigation tour under Spring Cloud Gateway

Design and practice of iQIYI's overseas operation system

iQIYI data lake practice

Guess you like

Origin blog.csdn.net/weixin_38753262/article/details/131989901