01
background
iQIYI's overseas back-end R&D team supports iQIYI's overseas PHONE/PCW/TV three-terminal back-end related business. In addition to being responsible for the back-end services of the three terminals, it also includes overseas points business, pop-up windows, reservation systems for various programs, etc. In addition, there are also a series of infrastructures, such as the IQ background that quickly supports various operational configurations and experimental demands of products; the strategy engine that helps product operations to achieve refined operations; the quality assurance platform that realizes traffic playback and pressure testing wait.
The stable operation of various businesses depends on a complete log system, so business codes often print many logs to help monitor and troubleshoot problems in the operation of business codes. However, the printing of the log has a certain impact on the performance of the project. We have read a lot of information and can find a lot of log performance comparisons of different frameworks, but there is a lack of refined log printing SOPs required for engineering practice. Furthermore, distributed systems are currently the mainstream. In the process of log printing, nodes are stateless and independently print the information of the same request, which will lead to a lot of information redundancy and waste of resources. Service performance is also not friendly.
As shown above, service 1 calls service 3, and then calls service 2 serially, and service 2 depends on service 3 internally. In the above cases, service 1 will record the detailed information of the requested service 3, service 2 will record the detailed information of the requested service 3, and service 3 will record all the requested information. So a link contains 4 times the same log.
In order to solve the above problems, we have carried out the special construction of the distributed system log printing optimization scheme, which mainly includes two parts:
(1) Obtain quantitative data on the actual loss of project performance caused by logging, and refine the log printing SOP to help improve project performance and provide reference for future log printing.
(2) Consider the link globality of log printing, and realize stateful log records between distributed service nodes.
Through the construction of the above two parts, the resource and performance consumption of log printing under the distributed link can be reduced. Improve system performance and reduce system loss.
This article mainly shares our exploration, thinking and practice process in the distributed system log printing optimization scheme.
02
Exploration and practice of single system optimization log printing
Currently the most popular frameworks are log4j, log4j2 and logback. It is generally believed that log4j2 is an upgrade of log4j, so we will conduct experimental comparisons between the most mainstream log4j2 and logback 1.3.0 frameworks, obtain performance data, and summarize best practices to provide process specifications for our business system to print logs.
2.1 Comparison of multiple dimensions to obtain the best practice plan for log printing
We chose container deployment, resource: 2c4g, independent deployment project. The project contains an API whose function is to control and output logs of different sizes according to input parameters.
2.1.1 Quantitative research on the printing performance of log4j2
log4j2 asynchronous
Log size: 2KB, asynchronous
log4j2 synchronization
Log4j2 pressure test conclusion
From the above data it is easy to generate the following graph.
From the above figure we can see:
When the number of concurrency is small, the performance of log4j2 is consistent. When the number of concurrency increases to a certain number, the performance of asynchronous printing is obviously better than that of synchronous printing.
Log4j2 asynchronous printing also has performance bottlenecks.
The performance bottleneck of log4j2's synchronous printing lies in the IO bottleneck, and the IO size per second is about 2635 ✖️ 2kb = 5.15MB.
2.1.2 Quantitative research on the printing performance of logback
logback asynchronous
logback synchronous printing
Logback pressure test conclusion
It is easy to generate the following charts through the pressure measurement data of logback.
From the above figure we can see:
When the number of concurrency is small, the performance of log4j2 is consistent. When the number of concurrency increases to a certain number, the performance of asynchronous printing is obviously better than that of synchronous printing.
Logback asynchronous printing also has performance bottlenecks
The performance bottleneck of log4j2's synchronous printing lies in the IO bottleneck. The IO size per second is about 2650 ✖️ 2kb = 5.2MB, and the IO size is roughly the same as that of log4j2.
2.1.3 Comparison between logback and log4j2
synchronous comparison
Through the above table data, the following comparison chart can be obtained
asynchronous comparison
Through the above table data, the following comparison chart can be obtained
From the above we can see that:
Whether synchronous or asynchronous. After the concurrency exceeds a certain threshold, the performance of logback is better than log4j2.
Within a certain range of concurrency, the performance of logback is comparable to that of log4j2.
2.1.4 Quantitative comparison of the performance of logback in different scenarios
From the conclusion in Section 2.1.3, it can be seen that no matter synchronous or asynchronous. Before the concurrency is less than the threshold, the performance of logback and log4j2 is equivalent. After the concurrency exceeds a certain threshold, the performance of logback is better than log4j2. Therefore, below we will explore the performance of logback in different scenarios in detail.
Logback synchronously prints logs of different sizes under the same number of concurrency interface performance data
Concurrency = 100, synchronous
Logback asynchronously prints logs of different sizes under the same number of concurrency interface performance data
Concurrency = 100, asynchronous
Through the above data, the following relationship diagram can be drawn
From the above figure we can get:
If the log size is within a certain range, it will have no impact on performance. If it exceeds a certain limit, the performance will drop significantly.
From the data of the synchronously printed logs, it can be known that when the number of concurrency is constant, there is a bottleneck in the number of IOs, that is, the size of IOs per unit time does not increase proportionally with the increase of each IO. The bottleneck of the experimental data is around 160MB.
Logback asynchronous printing is less sensitive to log size, because after the asynchronous printing queue is full, a strategy of discarding business logs can be adopted.
2.2 Summary of Best Practices
Prioritize the use of logback as the framework for log output to reduce the impact of log printing on project performance
In the case of high concurrency and business logs are not necessary, use logback to print asynchronously
Judging that business logs are strongly dependent, logback needs to pay special attention to configuring nerveBlock = true to asynchronously print logs. At this time, if the print log size of a single request of the project is less than 2KB, the IO data per second of the project should not be greater than 5MB.
2.3 Engineering project optimization
Based on the above best practices, we selected a project of iQIYI's overseas back-end R&D team for pilot transformation and analyzed performance changes.
2.3.1 Project introduction
iQIYI's overseas back-end R&D team is responsible for the corresponding overseas iQIYI back-end business of PHONE/PCW/TV, mainly including the TOC service for the stable output of page business data, and the flexible, efficient and scalable IQ operation background for A strategy engine for refined operations, and a data center service for synchronizing program data.
We choose a TOC service with a non-card structure as a pilot, because the project contains many APIs, and there are many scenarios of log printing. The project is deployed in Singapore, 4C8G containerized deployment, the peak stand-alone QPS is around 120.
2.3.2 Performance optimization results
After the asynchronous transformation, P99 dropped from 78.8ms to 74ms, and P999 dropped from 180ms to 164.5ms.
The situation of different projects is different. If it is a project with a large single-machine traffic or a large number of printed logs, asynchronous transformation is believed to have greater performance improvement.
2.4 Summary
The main work of performance optimization of stand-alone logs is to obtain the performance of different log frameworks and obtain the performance of the same log framework in different scenarios. I hope this part of the data can help colleagues who encounter the same dilemma. In addition, we have also standardized the printing method of logs. By grading the business SLA, we can explain the reason for the log here at the log level, and if it is an exception, what level of exception is it, so that all business students can learn about different alarms in a timely manner. The degree of urgency is conducive to making priority judgments and process-based responses.
03
Application of Distributed Variable Sharing in Log Printing
The above chapters introduced the log printing optimization of the stand-alone system, but the current systems are basically distributed systems, so what are the pain points and solutions for log printing in the distributed system? We have thought about the optimization scheme of log link printing under the distributed system. The following is the thinking and practice of log printing optimization under the distributed system.
3.1 Introduction
Looking at the evolution process of Internet technology, the evolution from a single system to a distributed system is a very important feature. But it is undeniable that changes in things are more advantageous in general, but new challenges will always be encountered in branches. Monolithic systems also have advantages that distributed systems do not have, such as local transactions, shared code, and shared variables. From the perspective of printing logs, the monolithic system can sink calls to the same service. Because the log records are in the same project, they can be viewed and even agreed upon. In a distributed system, functional modules are usually divided and belong to different development teams, so the log printing of different service nodes is usually unable to communicate. For this reason, basically all development teams will do their best. The log of the entire link will be very redundant, resulting in waste of resources and performance degradation of the distributed system.
3.2 Introduction to the full link tracking system
In a distributed system, an external request often requires multiple internal modules, multiple middleware, and multiple machines to call each other to complete. In this series of calls, some may be serial and some may be parallel. In this case, how can we determine which applications were invoked by this entire request? Which modules? Which nodes? And what is their sequence and the performance of each part? The purpose of link tracking is to solve the above-mentioned problems, that is, to restore a distributed request to a call link, and to display the call status of a distributed request in a centralized manner, for example, the time-consuming and specific requests on each service node. Which machine to reach, the request status of each service node, and so on. For example, zipkin, one of the full-link systems, has an architecture diagram as follows:
The full link tracking system supports link log sampling and variable transparent transmission. We refer to the full link design idea to optimize our distributed system.
3.3 The use of shared variables for log sampling
3.3.1 Background introduction
This problem is concentrated in our strategy engine system. The strategy engine system is a system designed and implemented to achieve refined operations. It is mainly able to identify and identify the portraits of different groups of people, and deploy different strategies accordingly. With the implementation of the system, the access business has grown rapidly. At present, the CARD business, pop-up window business, advertising business, interactive marketing business, recommendation business, navigation and journey business of iQIYI’s overseas page data have been connected. However, the above different systems have certain dependencies and call each other. And because of business needs, the request of the policy engine depends on the post request, and the gateway log cannot parse the request parameters of the post, so we need the business to record the details of each request. In addition, the policy engine strongly relies on user portrait data, which are stored in BI and Facebook services respectively. According to past experience, about 90% of the reasons for policy hit failures are due to user portrait data not being updated in time. In order to facilitate troubleshooting To locate the problem, the policy engine will record the user portrait data requested by each user. The QPS of the policy engine is very high, so the daily log volume is about 150G. How to elegantly optimize this part of the log is a pain point we have encountered.
The traffic of the strategy engine mainly includes pop-up windows, advertisements, top navigation, iq, recommendation, and interactive marketing.
However, we found that some traffic has obvious link characteristics. When the client request determines the top navigation, the associated page will be obtained.
There may be a set of data associated with the page. In this case, the policy engine needs to be requested to obtain a page that matches the user profile.
There are multiple sets of cards associated with the page. At this time, it is necessary to request the policy engine to obtain a card that matches the user's portrait.
The card is actually associated with different business data, including interactive marketing services and recommendation services. And interactive marketing and recommendation will request the strategy engine to obtain data that matches the portrait of the crowd.
It can be concluded from the analysis that once a user requests, the included microservices request the policy engine respectively. In a life cycle of the request, the user's request data is definitely the same, and the user portrait data is definitely the same.
3.3.2 Solutions
According to the above analysis, it is easy to think that it is uniformly recorded by the policy engine. The policy engine adds an identifier to the TraceContext to indicate whether there is a record in the life cycle of the request, if it is, it will not be recorded, and if it is not, it will be recorded.
As shown in the figure above, after Service 1 requests the policy engine, an identifier is added to the TraecContext, indicating that the request Trace has been logged. When subsequent nodes, namely service 2 and service 3 request the policy engine again, there is no need to repeat the record. Doing so can reduce a lot of logging.
However, there will be the following problems: if there is a 5xx error, the policy engine service fails and the corresponding request is not recorded, and the request record is lost, which will make it very difficult for the troubleshooting agent.
After comparative analysis, it is finally determined that the above two problems can be well solved by sharing variables in a distributed system.
The above requests 1, 2, 4, 5, 7, and 8 all need to judge the logBusiness field of the traceContext context. If it exists, it does not need to be recorded. If it does not exist, it will record the log and set logBusiness to true.
For example, this is a total of 200 requests, then after 1 request arrives at the policy engine and judges that the logBusiness field of the traceContext context is false, then it will record the request and set logBusiness to true, and in the subsequent 4, 5, 7, 8, there is no need to log again. Even if there are exceptions in 4, 5, 7, and 8, the scene of the request policy engine can still be restored through the traceId and the request recorded in 1.
Another example is that 1 request fails. If it is a 499 timeout request, because the timeout is transparent to the server and the policy engine continues to execute, then the policy engine prints the log and sets the logBusiness field of the traceContext context to true. But for service 1, due to timeout, a copy of data will be redundantly recorded. In the 4th request, because the logBusiness of the traceContext context is still false, it will be re-recorded, and the logBusiness will be set to true, and will not be recorded in the subsequent 7th request.
For example, if 1 request fails and it is a 5xx request, then 1 will be recorded, and 4 will be recorded again.
Therefore, this method can minimize the link log on the premise of ensuring the integrity of the restore log link.
3.3.3 Summary
We have optimized the policy engine service, and the log can be reduced from the previous 150G/day to 30G/day. Queue resource consumption for flink task collection and log processing is also reduced accordingly.
04
Summary Outlook
This paper mainly introduces the exploration and practice process of the log printing optimization scheme of the distributed system from the aspects of single log optimization and distributed system log optimization. Compare the performance of the current mainstream log frameworks, obtain data, form best practices, provide a standard access solution for the log printing method of our business projects, and introduce the benefits obtained by our team project by improving the log printing method. In a specific scenario, an innovative solution is given to the stateless problem of distributed system printing logs, which solves the problem of repeated logging of different or the same distributed nodes. One more thing to mention here. After our series of thinking, distributed shared variables have broad application prospects. In addition to the stateful optimized log printing introduced in this article, this method can be used for scenarios with zero tolerance or weak tolerance for data inconsistency. It is good to reduce the traffic pressure of bottleneck services and improve link performance. Of course, this requires the cooperation of data compression and decompression algorithms. In future practice, we will have the opportunity to share our thinking and practice process with our peers.
maybe you want to see
GC pause investigation tour under Spring Cloud Gateway