How does the Baidu search middle platform with tens of billions of traffic build its observability?

Guide : Baidu search middle-end system not only undertakes the search traffic of Aladdin, but also is committed to building the search capabilities of various vertical businesses. With the continuous development of business, the traffic scale of the system has reached tens of billions. Behind the tens of billions of traffic, there are thousands of microservice modules and hundreds of thousands of instances. How to ensure the high availability, high performance and high controllability of this complex system, and the observability of all elements and dimensions becomes the search engine. The key to the capabilities of the middle-office system.

This article will first introduce what observability is and why we should pay more attention to observability in the cloud-native era, and then explain how the search platform creates real-time indicator monitoring (Metrics), distributed tracking ( Traces), log query (Logs) and topology analysis (Topos).



1. Cloud Native and Observability

1) What is observability

Everyone is no stranger to monitoring. As long as there is a system, monitoring is needed to help us perceive problems in the system. As the industry's traditional technical architecture moves toward cloud-native architecture, observability is gradually being mentioned in more and more occasions. Such as Distributed Systems Observability, Monitoring in the time of Cloud Native, etc. are some interpretations of the observability of distributed systems. In CNCF's cloud native definition, observability is also regarded as a very important feature of cloud native architecture CNCF CloudNative Definition 1.0.

Observability is a superset of monitoring. Monitoring focuses on the changes and alarms of some specific indicators, while observability not only needs to provide a high-level overview of the operating status of all links in the distributed system, but also provides detailed analysis of system links when problems occur in the system, allowing developers to And the operation and maintenance students "understand" all the behavior of the system.

Currently, the basic elements for the widespread adoption of observability in the industry include:

  • Metrics Monitoring (Metrics)

  • Distributed tracing (Traces)

  • Log query (Logs)

After some practice, we also expanded an element: Topological Analysis (Topos). "Distributed tracing" is to look at the complete link of a request from a microscopic point of view, while "topology analysis" is to analyze the problem from a macroscopic point of view. For example, the Qps of a service has increased several times than usual. We need to locate the source of abnormal traffic and rely on topology analysis tools.

2) Necessity of observability under cloud native architecture

In the cloud-native era, the traditional service architecture and R&D operation and maintenance model are undergoing a paradigm shift. Technologies such as microservices, containerization, and FAAS (serverless) have fundamentally changed the R&D model and operation and maintenance method of applications. However, while the cloud-native architecture brings exponential improvement in business iteration efficiency, it also creates some new challenges. The transition from monolithic applications to microservices has led to the decentralization of the original focused system, the rapid increase in the complexity of the connection between services, and our control over the overall system is gradually weakening. In this case, how to quickly locate the abnormality and make the system clear and visible has become an urgent problem to be solved.

 

2. The challenges we face

1) Large system scale

For the log trace, the current search request volume of Zhongtaitian level has reached tens of billions. If we use conventional technical solutions (such as Dapper's idea), and put the logs in centralized storage, it means that we have to pay Hundreds of machine resources, the cost is very high. Some teams use the method of sampling or recording for wrong requests. This method has obvious problems in the search for middle-stage scenarios: 1. Sampling cannot guarantee coverage of online cases. 2. It is difficult to have an effective method to identify false requests. In addition, users still have trace requirements for some normal requests (such as false recall problems).

Similarly, for the data aggregation of indicators, how to optimize resource occupation and timeliness is also a very challenging problem in the ultra-large system scale.

2) Observation requirements from application to scene

With the continuous enrichment of business scenarios in the search center, our observation perspective is also changing. In the past, more attention was paid to application-dimensional information, but now there may be dozens of business scenarios in an application, and the scale of traffic in different scenarios is completely different. If you only pay attention to the indicators of the application dimension, the upper layer may not be able to perceive when some scenarios are abnormal. The following figure is a typical example: the traffic in scenario 3 cannot be withdrawn from the application-level indicators because of the small amount of traffic. Therefore, when an abnormality occurs, the monitoring does not alarm. At the same time, the indicators of such subdivision scenarios can also assist the upper layer to make certain decisions. For example, in different scenarios, one of which is loaded synchronously and the other is loaded asynchronously. The timeout requirements of the two are different. We can guide us to do fine-grained control through the indicators of this subdivision scene.

However, the subdivision from applications to scenarios has led to a sharp expansion of the system's index scale, reaching millions. This becomes a new challenge for the aggregation and calculation of indicators.

3) Macro analysis of topological links

Under the cloud-native architecture, the connection relationship between applications becomes more and more complex. Distributed tracing can help us locate problems with a specific request. When there are some macro problems in the system: the traffic increases sharply, the time-consuming 97th percentile increases, the rejection rate increases, etc., we need topology analysis tools to help us locate. At the same time, it also has a relatively strong guiding significance for upper-level decision-making. As shown in the example on the right side of the figure below: There are two types of scenarios for commodity search. The first scenario involves operational activities and is expected to increase traffic by 300qps. If there is no topology analysis tool, it is difficult for us to evaluate the capacity buffer of each service.



3. What have we done

Last year, we explored and practiced the four elements of observability, and released a full-element observation platform, which provided a strong guarantee for the availability of the search center.

1) Log query, distributed tracing

With the growth of the business scale, the overall log level of the search center has reached the PB level. The method of storing log data offline and then indexing it will bring huge resource overhead. And we use a breakthrough solution here: combining offline, storing a small amount of seed information offline, and directly reusing online logs online (0 cost).

The specific method is:

  1. At the traffic entry layer, logid, ip, and access timestamp are stored in a kv storage.

  2. When the user uses logid to retrieve, query the ip and timestamp corresponding to the logid in the kv storage.

  3. Go to the corresponding instance by ip and timestamp to obtain complete log information.

  4. Parse logs through rules to obtain IP and timestamp information of downstream instances.

  5. Repeat the breadth traversal process of 3-4 to obtain the complete calling link topology.

But there is still a problem: the Trace time is longer.

The instance needs to grep its own log files in full. When the log files are large and the request link is long, the trace time will be longer and the stability will be impacted. Here we use the idea of ​​dynamic N-point search by time, and use the requested time information and time-ordered log structure to quickly perform N-point search.

The following figure gives you an example: the log file in the figure is the log file at 20:00, and currently it is necessary to query a log request at 20:15. Because 15 minutes is exactly 1/4 of the hour, it will fseek the 1/4 position of the file first. The log information of the current 1/4 segment is at 20:13. At this time, the log file of the second half segment is the log data of 47 minutes, so it will be shifted down by 2/47 and the fseek will be performed again. Repeat this process to quickly query the corresponding detailed log information.

This retrieval method can achieve a very fast convergence speed. The log retrieval time of a single instance is controlled within 100ms, and the impact on io is basically negligible. At the same time, the overall retrieval time of the user is also controlled at the second level.

2) Indicator monitoring

Because our observation perspective has developed from the application level to the scene level, and the number of indicators has also increased from the 10,000-level to the millions-level, we have redesigned the monitoring architecture. This picture is an architecture diagram after the upgrade.

Its main idea is to embed a dependency library in the online instance. This dependency library will collect all the indicator information and pre-aggregate it to a certain extent. After that, the collector polls to obtain the indicator data of the online instance, and then Write the aggregated data to tsdb. It is worth noting that this solution is quite different from some indicator solutions in the industry: the indicators of the instance dimension will be aggregated in the collector in real time and converted into indicators of the scene or service dimensions, and then the dimension indicators of the instance will be discarded. Then store it in tsdb. Because the indicators of the instance dimension have limited reference significance, we use the aggregated data to analyze the operation of the application.

In this architecture, we have optimized a lot of computing and storage. From online indicator changes to platform display, only 2s feedback time is required, and the resource overhead is very light. Take the aggregation of indicators as an example: the online instance only performs the accumulation operation, and the collector will store the snapshot information of the last capture, compare it with the current one, and calculate the linear difference. In this way, the resource overhead of online instances is invisible to the naked eye. At the same time, it is also convenient to output information such as Qps and delay.

In addition to Qps and delay, we also optimized the time-consuming calculation method of quantiles.

The conventional solution for calculating the quantile time-consuming is to sort the time-consuming of the request, and then take the time-consuming value at its quantile position. This calculation method consumes very high resources when the request level is high. Therefore, we use the bucket calculation method to divide the bucket according to the request time. When the request execution ends, add 1 to the corresponding time-consuming bucket; and when calculating the quantile value, first determine the bucket where the quantile value is located. , the data in the bucket is considered to be subject to a linear distribution. Through this idea, the formula as shown below can be derived.

The advantage of this is that the resource overhead is low and it can be calculated in real time, but the disadvantage is that some precision will be lost. This accuracy depends on the granularity of the bucket. The bucket size used in the search center is 30ms, and the general error is within 15ms, which can meet the needs of performance observation.

3) Topology Analysis

The realization of topology analysis utilizes the operation mechanism of index monitoring.

First, color the traffic and pass the coloring information to each service via RPC. In this way, each span (note: the definition of span is from the Dapper paper) holds the identity of the scene and the name of the upstream span. Scenario identifiers can distinguish traffic in different scenarios, and span names and upstream span names can establish a parent-child connection relationship. These spans reuse the above-mentioned index calculation mechanism, and generate corresponding performance data by storing the span information in the index. When the user provides a scene identifier, the platform will extract all its indicators, and concatenate them into a complete calling topology according to the span information in the indicator.



Fourth, the last

At this point, the four basic elements of observability are basically finished. On top of these four elements, we have incubated many application products, such as historical snapshots, intelligent alarms, and rejection analysis. Through these products, we can better help us find and analyze problems quickly.

Of course, the observability work doesn't stop there. We are also relying on this observation system to create some adaptive and self-adjusting flexible mechanisms, which can automatically tolerate and recover when abnormality occurs, and maximize the vitality of the system.



Original link: https://mp.weixin.qq.com/s/5R1vJBhN8KBE0Rj7cWOdgA




Baidu Architect

Baidu's official technical public account is online!

Technical dry goods · Industry information · Online salon · Industry conference

Recruitment information · Internal push information · Technical books · Baidu peripherals

All students are welcome to pay attention!

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324151253&siteId=291194637