Hundreds of thousands of QPS, Baidu's stability guarantee practice for hot event search

Author | Wen Yan

Introduction 

In the Internet industry, business iterations are rapid and system changes are frequent. In particular, evergreen businesses will accumulate more and more historical baggage over time. As a vertical search product of Baidu, Aladdin's business has gone through many years of changes and has a lot of historical baggage. The business cluster faces great challenges when dealing with the large traffic of major events such as the college entrance examination, Tokyo Olympics, and Beijing Winter Olympics. Take the college entrance examination as an example. Baidu has been doing the college entrance examination since 2013. After 11 years of persistence and precipitation, the college entrance examination Aladdin now directly handles billions of PV traffic from users searching for college entrance examination-related content. The system that has been accumulated for many years is facing challenges due to its complexity. Huge stability risk. In order to cope with the huge traffic of major events such as the college entrance examination, multiple parties have joined forces to quickly establish a guarantee mechanism. This article summarizes and summarizes based on practice.

The full text is 3087 words, and the estimated reading time is 8 minutes.

01 Guarantee ideas

The traffic of major events is large and highly time-sensitive. When ensuring, special attention must be paid to the system's ability to withstand instantaneous pressure. For example, the essay questions in the college entrance examination every year are always hot topics in public opinion. The time when the college entrance examination scores and cutoffs are announced in each province is extremely popular instantly. The hot events in sports events such as the Olympics are unpredictable. Sometimes it is the table tennis semi-finals, and sometimes a weak event suddenly wins the championship. The users who pay attention to the event are not only the millions of sports enthusiasts who are usually active, but also those after the hot spot breaks out. A large number of newly added following users will lead to unpredictable instantaneous traffic peaks. Users also have high requirements for the timeliness of our event data. The best experience is that event data is updated in real time. All these aspects come together to make the guarantee ratio for major events higher. Ordinary security work is difficult. Conventional ideas include 3 aspects:

picture

02 Fault discovery

The conventional approach is to first sort out the business model, upstream and downstream dependencies, dependency strength, and data links in detail, while also doing a good job in troubleshooting and repairing hidden dangers, and standardizing logs. When ensuring major events, it is necessary to separately sort out the unique dependency links of different hot events:

(1) Persons in charge of upstream and downstream operations should notify the hot spots and scope of impact in advance and prepare emergency plans.

(2) Dependent function points: clarify which function points that users pay attention to in hot events use which business party or which capability provided by the architecture, and pay close attention to traffic trends before hot spots occur.

(3) Estimate the peak QPS of different hot events, evaluate the peak QPS of the upstream and downstream based on the degree of dependence and function points, prepare expansion and downgrade plans in advance, and reserve resources for emergency expansion.

(4) Check whether the codes of core function points corresponding to different hot events are robust enough, sort out the upstream and downstream bottleneck points, and make risk plans and downgrade plans.

(5) In order to quickly locate the business and data link failure points of hot events, plan log management to facilitate the establishment of targeted business monitoring and alarms.

picture

The second aspect of conventional assurance is to build a multi-dimensional and multi-perspective monitoring system, such as function, business and other monitoring as shown in the figure above. Major events require not only timely detection of faults, but also rapid perception of user experience issues, and the ability to pay attention to differences with official data in real time. Therefore, we set the content and frequency of monitoring around hot events and the data timeliness included in its core functions. Set the shortest possible update time from the source of the data, make real-time alarms and response plans for abnormalities in all aspects of the data processing process, monitor the differences in end-to-end effects of multiple computer rooms in multiple regions in real time, and realize the entire data chain Timeliness sensing and protection of roads.

03 Fault management and control

While detecting faults in a timely manner, it is also necessary to effectively control the scope of fault impact, which is fault management and control. Conventional control measures include fault isolation, performance optimization, problem planning, and fault drills. During major events, all aspects revolve around hot events.

(1) Fault isolation

During major events, fault isolation is performed around core modules related to hot events, including business isolation, isolation of dependent services, and storage layer isolation. In addition to conventional storage and business isolation, core modules have also been strengthened in a targeted manner.

picture

For example, the core module of the Olympic Games is estimated to reach hundreds of thousands of qps. This traffic exceeds the peak value of the most popular NBA events and for some services exceeds the traffic of the Baidu APP red envelope event in 2019. It is difficult for service clusters and even related structures to directly handle such a large volume. traffic, and the downstream of the core service also has the characteristic of traffic amplification. To this end, we have built a full-link multi-level cache and used redis in the core service to directly handle traffic. Considering that the traffic borne by redis in single cluster mode cannot reach the estimated peak value, based on the business characteristics, traffic distribution, and the performance upper limit of the existing redis cluster, multiple redis clusters in multiple locations were built for the Olympic Games to jointly undertake the overall Olympic traffic.

(2) Performance optimization

Can the current service cluster support the peak traffic of major events, and can the dependent upstream and downstream services support the peak traffic? If we want to ensure that there are no problems in these aspects, we need to apply the previously summarized content, estimate the peak traffic of different hot events in major events, evaluate how each service status can support the estimated traffic, and perform performance evaluation on the bottlenecks of service links. optimization.

First, evaluate the peak traffic of each module of the search link based on historical data, operational activity arrangements, sorting out hot events, etc. For example, when we estimated the traffic of the Tokyo Olympics, we found that the data from the 2016 Olympics was too far away and not very useful for reference. When we made the forecast, we referred to the traffic of popular events such as the NBA in 2021. At the same time, we analyzed each event based on the Olympic business framework and product characteristics. We made detailed traffic estimates for various search scenarios and front-end pages.

After determining the estimated peak value, start to evaluate whether each module on the service link needs to be expanded and by how much, and then expand the capacity before a major event. The scope of capacity expansion is not only the current service link, but also the service clusters of various business collaborators.

picture

In order to verify that the capacity expansion can support the estimated traffic, we conducted multiple rounds of pressure tests in stages based on the estimated qps, simulating real traffic to further investigate risk points in the service and discover performance bottlenecks. After discovering bottlenecks, optimize core business algorithms or business logic, eliminate redundant codes, slim down business logic, split data, hierarchical caching, etc., and improve performance in conjunction with optimization of search architecture.

(3)Problem plan

Through the previous work, a problem plan for the entire service link related to different hot events was summarized to deal with various risks in the system. The problem plan includes four aspects: service, risk, degradation, and intervention. There are targeted responses from data to functions and services.

picture

(4) Fault drill

After sorting out the problem plan, conduct fault injection and fault repair drills around the core functions of hot events to verify the effectiveness and speed of the plan.

04 Troubleshooting

After timely detection of faults and effective fault management and control, it is also necessary to ensure that faults are handled effectively and appropriately during major events. First, a duty group for rapid response to major events must be established to communicate in advance with the person in charge of each party in the service link and the person in charge of the coordinating party on the failure response plan before and after the hot event. Then cooperate with the operation and maintenance department to perform service cluster operation and maintenance, and automatically take faulty instances offline. When hot traffic increases suddenly, traffic can be cut, limited, and even some functions downgraded at any time based on various monitoring conditions. During major events, feedback will be received from all aspects, including users. If there is a problem with the feedback function, if a manual intervention plan has been prepared previously, it can be quickly solved through intervention. Some problems that cannot be covered by manual intervention and some important product iterations still need to be developed and launched. At this time, the collaborative team is quickly assembled to cooperate and go online.

picture

In addition to the conventional troubleshooting mentioned above, we must also consider the core functional characteristics of major events and formulate corresponding plans. In major events such as the college entrance examination, the Olympics, and the Winter Olympics, data timeliness is the aspect that users are most concerned about, and it is also the core manifestation of product competitiveness. In addition to the monitoring and contingency plans mentioned above, we have prepared two main and backup data resources, quick switching and mixed use of the most time-sensitive content. At the same time, we have built 3 offline processing platforms to quickly switch in case of failure. We have established a real-time manual intervention platform to determine whether manual intervention is required based on various indicators such as product effect monitoring, user feedback, and service monitoring to ensure product effects and data accuracy to the greatest extent.

05 Summary and reflections

Through the above measures, during the college entrance examination, Tokyo Olympics, and Beijing Winter Olympics, the stability of search Aladdin-related services remained at 99.99%+, and the data update speed was almost synchronized with official data. This not only ensured the stability of major events, but also Ensure that our products have a good user experience.

Searching for Aladdin involves many services, and there are many aspects that need to be considered and prepared when facing major events. This article explains how to ensure stability from three aspects: fault discovery, fault management and control, and fault handling. Major events each have their own characteristics, and core business service links also have varying degrees of technical debt and transformation difficulties. If we focus on stability construction during normal business iterations, we can reduce the difficulty of transformation in response to the peak traffic of major events.

Currently, the position of "Search Product R&D Engineer" is hotly recruited, mainly working in the back-end of search production and research, AI ecology and architecture.

Interested students are welcome to submit their resumes to [email protected]

——END——

Recommended reading

Baidu search trillion-scale feature calculation system practice

Support OC code reconstruction practice through Python script (3): Adaptation of data item use module to access data path

Baidu search intelligent computing power control and allocation method

UBC SDK log level repetition rate optimization practice

Baidu search deep learning model business and optimization practice

OpenAI opens ChatGPT Voice Vite 5 for free to all users. It is officially released . Operator's magic operation: disconnecting the network in the background, deactivating broadband accounts, forcing users to change optical modems. Microsoft open source Terminal Chat programmers tampered with ETC balances and embezzled more than 2.6 million yuan a year. Used by the father of Redis Pure C language code implements the Telegram Bot framework. If you are an open source project maintainer, how far can you endure this kind of reply? Microsoft Copilot Web AI will be officially launched on December 1, supporting Chinese OpenAI. Former CEO and President Sam Altman & Greg Brockman joined Microsoft. Broadcom announced the successful acquisition of VMware.
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4939618/blog/10150755