Baidu Knows Cloud and Architecture Evolution

Author | Baidu Zhizhi R&D Team

guide 

Baidu knows that as an old product line that has been online for more than ten years, it has many business scenarios, outdated architecture, and inconsistent code style. At the same time, the business iterates quickly, the overall load capacity is large, and the stability requirements are high, which brings great challenges to the cloud for all businesses. Based on practice, this article introduces how to select and implement cloud-based solutions, and simultaneously perform architecture evolution to improve online service stability and disaster recovery capabilities.

The full text is 5302 words, and the expected reading time is 16 minutes.

01 Background and challenges

1.1 Background

With the promotion of the group's PaaS strategy and Baidu's strategy on the cloud, the current online operation platform ORP has officially entered the stability maintenance stage, and no longer performs function updates and security repairs; at the same time, the ORP access layer cannot meet the deployment requirements on the cloud in terms of stability and change efficiency. OXP has gradually become the bottleneck of business development and iteration. In order to solve this problem, enhance resource elasticity, reduce business resource costs, access various cloud-native capabilities, improve deployment efficiency, and ensure online service stability, we will gradually complete the overall cloud migration and architecture evolution after starting the OXP special project.

1.2 Challenges

1. Know that the product line is old and has a lot of historical debt. Baidu Zhizhi is an 18-year-old product line with complex business models, many upstream and downstream dependencies, different focus directions in different periods, old architecture, inconsistent code style, and high transformation costs;

2. Know that the business is developing rapidly and iterative changes are rapid. Although the product line has a long history, in order to adapt to new changes, business iterations are agile, core scenarios are frequently updated, and the average annual demand for online business is 780+. It is necessary to complete the migration to the cloud under the premise of ensuring the achievement of business goals, so that the business process is seamless;

3. Know that the traffic is large, the commercial income is large, and the stability requirements are high. As a double TOP product line with knowledge traffic income, knowing that the average daily PV exceeds 100 million, any traffic and business income cannot be affected during the migration process, and the core service stability target must be more than four 9;

4. Go to the cloud while the architecture evolves reasonably. Migration to the cloud is a major technological change in the history of Zhizhi. In addition to bringing advanced cloud-native capabilities to the old product line and optimizing IT costs, it also hopes to promote the optimization and evolution of the overall architecture of Zhizhi, and improve disaster recovery capabilities and online service stability.

1.3 Benefits

1. All traffic is uploaded to the cloud, bringing advanced elastic resource supply capabilities to the public, greatly improving the efficiency of expansion and contraction, avoiding online capacity risks caused by traffic fluctuations, and improving the stability of online services;

2. Introduce flexible container sales capabilities, on-demand use, pay-as-you-go, and dynamic adjustments to optimize the overall resource level of online services; vacate a large number of OXP machines to greatly reduce IT costs;

3. Knowing that the architecture continues to evolve with the cloud, the core traffic will be deployed from 0 to 1 on the cloud in three places and four computer rooms, reducing the end-to-end time consumption of core pages, enabling the core pages to have N+1 redundant disaster recovery capabilities, and improving business risk resistance.

02 Concept introduction

2.1 Know the business profile

Zhizhi is a traditional graphic knowledge content production business. Firstly, users ask questions spontaneously, or filter and mine daily search queries to obtain unresolved problems; secondly, guide various producers to answer questions on different pages and backgrounds, and produce answer content; thirdly, the produced question-answer pairs are pushed to search, feed and other scenarios for users to browse and consume.

Knowing that after years of operation, it has accumulated a large amount of question and answer resources, and stably covered many long-tail needs in the search ecosystem; at the same time, by identifying user needs, mining high-value leads, and introducing institutional or MCN accounts, it has built multi-category high-quality content, and gradually formed a relatively stable multi-level content ecology and brand awareness.

picture

2.2 Business Architecture

Know the overall business structure as shown in the following figure:

picture

2.3 Traffic Architecture

Know the overall traffic architecture before going to the cloud, as shown in the figure below:

picture

03 Cloud design and practice

3.1 Selection of cloud solutions

The PaaS platform orp, which is widely used in the php module in the knowledge vertical category, has announced that it will stop maintenance at the end of 2022. At the same time, the existing orp system has some problems in the container arrangement management level, and the budget resource management is also inconsistent with the existing company's mechanism and process. The known existing architecture is based on the native implementation of odp, which is more embodied as a large-scale single application. Through this upgrade, it is known that it needs to migrate to a PaaS platform that is closer to the cloud-native environment, and carry out a new round of architecture iterations to create an ideal architecture that meets the current business situation.

Although Kubernetes, the open source system that manages containerized applications, is a community and future development trend, considering factors such as transformation costs, time nodes, and development manpower, we know that the final selection of this migration to the cloud is consistent with other knowledge vertical product lines: pandora is used for the bottom layer, and "Zhiyun Platform" is used for resource management and online .

3.1.1 why pandora

There are mainly several aspects to consider:

1. Pandora adapts to the main C-end businesses in the company, such as Dasou, feed, handbai, Baijiahao, video (good-looking), etc. These businesses are closer to the knowledge system in terms of scenarios, and detailed research and evaluation can support existing change plans;

2. Pandora is the only one in the existing PaaS that can support the simultaneous deployment of many modules (maximum support 2K), without requiring too many business transformations and mergers, and is more suitable for the existing odp large single architecture;

3. The ease of use of pandora is temporarily inferior to opera, but it has been solved by Zhiyun; at the same time, Zhiyun will provide orp services including access, static resources, proxy, data distribution, etc., so it will not affect the final selection conclusion.

3.1.2 why Zhiyun

Knowledge verticals and other oxp-based businesses have a relatively obvious structure: multi-app isomorphism under the large monomer mode, and this part of the demand is not supported by the existing PaaS platform. At the same time, because of the standardization of packaging and services at the bottom of Pandora, the business line needs to carry out targeted code transformation and regression, and this part of the work is obviously repetitive. The Zhiyun platform aims to provide a set of cloud solutions that are more in line with knowledge business (and oxp-based), and it mainly has the following advantages:

1. On-line changes: In addition to basic on-line, configuration management, and rollback, the core supports multi-APP isomorphic mode and multi-module deployment. It can reduce the cost of migrating the oxp project to Zhiyun. Ideally, there is no need to merge/split the code base, and it can be supported in translation;

2. Platform services: Comparing with the existing services of oxp, it provides support and solutions including log segmentation, timing tasks, access layer, static resources, flying lines, central control, etc., and supports business deployment and custom services based on the open service model based on cloud-native ideas;

3. Business runtime environment: rapid deployment and customization of odp basic runtime environment;

4. Basic environment (container): Integrate the entrance and provide a more convenient operation solution during daily operation and maintenance.

3.2 Stream cutting and expansion practice

3.2.1 Transformation before going to the cloud

For each traffic cluster, before migrating Pandora, it mainly involves the following aspects:

1. Zhiyun creates product lines and applications. It is necessary to build the basic environment of the product line on the Zhiyun platform, create APP basic information, apply for ECI computer room resources 2 and instance configuration, add ODP basic operating environment and data distribution container related information, create corresponding configuration of container components, add static file storage addresses, modify deployment paths and configure derived conf, create online templates, etc.;

2. Access layer transformation and authorization. The access layer creates BNS variables corresponding to the new APP, and performs various DB and redis authorizations for the new BNS, involving new computer rooms, and also needs to upgrade and adapt each mysql and redis configuration;

3. Transformation and testing of the business layer. Knowing that this time, the upgrade of the back-end language HHVM->PHP7 will be completed synchronously. The language version update will bring further improvements in security and performance. At the same time, PHP7 also provides many new syntax features, which cannot be used in old versions. It is necessary to complete the modification of the corresponding module PHP7 compatibility issues, and complete the offline test;

4. Add monitoring and log collection. It is necessary to add noah and sia monitoring at all levels corresponding to the APP, adjust each monitoring item, and optimize the monitoring threshold; modify the corresponding log collection path, merge service groups, and verify the storage effect offline.

3.2.2 Flow cutting scheme

  • The small flow experiment scheme is shown in the figure below:

picture

  • Access layer transformation:

You can use the Lua script at the access layer to implement small traffic switching. The script implements the following rules:

['strategy_1_1_98']   = {1, 1, 98},
['strategy_5_5_90']   = {5, 5, 90},
['strategy_10_10_80'] = {10, 10, 80},
['strategy_20_20_60'] = {20, 20, 60},
.....,
['strategy_80_20_0']  = {80, 20, 0},
['strategy_95_5_0']   = {95, 5, 0},
['strategy_100_0_0']  = {100, 0, 0}

The return value has three results: "opera", "abtest", "orp", corresponding to three numbers from left to right, that is, the probability of occurrence of each result, so that flow control can be realized according to the returned result;

Use the newly added variable $upstream_target to mark the final proxy value. The four values ​​correspond to the traffic of the test group/control group on the PC side and the mobile side respectively:

#设置最终proxy的值:pc_orp、pc_pandora、wap_orp、wap_pandora
set $upstream_target "${terminal_target}_${target_cluster}";
#知道上云切流实验配置结束

Added to the service delivery flag, the values ​​are "pandora", "abtest", "orp", which are used to identify the traffic of the experimental group, control group, and irrelevant group respectively.

  • Business Layer Transformation

The business layer captures the above traffic marks, respectively creates and uses new Eids to initiate business requests, and then obtains the business traffic data of each page of the current experiment group/control group.

if ($_SERVER['HTTP_X_BD_TARGET'] == 'pandora') {
    $adsEids = array(
        'asp'  => array(50001),
    );
} else if ($_SERVER['HTTP_X_BD_TARGET'] == 'abtest') {
    $adsEids = array(
        'asp'  => array(50002),
    );
}

3.2.3 Expansion correlation

Take knowing the core question and answer page as an example, each stage of expansion has work content that needs to be focused on at that stage, and an access list for entering the next stage. Only when all the contents of the list meet the standards can the next stage of expansion experiment be started. The specific instructions are as follows:

picture

3.2.4 Gateway switching on the cloud

After the business layer is migrated to the cloud, the downstream of the gateway changes from an orp environment that hardly migrates to a cloud environment that migrates frequently. The original orp access layer cannot be sensitive to frequent downstream changes, so the original orp access layer needs to be switched to the cloud. The Janus gateway has been widely used in product lines such as Handbai, Encyclopedia, Ask, Experience, and Baijiahao. Compared with the original inrouter, it has the following advantages, and has undergone a lot of practical verification. Therefore, we know that Shangyun chose Janus in the Zhiyun system to switch gateways.

picture

Knowing that after 18 years of iterations, the routing and forwarding rules of the gateway have become bloated, the logic is cumbersome, and the maintenance cost is very high. A little carelessness will cause serious online accidents. As a service for traffic forwarding control, the gateway should be as simple and clear as possible, with good readability and maintainability. Therefore, when the gateway is switched to the cloud, the in-depth reconstruction and management of routing should be carried out synchronously, rather than simple relocation. The core process of migration reconstruction is as follows:

picture

The final effect achieved:

1. Sensitive downstream perception: The new gateway is sensitive to downstream instance drift, superimposed retry strategy, and instance drift has almost no impact on business SLA;

2. Greatly enhance the maintainability: the preview, intranet, and extranet domain names are separated and constructed, and the division is clear, and the use and maintenance are clear at a glance. Integrate and converge the 2768 lines of the original nginx forwarding rules to 18 rules in the conf file, clearing up the historical burden of 18 years;

3. The security has been further enhanced: the online is detailed to each service, each route, and each forwarding rule, and the impact area is controllable. In addition to graded releases, superimposed checker, online inspection, test cases and other means have greatly enhanced the security of gateway online changes.

3.3 Architecture Evolution

3.3.1 Status & Problems

Knowing that for a long time, more than 95% of the main traffic has been concentrated in the north, the core traffic hits the tc+jx computer room, and the non-core traffic hits the yq computer room. From the external network access point, to the actual business layer, to the underlying basic services, and to the important third-party dependent services, no other regional resources and services have been built. This creates a very obvious security risk. Once a fault occurs in the North China region, it is impossible to completely cut the flow to avoid online losses.

3.3.2 Evolution Scheme

1. Know the core traffic of the three-terminal QB page, which accounts for more than 80% of the total traffic. It is the source of more than 99% of commercial income. The natural traffic is large, there are many user interactions, it is sensitive to online accidents, and the stability requires more than four 9s. This part is the lifeline of knowledge. With the migration to the cloud this time, four computer rooms in three places will be built simultaneously, so that the QB page has the ability to quickly switch to the other two places for a single region failure, and improve the overall reliability of the system;

2. Non-core traffic other than the QB page, due to long time, involves many modules, complex types of underlying resources, and basically does not contribute to commercial revenue. The proportion of traffic in each subsystem is relatively low. Considering the transformation cost and cost-effectiveness, this cloud migration will move the non-core traffic to dual computer rooms in North China, which has redundant disaster recovery capabilities for different computer rooms in the same region, and will not build other regions for the time being;

3. For the core traffic, it is necessary to carry out "external network access point -> BFE -> access layer -> business layer -> rely on its own services -> rely on third-party services -> rely on underlying storage" full-link three-site resource construction, and conduct connectivity tests;

4. For the core traffic, after the service resources at all levels are in place, all-round pressure testing and disaster recovery drills need to be carried out to ensure the traffic pressure capacity of the computer room in the new region, and to freely switch to the other two regions when a single region fails. For important three-party services that cannot be tested online, such as commercial advertisement requests, cooperate with the other party's OP, RD, QA and other roles to jointly formulate a flow cut observation plan to avoid online security risks caused by changes in traffic distribution;

5. For the core traffic, design a flow cut plan and establish a three-party service synchronization mechanism. During the flow cut, observe together whether the upstream and downstream subsystems meet expectations.

3.3.3 Implementation

It is known that the evolution of the traffic architecture after the construction of the core traffic three regions is completed is as follows:

picture

04 Summary and Benefits

1. As of March 31, 23, the proportion of traffic on the cloud has reached 100%, and the business has been fully migrated to the cloud.

picture

2. In Q3 of 2022, we started to switch to the cloud, and the SLA met four 9s for three consecutive quarters, and the number of online problems introduced by the cloud was 0.

picture

3. Know that the core page has completed the construction of four computer rooms in three places, know that the distribution ratio of online core traffic is North China: Central China: South China = 4:3:3, know that for the first time in history, it has N+1 cross-regional redundancy disaster recovery capability.

picture

4. The average end-to-end time consumption of the QB core interface of the applet has dropped by 12%, and the FMP80 percentile is stable within 1 second, and it can be opened in seconds without other technical optimization.

picture

picture

5. Completed the adjustment of GTC access points in three places, and the shared cost of public network IP has decreased month by month since 2023. At the same time, batch delivery of offline OXP machines has saved a lot of research and development costs.

picture

——END——

Recommended reading:

Baidu APP iOS terminal package size 50M optimization practice (4) code optimization

Baidu App Startup Performance Optimization Practice

Application practice of light sweeping motion effect on mobile terminal

Android SDK security hardening issues and analysis

Large-scale quantitative practice of search semantic model

How to design an efficient distributed log service platform

RustDesk 1.2: Using Flutter to rewrite the desktop version, supporting Wayland accused of deepin V23 successfully adapting to WSL 8 programming languages ​​​​with the most demand in 2023: PHP is strong, C/C++ demand slows down React is experiencing the moment of Angular.js? CentOS project claims to be "open to everyone" MySQL 8.1 and MySQL 8.0.34 are officially released Rust 1.71.0 stable version is released
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4939618/blog/10089662