Baidu search trillion-scale feature calculation system practice

Author | Jay

Introduction 

This article mainly introduces Baidu Search’s engineering practice for content understanding of trillions of content across the entire network, involving multiple topics such as machine learning engineering, resource scheduling, and storage optimization.

The full text is 6648 words, and the estimated reading time is 17 minutes.

01 Business background

Baidu collects a large amount of Internet content. To index this content, it is necessary to first have an in-depth understanding of the content and extract multi-dimensional information including content semantics, content quality, content security, etc., so as to further support content filtering, semantic database building and other needs. The challenge of deeply understanding the massive content of the entire network is huge, mainly in terms of cost and efficiency.

In terms of cost, the amount of calculation is very large. In addition to the large amount of content data (trillion scale) and large number of features on the entire network, there are two trends that also intensify the calculation pressure. On the one hand, the proportion of Internet content graphic culture and video continues to increase. The calculation amount of images/videos is much greater than that of text. On the other hand, the large-scale application of deep learning technology, especially the recent rise of large models, has also increased the demand for computing power. In terms of efficiency, how to make the system easier to use and improve business iteration efficiency as much as possible is one of the core goals of all engineering systems.

picture

02 Key ideas

(1) Cost optimization: To meet such a huge demand for computing power, it is necessary to "increase revenue and reduce expenditure" to the extreme.

1. "Open source" : Expand the computing resource pool as much as possible, meet the low ROI through procurement, and tap the potential of existing resources is the key. From a company-wide perspective, resource usage is insufficient. There are peaks and troughs in online resources, and there are many idle resources in inventory. However, we mostly do offline computing and do not have high requirements for resource stability. We can combine the two to build a set of flexible computing schedules. system to solve resource problems.

2. "Throttling" : Optimize service performance as much as possible and reduce unit computing costs. Model inference calculations are large, but there is considerable room for optimization. Combining model structure and GPU hardware characteristics for optimization can greatly improve model service per card. Hesitation. In addition, optimizing CPU processing and using Baidu's self-developed Kunlun chips can also reduce unit costs.

(2) Efficiency optimization: As shown in the figure, the overall business process includes two parts: real-time and offline calculation. New features need to be refreshed offline on the existing data. For newly collected data in Spider, high-timeliness data will be screened for real-time calculation. , the rest are also calculated offline, and the bulk of the calculation is in the offline part. The main efficiency issues are: how to support rapid engineering of models? How to improve offline computing efficiency?

1. Model service framework & platform : Model engineering is realized through a unified model service framework and supporting model service platform. The model service framework and platform support and cover all aspects of the model service life cycle from construction, testing, and online. .

2. Feature batch computing platform : In order to solve the problem of offline feature computing efficiency, a unified batch computing platform was built to analyze and deeply optimize the efficiency and performance bottlenecks in each link from offline task development to calculation process, so as to improve efficiency as much as possible.

picture

03 Technical solution

3.1 Overall architecture

The overall architecture is shown in the figure below. The core parts are the model service platform, batch computing platform, computing scheduling system, and model service framework.

1. Model service framework : Algorithm students use a unified model service framework for service encapsulation. Based on R&D efficiency considerations, Python is chosen as the framework language. However, Python performance problems are also obvious, so a lot of targeted optimization is required. In addition, we continue to integrate a variety of inference optimization methods into the framework to reduce the computing cost of service units as much as possible.

2. Model service platform : The model service platform supports model service DevOps and capability output. The platform uses "operators" as the management granularity. "Operators" represent a complete function, such as video classification, etc. It usually requires a combination of multiple model services. use. Algorithm students register operators on the platform, provide meta-information such as service topology, and generate performance reports through automatic performance parameter tuning, automated stress testing, etc. Service topology and performance reports are important inputs for subsequent scheduling. The platform also provides functions such as operator retrieval, research and trial, and supports other business needs in a middle-to-office manner.

3. Computing and scheduling system : The computing and scheduling system performs unified scheduling of traffic and resources. All requests for model services will pass through the gateway of the computing and scheduling system to implement traffic policies such as flow control and routing. The computing and scheduling system will also schedule multiple Baidu PaaS A variety of idle heterogeneous resources, automatic deployment of appropriate operators, providing greater throughput for offline computing.

4. Batch computing platform : The batch computing platform supports task generation, task scheduling, DevOps and other functions for offline jobs. It builds a storage solution based on HTAP to solve the Scan throughput bottleneck problem, and links the computing scheduling system to support large-scale offline computing.

picture

3.2 Key technical points

This chapter mainly explains the key points of system technology, including the technical difficulties encountered, thoughts and trade-offs. We also hope that readers can communicate with us about some common issues.

3.2.1 Model service framework

In actual business scenarios, the model service framework has several key issues that need to be solved: business programming model, Python service performance optimization, and inference performance optimization, which are introduced below.

3.2.1.1 Business programming model

Implementing a certain function often requires the combined use of multiple models and multiple data processing logics. In order to abstractly express the processing flow and achieve common logic reuse, the solution is as follows:

  • Describe business logic as a DAG (directed acyclic graph). The nodes on the DAG are called Ops. The DAG consists of multiple Ops. There are series and parallel relationships between Ops. An OP can be model reasoning or a piece of processing logic. Op Context transfer between data whiteboards. DAG can clearly present the overall processing flow and improve code readability and maintainability.

  • Build a general Op library. Common logic such as model reasoning, video frame extraction, and video conversion are integrated into a general Op library to support business reuse. The business can also customize the extended Op as needed and register it for use in the framework.

picture

3.2.1.2 Python service performance optimization

Choosing Python reduces development costs, but it also introduces the Python GIL (Global Interpreter Lock) problem, which prevents full utilization of CPU multi-cores and greatly limits service throughput. The solution is as follows:

  • Adopting a concurrency solution of multi-process + asynchronous coroutine + CPU/GPU computing separation, the service includes three types of processes: RPC process, DAG process, and model process, and data interaction between them is through shared memory/video memory.

  • The PRC process is responsible for network communication. Based on BRPC development (open source version: https://github.com/apache/brpc ), we have optimized the Python implementation of BRPC to support the concurrency mode of Python multi-process and coroutine. In actual business Under scenario testing, the performance increased by 5 times + after optimization.

  • The DAG process is responsible for DAG execution (CPU processing), making full use of CPU multi-cores through multi-DAG processes and Op execution asynchronous coroutineization. Another important one is ModelOp, which is actually an inference agent (similar to RPC). Real inference is executed in the local model process or remote service. ModelOp shields the call details and supports users to use the model conveniently.

  • The model process is responsible for model inference (GPU processing). Considering reasons such as limited video memory, the model process and the DAG process are separated and independent. The model process supports multiple inference engines such as Pytorch and Paddle , and has done a lot of inference optimization work. Since Tensor data is usually large, DAG and model processes transfer Tensor directly using shared video memory to avoid unnecessary memory copies.

picture

There are mainly optimization methods such as inference scheduling, inference optimization, model quantification, and model compression. After optimization, the service single-card throughput is usually improved several times compared with the native implementation.

1. Inference scheduling : dynamic batch processing (DynamicBatching) and multi-Stream execution. GPU batch computing is more efficient. Since the service also accepts real-time single requests, it is impossible to make batches when requests are made. Therefore, in-service caching is used to make batches, sacrificing latency for throughput. Stream can be regarded as a GPU task queue. By default, there is a single global one. Tasks are executed serially. When GPU IO operations (memory and display memory copy) occur, the computing unit is idle. By creating multiple Streams, different inference requests use different Streams to make IO and computing more efficient. Fully parallel.

2. Inference optimization : The mainstream solution in the industry is to use TensorRT, but in actual applications there are problems such as failure to staticize dynamic graphs and incomplete TensorRT Op coverage. In order to solve these problems, the team self-developed Poros (open source version: https://github.com/ PaddlePaddle /FastDeploy/tree/develop/poros), combined with TorchScript, graph optimization, TensorRT, vLLM and other technologies, to achieve no need for complex model conversion. Adding a few lines of code can greatly improve inference performance, a win-win situation for efficiency and performance. Poros also supports heterogeneous hardware such as Kunlun.

3. Model quantization : Hardware such as GPU and Kunlun have stronger computing power for low precision. Although quantization has a small effect loss, it brings a significant increase in throughput. Therefore, FP16 or even INT8/INT4 quantization will be used online. This part is also Supported by Poros.

4. Model compression : Streamline model parameters and reduce the amount of calculation through methods such as model distillation and model clipping, but it requires training and the effect is lossy. It is usually optimized together with algorithm classmates.

picture

3.2.2 Computational scheduling system

The operation architecture diagram of the computing and scheduling system is as follows. All request traffic passes through the unified gateway (FeatureGateway). The gateway supports multiple traffic policies such as flow control and routing. Offline jobs also submit computing requirements through the gateway, and the gateway will forward the requirements to the scheduler (SmartScheduler) for scheduling. The scheduler is connected to multiple PaaS in Baidu, continuously detects idle resources, and automatically schedules and deploys appropriate operators based on demand, multiple indicators, idle heterogeneous resource distribution, etc. The operator meta-information is obtained from the service platform. After the scheduling is completed, The scheduler will adjust the gateway's flow control and routing, etc.

picture

Two key issues in the system: How to realize automated deployment of operators (composite services, including complex service topologies)? How to schedule under complex conditions such as unstable traffic distribution and multiple heterogeneous resources?

3.2.2.1 Automated deployment

In order to simplify the complexity of scheduler development, declarative programming is adopted, which is actually developed based on the k8s controller mechanism. The implementation plan for automated operator deployment is as follows:

1.CRD extension : Use K8S CRD to customize objects such as ServiceBundle (operator deployment package), and use the controller mechanism to perform deployment and other operations on external systems such as PaaS. ServiceBundle contains all sub-service deployment information required by the operator, as well as its topological relationships. When scheduling and creating an operator service, sub-services will be created layer by layer starting from the bottom layer. The upper-layer sub-services can obtain the addresses of downstream sub-services through the communication hosting mechanism.

2. Communication hosting : The communication hosting mechanism is implemented based on the configuration center and model service framework. The service startup command will carry the remote configuration address and AppID. By loading the remote configuration, the downstream service address can be changed at startup. In fact, a more ideal solution is to use technologies such as ServiceMesh to decouple architectural capabilities and business strategies. However, considering that we need to deploy in multiple PaaS, the cost of deploying components such as ServiceMesh SideCar in each PaaS is high, and it is too heavy to integrate into the framework. Therefore, Build a solution based on the configuration center first, and then consider migration when the time comes.

picture

3.2.2.2 Scheduling design

Scheduling is a very complex problem. In our scenario, its complexity is mainly reflected in the following aspects:

1. Operator scheduling : The traffic that an operator (composite service) can carry depends on the sub-service capacity of its shortest board. It needs to be considered as a whole when scheduling to avoid wasting long board service resources.

2. Changes in traffic distribution : The performance of some operators will be affected by the distribution of input data. For example, video OCR will be affected by the video duration and screen text ratio, and requires adaptive adjustment during scheduling.

3. Multiple heterogeneous hardware : Some operators can support multiple heterogeneous hardware (Kunlun/GPU/CPU, etc.), while some can only be bound to one type. How to allocate them can ensure the most effective use of global resources.

4. Other factors : Job priority, resource priority, resource fluctuation and other factors will also affect scheduling. The factors to be considered in actual scheduling are very diverse.

Based on the above factors, our scheduling design plan is as follows:

1. Two-stage scheduling : divided into two stages: traffic scheduling and resource scheduling, each of which is scheduled independently. Traffic scheduling is responsible for allocating the current operator service capacity to each job, and synchronizing the results to the gateway to adjust the traffic policy; resource scheduling is responsible for scheduling based on resource idleness and operator capacity gaps, and ultimately expanding and shrinking operator service instances. .

2. Traffic scheduling : The Adjust phase of traffic scheduling will adjust the normalization coefficient according to task operation indicators, etc., and then use the coefficient to map the Qps required by the task into NormalizedQps. NormalizedQps is the basis for all subsequent scheduling, thereby solving the problem of changes in traffic distribution. In the Sort stage, the tasks will be sorted according to their job priorities. In the Assign stage, the existing operator capacity will be allocated to each job based on the Sort results and priorities. The Bind phase will execute the results and synchronize routing until the gateway.

3. Resource scheduling : The resource scheduling Prepare phase will first convert the capacity gap of the job into the corresponding service instance number gap; then perform HardwareFit, allocate the services to be expanded to the appropriate hardware resource queue, and proceed based on resource scarcity, computing cost-effectiveness, etc. Sort; then perform PreAssign to pre-allocate resources for each sub-service. Finally, the GroupAssign stage considers the scheduling satisfaction of each sub-service of the composite service and fine-tune the capacity of each sub-service of the composite service to avoid resource waste.

picture

3.2.3 Batch computing platform

Problems that the batch computing platform needs to solve: When elastic resources are relatively abundant (such as at night), the Scan throughput bottleneck of Table (distributed table system), and how to optimize the efficiency of offline tasks as much as possible, the specific solutions are introduced below.

3.2.3.1 HTAP storage design

Let’s first analyze the reasons why Table Scan is slow, mainly as follows:

1. Mixed reading and writing : Both OLTP (fetching updates, etc.) and OLAP (feature batch calculation, etc.) require access to the Table. Multiple reading and writing methods are mixed, and the bottom layer uses HDD storage. A large amount of mixed reading and writing causes a serious decline in disk IO throughput.

2. Scan amplification : Table is stored in a wide table structure. When scanning for different tasks, only certain columns are usually needed. However, when Table Scan needs to read the entire row of data and then filter it, IO amplification is serious.

3. High expansion cost : Since OLTP and OLAP mix reading and writing, it is very expensive to expand the capacity of Scan separately. At the same time, because the read-write ratio is difficult to fix, it is also difficult to estimate expansion resources.

picture

From the above analysis, we can see that the key problem is the mixed use of Tables in OLTP/OLAP. According to industry practice, it is difficult to use a single storage engine to meet both OLTP and OLAP scenarios. However, for the sake of ease of use of the storage system, we hope that a storage system can support both scenarios at the same time. Therefore, we combined business scenarios and industry experience to implement an HTAP storage solution. The specific solution is as follows:

1. OLAP/OLTP storage separation : Build efficient OLAP storage for batch computing and other OLAP scenarios to reduce the mixed reading and writing problems caused by the mixed use of OLAP/OLTP tables, and can also be expanded separately according to needs.

2. Efficient OLAP storage design : The self-developed OLAP storage is built based on Rocksdb and AFS (Baidu-like HDFS). It adopts the design of incremental synchronization, row data partitioning, and column data dynamic merge storage to divide the entire table data into N data physical partitions. , use the incremental snapshot of the table to regularly and efficiently update the OLAP storage data (because the bottom layer of the table uses LSM storage, the incremental snapshot efficiency is much higher than that of the full scan). Column storage is reorganized according to field access hotspots, and the hotspots are stored together in the physical layer to reduce IO amplification and also support dynamic adjustment. The solution will have data synchronization delay problems, but in our scenario, timeliness requirements are not high, and the problem can be ignored.

3.HTAP SDK : Provides a unified SDK that supports both Table and OLAP storage access. Users can perform their own OLAP and OLTP tasks at the same time based on the SDK.

picture

3.2.3.2 Task generation and scheduling

In order to simplify the development of batch computing tasks, the platform currently provides three task development modes: configuration, KQL, and offline framework. The development freedom/cost is from low to high, and the ease of use is from high to low:

  • Configuration : For common and frequently used task types, the platform highly encapsulates these tasks and only needs to be configured on the Web interface to generate tasks.

  • KQL : KQL is a self-developed SQL-like language that provides a variety of general functions and supports custom functions (similar to Spark UDF). Users can query and process data through KQL.

Function classify = {
def classify(cbytes, ids):
    unique_ids=set(ids)
    classify=int.from_bytes(cbytes, byteorder='little', signed=False)
    while classify != 0:
        tmp = classify & 0xFF
        if tmp in unique_ids:
            return True
        classify = classify >> 8
    return False
}

declare ids = [2, 8];
select * from my_table
convert by json outlet by row filter by function@classify(@cf0:types, @ids);
  • Offline framework : The framework provides functions including data reading and writing, universal conversion, etc. Users can customize logic according to the framework specifications and generate offline task deployment packages to submit to the platform, which schedules tasks.

In addition to the following methods, the platform is also trying to combine large models to achieve task generation based on natural language. In fact, no matter which method is used, the finally generated offline tasks are based on the offline framework, but only provide a higher degree of encapsulation based on more specific scenarios.

After the task is generated, the task will be scheduled to the MapReduce or FaaS platform for execution. Different task generation methods have different preprocessing before scheduling. For example, KQL tasks need to be parsed first and then generate actual tasks for scheduling, while the business is developed through the framework. Tasks are prone to various unexpected problems, so DevOps processes such as automated admissions are adopted. When a task is executed, the required operators and expected throughput will first be submitted to the computing scheduling system, and then the available Quota will be continuously obtained from the gateway, and the request delivery speed will be adaptively adjusted based on the current number of task instances, failure rate, etc.

picture

04 Summary

The current system supports more than ten business directions such as image search, video search, and image search. It supports the development and launch of hundreds of operators, tens of billions of calculation calls on a daily basis, and routine processing of trillions of content features across the entire network. renew. With the advent of the era of large AI models, it has brought many new scenarios and challenges. There are many points worth rethinking. We will conduct more explorations in combination with large models in the future.

recruitment

The department is actively recruiting for multiple positions, including ANN retrieval engineers, model optimization engineers, distributed computing R&D engineers, etc. Talents who are willing to embrace challenges and have excellent problem analysis and problem-solving abilities are welcome to join ~

Recruitment email: [email protected]

——END——

Recommended reading

Support OC code reconstruction practice through Python script (3): Adaptation of data item use module to access data path

Baidu search intelligent computing power control and allocation method

Baidu search deep learning model business and optimization practice

UBC SDK log level repetition rate optimization practice

Large-scale practice of Wenshengtu: Revealing the story behind Baidu’s search for AIGC painting tools!

Microsoft launches new "Windows App" .NET 8 officially GA, the latest LTS version Xiaomi officially announced that Xiaomi Vela is fully open source, and the underlying kernel is NuttX Alibaba Cloud 11.12 The cause of the failure is exposed: Access Key Service (Access Key) exception Vite 5 officially released GitHub report : TypeScript replaces Java and becomes the third most popular language Offering a reward of hundreds of thousands of dollars to rewrite Prettier in Rust Asking the open source author "Is the project still alive?" Very rude and disrespectful Bytedance: Using AI to automatically tune Linux kernel parameter operators Magic operation: disconnect the network in the background, deactivate the broadband account, and force the user to change the optical modem
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4939618/blog/10149445