Detailed explanation of Baidu image processing that supports 700 million user searches

picture

Introduction : In Baidu search, it is mainly composed of two parts: "search online" and "search offline". The "online" service is mainly used to respond to user requests, while the "offline" service converts and processes data from various sources and sends it to the "search". Online" service. "Search offline" data processing is a typical combination of massive data batches/real-time computing.

The full text is 4142 words, and the estimated reading time is 8 minutes.

1. "Offline" and "Online" behind multimodal retrieval

In Baidu search, it is mainly composed of "search online" and "search offline". The "online" service is mainly used to respond to user requests, while the "offline" service converts data from various sources and sends it to the "online" service. middle. "Search offline" data processing is a typical combination of massive data batches/real-time computing.

Since 2015, Baidu App has launched a multi-modal search capability, which intuitively reflects intelligent search in front of users. Multimodal retrieval is the addition of visual retrieval and speech retrieval capabilities on top of traditional text retrieval.

picture

Among them, the offline and online technologies of "visual retrieval" and "text retrieval pictures" have many things in common. Taking visual retrieval as an example, the product forms include: word guessing, more size pictures, picture sources, vertical pictures (short videos, products, etc.), similar recommendations, etc. The core technologies behind it are classified (GPU online model prediction). estimate) and ann retrieval.

picture

In terms of ann retrieval, the main retrieval methods currently used are cluster-based gno-imi, graph-based hnsw, and local-sensitive hash method. The main consideration for selection is the cost of technical solutions and the applicability of features, such as gno-imi It is an open source solution in Baidu, with a relatively small memory footprint, and the cost is acceptable when applied to tens of billions of ann retrievals; the locally sensitive hash method, applied to local features such as SIFT, can enhance the recall effect in mobile phone photo recognition scenarios. .

Behind these online technologies, there are more than 100 types of features that are relied on. It takes a huge amount of computing power to include pictures of the whole network offline and calculate the features of the pictures. In addition, pictures are attached to web pages on the Internet and need to be maintained.” Picture-picture link-webpage link” relationship (offline data processing and online applications are inseparable from data relationships, for example, in order to trace the source, the source web page url of the picture needs to be provided, etc.).

In this case, the Search Architecture Department and the Content Technology Architecture Department jointly designed and developed the "Image Processing and Recording Center" based on their own business and technical characteristics, in order to achieve the following goals:

  1. Unified data acquisition and processing capabilities can integrate the data acquisition, processing, and storage logic of image business, improve human efficiency, and reduce storage & computing costs.

  2. The tens of billions to hundreds of billions of image applications can realize rapid research, data collection, and network-wide data update capabilities.

  3. Build a data channel for real-time image screening and customized distribution to improve the timeliness of image resource introduction.

The project is known internally as Project Imazon. Imazon comes from Image + Amazon, where amazon represents the throughput capacity, DAG processing capacity, and image capacity of the mid-stage capability.

At present, image processing is included in the middle platform, providing complex business scenarios to process billions of image data in a single day, recording 100 gps in real time in seconds, and recording 10,000 gps in the whole network. The platform currently supports the image processing and collection requirements of multiple business lines, which greatly improves the efficiency of business execution.

2. The architecture and key technologies of the middle stage of image processing and recording

The continuous optimization of search results is inseparable from data and computing power, mainly based on collection, storage, and computing. Image processing and recording in the middle platform, the general capabilities that we hope to provide through the middle platform include: filtering data from the timeliness, the whole network picture recording channel, providing a streaming processing mechanism with high throughput, the ability to describe the relationship between pictures and web pages, original image & thumbnail storage , online processing mechanism, etc.

2.1 What problem does the image processing and recording center solve?

The main process of image processing and recording in the middle platform has gone through 6 stages: web spider (obtaining web content), image content extraction, image spider (crawling images), feature calculation (more than 100 types of features), content relationship storage, and database building. As shown below:

picture

2.2 The technical indicators of the middle and Taiwan are included in the image processing

The definition of technical indicators in China and Taiwan is described from three aspects: structure indicators, effects, and R&D efficiency.

Architecture metrics include throughput, scalability, and stability:

  • Throughput, that is to improve throughput within the cost limit, the specific indicators are: single data size: 100 K bytes (picture + features);

  • Scalability, that is, cloud-native deployment, flexible scheduling of computing resources, fast calculation when resources are available, and slow calculation when resources are not available.

  • Stability, that is, no data loss, automatic retry, automatic playback; time-sensitive data processing success rate in minutes; network-wide data processing success rate in days

Performance metrics focus on data relationships:

  • Real picture-web page link relationship (eg web page/picture is out, the relationship is updated)

R&D efficiency indicators include business versatility and language flexibility:

  • Business versatility: support data acquisition for businesses that rely on images on the entire network; feature iteration

  • Language flexibility: C++&go&php

2.3 The architecture design of image processing and recording middle platform

Image processing and recording is a stream processing process of unbounded data, so the overall architecture design is mainly based on a streaming real-time processing system, which also supports batch input. At the same time, in order to solve problems such as large throughput requirements and business R&D efficiency, the design adopts the ideas of elastic computing & event-driven, business logic and DAG framework decoupling deployment. The details are shown in the figure below and will be explained in detail later.

picture

2.4 Image processing includes the infrastructure of the middle and Taiwan

Baidu infrastructure:

  • Storage: table, bdrp (redis), undb, bos

  • Message queue: bigpipe

  • Service framework: baidurpc, GDP(go), ODP(php)

Relying on & building business infrastructure

  • Pipeline scheduling: odyssey, supporting each DAG in the architecture panorama

  • Flow control system: Provides the ability to balance and adjust flow at the core entry layer

  • Qianren: Hosting/scheduling/routing hundreds to thousands of cpu/gpu operators with tens of thousands of instances

  • Content relationship engine: depicting graph-web page relationships, based on event-driven computing, and linked with blades for flexible scheduling

  • Offline microservice components: Tigris, the specific business logic of DAG nodes is executed in remote RPC

3. Optimization practice

The following is a brief introduction to some optimization practices of the middle platform in the scenario of high throughput and high computing power.

3.1 Practice of high-throughput streaming architecture

Costs (computing power, storage) are limited. In the face of large throughput requirements, targeted optimizations have been made in the following directions:

  • Message queues are expensive

  • Insufficient resource utilization caused by traffic glitches and peaks and valleys

  • Data accumulation caused by insufficient computing power

3.1.1 Message Queuing Cost Optimization

In offline streaming data processing, it is a relatively conventional solution to transmit data in DAG/pipeline through message queues. This solution can ensure that the business does not lose data (at least once) by means of the persistence of message queues. Business Features:

  • The transmission in the Pipeline/DAG is the picture and its characteristics, hundreds of K bytes, the cost of the message queue is relatively high

  • Downstream operators do not necessarily need all data, and transparent transmission of all fields through message queue is cost-effective

The specific optimization ideas are as follows:

  • The message queue in the DAG is passed by reference (trigger msg), and the output of the operator in the DAG is stored in the bypass cache

  • High-throughput and low-cost optimization of bypass cache, using the life cycle of data in DAG, active deletion & dirty write optimization

picture

The specific protocol is designed as:

  • Trigger msg (bytes), through message queue, point-to-point transmission between miners

  • TigrisTuple (100K~ bytes) is shared between miners through redis

  • ProcessorTuple (M~ bytes) realizes on-demand read and write by bypassing cache

3.1.2 Flow equalization and peak lag calculation

The peaks and troughs or glitches of the ingress traffic make the entire system must be deployed according to the peak capacity, but the resource utilization rate during the low peak period is insufficient. As shown below:

picture

The specific optimization ideas are as follows:

Through the back pressure/flow control mechanism, the total throughput of the system is maximized under the premise of constant resources

  • The flow control system smoothes the flow and reduces the gap between the mean and peak value, so that the "capacity utilization" of each module of the whole system is maintained at a high level.

  • DAG/pipeline has back pressure capability. When the local module capacity is insufficient, the back pressure will be applied to the flow control module, and the flow control module will be adaptively adjusted, and the peak data will be delayed to the trough calculation.

  • In order to solve the unacceptable data lag in the business, differentiate the data priority and ensure the priority distribution of high-quality data (the throughput design of the whole system covers at least the throughput of high-quality data)

picture

△ Figure 3 3 priority flow control

3.1.3 Solve the data accumulation caused by the temporary shortage of computing power in high throughput scenarios

In the network-wide data collection scenario, there is a GPU resource bottleneck in feature calculation, and the GPU card consumed by these features is very large. This problem can be solved through the ideas of "staggered peak" and "off-line mixed distribution, temporary resource use", etc., but the introduction of new Problem: So much data cannot be buffered in the offline pipeline, and the back pressure is not expected to affect the upstream DAG processing throughput

picture

Specific optimization ideas:

  • Analyze bottleneck points and split DAGs; use storage DB as a "natural flow control" system, event-driven (elastic scheduling of computing features, features in place trigger scheduling of downstream DAGs).

3.2 Content Relationship Engine

The content relationship of pictures on the Internet can be described by a three-part diagram. The following concept definitions are used to describe:

  • f: fromurl, representing a web page, there are multiple o's under f. Features of f latitude: title, page type, etc.

  • o: objurl, representing an image link, an o can only point to one image. o Characteristics of Latitude: Dead Links

  • c: Image content sign, the signature of the image content, representing the image. Features of c-latitude: picture content, ocr, clarity, characters, etc.

  • fo: The edge of the link between the web page and the image. Edge features: image context, alt

  • oc: The image link and the edge of the image. Edge Features: Image Crawl Time

The content relationship engine needs to be able to describe the following behaviors:

picture

In order to describe the complete relationship between elements in the Internet, this is a graph database with a scale of 100 billion nodes and P-level storage. The system indicators that need to be achieved are as follows:

  • Write performance:

  • vertex: 10,000-level qps, single-node attribute (100~K bytes)

  • edge: 100,000-level qps

  • Read performance (full screening, feature iteration):

  • Exported point and edge attribute information (scan throughput requirement: G bytes/s)

In order to solve the problem of read and write performance, a COF three-part graph content relationship engine is designed based on table. The core design ideas are as follows:

  • The C table uses the prefix hash to divide the data to ensure the sequentiality of the scan, and read the complete relationship (c from which o, o from which f), P-level storage

  • The O table adopts the SSD mechanism to support checking the C corresponding to the O.

  • The F table uses SSD media to improve random read performance; saves the reverse mapping relationship, and supports searching for O and C through F

picture

In order to reduce the IO bottleneck caused by random writes and reduce the complexity of system transactions, a "version-based verification method, verification during reading, and asynchronous exit" is adopted to ensure the correctness of the relationship.

3.3 Other Practices

In order to improve the iterative efficiency of business research and development and improve the maintainability of the system itself, the system has solved some problems, but it has just started on the road of improving "development happiness". We focus on solving R&D efficiency and maintenance costs.

For example, in terms of service access efficiency:

Data source multiplexing

  • Problem: 10 business data, 10 formats, too many proto embedded, can't understand

  • Try: From heterogeneous schema => standard schema; OP's input/output management

DAG**** output multiplexing

  • Problem: It cannot affect the upstream DAG processing throughput and speed.

  • Try: DAG rpc series connection to solve cascade blocking; DAG native connection, data life cycle problem, copy on write&erase

Resource storage reuse:

  • Problem: I used it to generate a thumbnail, but the thumbnail can't be opened now! What, the original image has also been deleted?

  • Try: multi-tenant mechanism, reference counting exit, unified access to cdn, online unified intelligent clipping and compression

In terms of multilingual support:

  • Problem:

  • Want to use C++/Python/PHP/go, the framework compatibility is complicated! It's slow, who's the problem?

  • I only need to implement a business logic, I don't want to care about too many details of the DAG

  • Try:

  • The DAG framework language is unified, and the business is isolated through remote RPC.

  • Rpc Echo(trigger msg[in], tigris tuple[in], processor input list[in], processor output list[out])

In terms of maintenance costs:

  • Problem:

  • Why is this data not included?

  • SMS 99+ (warning, fatal mixed), what should I do: the message queue is blocked again

  • Try:

  • Distributed app log trace

  • Monitoring & Alarm, Classification + Howto

  • Core business indicators: collection scale/s, collection time by quantile, feature coverage, business distribution scale/s, data loss rate, overtime collection ratio

  • System core indicators: DAG submitted PV, DAG capacity/utilization, OP status (OK, FAIL, RETRY, ...), OP capacity/utilization, OP time-consuming/timeout rate

  • Key indicators: dependent service throughput, delay, failure; OP internal detail monitoring;

Author of this issue | imazon

We are currently recruiting for computer vision processing, index retrieval, offline streaming processing and other positions. Students are welcome to pay attention to the public account of the same name, Baidu Geek, and submit their resumes. We look forward to your joining!

Recommended reading

This easy-to-use distributed application configuration center, we open source it

Detailed explanation of key technologies of Baidu rich media advertisement retrieval and comparison system

Read the original article
| Detailed explanation of Baidu image processing that supports 700 million user searches

Welcome everyone to pay attention to the public account of the same name, Baidu Geek, said that more dry goods, benefits, and internal promotions are waiting for you~

{{o.name}}
{{m.name}}

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=324121193&siteId=291194637