Evolution of Baidu's Smart Mini Program Inspection and Scheduling Solution

picture

Introduction : Baidu Smart Mini Program relies on global traffic represented by Baidu APP to accurately connect users. Today, Baidu’s smart applet has an online volume of nearly one million, and contains tens of billions of content resources; under the massive pages, how to find problematic pages more efficiently and quickly, so as to ensure the safety of online content With user experience, it will be no small challenge. The content of this article will focus on the security inspection mechanism of the Mini Program online content, and focus on the evolution of the Mini Program's inspection and scheduling scheme.

The full text is 6178 words, and the estimated reading time is 16 minutes.

1. Business Introduction

1.1 Introduction of Inspection Service

Baidu Smart Mini Program relies on Baidu's ecology and scenarios, and provides a convenient channel for developers to obtain traffic through Baidu APP's "search + recommendation" method, which greatly reduces the cost of customer acquisition for developers. With the increase in the number of small program developers, the content quality of online small programs is uneven. If low-quality content (pornography, vulgarity, etc.) is displayed online, it will greatly affect the user experience; Sensitivity, gambling, etc.) will even cause serious legal risks and threaten the ecological security of Mini Programs. Therefore, for online mini-programs, it is necessary to build quality assessment capabilities, inspections and online intervention mechanisms. By implementing 7*24-hour online inspections of the content of mini-programs, for mini-programs that do not meet the standards, timely rectification or rectification is carried out within a limited time. Forcibly offline and other processing, so as to ultimately ensure the online ecological quality and user experience of Mini Programs.

1.2 The goals and core constraints of the inspection and scheduling strategy

At present, Baidu's smart applet has hundreds of millions of page views after deduplication, and the total amount of page resources on the applet online is as high as tens of billions. Ideally, in order to comprehensively control risks, all pages should be "checked and checked", and online risks should be recalled quickly and accurately.

However, in actual implementation, the following factors or limitations need to be considered:

  • Different Mini Programs (or main bodies) have different content security indices. For example, the release of small programs in special categories such as government affairs needs to undergo stricter scrutiny, and there is little possibility of violations; on the contrary, the risk index of violations of small programs in other special categories is relatively high. When some mini-programs under a certain main body have more violations in history, the possibility of illegal cheating in other mini-programs under the main body is also relatively large, etc. need to be treated differently;

  • Quota limit for the applet to be crawled. Every time a small program page is crawled, it is ultimately equivalent to a visit to the page, which will be converted into pressure on the small program server, and the stability of the developer service cannot be affected by the inspection itself. In the applet development platform, developers can express the quota that their applet is allowed to be crawled; for the applet that does not display the expressed quota, we will also set a reasonable crawling threshold according to the flow (PV, UV, etc.) of the applet ;

  • Resource constraints. The content security assessment of a page first relies on spider crawling, rendering, parsing the text and image content contained in the page, and security detection of text and images; and the processes of crawling, rendering, and detection require a lot of machines. resource;

  • Other relevant factors and limitations include: risk index corresponding to page traffic (pages with high traffic and pages with only one click have different risk exposures), distinction between different traffic portals, etc.

Therefore, we need to comprehensively consider the above items, design an inspection and scheduling strategy, and continuously optimize and adjust the strategy and resource utilization according to the evaluation of online quasi-call cases, so as to discover potential online potential problems more efficiently and accurately. Risk problems, reduce the exposure time of risks online, and ultimately ensure the ecological health of the smart applet.

2. The evolution process of the inspection scheduling scheme

2.1 V1.0 Inspection and Scheduling Scheme

2.1.1 Top-level architecture

The top-level design of inspection V1.0 is shown in the figure below, and the key components (or processes) included are as follows:

Data source : During the process of online users using Baidu Smart Mini Programs, the terminal SDK will continuously collect relevant buried logs (including Mini Program access logs, performance logs, exception logs, etc.), and then report them to Baidu Log Center, and Disk storage. These log data will be an important data source for the inspection page discovery strategy. (Note: Based on Baidu's security guidelines, we will not collect or store confidential mobile phone numbers and other user privacy information through login status, etc.)

Page discovery strategy : The number of deduplicated pages that are clicked (or visited) by the Mini Program every day is as high as hundreds of millions, which is limited by the resource limitations of crawling, rendering, detection and other links. The risk page is the target of the inspection strategy.

Inspection platform : The platform itself includes multiple sub-service modules, such as inspection task generation, page sending and crawling, and various ability detection (corresponding to various low-quality problems such as risk control/experience, redline/non-redline, etc.) There are many asynchronous interactions through Kafka.

Low-quality review and signal issuance : Since some of the machine review capabilities focus on high recall, the operators will manually review the risk content of the machine review recall, and the manually confirmed low-quality signals will be issued and applied to the downstream.

Online low-quality intervention (suppression) : For various low-quality risk problems, we have a complete and refined online intervention process based on the characteristics of Mini Program traffic and the risk level of the problem. Punishment measures of varying degrees ranging from page blocking, traffic closure to mini-program offline, and even main body blocking.

picture

2.1.2 Implementation of inspection scheduling strategy

The inspection strategy of version V1.0 adopts offline scheduling. Combining with the characteristics of small program traffic distribution, industry category characteristics, online release cycle, violation history and other characteristics, we abstract a variety of different strategies from the online, respectively. Scheduling in different periods such as hourly to weekly.

In order to balance business demands and resource requirements, the inspection and scheduling plan also considers the following factors:

  • Precise deduplication of page URLs within the same strategy and between different strategies

  • Identification and deduplication of the same page from different channel sources

The pages of the applet are distributed by different channels, such as feed, search, dynamic forwarding, etc. There will be some differences in the URL of the same page under different distribution channels, but the content of the page itself is the same, so we have built a special strategy to identify different The same page under the traffic funnel. The purpose of deduplication of the above pages is to improve the effective utilization of resources.

2.1.3 Business Challenges

However, with the rapid growth of the number of developers and the number of small programs settled in Baidu Smart Mini Programs, the number of pages of the Mini Programs has surged from billions to tens of billions; at the same time, the introduction of service businesses has limited the time-effectiveness of risk control. Sexual requirements have been raised from days to hours. The current architecture can no longer meet the business requirements of online risk control for business growth.

2.2 V2.0 version of the inspection scheduling scheme

2.2.1 Design objective (optimization direction)

The detection data of the V1.0 architecture is mainly offline data, with a time delay of T+1. The risk control problems exposed the previous day can only be discovered the next day. The exposure time of the problem, the ultimate goal of the V2.0 architecture design is that the online risk page is exposed and discovered. In order to achieve this goal, the inspection page is mainly based on real-time streaming data, supplemented by offline data; in addition, the applet page inspection needs to capture the page content, which will put pressure on the applet server, so it is necessary to ensure a single applet page The uniformity of inspection and the limitation of single-day quota. The specific design guidelines are as follows:

  • Principle : real-time priority, offline replenishment

  • Red line : It does not exceed the small program grab limit, the grab is even, and the same small program cannot be grabbed centrally

  • Product requirements : ensure high coverage of page detection

  • Limit : Page Crawl Resource Limit

2.2.2 Top-level architecture

The top-level architecture of the evolved V2.0 inspection strategy is designed as follows:

picture

Compared with V1.0, V2.0 introduces real-time data streams, and provides more fine-grained control over the crawling quota of small programs at a window level of 5 minutes.

2.2.3 Dismantling and Realization of V2.0 Inspection and Scheduling Strategy

The overall implementation of real-time inspection is shown in the figure below, which can be divided into three parts: real-time page discovery strategy, offline page discovery strategy, and page scheduling strategy . The real-time page discovery strategy is a new strategy compared to the patrol inspection V1.0 architecture. It directly receives real-time flow log data, and selects the pages clicked by users according to a certain strategy, which can realize risk control problem discovery at the minute level; the offline page discovery strategy is related to Inspection V1.0 has a similar architecture. T-1 log data is used as the bottom line for real-time data. Real-time data is not all used because the use of qps in the applet has peaks and troughs, and the use of troughs in the applet will reduce the amount of page discovery. , resulting in insufficient use of page crawling and detection capabilities. At this time, offline data is needed as a supplement; the page scheduling strategy aggregates real-time and offline strategies, realizing the functions of offline data supplementing real-time data, and real-time page deduplication. Pick up and send for inspection.

picture

2.2.3.1 Offline page discovery strategy

The offline page discovery strategy is to use the log of the previous day's user browsing the Mini Program page to count the PVs of each page, and to filter the pages to be inspected and the PVs of the pages through strategies such as accidental injury pool filtering and crawl quota restrictions. into Doris for inspection and scheduling.

The data flow is shown in the following figure. The data goes through the ODS layer (Hive table stores the original log of the applet), DWD layer (Hive table stores the log loaded by users), DWA layer (Hive table stores PV of each page), DM layer (Doris table Stores the page information to be detected), and the calculation between each layer of the data warehouse is implemented by Spark.

picture

2.2.3.2 Real-time page discovery strategy

The data source of the real-time page discovery strategy is the mini-program real-time distribution service, which is the basis of the mini-program real-time data warehouse. The data user adds a log distribution rule on the management side of the real-time data distribution service, so that the eligible data can be distributed. In the designated message queue (Baidu BigPipe) of the data offloading, use Spark, Flink or a program to receive messages for calculation. The overall architecture of the real-time data service is shown in the figure below, and the delay in seconds has been realized.

picture

In the Mini Program real-time data distribution service, configure the log distribution rules for filtering users to activate the Mini Program, and use Structured Streaming to receive the corresponding message queue topic. First, extract the key information in the log, including: Mini Program ID, page url, event time, etc. . There are hundreds of millions of pages clicked on the applet every day, and it is unrealistic to detect all pages. Therefore, the pages with higher PV are selected first for detection. The calculation of real-time data PV requires a time interval, which corresponds to the Structured Streaming micro-batch. The time interval is 5 minutes, the windowSize of Structured Streaming is set to 5 minutes, the sliding step size is also set to 5 minutes, the windows do not overlap, and the PV of each page within 5 minutes is calculated. For data that has been delayed for too long, it needs to be discarded through the watermark. The watermark time is 15 minutes, that is, the data 15 minutes ago will be filtered. The data output adopts the Append mode, each window is only output once, and the final result is output, so as to avoid repeated submission of pages in a single window. The concepts of specific windows and watermarks are shown in the following figure (this figure is quoted from the Spark Structured Streaming official website, https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html).

picture

Due to the limited QPS of page content crawling, it is impossible to scrape all pages for inspection. For pages with large PV in the window, all pages are crawled. For pages with low PV, due to the excessive number of pages, sampling is adopted, which is not supported by Structured Streaming. Limit statement, in order to achieve data sampling, a random number rand of 0-9999 is given to the page with PV = 1, and pages with rand < 100 are filtered out, that is, the low PV page is sampled by 1%, and finally the high PV page is compared with the sampled low PV. After the page is unioned, it is guaranteed that the qps for inspection is less than the qps limit for page crawling.

The filtered pages also need to go through a series of product policy restrictions:

  • Some applets are exempt from review, and the pages of these applets are filtered out;

  • The accidental injury pool is a page that has been recalled by the machine reviewer, but the human reviewer has no problem. Filtering the accidental injury pool can greatly improve the detection accuracy;

  • A large number of page crawling will put pressure on the server side of the applet, and each applet has a crawling quota limit.

The left join operation is used for page filtering to associate the real-time stream with the offline dimension table. The offline dimension data is constantly updated. It is necessary to ensure that the name of the dimension data remains unchanged, and the data content is constantly updated to ensure that the real-time stream can be accessed in every window. to the latest dimension data.

The final output data is output to Elasticsearch. If all data is written under the same index, additions, deletions and modifications are all under the same index. The amount of data under the index is increasing day by day, the efficiency of query and insertion is reduced, and it is inconvenient to delete historical data. , delete_by_query itself has poor performance, and it is not physically deleted, so it cannot achieve the purpose of freeing space and improving performance. The method of index alias and time index segmentation is adopted here. The advantage is that deleting historical data can be deleted according to the historical index, which is convenient for operation, can effectively release space and improve performance. For ES index segmentation, you need to create index aliases, create index templates, create indexes containing dates, formulate and configure rollover rules, and create scheduled tasks for splitting indexes and deleting old indexes.

PUT /%3Conline-realtime-risk-page-index-%7Bnow%2FH%7BYYYY.MM.dd%7C%2B08%3A00%7D%7D-1%3E/{    "aliases": {        "online-realtime-risk-page-index": { "is_write_index" : true }    }}POST /online-realtime-risk-page-index/_rollover{    "conditions": {        "max_age": "1d", //按天切分索引        "max_docs": 10000000,        "max_size": "2gb"    }}

2.2.3.3 Paging Policy

After the offline and real-time page discovery strategies introduced in the previous two parts, the offline page data set to be checked and the real-time page data set are obtained respectively. The following will introduce the final paging strategy in detail based on these two sets to be checked.

1. Data division, periodic scheduling

For offline datasets, use [Batch] to divide multiple batches of datasets; for real-time datasets, use [Window] to divide multiple batches of datasets. Use [Scheduled Task] to process these divided datasets periodically. Divide a day into bn batches, assuming that the number of minutes of the day where the current running time of the [timed task] is currentMinutes, then the [batch] of the [offline data] to be processed now batch = currentMinutes * bn / (24 * 60). Correspondingly, the [window start point] of the [real-time data] to be processed now windowStart = batch * (24 * 60) / bn. Considering the watermark setting of real-time data processing and the scheduling period of timed tasks, currentMinutes is not strictly taken from the current time, but is obtained from the current time - 30 minutes later.

2. Real-time priority, offline supplement

Again, focusing on the design principles of system resource limitations, pressure on a single applet crawling, and complement between offline and real-time page sets, within a single cycle of paging, if the limit is not reached, all real-time pages will be scheduled for detection; if If the real-time page still fails to reach the limit at this time, it will be supplemented with offline pages to ensure that the system runs at full capacity at all times and resources are fully and evenly utilized. At the same time, this scheduling method can also ensure the mutual preparation of real-time and offline policies. When a problem occurs with a single policy, the system can still find the page through another policy and submit it for inspection, so as not to idle. In addition, the pages in the offline data set will be reversed according to the PV, so that the offline pages with more clicks are detected with higher priority.

picture

3. Page deduplication

Based on the PV distribution law of online pages, in order to maximize the PV coverage of inspections, the page scheduling policy also adds logic for deduplication of high PV pages on the day and deduplication of medium and low PV pages within a specified period. Storage database selection Redis, data structure and fragmentation calculation design corresponding to page url are as follows:

data structure

Collection. The URL of the detected page stored in the collection is converted to 16-bit md5. The amount of data stored in a single collection for one day is too large, so the data of one day is divided into multiple shards, and each shard is a collection. If you divide a day into 100 shards, then there are 100 sets in a day.

Fragment calculation corresponding to url

1. The letters in the applet url are converted to int and added to get the number x

2. x mod the number of shards to get the key

Data structure diagram:

picture

2.2.4 Revenue Review

In the process of continuous evolution and optimization of the intelligent applet inspection platform, the platform capabilities have been greatly improved:

  1. The number of pages inspected daily has supported tens of millions, and the page coverage has been greatly improved;

  2. A real-time inspection channel based on real-time data is added, which greatly reduces the online exposure time of problem pages, finds problems at the minute level, and reduces the sorting and intervention links from days to hours;

  3. Offline data supplements the scheduling strategy of real-time data, makes full use of page crawling and page detection resources, and uses fewer resources to detect more pages;

Finally, a quality assurance system for Mini Programs was established to help better discover and deal with online problems, control online risks, reduce the proportion of online low-quality programs, and ensure the ecological health of Mini Programs.

===

3. Thinking and Outlook

In this paper, we focus on the business goals and background of inspection, and focus on the evolution of the small program inspection and scheduling strategy, which is enough to see that online quality inspection work requires continuous construction and training. With the continuous development of the business, the online resources will be more abundant, the content will be more diverse, and the amount of page resources will continue to grow, and we are bound to face greater challenges. How to efficiently and accurately recall online risk issues from massive page resources has always been the thinking goal of the inspection and scheduling strategy. We will continue to explore and optimize, and unswervingly escort the growing smart applet business.

Recommended reading :

Mobile Heterogeneous Computing Technology - GPU OpenCL Programming (Basic)

Cloud native enables development and testing

Saga-based distributed transaction scheduling landing

Design and Implementation of Spark Offline Development Framework

Aifanfan micro front-end framework landing practice

Baidu's official technical account "Baidu Geek Talk" is online!

Technical dry goods·Industry information·Online salon·Industry conference

Recruitment information · Internal push information · Technical books · Baidu peripherals

Welcome students to pay attention to Baidu Geek!

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4939618/blog/5530920