Knowing the practice of DMP platform architecture construction based on Apache Doris |

Introduction: Zhihu built a DMP platform based on business requirements. This article introduces the working principle and architecture evolution of DMP in detail. At the same time, it introduces the application practice of Apache Doris in the DMP platform. This article is very helpful for everyone to understand how DMP works. Welcome read.

Author User Understanding & Data Empowerment R&D Leader Hou Rong

DMP business background

DMP platforms are a commonplace topic. After the early advertising system appeared, there were similar DMP platforms, such as Tencent's Guangdiantong, Alibaba's Dharma, etc., which are typical examples of better DMP platforms in the industry. Zhihu builds its own DMP platform, on the one hand, because Zhihu has relevant in-site operations; on the other hand, it is also because we can support internal system docking by building a DMP platform, and at the same time, we can also assist in the completion of related business development and customization. The purpose of demand construction.

DMP services include: business models, business scenarios, and business requirements.

img

Figure 1.1 DMP service

The direction of DMP platform design: in order to find our core customers, and then complete marketing operations such as advertising for our core customers, so that there can be a better match between core customers and our content.

business model

DMP platform business model

  • Transfer from outside the station to inside the station . A typical scenario is how advertisers transfer possible off-site crowds to the site through Mapping in the process of advertising, and undertake these user packages on the site's system.
  • From inside the station to outside the station. First find targeted users in Zhihu, and then use these users to advertise in three parties.
  • On-site operation. Including content operation, user operation and activity operation. On the one hand, it can increase the publicity of Zhihu related content, on the other hand, it can locate customers and accurately solve the problems and needs of certain customers. At the same time, we can also improve business results through event design.

Business scene

Based on these three business models, the main application business scenarios are:

  • information flow. Take the recommendation scenario as an example: in the recommendation scenario, there will be two kinds of demands: directional recommendation and directional rights escalation. Targeted recommendation means that we push the push content to certain users in a targeted manner, and targeted privilege escalation means that we escalate the push content to the pushed user and re-score it.
  • Real-time bidding on the advertising side. After knowing which advertisements are hung on the user, real-time bidding can be performed, and the most suitable advertisements for the user can be selected by sorting.
  • Details page. There will be pop-up prompts on the details page: for example, after a user clicks on a particular detail, if the user does not meet the target conditions, a pop-up window will guide the user to meet the conditions.
  • activity platform. Set the target user for the campaign. Display different activity information for different target groups.
  • touch the system. For example, when pushing messages, pop-up windows and text messages, you can get a specific type of user, and then publish the corresponding Push and in-site messages to this type of user.
  • Offsite delivery. Find the right user group and place appropriate ads for them off-site.

Business needs

Based on the business model scenario, the things that can be done on the crowd side can be divided into three categories:

docking system

Generally divided into the following three situations:

  • Which crowd packs this user has hit. Taking the advertising system as an example, the ID of the crowd pack can be mapped into an advertisement, that is, which advertisements the user has hit.
  • Internal crowd pack. Internally, a crowd pack is a push for who to recommend content to, or to whom to publish content.
  • to external advertising. When we filter out a type of user that needs to be placed outside the site, we are using the external crowd package. For the difference between the two crowd packages, the crowd IDs will be different: one is the general ID within the site, and the other is based on the corresponding external IDs on different delivery platforms.

Crowd Orientation

Crowd orientation includes import/export, tag circle selection based on certain features, crowd generalization, user volume estimation, etc.

  • Crowd generalization, after getting a relatively small seed crowd package, find similar features based on rules, and then expand more crowds by adjusting the confidence of similar features.
  • Estimate the number of users. After selecting a batch of users, you need to know the number of users in this batch immediately.

Crowd Insights

Including profile insights, internal profiles of users, and comparative analysis between two different groups of people.

Business Process

Since the three scenarios of the current DMP service are oriented to different groups of people, different systems inside and outside the station will be provided to complete the operations related to these groups of people.

According to this situation, we organize the crowd targeting function, carry out Mapping after acquiring the target users, and recover the effect of the users on the site or outside the site, then obtain the target users for composition analysis and comparative analysis, and conduct user insight. If the target is achieved, the launch will be successfully achieved; if the target is not achieved, the operation side will make relevant assumptions: Is it possible to add a feature or special operation to further improve the business? After the hypothesis is proposed, the AB experiment is designed, and after the AB experiment, we will make some adjustments to the target population. The above is our operation process.

img

Figure 1.2 DMP business process

In-station operation self-closed loop

Crowd Orientation. Through the label circle selection, select the people who have activity effects in history or import people who like this activity, and perform generalization to complete the selection of the basic group package, so as to determine the target group.

to deliver. Since many services are connected in the information flow, reach system, detail system, and advertising engine system on the recommendation side, the above systems and services can be used to complete the delivery of target users in different traffic scenarios on the site.

after delivery. Get the performance of this launch and analyze it. For example, the operation we do is to send Push, who clicks Push, reading time and other behaviors, we can analyze which users prefer the Push we released this time, so as to obtain the typical characteristics of target users.

If the clicks of this Push reach the push target, then the target is completed; if the clicks fail to reach the target, we will make an assumption, such as initially predicting that the males who clicked the Push > females, but when the final result is the opposite, we will pass The TGI algorithm is sorted, the typical characteristics of the two differences are found, the design is completed and the AB experiment is produced.

Through the AB experiment, we can make a comparison of the front and rear crowd packages and release the relevant Push. If the number of clicks increases, we will continue to complete the cycle in the follow-up process, and finally find accurate users in the field based on our operation scenario.

On-site to off-site delivery

Based on the accumulated user characteristic data, find out the groups of people who are likely to have off-site effects within Zhihu, and delineate the scope of such groups of people. Then through Mapping, the ID in the site can be converted into the ID produced on the third-party delivery platform and delivered.

Due to the different systems in our station in this process, we cannot directly obtain the corresponding buried point data for data link construction. Therefore, we must download the corresponding buried point data through the third-party delivery platform, and complete the data import through similar scenarios. Then proceed to the construction of the subsequent process. This also leads to a longer recovery of the effect of the entire process.

Outside the station transfer station

If I am an advertiser outside Zhihu, I want to launch a toothpaste product, but I don’t know much about Zhihu’s users. Through the previous operational research, we can find out what the history of people who buy toothpaste looks like. Then you can convert the crowd package obtained from the previous research into Zhihu ID through ID Mapping and import it to generate the target crowd. However, there may be a situation where the advertisers get the people who buy toothpaste and have a low degree of overlap with the users of Zhihu. At this time, the second function is enabled, which is the crowd generalization function.

Crowd generalization will connect seed populations with a small number of imported people to Zhihu, and this process can complete training in the AI ​​model for all the features achieved by users. It is possible to train the model of the seed population under the characteristics of all users of Zhihu, and then pour all the characteristics of all users into the obtained model for reasoning. In this way, target users with confidence are obtained.

If the advertiser believes that based on the previous research results, the relevant target group is about 10 million in Zhihu, then we can choose the confidence level for the target user. For example, when the confidence level is 0.7, the result is 20 million; then when we adjust the confidence level to 0.8, the result is 10 million. At this time, we can choose the confidence level of 0.8 to complete the docking of the advertising engine and put it into service. Analyze the effect.

Based on the above operation process, we can abstract the core functions of the DMP platform including insight, orientation and ID mapping.

Portrait features

img

Figure 1.3 DMP profile features

Based on the above user portraits, we constructed portrait features. Among them, the label is the most important part, and it is also the discrete part. The continuous part includes the user's stay time and related user behavior, such as: what someone did in a certain place, etc., which are all continuous features. In terms of features, until the feature has not been labeled, we will collectively refer to it as ordinary features.

Function carding

img

Figure 1.4 DMP function carding

Based on the functions of the DMP platform, it is expanded to the right side as business functions. Business functions will serve operation, sales or in-site application systems, including crowd orientation, crowd insight, and related ID Mapping. Extending to the left is an informative and important feature access part.

The current DMP platform has 2.5 million tags in terms of tags alone, and 110 billion related user data in user X tags. At the same time, there are real-time requirements for some tags in business. This also results in a lot of things to be done during the feature access process.

Next, we will introduce the specific functions for you.

  • Crowd Orientation. In terms of crowd orientation, it is generally divided into three functions: import and export, feature circle selection, and crowd generalization.
  • Crowd Insights. Including composition analysis and comparative analysis of two functions. The composition analysis part can be simply understood as a pie chart or a bar chart. Comparative analysis is a comparative analysis of multiple populations.
  • ID Mapping. On the whole, whether it is oai, idfa, or mobile phone number, all the continuous unified IDs of Zhihu will be hard generated, and this continuous ID is basically strictly self-incrementing.
  • Feature access
    • Construction methods are divided into real-time features and offline features
    • There are offline and real-time access methods for tag groups. Among them, tree tags are mainly used to deal with complex scenarios. For example, users read and interact with a topic in a multi-select tree structure.

DMP Architecture and Implementation

img

Figure 2.1 DMP service architecture

I think architecture is an important stage to achieve the end goal, but not a necessary stage. As long as we improve all functions, we can complete all our business practices, but this will lead to the continuous expansion of the system, the corresponding maintenance costs will continue to increase, the stability will become worse, and finally no one can maintain it. distress occurs. The architecture can mainly solve for us how to maintain and iterate in a low-cost way and optimize a certain module in a targeted manner in multiple complex business function scenarios, but it cannot solve the actual business function problem.

Based on my understanding of the architecture above, I disassemble the business and the overall DMP architecture:

DMP user

The DMP system is connected to three types of users:

  • In terms of platform, it includes advertising platform, information flow, advertising engine and reaching system.
  • Operators, including business-related operators such as operations, delivery, and sales.
  • Such as feature developed products and related internal products.

And the front-end systems that these three people are connected to are also different.

First of all, we believe that the platform or system aspect will interface with the interface layer of DMP. There are three main types of interfaces:

The first interface is such as advertising engines and feeds often request a list of which groups of people the user has hit. In the advertising engine, after completing the request, you can directly turn the crowd package list into an advertising ID and complete the bidding. The information flow is similar to the advertising engine: if the current user hits a content or domain tag that we want to escalate, we will escalate. The design of this interface is typical of high stability, high concurrency, and high throughput. We can compare the bearing difference between this interface and other interfaces through online data: this interface currently carries 100,000 qps. Since the interface is connected to the company's core system, there can be no jitter and failure, and its stability is required. Reaching the S level, so the interface also has multi-machine cache and high concurrency related design, which needs to be able to achieve the goals of high stability, high concurrency, and high throughput.

The second part is the crowd packs inside and outside the station . This part is similar to the above content and will be connected to our core system. Once the crowd package cannot select the crowd, the overall marketing and targeted delivery will also be affected. For the DMP front-end part, there is a clear difference between this part and the interface layer: the DMP front-end is mainly connected with our internal operation classmates and sales classmates. If there is an abnormal situation in the DMP front desk, it will only be unable to carry out new insights and crowd orientation, and will not affect the normal use of historical crowds. Since this part will connect with a large number of sales and people instead of the interface for re-requests, the complexity of use must be minimized and the training cost in operation should be reduced. Therefore, the DMP front desk needs to have simple operation and low use cost. specialty.

The third aspect is to interface with our internal systems , which will mainly reduce the cost of our daily development.

DMP core functions

DMP can support the core business modules of crowd selection, generalization, and crowd insight; it supports label production, ID Mapping, and computing task operation, maintenance, and storage functions.

DMP business module

The DMP business module is divided into upper and lower layers. The upward business layer realizes the cost reduction of new functions, and the focus is on scalability; the downward business layer increases with the growth of the population and business functions, and the overall development or technology investment cost is not high. There will be too much output, that is, the scalability of resources.

DMP infrastructure

At the bottom is the infrastructure, which needs to ensure the stability of the infrastructure.

We judge the interface based on the fact that the requested interface mainly bears Redis; Doris mainly bears the DMP foreground and overall business functions ; the background part mainly bears MySQL and TiDB. The above are the relevant bearing aspects of our current specific underlying database.

Some people will ask if Redis costs will be too high? Will not. Because the core crowd selection logic is implemented on Doris, and a large number of related tags are stored through Doris, only when an advertisement needs to specify certain characteristics of a target group to arrange and combine and complete the generalization , we will circle the result corresponding to a certain crowd package ID, and finally export it and store it in Redis. Therefore, the main purpose of Redis is to carry high concurrency, and the actual storage amount is very small.

DMP Platform Feature Inventory

The functional inventory is mainly divided into two parts: business orientation and basic orientation.

img

Figure 2.2 DMP platform function inventory - business orientation

business direction

The business gives us the ability to support crowd targeting and crowd insight.

Crowd Orientation:

  • Crowd estimation: For example, multiple conditions such as gender, age, topics of interest, and the user's mobile phone brand are arranged and combined, and it is required to complete the estimation of the magnitude of crowd characteristics with an accurate structure within 1 second.
  • Crowd selection: After estimating the number of people with a precise structure, the estimated result can be converted into a relevant crowd package for delivery and use within minutes.
  • Crowd package generalization: The ability of generalization is required to be as simple as possible. For example, after I select a crowd package with a history, I can perform crowd generalization and have a specific execution degree selection.

Crowd Insights:

  • You can explore the current activity portal portrait and complete traffic recovery. For example, if I publish a push to 1 million people, and 30,000 of them click it, then the traffic can be recovered for these 30,000 people. Compared with the 1 million people who have already pushed, the obvious user characteristics of these 30,000 people can be obtained. It is convenient for us to extract more accurate user groups in the future.

basic

In addition, the DMP architecture has some basic functions, including main feature construction, ID mapping, and computing task operation and maintenance.

img

Figure 2.3 DMP Platform Function Inventory - Basic Direction

These three basic functions not only allow us to quickly complete real-time and batch calculations, but also help us solve the problem of rolling out new and old versions. Because we currently find a user through AI, data acquisition, and feature screening, even the most basic feature such as gender is in the process of continuous optimization, but there is no way to quickly evaluate the operational impact of each optimization. It is necessary to achieve multi-version grayscale online and roll out online.

feature construction

The whole feature has two parts, one is the atomic feature and the other is the derived feature.

  • When building atomic features, we need to produce a large number of features of the same benchmark from offline or real-time data.
  • For derived features, one more feature is produced based on the produced feature. Example: If we think that a certain group is a high spending power group, in a simple scenario, we may select a woman between the ages of 18-25 in a first- and second-tier city, and think that such characteristics may It is a group characteristic with relatively high consumption ability of cosmetics. We will then store this feature as a derived feature to speed up subsequent computations and reduce the cost of operational screening.

Feature construction can achieve capability isolation, so as to improve our feature construction and online efficiency.

Mapping capability

Including device ID mapping, user feature ID mapping, and generalized feature ID mapping. The overall scene of this part is mainly unified ID, and the ID is changed from a discontinuous ID with different types and different types to a continuous and unified int-type self-incrementing ID.

Computing task operation and maintenance

Task operation and maintenance is mainly to complete the DAG scheduling and computing resource management. If you have used Doris, you will know that Doris will use the fastest speed to complete each SQL execution. Therefore, when estimating the crowd, it is necessary to do a good job of queuing. Otherwise, when there is a sudden wave of operational actions or hot events, it may occur that multiple crowd packs are estimated and all resources are occupied, which will affect each other. Therefore, it is necessary to prioritize resources through task operation and maintenance, and perform the circle selection of relevant groups of people one by one.

Summarize

  • Feature construction can achieve capability isolation, so as to improve our feature construction and online efficiency.
  • ID Mapping masks the difficulty cost of our ID Mapping. We will be divided into three parts: completing the construction of atomic features, completing the construction of derived features, and building infrastructure. When the infrastructure construction students complete the shielding or isolation in the architecture, the feature construction students do not need to worry about ID Mapping issues, but only need to focus on building features.
  • In the operation and maintenance part of computing tasks, business development students do not need to know what happened at the bottom layer. For this reason, we need a student to complete the encapsulation of the bottom layer and then provide an interface to the upper layer. The business side can directly use the functions of the bottom layer and shield it at the same time. the underlying complexity. Through abstraction and shielding, the efficiency of the final launch and construction can be significantly improved, and some other work can be transferred from the R&D side to the operation side.

Example: We currently have two features, the first is the atomic feature. In the process of forming atomic features, writing a single SQL can form a feature. Both analysts and business products can participate in the building process of features. The second is derived features. We have the ability to merge and differentiate derived features in the operation background, and some business operations can be directly operated in the management background and the construction of derived features can be completed. In this way, the main workload is gradually transferred from the R&D side to the product side and the business side, which significantly improves the efficiency of launching various capabilities and features.

DMP core introduction

The core part of DMP has two aspects: data writing/importing and fast checking/fast reading. Writing and importing are part of the link and storage. I will introduce Quick Check and Quick Read later.

Feature data link and storage

img

Figure 3.1 Feature data link and storage

The first part of the writing process is the offline link: the offline link will run the relevant SQL from the Hive storage of each business and generate a Tag table. We will complete offline Mapping after dropping a Tag table on Hive. This offline Mapping process will request to automatically generate a unified and continuous user ID through the core of the user device. At the same time, the offline Mapping process will convert and uniquely bind data such as imei, idfa, oaid, etc. If a new user is found after the process, it will be generated. The new ID, if it is an old user, get the user ID. Through this process, the ID mapping table is generated, and then complex processes such as capitalization can be used to obtain the mapping table of the user's unique ID and the mapping ID. This is the first table we get.

Then we will perform enumeration collection after ID Mapping: the current tag group is 125, consisting of 120 offline features and 5 implementation features. When we complete the development of these 125 related data, the corresponding atomic features of the data can be directly taken out through Mapping. The reason for enumeration and collection is that users need the search function of tags during use. When users search for tags, the cost of 2.5 million manual entries is too high, so we will enumerate and collect in the process of offline and real-time processing. out, and written to ElasticSearch through Bulk Load. In this process, continuous self-incrementing IDs are also generated to map the inverted list of user tags, that is, the tag_map table, which is the second table we get. In addition, there is a third table user behavior table. This table is constructed by us in real-time data warehouse, so that part is not emphasized separately.

Based on the parts of the above three tables, we have formed three sets of storage:

  • The first set is the search tag store on ElasticSearch.
  • The second set is on Doris, which is also the core storage.
  • The third set is the storage of the overall ID Mapping.

After obtaining these three storages, various joins and queries can be performed, which provides a basis for subsequent insights and crowd orientation.

Next, we will announce several orders of magnitude: the order of user X tags is 110 billion; ID Mapping is a wide table, the order of magnitude is 850 million; ElastichSearch, the order of magnitude is 2.5 million. These three orders of magnitude are also why we chose ElasticSearch and Doris.

Crowd Orientation Process

After the above data is imported, three tables are formed. Here, the three tables are used to generate crowd-related orientations and crowd packages.

img

Figure 3.2 Crowd Orientation Process

There are two types of crowd targeting processes:

  • The first is to estimate the crowd after screening the crowd tags through the shopping cart, and finally complete the process of writing back the crowd circle selection to Redis.
  • The second is crowd generalization. The overall training of the AI ​​model and the reasoning of the crowd are completed through the AI ​​platform, and then written back to Doris, selected and labeled by confidence.

Briefly describe the process of these two processes:

Overall tag search. The user's front desk will complete the tag search after generating the tag search event: after looking for the desired tag by thinking about various name combinations, we will put this tag in the tag shopping cart side by side. This process is the process of checking the number of people after adding various tags and combination conditions to the crowd shopping cart.

The reason this process exists is that in day-to-day operational use, we estimate the magnitude of each promotion or target group. If this event originally only involved about 2 million to 3 million people, and it is estimated to be 50 million after the crowd selection, then it must be that our selection conditions are not accurate enough. In this case, we need to gradually add various precise conditions. And control the circle selection to an appropriate range of magnitude before forming a crowd package, so this process will continue to loop and obtain the appropriate combination of tags/features. After obtaining the appropriate combination, we need to determine the target and crowd of this label, and this process will generate a crowd package. The process of generating the crowd package will perform the table join operation and correlate the original data, and also correlate the ID Mapping table. If there is a situation of exporting to the off-site, the ID Mapping table will be made and the off-site ID conversion will be completed. Then write the exported crowd package ID and crowd ID into Reids, and notify after writing.

If you only need to provide a crowd package to publish push and SMS and other services, you don't need to write to Redis, so you can release a lot of storage and write it to offline storage. For example, on the one hand, it is HDFS, and on the other hand, the object storage we connect to will be written to these storages. After these storages are directly transmitted to the push system, the information system can directly get the crowd package and issue relevant push or push messages to the relevant crowd in batches.

Crowd generalization. The crowd generalization process may or may not have the process of uploading the crowd package at the beginning. This process mainly solves the problem that in some businesses, we have a population of certain historical activities and need to generalize the population. If its crowd package has clicked on our Push before, it can be directly screened. After the screening is completed, all user characteristics are associated for user training. After the model training is completed, the users of the whole site are inferred, and a batch of confidence is obtained. The result of the crowd ID is returned and written to Doris. During this process, another process will be initiated at the same time. This process will filter the generalization results on the user side, and an appropriate number can be selected according to the appropriate confidence.

Next, I will introduce several common processes: after the development is completed, the core process is to add tags and shopping carts and complete the circle selection, the traditional crowd generalization process. However, after communicating with the operation side, we found that in our daily work, the operation side will actually use these several processes repeatedly. The actual use is as follows: get people with historical effects and generalize them , but after the generalization is completed, the effect of his user characteristics will be expanded accordingly, and then the label of this operation characteristic will be superimposed to complete the circle selection and use.

The second is to gain insight and analysis after obtaining historical effects. Including viewing the user's portrait and then re-selecting according to the label relationship, and then superimposing a historical positive crowd package before generalization. After the generalization, the distribution conditions are realized, and finally, the circle selection is carried out, and the group is included in the advertisement and related delivery business. The operation side will do a lot of more complex combinations based on atomic capabilities before using them.

Crowd-oriented performance optimization

background

img

Figure 3.3 Background and Difficulties of Crowd-Oriented Performance Optimization

There are two major functions in the current DMP system, the first one is crowd orientation, and the other one is crowd insight. Based on these two functions, there will be an underlying function to build various user-side portrait features. When we complete the dismantling, we will find that this part of the function of crowd orientation is a pain point on the operation side or business side.

Scenario Requirements

  • Crowd estimation, for the delivery and marketing scenarios, there will be an expected number of people on the operation side, then a shopping cart of the corresponding scale will be built, and new features will be added to the shopping cart. You need to see how many new features will be circled after they are added. people, rather than having to wait a long time every time a new feature is added.
  • Crowd selection, targeting hot spots. The operation side will continue to follow up on various hot events in daily work. When certain hot events occur, it is necessary to quickly circle the crowd packs to publish Psuh and recommendations. If the circle selection process takes several minutes, hot events will be missed.

difficulty

  • The first data volume is huge, as marked in the above figure.
  • The second expectation time is very short, crowd estimation and crowd screening can be completed in one second and one minute respectively.

performance optimization (1)

In the first stage of optimization, we address these two problems through the following points:

img

Figure 3.3 The first stage of crowd orientation performance optimization

Inverted index and query by condition

img

Figure 3.4 Inverted Index and ID Mapping for Crowd-Oriented Performance Optimization

  • First of all, in terms of inverted index, we changed the query condition from the original and or not to the intersection and difference of the bitmap function; at the same time, we broke up the continuous values ​​into discrete labels. For example: the user's age is an int type greater than 0 and less than 100. If you filter in numerical order, the operation side is not easy to control, and the circle selection process will also lead to unsatisfactory use results. Therefore, we put another label on the age arranged in order, called the age group, such as 18-25, 0-18 and so on.
  • Then, convert the original and or not query into an inverted index related query, and the originally created table will be sorted in the order of tag_group, tag_value_id, confidence interval identifier, and bitmap. At the same time, based on this part, we also need to perform ID Mapping. The core of ID Mapping in the process of importing is to change the user ID into continuous self-increment.

Query logic changes

img

Figure 3.5 Query logic changes for crowd-oriented performance optimization

The original query conditions were and, or, and not in the where condition. Now, through complex means, the original query conditions are modified to bitmap_and, bitmap_or, and bitmap_not. We use the business code to pass the external operation through the and, or of the visual background configuration. The logic of , not are all changed to functional logic, which is equivalent to putting the where condition into the function and aggregation logic.

But after optimization, there are still two problems:

The first problem is that a single bitmap is too large, and the second problem is the spatial dispersion of the bitmap. These two problems collectively cause the network IO to be extremely high each time the intersection and difference aggregation is performed.

The underlying Doris uses brpc. In the process of data exchange, because each single bitmap is very large, it will cause congestion of brpc transmission, and sometimes even hundreds of megabits of bitmaps are exchanged. Hundreds of megabits of bitmap have very low performance when calculating the intersection and difference. Basically, we want to reach it to circle the crowd in 1 minute, and it is impossible to estimate the crowd in 1 second.

Performance optimization (2)

Based on the remaining issues, we performed a second-stage optimization.

img

Figure 3.6 The second stage of crowd-oriented performance optimization

divide and conquer

The core idea of ​​the second stage is divide and conquer. When we launched the first wave, we found that the crowd estimation ability was at the minute level, and the circle selection was basically 10 minutes away. The idea of ​​divide and conquer is to group all users of the whole site into continuous self-incrementing IDs and then group them according to a certain degree. For example, 0-1 million is a group, 1 million-2 million is a group... gradually divided into several groups. The intersection and difference of users of the whole station can be equivalent to the sum of the intersection and difference results after grouping.

img

Figure 3.7 Crowd-oriented performance optimization divide and conquer

Data Preset

After we discover this rule, we can do related data presets by dividing and conquering. Using the Colocate group feature in Doris, all 1 million people in each group are placed on a physical machine to avoid network overhead.

Operator optimization

After all of them are placed on a certain physical machine, the aggregated operator can be implemented by replacing the bitmap not and bitmap count of the original bitmap_and_not with a function. At this time, based on the new version of the Doris team, after adding a combination function similar to bitmap_and_not_count, the performance has been significantly optimized compared to the original nested function.

solution

Based on the above solution ideas, we designed a new solution.

The new solution is divided into the above three ideas, including the change of query logic, the summation of sub-logics from estimation, and the merger of sub-logics from crowd selection.

  • Since the original calculation of several bitmaps has been transformed into multiple group bitmap calculations, the parallelism of multi-threading can be further improved, and the calculation speed can be improved; at the same time, the code is also optimized, and the composable bitmap_and_or_not functions are merged when submitting into the same function; during the writing process, the group ID and the corresponding million group are written to adjust.
  • The corresponding tag table will be written both offline and in real time. After completing the writing of the tag table, you can write the different user tags in each tag to different physical machines: for example, you can split 3 million and write them on three different physical machines to complete the separation of physical machines. . This is set with the help of Colocate group and Group key. After the writing is completed, the calculation process changes from the original overall calculation to independent calculation according to each group. Because the overall bitmap is very large, each independent group is calculated on a physical machine, and the speed is significantly improved.
  • After each group is calculated, it is merged. After the merger, the crowd estimation becomes a simple sum of the numbers on different physical machines, and the result is basically achieved in seconds. The crowd selection also becomes the bitmap on different physical machines, and then Shuffle goes out to do the final merge. This process is very small, and the result can be output within 1 minute.

Optimization Results

The following two screenshots are the query plans before and after the merge, respectively.

imgFigure 2.7 Crowd-oriented performance optimization data preset

Before optimization: in the query process, first we need to do a bitmap_and and bitmap_not or bitmap_or for a certain tag, after that, the other tags will also do the same aggregation, do a Shuffle after the aggregation, and finally do a Join . At the same time, the other parts will also be aggregated, and after the aggregation, Shuffle and Join will be performed.

In these several aggregation processes, each tag has a very high cost, and it needs to go through the process of aggregation-network transmission-re-aggregation-re-network transmission before joining.

After optimization: The query plan has changed significantly. You only need to query during the merging process through a function, and after the merging is completed, the final result merging can be completed. Whether it is the addition of int types or the merging of bitmaps, there is only the last layer, and the speed is significantly improved. The original per capita estimate can be completed in minutes. After optimization, it can be completed in only a few hundred milliseconds. Even if it is complex to thousands of conditions, it can be completed in only one second.

The crowd selection process is also similar to the above process: in the case of complex conditions, it can be completed in more than one minute to two minutes. If there are only a few dozen to one hundred conditions, the crowd selection can be completed in about a minute.

The whole process mainly splits the data. The split data is preset on a physical machine in advance by Doris' Colocate principle. Through optimization, it can meet the operational requirements of most scenarios.

Future and Outlook

business direction

img

Figure 4.1 Future and Outlook Business Direction

As you can see from the red box selection, the current system process is to perform Mapping after crowd orientation. In terms of user insight, it is built around the crowd, and at the same time, it connects with each business side in the links of Mapping, insight, and crowd. But in this process, how to achieve the goal through operation and how to design the AB scheme, the two parts are loosely coupled.

In the future, we hope that the DMP operation platform will not only be a loosely coupled model, but also be able to implement a strong coupling and strong binding model in business. Such an operation mode will be more comfortable in the use process, and the overall operation process can be completely completed on the DMP platform, and relevant AB experiments can be designed and continuously optimized according to the operation effect.

technology

img

Figure 4.2 Future and Prospect Technology Trends

In the process of technology construction, the most important thing is to circle the crowd. The operation side will even select hundreds of conditions for crowd selection. And these operators may belong to different businesses, which will cause their basic conditions to be written very similar. For such similar basic conditions, we will manually create the corresponding bitmap for pre-merging, and then circle the selection based on this feature. Due to the pre-merging, our subsequent execution speed will be significantly improved.

The first is query efficiency. Regular scans and SQL Parser for all operational crowd selections. After analysis, the aggregation conditions of SQL are automatically designed for pre-aggregation, and the corresponding bitmap is synthesized and registered to the relevant features. When the crowd is selected, we will also automatically rewrite the previously selected SQL through the same SQL Parser. Before the rewrite, there may be dozens of features, and they are exactly equal to the result of a derived feature. At this time, you can Replace directly with derived features. This move can further improve the circle selection rate of our query.

The second is import speed. After five days, we need to import about 2TB of data every day, and store 11TB of data. The amount of data is relatively large. We hope to further speed up the import process. At present, we have learned that Spark directly writes specific OLAP engine files in the industry. We are also thinking about whether we can directly write Doris Tablet files through Spark and mount them on FE, so that we can quickly complete the import or write.

Q&A session

Q: How many labels does Zhihu's label system have? How many records are there? Is there a large wide table or multiple sheets in the background? When a table is linked when the crowd is selected, can the business personnel display the characteristics and number of the crowd selected in real time?

A: Zhihu's label system is very large, including labels for users, content, business, and business governance and security. The DMP system will mainly connect with the labels of users. In terms of the certified and in-use label groups alone, there are nearly 700 labels, and if the business is added, the number of uncertified labels can reach thousands. For the user-side tags we are using there are 120 tag groups and 5 live tags for a total of 125 tags.

In terms of records, there are 110 billion records.

The backend is not a wide table. After the sub-tags are generated, independent data source tables of tag1, tag2, and tag3 will be generated. After we write these tables into DMP, it will eventually become a large-width table. In DMP, it is a large-width table in the problem, and in business, it is each independent label table. Multiple large-width tables will be connected when the crowd label circle is selected. After data processing, we will write the data into one table instead of one large-width table.

Due to our optimization, the files stored in this table will no longer be scattered according to the slow query progress of Tag ID. We will store the keys according to the stored keys, for example, IDs of 0 to 1 million will be stored in the same place. In the process of calculation, we will scan it on the same physical machine, and we can get the result after the aggregation logic. Therefore, it is possible to circle the results of relevant quantities in real time.

Q: Is the group circle selection based on experience for tag combination circle selection? How to analyze the effect after investment? Is it a standalone analytics platform tool? How do I know the conversion rate of a crowd pack? Do conversions go back to tagging to be analyzed using another analytics platform?

A: The crowd selection can be divided into two parts. The first part is the circle selection based on our operational experience. This part is divided into two branches: known crowd selection and unknown crowd selection.

The known crowd circle selection means that the operation has been very clear about this scene. It is known that the user group we are operating is a certain gender and user age group, etc. At this time, we will circle the selection based on historical experience. For completely unknown user characteristics, we will circle the market directly.

The difference between these two operational processes is that the accuracy of the circle selection of known user groups will be higher. Based on the known results, we almost no longer need to run the AB experiment to complete this launch. For completely unknown user characteristics, if we directly circle the market, we must conduct a small-traffic AB experiment and find that after the users who click Push satisfy a certain interest, we can accumulate experience based on this part of interest, and then design a new one. AB experiment and adjust the crowd characteristics to the appropriate scene, until the effect gradually achieves the desired goal, it will change from an unknown crowd to a known crowd.

There is another experience. For example, the experience of the advertiser, the advertiser may have no historical experience in Zhihu, but the advertiser knows who has purchased my products, such as the encrypted MD5 of their mobile phone number or the encrypted MD5 of the mobile idfa, etc. You can import the effects delivered by other stations to form a basic crowd. Through crowd generalization, join all the features in the station to train the model, and automatically find out the salient characteristics of my historical purchase crowd through the ability of AI, and then complete the selection of this part of the generalization. After the generalization-based selection, it will still go through the same link and complete this part of the cycle several times, and then you can know which users should be served in my scenario.

We look at the conversion rate in a separate place, which is what I want to integrate into the DMP platform later. We can see the conversion rate of different Push on a separate page. On the DMP platform, it can only be viewed through effect recycling.

Q: Is the backend all based on Doris? How many nodes are in a cluster?

A: The main computing aspects of the backend are based on Doris. We also rely on Redis for high throughput. For TPP, we use TiDB. The current Doris cluster is a 6-node, 64-core 256g BE; 3 FEs are a 6-node, 16-core 32g cluster configuration.

Q: Is crowd enlargement reliable? What is the proportion of all the crowd selection, and what algorithm is used?

A: Crowd enlargement is more reliable. From the feedback from the operation side, we can know that if only the advertiser or the data obtained based on the historical operation effect can basically not support the completion of this operation, but if all our features are added and trained, Basically every time there will be a more obvious improvement, in terms of CTR, it can reach 80%-90%. The confidence level is adjusted to 80%.

The proportion of business use in crowd circle selection will be less than that in general circle selection. For general circle selection, the features we have in our current history also carry confidence. Based on these existing features, we can basically complete most of the operational work. Crowd generalization is mainly used to solve the situation when I have no knowledge of these customers at all, and at the same time I want to import all random large-scale users in the station to detect user group characteristics. In fact, this process has a relatively large workload on the operation side. Only in this specific case will generalization be selected, so the proportion of generalization is not much proportionally. For example, there are 300 feature- and label-based orientations per day, and algorithm-based generalization is 1-2 times per day.

I haven't looked into what algorithm is used. At present, we will use the data to call the relevant algorithms of AI students. What we currently provide is to pour all the user's features into the AI's automatically trained model. After training is complete, we call the model again and inject all features for inference.

Q: How to design if AB wants to use Reids to check tags? How to maintain real-time performance?

A: For table A and table B in the question to check the labels, the amount of data will explode. This situation does exist. So I suggest making labels, preferably all labels are in this one table. Through our current exploration, our solution to this problem is that each physical machine may store more than 1 million, but to ensure that each 1 million segment is on the same physical machine, it It can become the Scan of this physical machine and perform direct operations after aggregation, so it does not have the Join problem of double tables, and can be aggregated directly in the table. We have several calculations of tags similar to bitmap and or not, but in terms of operators, the operators are already merged into the aggregation operator and the aggregation is completed, and then a final data merge is done after aggregation. The performance will be much better, and it can also avoid the results of joining the A and B tables.

For the second question, we will pass this function to complete the ID aggregation of the crowd. When this function is finished, it will generate a list of people under the current delivery feature, and I will complete the Join. At this time, ordinary Join will not have a very explosive number, nor will it involve hundreds of billions of fast query calculations.

Q: Can you interpret the relevant content about 2.5 million tags?

A: As you can see in Figure 1.3, there are 2.5 million tags mainly because a gender is counted as 1 in the tag group, and in terms of tags, there are male, female, and other 3 tags. Among the mobile phone brands, we currently have nearly 20 mobile phone brand tags under one tag group. After that there is a considerable number of topic-interest tags in the topic-interest tag group. For example, there are actually many topics on Zhihu. Some users may be interested in film and television content, mother and baby content, and education or student content. The above topics have a continuous common point of interest. . In terms of continuous tags, we will continue to introduce them to you in subsequent articles. In terms of the content of the current user portraits, if they are grouped from tags, they are all discrete tags. Consecutive tags are more user behavior or operand values, etc.

Q: What is the relationship between labels and features? How are labels created?

A: We define features to be larger than labels. It can be understood that 90% of our current features are labels, and the remaining 10% are the proportion of user behavior.

Join the community

Welcome more friends who love open source to join the Apache Doris community and participate in community construction. In addition to submitting PR or Issue on GitHub, you are also welcome to actively participate in the daily construction of the community, such as:

Participate in community essay activities to produce articles such as technical analysis and application practice; participate in online and offline activities of the Doris community as a lecturer; actively participate in the questions and answers of the Doris community user group, etc.

Finally, more open source technology enthusiasts are welcome to join the Apache Doris community, grow together, and build a community ecosystem.

img

img

img

SelectDB is an open source technology company dedicated to providing the Apache Doris community with a team of full-time engineers, product managers and support engineers, prospering the open source community ecosystem, and creating an international industry standard in the field of real-time analytical databases. SelectDB, a new generation of cloud-native real-time data warehouses developed based on Apache Doris, runs on multiple clouds, providing users and customers with out-of-the-box capabilities.

Related Links:

SelectDB official website:

https://selectdb.com

Apache Doris official website:

http://doris.apache.org

Apache Doris Github:

https://github.com/apache/doris

Apache Doris developer mailing group:

[email protected]

I want to contribute: https://jinshuju.net/f/nEPj5W

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5735652/blog/5552323