Exploration and practice of Meituan’s multi-scenario modeling

This article introduces the exploration and practice of Meituan Home/off-site delivery team in the direction of multi-scenario modeling technology. Based on the business background of external delivery, this paper proposes an adaptive scene knowledge migration and scene aggregation technology, which solves the problem of a large number of scenes and large differences between scenes caused by massive external traffic during delivery, and achieves obvious results. Improved effectiveness. I hope it can bring you some inspiration or help.

  • 1 Introduction

  • 2 Adaptive scene modeling

    • 2.1 Adaptive scene knowledge transfer

    • 2.2 Adaptive scene aggregation

  • 3 Summary and outlook

1 Introduction

Meituan Daojia Demand-Side Platform (hereinafter referred to as DSP) platform is mainly responsible for recommending and placing products or materials on Meituan’s external media, and continuously optimizing the conversion effect. With the continuous development and expansion of business, the external channels connected to DSP are becoming more and more abundant, the display forms are becoming more and more diverse, and the differences in material display scenarios are becoming more and more obvious (such as screen opening, interstitial screen, information flow, pop-up window, etc.) .

For example, during lunch time, users are more likely to click on the fast food merchant's materials in [a certain recommendation channel] [a certain App] and [open screen display] rather than the beer and barbecue merchant's materials in the [information flow display]. Behind the differences between scenarios are essentially differences in user intentions and needs. Therefore, the model needs to be customized for more and more scenarios to adapt to the personalized needs of users in different scenarios.

The industry's classic Mixture-of-Experts architecture (MoE, such as MMoE, PLE, STAR [1], etc.) can adapt to the personalized needs of users in different scenarios to a certain extent. This architecture weights and combines the output results of multiple Experts through a gated network to obtain the final prediction result. In the early days, we proposed a multi-scenario modeling solution based on the MoE architecture that uses material recommendation channels to divide scenarios. However, as the business continues to grow, the differences between scenarios become larger and larger, and the number of scenarios becomes more and more abundant. This version of the model is difficult to adapt to business development and cannot well solve the following two problems existing in the DSP context:

  1. Negative migration phenomenon : Take recommendation channels as an example. Since the traffic of different recommendation channels is different in terms of user distribution, behavioral habits, material display forms, etc., their number of exposures and click-through rates are not of the same order of magnitude (as shown in Figure 1 below. Different The difference in click-through rates between channels is very significant), and the data presents a typical "long tail" phenomenon. If recommended channels are used as the basis for multi-scenario modeling, on the one hand, the model will be more inclined to learn information from the head channels, and there will be a problem of insufficient learning for the tail channels. On the other hand, the data from the tail channels will also provide information to the head channels. The learning of channels brings "noise", leading to negative transfer.

  2. Data is sparse and difficult to converge : DSP will display materials on different external media. When users access external media, their time and space background, contextual information, different apps, and material display locations together constitute the current scene. In this way The number of scenes is in the order of 100,000, and the data of each scene is very sparse, making it difficult for the model to be fully trained on each scene.

When faced with such modeling tasks, the existing method in the industry is to transfer knowledge between different scenarios. For example, the SAML[2] model uses an auxiliary network to learn the shared knowledge of the scene and transfer it to the unique network of each scene; the ADIN[3] and SASS[4] models use gating units to select and Fusion of global information into single scene information. However, in the context of complex and changeable traffic in the DSP background, scene differences have led to a sharp increase in the number of scenes, and existing methods cannot be effective in huge sparse scenes.

Therefore, in this article, we propose an adaptive scene modeling solution (AdaScene, Adaptive Scenario Model) in the context of DSP, and model it from two perspectives: knowledge transfer and scene aggregation. AdaScene maximizes the utilization of common information in different scenarios by controlling the degree of knowledge transfer, and uses sparse expert aggregation to automatically select experts to form scene representations using a gated network, which alleviates the phenomenon of negative transfer; at the same time, we use the gradient of the loss function to guide the scene Aggregation constrains the huge recommended scene space to a limited range, alleviates the data sparse problem, and implements an adaptive scene modeling scheme.

9b6ce2889de0b026feefe478f82f10af.jpeg

Figure 1 Differences in scale of different channels

2 Adaptive scene modeling

Before the start of this section, we first introduce the modeling method of multi-scenario model. The multi-scenario model adopts the modeling paradigm of input layer Embedding + Mixture-of-Experts (MoE), where the input information includes user-side, merchant-side and scene context features. The loss of the multi-scene model is aggregated from the loss of each scene, and its loss function form is as follows:

Among them, K is the number of scenes, and α is the loss weight value of each scene.

The AdaScene adaptive scene model we proposed mainly consists of the following two parts: the scene knowledge transfer module and the scene aggregation module. The model structure is shown in Figure 2 below. The scene knowledge transfer module adaptively controls the degree of knowledge sharing between different scenes, and automatically selects K experts through a sparse expert network to form an adaptive scene representation. The scene aggregation module automatically measures the similarity of the loss function gradients between all scenes in advance offline, and then guides the aggregation of scenes by maximizing the scene similarity.

ad081988e0f04bc6f56cbfd3351248be.jpeg

Figure 2 Schematic diagram of adaptive scene modeling AdaScene

Next, we introduce the modeling solutions of adaptive scene knowledge transfer and scene aggregation respectively.

| 2.1 Adaptive scene knowledge transfer

In multi-scenario modeling, the scene definition method determines the learning samples of scene experts, which greatly affects the model's ability to fit the scene. However, no matter which scene definition method is used, there is overlap in user distribution between different scenes. User behavior patterns will also be similar.

In order to improve the ability to capture commonalities between different scenes, we explored the method of scene knowledge transfer from the two dimensions of scene characteristics and scene experts. Based on the material recommendation channel × App × display form as the multi-scene modeling Base model, we constructed The adaptive scene knowledge transfer model (Adaptive Knowledge Transfer Network, AKTN) is shown in Figure 3 below. This model establishes a knowledge transfer bridge between shared scene parameters and private parameters, which can adaptively control the degree of knowledge transfer and alleviate the phenomenon of negative transfer.

f59fe29c3e6eaaa85a71db01e2be2df8.jpeg

图3 AKTN(Adaptive Knowledge Transfer Network)

Scene feature adaptation performs weight adaptation on different features based on scene information in the input layer, and selects the features that the model is most concerned about in the current scene; scene knowledge transfer performs knowledge transfer in the hidden layer expert network, controlling the transfer of common information from shared experts to the scene. The flow of unique information allows the common information of the scene to be transmitted.

These two knowledge transfer methods complement and reinforce each other, and jointly improve the prediction capabilities of multi-scenario models. We compared the experimental effects of different modules. The specific results are shown in Table 1 below. It can be seen that the introduction of scenario knowledge transfer and feature weight optimization can bring certain improvements in both head and tail channels. Among them, the tail small traffic scenario (see sub-scenarios 2 and 3 in Table 1 below) has a more obvious improvement. It can be seen that Scene knowledge transfer alleviates the negative transfer phenomenon between scenes.

c295f5df2dbfa20b3b7b8faa593a244d.png

Table 1 AKTN experimental results

Relevant research and practice have shown [6][7][8] that sparse expert networks are very useful for improving computational efficiency and enhancing model effects. Therefore, based on the AKTN model, we further optimize the multi-scenario model at the expert level. Specifically, we replace the scene knowledge transfer layer with an automated sparse expert selection method, and select the most relevant ones to the current scene from large-scale experts through a gated network to form an adaptive scene representation. The selection process is shown in Figure 4 below:

7f97e9d441a0fc1e9aa0055776a23232.jpeg

Figure 4 Schematic diagram of sparse expert network

In practice, we effectively combine experts by using differentiable gating networks to avoid the negative transfer phenomenon between unrelated tasks. At the same time, the introduction of large-scale expert networks expands the selection space of multi-scenario models and better supports the selection of gated networks. Considering the massive traffic and complex scene characteristics in multiple scenarios, the sparse expert gating network was explored based on industry research.

Specifically, we practiced the following sparse gating methods:

  • Method 1 : Use KL divergence to measure the similarity between the sub-scenario and each expert to select the k experts that best match the current scenario. In terms of implementation, the two-dimensional matrix of scene*experts is used to calculate the similarity, and the most suitable k experts are selected through KL divergence.

  • Method 2 : Each sub-scenario is equipped with an expert selection gating network, and m scenes have m gating networks. For each scene's gated network, it is equipped with k single expert selectors [9]. Each single expert selector is responsible for selecting one of n experts as the expert of the current scene (n is the number of Experts). In practice, in order to improve training efficiency, we truncate the values ​​with smaller weights in the single-expert selector to ensure that each single-expert selector only selects one expert.

In the offline experiment, we used the material recommendation channel * display form as the scene definition and tried the above sparse gating method. The offline effect is shown in Table 2 below:

a2886bafa1a591ad284ba9db7925c824.png

Table 2 Effect of sparse gating method

It can be seen that the expert aggregation method based on the soft sharing mechanism can better share knowledge between scenarios through the same activated expert network. Compared with the common gated network mainly based on truncation method, the use of binary coding allows it to better converge to the target number of experts without losing the information of other expert networks. At the same time, its differentiability makes it better in Training is more stable in gradient-based optimization algorithms.

At the same time, in order to verify whether the sparse gated network can effectively distinguish different scenarios and capture the differences between scenarios, we used the example of selecting K = 7 out of n = 16 experts to compare the utilization rate of each expert in different scenarios in the verification set. The average weight of selected experts was visually analyzed (shown in Figure 5-Figure 7). The experimental results show that this method can effectively select different experts to express the scene.

For example, in Figure 6, KP_1 is more likely to choose the 5th expert, while KP_2 is more likely to choose the 15th expert. In addition, there are obvious differences in the usage rate of each expert and the average weight of selecting experts in different scenarios, indicating that this method can capture the difference in traffic in subdivided scenarios and express it differentially.

feab063117c9b658fccb6a4443a60937.jpeg

Figure 5 Distribution of experts in different display formats under the same channel

ba45a7031ae116fd68f8a10125babea5.jpeg

Figure 6: Open screen showing the distribution of experts from different channels

fdb1de17321bf5be6f7d23dd9e5b9272.jpeg

Figure 7 Information flow shows the distribution of experts in different channels

Experiments have proven that while modeling each scenario through a large-scale expert network, the expert aggregation method based on the soft sharing mechanism can better share knowledge between scenarios through the same activated expert network. At the same time, in order to further explore the impact of the number of Experts on model performance, we designed multiple sets of comparative experiments based on Method 2 by adjusting the number of experts and the proportion of topK. The experimental results are shown in Table 3 below:

676f0c30fbf78666ab5596238a517de3.png

Table 3 Method 2 parameter adjustment experiment

It can be seen from the experimental data that a large-scale Experts structure will bring positive offline benefits; and as the proportion of the number of experts selected increases (horizontal axis of Table 3), the overall performance of the model also has an upward trend.

| 2.2 Adaptive scene aggregation

Ideally, a request (traffic) can be viewed as an independent scenario. However, as mentioned in the introduction, as the DSP business continues to develop, different material display channels, formats, locations, etc. continue to increase. The data of each scene is very sparse, and we cannot effectively train each segmented scene.

Therefore, we need to cluster and merge various recommendation scenarios. We use the scene aggregation method to solve this problem. By measuring the similarity between all scenes and maximizing the similarity to guide scene aggregation, we solve the problem of difficult convergence caused by sparse data. Specifically, we express this problem as:

Therefore, based on the scene knowledge transfer model in Section 2.1, we added the scene aggregation part and proposed a scene aggregation model trained based on the Two-Stage strategy:

  • Stage 1 : Summarize the similarity of each scene based on the similarity measurement method, and find the optimal aggregation method of each scene with the goal of maximizing the similarity of grouped scenes (for example, Scene1 and Scene 4 can be aggregated into a scene combination Scene Group SGA );

  • Stage 2 : Based on the scene aggregation method obtained in Stage 1, use cross-entropy loss as the objective function to minimize the cross-entropy loss in each scene.

Among them, Stage 2 is consistent with that described in Section 2.1. This section mainly explains Stage 1. We believe that an effective scene aggregation method should be able to adaptively respond to traffic changing trends, be able to discover the intrinsic connections between scenes, and automatically adapt the aggregation method according to the current traffic characteristics. The first thing we thought of was to start from the rules, use artificial prior knowledge as the basis for scene aggregation, and conduct corresponding iterations based on recommendation channels, display forms, and the cross-multiplication of the two. However, this type of scene aggregation method requires reliable manual experience to support, and cannot quickly capture changes when dealing with massive traffic.

Therefore, we conducted related explorations into modeling methods of relationships between scenes. First, we evaluate the impact between scenes through representation transfer and combined training between scenes during offline training. However, this method has the problems of huge combination space, long training time, and low efficiency.

In multi-task related research [10][11][12][13], using gradient information to model the relationship between tasks is an effective method. Similarly, in the multi-scene model, the similarity between scenes can be modeled based on the gradient information of the loss function of each scene. Therefore, we use a multi-expert network and automatically solve the similarity between scenes based on the gradient information. The model The diagram is shown in Figure 8 below:

9c1c7c210a9af0e5fef991539fedb9e9.jpeg

Figure 8 Schematic diagram of scene aggregation

Based on the above ideas, we made the following attempts on the relationship modeling method between scenes.

1. Gradient Regulation

Based on the recognition that gradient information can potentially represent scene information, we add the regularization term of each scene loss function with respect to the gradient distance of the expert layer to the loss function. The overall loss function is as follows. The coefficient of this regularization term represents the relationship between the scene and the gradient distance of the expert layer. The similarity between gradients is a common method to evaluate the distance between gradients, such as distance.

ffe66d27f2b9bdce64ab7c6116dd5cf9.png

2. Lookahead Strategy

3. Meta Weights

The Lookahead Strategy method explicitly models the relationship between scenes. However, this strategy of calculating the scene correlation coefficient based on changes in the loss function has the phenomenon of unstable training and large fluctuations, and cannot be compared to the Gradient Regulation method. Scene similarity is calculated.

We used the multi-scenario model of recommended channels and display formats (screen open or not) as the base to explore the above three methods. In order to improve training efficiency, we made the following optimizations when designing the Stage 1 model:

We compared the GAUC of each method, and the experimental results are shown in Table 4 below. Compared with artificial rules, gradient-based scene aggregation methods can bring significant improvement in performance, indicating that the gradient of the loss function can represent the similarity between scenes to a certain extent and guide the aggregation of multiple scenes.

6160f34516f16665e9349c719b312a3c.png

Table 4 Scenario aggregation experimental data

In order to more comprehensively demonstrate the impact of scene aggregation on model prediction effects, we selected Meta Weights to perform tuning experiments on the number of groups. The specific experimental results are shown in Table 5 below. It can be found that as the number of groups increases, the GAUC increases, and the negative transfer effect between scenes weakens. However, when the number of groups exceeds a certain number, the overall similarity between scenes decreases, and GAUC shows a downward trend.

8a4668e3dcbb1d786d1226c836d94583.png

Table 5 Experimental data on the number of different aggregation scenarios

In addition, we performed a visual analysis of the relationship between some scenes in the Meta Weigts method, and the analysis results are shown in Figure 9 below. Taking the scene as the coordinate axis, each square in the figure represents the similarity between each scene, and the depth of the color represents the degree of similarity between channels.

c80bcb7a11f4a4b9113bf637b72b166f.jpeg

Figure 9 Similarity example in partial subdivision scenario

It can be found from the figure that in subdivided scenarios based on channels and display forms, this method can learn the correlation between different scenarios. For example, the information flow (s16) under channel A has a low correlation with other scenarios. It will be estimated as an independent scene, and the opening screen display under B channel (s9) and the opening screen display under C channel (s8) are highly correlated, and they will be aggregated into one scene for estimation. At the same time, the similar The degree matrix is ​​not symmetrical, which also shows that there are differences in the mutual influence between various scenes.

3 Summary and outlook

Through the exploration and practice of multi-scenario learning, we have deeply explored the modeling capabilities of recommendation models in different scenarios, and have tried and optimized them from the direction of scene knowledge transfer and scene aggregation. These attempts have provided better understanding and explanation. The recommendation model's ability to cope with different types of traffic and scenarios. However, this is just the beginning of multi-scenario learning research. In the future, we will explore and iterate in the following directions:

  • Better scene division method : The current multi-scenario division is mainly based on channels (channel * display form) as the traffic division method. In the future, more detailed exploration will be carried out in the dimensions of media, display position, media * time, etc.;

  • End-to-end traffic aggregation method : When performing traffic aggregation, the Two-Stage strategy is used for aggregation. However, this approach cannot fully utilize the relevant information in the traffic data. Therefore, there is a need to explore end-to-end traffic scenario aggregation solutions that will more directly and effectively improve the capabilities of recommendation models.

Combined with multi-scenario learning, new methods and technologies will be continuously explored in future research to improve the modeling capabilities of recommendation models for different scenarios and traffic types, creating better user experience and business value.

4 About the author

Wang Chi, Senjie, Shu Li, Wen Shuai, Yin Hua, Xiao Xiong, etc. are all from Meituan Daojia Business Group/Daojia R&D Platform/Chengdu R&D Center.

5 References

[1] STAR:Sheng, Xiang-Rong, et al. "One model to serve all: Star topology adaptive recommender for multi-domain ctr prediction." Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2021.

[2] SAML:Chen, Yuting, et al. "Scenario-aware and Mutual-based approach for Multi-scenario Recommendation in E-Commerce." 2020 International Conference on Data Mining Workshops (ICDMW). IEEE, 2020.

[3] ADIN:Jiang, Yuchen, et al. "Adaptive Domain Interest Network for Multi-domain Recommendation." Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 2022.

[4]SASS:Zhang, Yuanliang, et al. "Scenario-Adaptive and Self-Supervised Model for Multi-Scenario Personalized Recommendation." Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 2022.

[5] Squeeze-and-Excitation:Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[6] Practice and exploration of contextualized intelligent traffic distribution for Meituan takeaway recommendations

[7] PaLM:https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html

[8] GLaM:https://proceedings.mlr.press/v162/du22c.html

[9] Single expert selector: https://arxiv.org/abs/2106.03760

[10] HOA:https://proceedings.mlr.press/v119/standley20a.html

[11] Gradient Affinity:https://proceedings.neurips.cc/paper/2021/hash/e77910ebb93b511588557806310f78f1-Abstract.html

[12] SRDML:https://dl.acm.org/doi/abs/10.1145/3534678.3539442

[13] Auto-Lambda:https://arxiv.org/abs/2202.03091

[14] MAML:https://arxiv.org/abs/1703.03400

----------  END  ----------

 Meituan scientific research cooperation 

Meituan Scientific Research Cooperation is committed to building a bridge and platform for cooperation between Meituan’s technical team and universities, scientific research institutions, and think tanks. Relying on Meituan’s rich business scenarios, data resources, and real industrial issues, it is open to innovation, gathering upward forces, and focusing on robots. , artificial intelligence, big data, Internet of Things, autonomous driving, operations optimization and other fields, jointly explore cutting-edge technology and industry focus macro issues, promote industry-university-research cooperation and exchanges and the transformation of results, and promote the cultivation of outstanding talents. Looking to the future, we look forward to cooperating with teachers and students from more universities and research institutes. Teachers and students are welcome to send emails to: [email protected].

 Recommended reading 

  |  KDD Cup 2020 Multimodal Recall Competition Third Place Solution and Advertising Business Application

  |  KDD Cup 2020 multi-modal recall competition runner-up solution and search business application

  |  The practice of multi-business modeling in Meituan search ranking

Guess you like

Origin blog.csdn.net/MeituanTech/article/details/132893293