KDD Cup 2020 multi-modal recall competition third place program and advertising business application

ACM SIGKDD (ACM SIGKDD Conference on Knowledge Discovery and Data Mining) is the world's top international conference in the field of data mining. This year, the KDD Cup has set up four tracks and a total of five competition questions, involving data bias (Debiasing), multimodality recall (Multimodalities Recall), automated graph learning (AutoGraph), adversarial learning problems, and reinforcement learning problems.

US Mission algorithm search advertising team eventually Debiasing won the championship (1/1895) track, the AutoGraph track also won the championship (1/149). In the Multimodalities Recall track, the second runner-up was won by the Meituan Search and NLP team (2/1433), and the second runner-up was won by the Meituan search advertising algorithm team (3/1433).

This article will introduce the technical solutions of multi-modal recall of the third place on the track, as well as its application and practice in the Meituan search advertising business. I hope to bring some help or inspiration to students who are engaged in related work.

background

Based on its own business scenarios, the search advertising algorithm team of the Meituan to-store advertising platform has been continuously optimizing and innovating cutting-edge technologies. The team has certain algorithms in the three frontier areas of graph learning, data deviation, and multimodal learning. Research and application, and achieved good business results.

Based on the accumulation of technologies in these three fields, the team chose three closely related competition questions in the KDD Cup 2020 competition, hoping to apply and improve the accumulation of technologies in these three fields, and bring further breakthroughs in technology and business. The team’s Huang Jianqiang, Hu Ke, Qi Yi, Qu Tan, Chen Mingjian, Zheng Bohang, Lei Jun and Tang Xingyuan of the Chinese Academy of Sciences jointly formed the participating team Aister, participated in the AutoGraph, Debiasing, and Multimodalities Recall three questions, and finally won the championship in the AutoGraph track ( 1/149) ( KDD Cup 2020 automatic map learning competition champion technical plan and practice in Meituan advertising ), won the championship in Debiasing track (1/1895) ( KDD Cup 2020 Debiasing competition championship technical plan and advertising in Meituan Practice ) and won the third place in the Multimodalities Recall track (3/1433).

Figure 1 KDD 2020 conference

To deal with the entangled and complementary information of multiple modalities in nature and life, multi-modal learning is the only way. With the continuous evolution of Internet interaction, multi-modal content has become more abundant in graphics, text, videos, etc. The same trend is also reflected in Meituan’s search advertising system.

The search advertising algorithm team has used multimodal learning related technologies and has achieved good results in business, and won the third place in the Multimodalities Recall track of the KDD Cup this year. This article will introduce the technical solutions of the Multimodailites Recall competition, as well as the team's application and research of multi-modal learning related technologies in the advertising business. I hope that it will be helpful or inspiring for students engaged in related research.

Figure 2 KDD Cup 2020 Multimodalities Recall competition TOP 10 list

Introduction and analysis of competition questions

Topic overview

The multi-modal recall contest was initiated and organized by the Intelligent Computing Laboratory of Alibaba Dharma Academy, focusing on the multi-modal information learning problem in the e-commerce industry. In 2019, the global online e-commerce revenue has reached 353 billion US dollars. According to relevant forecasts, by 2022, total revenue will grow to 654 billion US dollars. Large-scale revenue and rapid growth mean that consumers have a huge demand for e-commerce services. Following this growth, various modalities of information in the e-commerce industry are becoming more and more abundant, such as live broadcasts, blogs, and so on. How to introduce these multi-modal information into traditional search engines and recommendation systems to better serve consumers is worthy of in-depth discussion by relevant practitioners.

This track provides the real data of Taobao Mall, including two parts, one is related to search sentences (Query), which is the original data; the other is related to product pictures, taking into account intellectual property rights, etc., the provided is to use Faster RCNN on the pictures The extracted feature vector. The two parts of data are organized as a Query-based picture recall problem, that is, the recall problem of text modal and picture modal.

To facilitate understanding, this track provides a small amount of real pictures and their corresponding raw data. The following is an example. The legend is a positive example. The Query is Sweet French Dress. The main part of the picture is a woman in a sweet dress. Outside the main part, there is a lot of messy information, including a handbag, some balloons, and some trademarks. Promotional text information. The question itself does not provide the original picture, but provides the feature vector extracted from the picture by Faster RCNN, that is, the framed parts of the picture. It can be seen that, on the one hand, Faster RCNN extracts obvious semantic content in the picture, which is helpful for model learning; on the other hand, the extraction of Faster RCNN contains more boxes, which do not reflect the priority of semantics. How to use these boxes to match the text is the core content of the question.

The evaluation index set for this competition is NDCG@5. Specifically, in a given test set, each query will give about 30 samples, of which about 6 are positive samples and the rest are negative samples. The contest requires the contestant to design a matching algorithm and recall any 5 positive samples to obtain all the scores of the Query. Otherwise, the NDCG index is calculated as the score of the Query according to the number of recalled positive samples. All the Query scores are averaged to obtain the final score.

Figure 3 Query and Product data example

Data analysis and understanding

This track provides three data sets, called training set, validation set and test set. The basic information of each data set is as follows:

Table 1 Overview of the data set

In order to further explore the characteristics of the data, we aggregated the original pictures and feature information given in the validation set. The following table is a set of examples.

Table 2 Positive and negative examples of matching between search phrases and pictures

Based on the above exploration, we summarized three important characteristics of the data set:

The data characteristics of the training set and the validation set/test set are quite different. The magnitude of the training set is significantly higher than that of the validation set/test set, with 3 million Query-Image pairs, which is more than one hundred times that of the validation set/test set. At the same time, each Query-Image pair in the training set is regarded as a positive sample, which is completely different from a Query given in the validation set with multiple positive and negative images. Through the visual exploration of the original images and Query of the validation set, it can be seen that the data of the validation set is of high quality and should be manually labeled. Considering the cost of manual labeling and the lack of negative samples, the training set is likely to describe the click relationship instead of the semantic matching relationship of manual labeling. Our solution must take into account the basic characteristic that the distribution of the training set and the distribution of the test set do not match.
Picture information is complex and often contains multiple objects. These objects are all framed as given features, but the semantic information between the frames is not equal; some are noise, such as the block diagrams of sunglasses, scarves, and cameras under Query (men's high collar sweater), and some are Repeated due to product display needs, such as the block diagram of repeated shoes under Query (breathable and comfortable children's shoes). On average, a picture has 4 frames. How to denoise and synthesize the semantic information contained in these multiple frames to obtain the overall semantic expression of the picture is a focus of modeling.
As a given original text, Query has a completely different structure and distribution from commonly used corpus. It can be seen from the sample table that Query is not a natural sentence, but a phrase formed by concatenating some attributes and commodity entities. According to statistics, 90% of the queries are composed of 3-4 words; the training set has about 1.5 million different queries, and the vocabulary size is about 15000; through the last word, all queries can be reduced to about 2000 categories , Each category is a specific commodity noun. We need to consider these characteristics of text data and conduct targeted processing.

Problem challenge

This competition is a multi-modal information matching task on the search data of e-commerce. Starting from the three characteristics of the above data set, we summarized the two main challenges of the competition:

First, the problem of inconsistent distribution. The basic assumption of classic statistical machine learning is that the training set and test set have the same distribution. Inconsistent distributions usually lead to model learning bias, and it is difficult to align the effects of training set and validation set. We must rely on the click signals in the existing large-scale training set and the small-scale verification set with the same distribution as the test set, design feasible data construction methods and model training procedures, and adopt techniques such as transfer learning to deal with this problem .

Second, the problem of complex multi-mode information matching. How to perform multimodal information fusion is a basic problem in multimodal learning, and how to perform semantic matching on complex multimodal information is a unique challenge for this competition. From the data point of view, on the one hand, product images have multiple frames, large information content, and more noise; on the other hand, user search queries generally have multiple fine-grained attribute words, and each word plays a role in semantic matching. This requires us to deal with the complexity of the graph and the query in the design of the model, and do a fine-grained match.

In response to these two challenges, the search advertising team’s solutions will be detailed below.

Competition plan

Our solution directly responds to the above two challenges. Its main part contains two aspects. One is to bridge the distribution of training data and test set through joint diversified negative sampling strategy and distillation learning to deal with the problem of inconsistent distribution; A fine-grained text-image matching network is adopted to perform multi-modal information fusion, and to deal with complex multi-modal information matching problems. Finally, through two-stage training and multi-mode fusion, we further improved the performance of the model. The flow of the entire program is shown in the following figure. The various parts of the plan are detailed below.

Figure 4 Multi-stage distillation learning framework based on diverse negative sampling

Various negative sampling strategies and pre-training

The distribution of training set and test set are inconsistent. The most intuitive inconsistency is that there are only positive samples in the training set and no negative samples. We need to design a negative sampling strategy to construct negative samples, and make the sampled negative samples as close as possible to the true distribution of the test set. The most intuitive idea is to sample randomly. Random sampling is simple and easy to implement, but it is quite different from the validation set.

However, analyzing the validation set found that candidate images under the same query usually have close semantic associations. For example, under the query "Sweet French dress", the pictures to be selected are all dresses, but they are different in style. This shows that this multi-modal matching question needs to match text and pictures at a finer attribute granularity. From the two perspectives of image tags and query words, we can use the corresponding clustering algorithm to refine the space to be sampled into similar semantic items from the global, so as to achieve the purpose of negative sampling closer to the distribution of the test set.

Based on the above analysis, we designed the four sampling strategies shown in the following table to construct the sample set. Among the four strategies, the positive and negative samples obtained by random sampling are the easiest to distinguish, and the positive and negative samples obtained by sampling according to the last word of Query are the most difficult to distinguish; in training, we start from the benchmark model and start with the simplest random The benchmark model is trained on sampling, and then training is continued on the more difficult sample set sampled by image labels and clustered by Query based on the previous model, and finally trained on the most difficult sample set sampled by the last word of Query. This way of training from easy to difficult and from far to near helps the model converge to the verification set distribution and achieves better results on the test set.

Table 3 Diversified negative sampling

Distillation learning

Although a variety of sampling strategies can be used to approximate the true distribution of the test set from different angles, these sampling strategies are still insufficient because the test set information is not directly used to guide negative sampling. Therefore, we use distillation learning to further optimize the negative sampling logic in order to get a sample set distribution closer to the test set.

As shown in the figure below, after pre-training on the sample set obtained by negative sampling of the training set (step 1), we further Finetune the model on the validation set to obtain a fine-tuned model (step 2). Using the fine-tuning model, we instead used pseudo-labels on the training set as Soft Labels, and introduced Soft Labels into Loss, and jointly learned with the original 0-1 Hard Label (Step 3). In this way, the training of the training set directly introduces the distribution information of the verification set, which is closer to the distribution of the verification set, and improves the performance of the pre-training model.

Figure 5 Multi-stage distillation learning

Fine-grained matching network

Multi-modal learning is in the ascendant, with various tasks and models emerging one after another. In response to the problem of complex images and search query matching that we face, referring to the championship scheme of the VQA competition in CVPR 2017, we designed the following neural network model as the main model.

The design of the model mainly considered the following three points:

Use a fully connected network with gates for semantic mapping. Images and Query are at different semantic levels and need to use functions to map to the same semantic space. We have adopted two fully connected layers to achieve this goal. Experiments have found that the size of the hidden layer of the fully connected layer is a more sensitive parameter. Appropriately increasing the hidden layer can significantly improve the model effect without excessively increasing the computational complexity. In addition, as described in the literature, the use of a fully connected layer with gates can further enhance the effect of semantic mapping networks.
Adopt two-way Attention mechanism. Both the picture and the query are composed of more fine-grained sub-semantic units. Specifically, there may be multiple frames on a picture, and each frame has independent semantic information; a query is divided into multiple words, and each word also contains independent semantic information. The characteristics of this data are determined by the e-commerce search scenario. Therefore, when designing the model, it is necessary to consider the matching between individual sub-semantic units. We use a bidirectional attention mechanism of single word and all boxes, single box and all words to capture the matching relationship and importance of these subunits.
Use diversified multi-mode fusion strategies. There are many methods for multimodal information fusion, most of which ultimately boil down to the mathematical operators between the picture vector and the query vector. Considering that different fusion methods have their own characteristics, multiple fusions can more comprehensively describe the matching relationship. We adopted three fusion methods of Kronecker Product, Vector Concatenation and Self-Attention, and combined the image vector after semantic space conversion and Attention mechanism mapping. The Query vector is used for information fusion and finally sent to the fully connected neural network to obtain the probability value of whether it matches or not.

In addition, we use pre-training word vectors on the training set samples to get the original Query representation, instead of using popular pre-training models such as the BERT model. The main consideration here is that data analysis points out that Query is very different from common natural sentences, and is more like a set of phrases combined with specific attribute/category nouns. This is obviously different from the corpus used by pre-training models such as BERT. . In fact, our initial attempt to introduce Glove pre-training word vectors, etc., has no obvious benefit compared with pre-training directly on the Query text. Considering that the BERT model is relatively cumbersome and is not conducive to rapid iteration, we finally did not use the relevant language model technology.

Figure 6 Fine-grained matching network

Multi-mode fusion

Under the treatment of the above technical means, we have obtained multiple basic models. These models can be Finetuned on the validation set, so that their effects are closer to the real distribution. On the one hand, the Finetune stage can continue to use the aforementioned neural network matching model. On the other hand, the aforementioned neural network can be used as a feature extractor to put its output on a smaller-scale verification set into the tree model for retraining. This advantage is that the tree model and the neural network model are heterogeneous and the fusion effect is better. In the end, the result we submitted is the result of the fusion of multiple neural network models and tree models.

evaluation result

We use the coarse-grained (picture is expressed as the average of all boxes and Query is expressed as the average of all words) matching network trained by random sampling as the benchmark model. The following table lists the improvement effects of each part of our solution on the benchmark model.

Table 4 NDCG improvement of different methods

Advertising business application

The search advertising algorithm team is responsible for the search advertising and screening list advertising business of the Meituan and Dianping dual platforms. The business types involve catering, leisure and entertainment, beauty, and hotels. The rich business types bring great space and challenges to algorithm optimization. In the creative optimization stage of search advertising, the purpose is to select high-quality pictures for each user's advertisement display results through the current search terms or screening intentions. The user's search terms and pictures are very different in dimensionality and expression granularity. We use multi-modal learning to solve this problem and map cross-modal expressions in the same space.

As shown in the figure below, in a multi-modal network, advertising features, request features, user preferences, and image features are used as input. The image features are represented by image vectors extracted from the CNN network, and other features are crossed by multi-layer MLP to obtain dense vector representations , And finally constrain the model training through the loss function of the picture Loss and the multi-mode Loss. Through this modeling method, the creative optimization model can present the most suitable images for the advertisement results of different users according to the query.

Figure 7 Multi-modal learning in advertising creative business

The search advertising system is divided into modules such as advertising triggering, creative optimization, and click-through rate estimation (ad granularity). Among them, the creative optimization stage has more than ten picture candidates for each advertisement result, and the calculation of online services is more than ten times the click-through rate estimate (ad granularity), which has higher requirements for performance. In order to shorten the time-consuming and reduce the complexity of the model will inevitably lead to the decline of model accuracy.

In order to balance the performance and effect of the model, we borrowed the idea of knowledge distillation to deal with this problem, and borrowed the advertising granularity prediction model with high expressive ability. As shown in Figure 7 above, the left model is a complex advertising granularity click-through rate estimation model, which can be used as a teacher network; the right is a simple creative granularity optimization model, which serves as a student network. In the objective loss function of the student network, in addition to the logloss of the student network's own output Logit, the square error between its Logit and the teacher's network output Logit is also added. This auxiliary Loss can force the output of the student model to be closer to the output of the teacher model. Therefore, the student model can learn closer to the teacher model, so as to achieve the goal of improving accuracy while maintaining a relatively simple network scale.

In addition, the bottom layer shares the design of Embedding, so that the bottom layer parameters of the student model can be trained by the teacher model. Moreover, while improving accuracy, the consistency between multiple modules (such as CTR estimation and creative optimization) is also a key to improving system accuracy. The teacher-student joint training of target and expression learning is conducive to multi-stage target unification. . Based on the accuracy improvement and the consistency of the multi-stage goals, we have achieved a significant improvement in online business results.

Figure 8 Distilled learning in advertising creative business

Summary and outlook

The KDD Cup is a competition that is very closely connected with the industry. The annual competition questions are closely related to the hot issues and practical issues in the industry. The Winning Solution produced over the years has a great impact on the industry. For example, KDD Cup 2012 produced FFM (Feild-Aware Factorization Machine) and XGBoost prototypes, which have been widely used in the industry.

This year's KDD Cup mainly focuses on the areas of automated graph representation learning and recommendation systems. Information in nature is often mixed with multiple modalities, and the processing and processing of multi-modal information is a major research hotspot in recent years. At the same time, the multi-modal information processing involved in search engines or recommendation systems in the industry is becoming more and more important. Especially with the rise of live broadcast, short video and other business forms, multi-modal learning has become indispensable.

This article mainly introduces the multi-modal competition of KDD CUP 2020 and the solution of the Meituan search advertising algorithm team. After fully exploring the data, we analyzed the three characteristics of the competition data, and at the same time positioned the two challenges of the competition, namely the inconsistent distribution of training set and test set and the matching of complex multi-modal information. We have dealt with the problem of inconsistencies in distribution through diversified negative sampling strategies, distillation learning, pre-training, and Finetune, and used fine-grained matching networks to deal with the problem of complex multi-modal information matching. Both ideas have achieved significant improvements.

At the same time, this article also introduces the practical application of multi-modal learning related technologies in the search advertising business, including the joint learning of pictures and user preferences in the creative optimization model, and the application of distillation learning in the creative model. Through the high-intensity and fast-frequency iteration of the competition, the team has a deeper understanding of multi-modal learning. In the future work, we will analyze and model in more multi-modal business scenarios based on the experience gained in this competition, and play the value of data.

references

[1] Teney, Damien, et al. "Tips and tricks for visual question answering: Learnings from the 2017 challenge." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[2] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint arXiv:1503.02531 (2015).

[3] Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. "Glove: Global vectors for word representation." Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.

[4] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

[5] Zhou, Bolei, et al. "Simple baseline for visual question answering." arXiv preprint arXiv:1512.02167 (2015).

[6] Yu, Zhou, et al. "Deep modular co-attention networks for visual question answering." Proceedings of the IEEE conference on computer vision and pattern recognition. 2019.

About the Author

Qi Yi, Jianqiang, Hu Ke, Lei Jun, etc. are all from the search advertising algorithm team of the Meituan advertising platform.

About Meituan AI

Meituan AI takes "helping people eat better and live better" as its core goal, and is committed to exploring cutting-edge artificial intelligence technology in actual business scenarios, and quickly implementing it in real life service scenarios to complete offline Digitization of the economy.

Meituan AI was born out of the rich life service scene demands of Meituan, and it has the uniqueness and advantages of scene-driven technology. Based on business scenarios and rich data, through image recognition, voice interaction, natural language processing, and distribution scheduling technology, it can be used in real scenarios such as unmanned distribution, unmanned micro warehouses, and smart stores, covering all aspects of people's lives, using technology Assist users in improving the quality of life, upgrading the industry’s intelligence, and even building a new infrastructure for life services for the entire society.

For more information, please visit: https://ai.meituan.com/

---------- END ----------

Job Offers

The search advertising algorithm team of the Meituan advertising platform is based on the search advertising scene, exploring the most cutting-edge technological development of deep learning, reinforcement learning, artificial intelligence, big data, knowledge graphs, NLP and computer vision, and exploring the value of local life service e-commerce. The main work directions include:

Triggering strategy : user intention recognition, advertising business data understanding, Query rewriting, deep matching, correlation modeling.

Quality estimation : modeling of advertising quality. Estimated click-through rate, conversion rate, customer unit price, and transaction volume.

Mechanism design : advertising ranking mechanism, bidding mechanism, bid suggestion, traffic estimation, budget allocation.

Creative optimization : intelligent creative design. Optimize the display creativity of advertising pictures, text, group orders, discount information, etc.

job requirements:

Have more than three years of relevant work experience, and have application experience in at least one aspect of CTR/CVR estimation, NLP, image understanding, and mechanism design.
Familiar with commonly used machine learning, deep learning, and reinforcement learning models.
Excellent logical thinking ability, passion for solving challenging problems, sensitive to data, and good at analyzing/solving problems.
Master degree or above in computer and mathematics related majors.

The following conditions are preferred:

Have relevant business experience in advertising/search/recommendation.
Have experience in large-scale machine learning.

Interested students can submit their resumes to: [email protected] (please indicate the title of the email: Guangping Search Team).

Maybe you still want to watch

| KDD Cup 2020 Debiasing competition champion technical solution and practice in Meituan advertising

| KDD Cup 2020 automatic map learning competition champion technical plan and practice in Meituan advertising

| Practice of MT-BERT in text retrieval tasks

KDD Cup 2020 multi-modal recall competition third place program and advertising business application

Guess you like