KDD Cup 2020 multi-modal recall competition runner-up plan and search business application

 

ACM SIGKDD (ACM SIGKDD Conference on Knowledge Discovery and Data Mining) is the world's top international conference in the field of data mining. This year, the KDD Cup has set up four tracks and a total of five competition questions, involving data bias (Debiasing), multimodality recall (Multimodalities Recall), automated graph learning (AutoGraph), adversarial learning problems, and reinforcement learning problems.

US Mission algorithm search advertising team eventually Debiasing won the championship (1/1895) track, the AutoGraph track also won the championship (1/149). In the Multimodalities Recall track, the second runner-up was won by the Meituan Search and NLP team (2/1433), and the second runner-up was won by the Meituan search advertising algorithm team (3/1433).

This article will introduce the technical solutions for the runner-up of the multi-modal recall competition, as well as the application and practice in the Meituan search business, hoping to bring some help or inspiration to students engaged in related work.

1. Background

Like other e-commerce companies, in addition to text, Meituan’s business scenarios also have multiple modal information such as pictures, animations, and videos. At the same time, Meituan Search is a typical multi-modal search engine. There are multiple modal results such as POI, pictures, text, and video in the recall and sorting list. How to ensure the relevance of Query and multi-modal search results is facing great Challenges.

In view of the fact that Multimodalities Recall is similar to the challenges of Meituan’s search business, Meituan Search and NLP formed a team to participate in the competition for the purpose of honing the basic skills of algorithms and precipitating relevant technical capabilities, and finally proposed " The multi-modal recall solution based on the fusion of ImageBERT and LXMERT" finally won the second place (2/1433) ( KDD Cup2020 Recall list ). This article will introduce the technical solutions of multi-modal recall contest questions and the application of multi-modal technology in the Meituan search scene.

The relevant code has been open sourced on GitHub:

https://github.com/zuokai/KDDCUP_2020_MultimodalitiesRecall_2nd_Place

Figure 1 KDD Cup 2020 Multimodalities Recall competition TOP 10 list

 

2. Introduction

In 2019, global retail e-commerce sales reached 3.53 trillion U.S. dollars, and it is estimated that by 2022, e-retail revenue will grow to 6.54 trillion U.S. dollars. Such a rapidly growing business scale indicates the broad development prospects of the e-commerce industry, but at the same time, it also means increasingly complex market and user needs. As the scale of the e-commerce industry continues to grow, various modal data related to it are also increasing, including all kinds of live videos with goods, life stories displayed in the form of pictures or videos, and so on. New business and data have brought new challenges to the development of e-commerce platforms.

Currently, most e-commerce and retail companies have adopted various data analysis and mining algorithms to enhance the performance of their search and recommendation systems. In this process, multi-modal semantic understanding is extremely important. A high-quality semantic understanding model can help the platform better understand the needs of consumers, return products that are more relevant to user requests, and can significantly improve the platform's service quality and user experience.

In this context, this year’s KDD Cup held a multimedia recall task (Modern E-Commerce Platform: Multimodalities Recall). The task required participants to sort all product images in the candidate set according to the user’s query, and search for them. Show the 5 most relevant product pictures. Examples are as follows:

As shown in Figure 2 below, the Query entered by the user is:

leopard-print women's shoes

According to its semantic information, the picture on the left is related to the query query, while the picture on the right is not related to the query query.

Figure 2 Schematic diagram of multi-modal matching

 

As can be seen from the example, this task is a typical multi-modal recall task, which can be transformed into a Text-Image Matching problem. By training the multi-modal recall model, the Query-Image sample pair is scored for correlation, and then the correlation score is scored Sort to determine the final recall list.

2.1 Competition data

The data for this competition comes from user Query and product data in real scenarios on Taobao platform, and consists of three parts: training set (Train), validation set (Val) and test set (Test). According to the different stages of the competition, the test set is divided into two parts: testA and testB. The scale of the data set, the included fields, and the data sample are shown in Table 1. The real sample data does not contain visual pictures. The example pictures are for the convenience of reading and understanding.

Table 1 Details of the competition data set

 

In terms of data, the points to note are:

  • Each piece of data in the training set (Train) represents a related pair of Query-Image samples, while in the validation set (Val) and test set (Test), each Query has multiple candidate images, and each piece of data indicates that the correlation needs to be calculated The Query-Image sample pair.

  • The organizer of the event has extracted multiple target frames from all the pictures through the target detection model (Faster-RCNN), and saved the corresponding 2048-dimensional image features of the target frames. Therefore, there is no need to consider the extraction of image features in the model.

2.2 Evaluation Index

In this competition, the Normalized Discounted Cumulative Gain (NDCG@5) of the Top 5 recalled results was used as the evaluation index for ranking the relevant results.

3. Classical solution

The problem to be solved in this competition can be transformed into a Text-Image Matching task, that is, to score the similarity of each Query-Image sample pair, and then rank the relevance of each Query candidate image to obtain the final result. There are usually two ways to solve the multimodal matching problem:

  1. Map different modal data to different feature spaces, and then learn an unexplainable distance function through the hidden layer interaction with these features, as shown in Figure 3 (a).

  2. Map different modal data to the same feature space to calculate the interpretable distance (similarity) between different modal data, as shown in Figure 3 (b).

Figure 3 Commonly used multi-modal matching solutions

 

Generally speaking, under the same conditions, since the combination of graphic and text features can provide more cross feature information for the hidden layer of the model, the effect of the model on the left is better than the model on the right, so in the subsequent algorithm design, we They are all developed around the solution ideas on the left side of Figure 3.

With the great success of the Goolge BERT model in the field of natural language processing, more and more researchers in the field of multi-modality have begun to learn from the pre-training method of BERT and develop other modalities such as fusion of images/videos (Image/Video) The BERT model was successfully applied to tasks such as multimodal retrieval, VQA, Image Caption, etc. Therefore, consider using the BERT-related multi-modal pre-training model (Vision-Language Pre-training, VLP), and transform the downstream task of the correlation calculation of graphics into a binary classification problem of whether the graphics and text match for model learning.

At present, the multi-modal VLP algorithm based on the Transformer model is mainly divided into two schools:

  • Single-stream model. In the single-stream model, text information and visual information are fused at the beginning and directly input into Encoder (Transformer) together. Typical single-stream models are ImageBERT [3], VisualBERT [9], VL-BERT [10], etc.

  • In the dual-stream model, the text information and visual information in the dual-stream model first pass through two independent Encoder (Transformer) modules, and then use Cross Transformer to achieve the fusion of different modal information. Typical dual-stream models are LXMERT [4], ViLBERT [8] and so on.

4. Our approach: Transformer-Based Ensembled Models TBEM

In this competition, in terms of algorithms, we selected the latest Transformer-based VLP algorithm to build the model body, and added Text-Image Matching as a downstream task. In addition to building the model, we use data analysis to determine model parameters and build training data. After completing the model training, the result post-processing strategy is used to further improve the algorithm effect. The flow of the entire algorithm is shown in Figure 4:

Figure 4 Algorithm flow chart

 

Next, each link in the algorithm is explained in detail.

4.1 Data Analysis & Processing

Data analysis and processing are mainly based on the following three considerations:

  • Positive and negative sample construction : Since the training data provided by the organizer only contains relevant Query-Image sample pairs, which is equivalent to only positive sample data, it is necessary to construct negative samples through data analysis and design strategies.

  • Model parameter setting : In the model, parameters such as the maximum number of image target frames and the maximum length of Query text need to be designed in conjunction with the distribution of training data.

  • Post-processing of sorting results : Determine the post-processing strategy of the results by analyzing the distribution characteristics of the picture data recalled by Query.

The conventional training data generation strategy is: for each batch of data, select positive and negative samples in a 1:1 ratio. Among them, the positive sample is the original data in the training set (Train), the negative sample is generated by replacing the Query field in the positive sample, and the replaced Query is obtained from the training set (Train) according to a certain strategy.

In order to improve the learning effect of the model, we carried out difficult case mining in the process of constructing negative samples. When constructing samples, we made some target boxes of positive and negative samples contain the same category label to construct a part of more similar positive and negative samples. In order to improve the model's discrimination for similar positive and negative samples.

The process of mining difficult cases is shown in Figure 5 below. The relevant sample pairs on the left and right sides all contain the category label "shoes", and the Query of the right sample pair is used to replace the Query of the left picture to construct a difficult case. By learning such samples, the model can improve the discrimination of different types of "shoes" descriptions.

Figure 5 Schematic diagram of the mining process of difficult cases

Specifically, the negative sample construction strategy is shown in Table 2:

Table 2 Negative sample query extraction strategy

 

Secondly, by analyzing the number of target frames in the training data and the distribution of the query length, the relevant parameters of the model are determined. The maximum number of target boxes in the picture is set to 10, and the maximum number of words in the query text is 20. We will introduce the content related to post-processing strategy in detail in section 4.3.

4.2 Model construction and training

4.2.1 Model structure

Based on the above survey of existing methods in the field of multimodal retrieval, in this competition, we have selected the corresponding SOTA algorithms from the single-stream model and the dual-stream model, namely ImageBERT and LXMERT. Specifically, for the competition task, the two algorithms have been improved as follows:

The main improvements in the LXMERT model include:

  • The image feature part (Visual Feature) incorporates the text feature corresponding to the target box category label.

  • In Text-Image Matching Task, a two-layer fully connected network is used to classify image and text fusion features. After the first fully connected layer, GeLU [2] is used for activation, and then LayerNorm [1] is used for normalization.

  • Use Cross Entropy Loss to train the network after the second fully connected layer.

The improved model structure is shown in Figure 6 below:

Figure 6 The structure of the LXMERT model used in the competition

The pre-training weight of the feature network uses the weight file provided by LXMERT, and the download address is: https://github.com/airsplay/lxmert .

ImageBERT : In this solution, two versions of the ImageBERT model are shared, denoted as ImageBERT A and ImageBERT B. The improvements will be introduced below.

ImageBERT A : The improvements based on the original ImageBERT have the following points.

  • Training task : do not mask the image features and some words of the query, only train the correlation matching task, and do not train other tasks such as MLM.

  • Segment Embedding : Segment Embedding is uniformly coded to 0, and image features and Query text are not separately coded.

  • Loss function : Output the matching relationship between Query and Image at the [CLS] bit, and calculate the loss through Cross Entropy Loss.

According to the above strategy, the BERT-Base model weight is selected to initialize the variables, and FineTune is performed on this basis. The model structure is shown in Figure 7 below:

Figure 7 The structure of the ImageBert model used in the competition

 

ImageBERT B : The difference from ImageBERT A is in the processing of Position Embedding and Segment Embedding.

  • Position Embedding : Remove the Position Embedding structure of the image target frame position information in ImageBert.

  • Segment Embedding : The segment embedding code of the text is 0, and the segment embedding code of the picture feature is 1.

According to the above strategy, the BERT-Base model weight is also selected to initialize the variables, and FineTune is performed on this basis.

In the construction of the three models, the common innovation is that the label information of the picture target frame is introduced into the input of the model. This idea is also applied to Microsoft's latest paper Oscar [7] in May 2020, but the feature usage and loss function settings of this paper are different from our solution.

4.2.2 Model training

Use the data generation strategy in Section 4.1 to construct training data, and train the above three models separately. The effect of the trained model on the validation set (Val) is shown in Table 3.

Table 3 The effect of the model on the validation set (Val) after initial training

 

4.2.3 Use the loss function to fine-tune the model

After completing the preliminary model training, the model will be further fine-tuned using different loss functions, mainly AMSoftmax Loss [5] and Multi-Similarity Loss [6].

  • Through weight normalization and feature normalization, AMSoftmax Loss reduces the intra-class distance while increasing the inter-class distance, thereby improving the model effect.

  • Multi-Similarity Loss transforms deep metric learning into a weighting problem of sample pairs. It uses sampling and weighted alternate iteration strategies to achieve self-similarity, negative relative similarity and positive relative similarity, which can promote model learning to obtain better features. .

The specific strategies adopted in our program are as follows:

  • For LXMERT, add Multi-Similarity Loss after the feature network to form a multi-task learning network with Cross Entropy Loss to fine-tune the model.

  • For ImageBERT A, use AMSoftmax Loss instead of Cross Entropy Loss.

  • For ImageBERT B, the loss function processing method is the same as LXMERT.

After fine-tuning, the effect of each model on the verification set (Val) is shown in Table 4.

Table 4 The effect of loss function on model fine-tuning-validation set (Val)

4.2.4 Model fine-tuning through data oversampling

In order to further improve the model effect, this solution oversampling the training set (Test) based on the similarity between the Query field in the training set (Train) and the Query field in the test set (testB). The sampling rules are as follows:

  • For the samples that Query appears in the test set (testB), or the samples that have a containment relationship with the Query in the test set (testB), according to the number of times they appear in the training set (Train), oversample in inverse proportion.

  • For samples that Query does not appear in the test set (testB), according to the number of repeated words in the two data sets Query, extract the top 10 training set (Train) samples for each query in the test set (testB). The strip samples are oversampled 50 times.

After data oversampling, fine-tune the above three models according to the following schemes:

  • For the LXMERT model, the training samples obtained from oversampling are used to further fine-tune the LXMERT model.

  • For the ImageBERT A model, this solution selects samples from the training set (Train) where the words in the query and the test set (Test) Query overlap to further fine-tune the model.

  • For the ImageBERT B model, considering that there are Query expressions in the training set (Train) that have the same meaning, but the order of the words is different, similar to "sporty men's high-top shoes" and "high-top sporty men's shoes", in order to enhance the model Robustness. Randomly scramble the Query word (Word) with a certain probability, and further fine-tune the ImageBERT B model.

The effects of each model on the validation set (Val) after training are shown in Table 5:

Table 5 Effect on the verification set (Val) set after oversampling

 

In order to make full use of all labeled data, this solution further uses a validation set (Val) to perform FineTune on the model. In order to avoid over-fitting, the final submission results only performed the above operations on the ImageBERT A model.

In the phase of predicting the correlation of Query-Image sample pairs, this solution counts the short sentences contained in the test set (testB) Query, and finds that the short sentence "sen department" appears in a large number in the test set (testB). But it never appeared in the training set (Train), but the phrase "forest style" appeared. In order to avoid the impact of this group of synonymous phrases on the model prediction, we chose to replace the "sen department" of the Query in the test set (testB) with "forest style", and use ImageBERT A to correlate the replaced test set Prediction, the result is recorded as ImageBERT A'.

4.3 Model fusion and post-processing

After the above-mentioned model construction, training, and prediction, a total of 4 sample pair correlation score files were obtained in this program. Next, perform Ensemble on the prediction results and perform post-processing according to certain strategies to obtain the corresponding Image candidate sort set of Query. The specific steps are as follows:

(1) In the Ensemble stage, this solution chooses to perform a weighted summation of the correlation scores obtained from different models as the final correlation score of each sample pair. Each model is in the order of LXMERT, ImageBERT A, ImageBERT B, ImageBERT A' The weight is 0.3:0.2:0.3:0.2, and the weight of each model is determined by grid search. By traversing the different weight proportions of 4 models, the weight proportion of each model is from 0 to 1, which is selected on the valid set The weight with the best effect is normalized as the final weight.

(2) After obtaining the relevance scores of all Query-Image sample pairs, then sort the multiple candidate images corresponding to the Query. In the data of the verification set (Val) and test set (testB), some images appear in multiple candidate samples of Query. This solution further processes these samples:

a. Considering that the same Image usually corresponds to only one Query, it is considered that the same Image is only related to the Query with the highest relevance score. Using the above strategy to post-process the results of the ImageBERT B model on the validation set (Val), the model's NDCG@5 score increased from 0.7098 to 0.7486.

b. Considering that multiple queries corresponding to the same Image often have small differences, and their semantics are relatively close, which leads to poor discrimination of such samples after training, and the correlation score of poor discrimination will be certain To a certain extent, the model NDCG@5 decreases. In response to this situation, we adopted the following operations:

  • If the correlation score between Top1 Image and Top2 Image in the same Query is greater than a certain threshold, only the Query-Image sample pair corresponding to Top 1 will be retained when calculating NDCG@5, and other sample pairs will be deleted.

  • Conversely, if the difference between the correlation scores of Top1 Image and Top2 Image is less than or equal to a certain threshold, when calculating NDCG@5, delete all sample pairs containing the Image.

Using the above strategy to post-process the validation set (Val) results of ImageBERT B, when the selected threshold is 0.92, the NDCG@5 score of the model is increased from 0.7098 to 0.8352.

It can be seen that after adopting strategy b, the model performance has been significantly improved. Therefore, this solution uses strategy b to process the correlation scores of all models after Ensemble on the test set (testB), and obtain the final correlation Sexual sorting.

5. Application of multi-modal search in Meituan

As mentioned earlier, Meituan Search is a typical multi-modal search scenario. Currently, multi-modal capabilities have been implemented in multiple search scenarios. Before introducing the specific landing scenarios, let’s briefly introduce the overall structure of Meituan search. The overall search structure of Meituan is divided into five layers, namely: data layer, recall layer, refinement layer, small model rearrangement layer, and final The result display layer, and then follow the five-layer structure of the search to introduce the implementation of multi-modality in the search scene in detail.

Data layer

Multi-modal representation : Based on Meituan's massive text and image/video data, parallel corpus is constructed, and the ImageBERT model is pre-trained. The training model is used to extract text and image/video vectorized representations to serve downstream recall/ranking tasks.

Multi-modal fusion : In the multi-classification task of image/video data, the related text is introduced to improve the accuracy of the classification label, and the image/video label recall of the downstream service and the display layer are drawn according to the search query.

Recall layer

Multi-modal representation & integration : In multi-channel recall scenarios such as content search, video search, and full-text search, the classification label recall of pictures/videos and the vectorized recall of pictures/videos are introduced to enrich the recall results and improve the relevance of the recall results.

Refinement layer & small model rearrangement

Multimodal representation & fusion: In the ranking model, the vectorized Embedding feature of pictures/videos, as well as the correlation characteristics of search queries and display pictures/videos, search results and correlation characteristics of display pictures/videos, are introduced to optimize the sorting effect.

Display layer

Multi-modal fusion: In the picture/video optimization stage, pictures/videos and Query and related information with search results are introduced, so that the pictures are output according to the search query and the search results, and the user experience is optimized.

Figure 8 Landing scene of multi-modal search in Meituan

 

6. Summary

In this competition, we built a multi-modal recall model based on ImageBERT and LXMERT, and improved the model effect through data preprocessing, result fusion and post-processing strategies. This model can fine-grained the scoring and sorting of the relevant pictures of the user query Query, so as to obtain a high-quality sorted list. Through this competition, we have a deeper understanding of the algorithms and research directions in the field of multimodal retrieval, and we also take this opportunity to test the industrial implementation capabilities of cutting-edge algorithms, laying the foundation for further algorithm research and implementation. . In addition, because the scene of this competition is similar to the business scene of Meituan Search and the NLP department, this model can also directly empower our business in the future.

At present, the Meituan Search and NLP teams are combining multi-modal information, such as text, images, OCR, etc., to carry out MT-BERT multi-modal pre-training work, through the fusion of multi-modal features, learning better semantic expression, and also We are trying to implement more downstream tasks, such as image and text correlation, vectorized recall, multi-modal feature representation, and title generation based on multi-modal information.

references

[1]  Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer Normalization. arXiv preprint arXiv:1607.06450 (2016).

[2]  Hendrycks, D., and Gimpel, K. Gaussian Error Linear Units (GeLUs). arXiv preprint arXiv:1606.08415 (2016).

[3]  Qi, D., Su, L., Song, J., Cui, E., Bharti, T., and Sacheti, A. Imagebert: Cross-modal Pre-training with Large-scale Weak-supervised Image-text Data. arXiv preprint arXiv:2001.07966 (2020).

[4]  Tan, H., and Bansal, M. LXMERT: Learning Cross-modality Encoder Representations from Transformers. arXiv preprint arXiv:1908.07490 (2019).

[5]  Wang, F., Liu, W., Liu, H., and Cheng, J. Additive Margin Softmax for Face Verification. arXiv preprint arXiv:1801.05599 (2018).

[6]  Wang, X., Han, X., Huang, W., Dong, D., and Scott, M. R. Multi-similarity Loss with General Pair Weighting for Deep Metric Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), pp. 5022–5030.

[7]  Li X, Yin X, Li C, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks[J]. arXiv preprint arXiv:2004.06165, 2020.

[8]  Lu J, Batra D, Parikh D, et al. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]//Advances in Neural Information Processing Systems. 2019: 13-23.

[9]  Li L H, Yatskar M, Yin D, et al. Visualbert: A simple and performant baseline for vision and language[J]. arXiv preprint arXiv:1908.03557, 2019.

[10]  Su W, Zhu X, Cao Y, et al. Vl-bert: Pre-training of generic visual-linguistic representations[J]. arXiv preprint arXiv:1908.08530, 2019.

[11] Yang Yang, Jia Hao, etc. Exploration and practice of MT-BERT .

About the Author

Zuo Kai, Ma Chao, Dongshuai, Cao Zuo, King Kong, Zhang Gong, etc., all come from the search and NLP department of the Meituan AI platform.

About Meituan AI

Meituan AI takes "helping people eat better and live better" as its core goal, and is committed to exploring cutting-edge artificial intelligence technology in actual business scenarios, and quickly implementing it in real life service scenarios to complete offline Digitization of the economy.

Meituan AI was born out of the rich life service scene demands of Meituan, and it has the uniqueness and advantages of scene-driven technology. Based on business scenarios and rich data, through image recognition, voice interaction, natural language processing, and distribution scheduling technology, it can be used in real scenarios such as unmanned distribution, unmanned micro warehouses, and smart stores, covering all aspects of people's lives, using technology Assist users in improving the quality of life, upgrading the industry’s intelligence, and even building a new infrastructure for life services for the entire society.

For more information, please visit: https://ai.meituan.com/ 

----------  END  ----------

Job Offers

Meituan Search and NLP Department, long-term recruitment of search, recommendation, NLP algorithm engineers, coordinates Beijing/Shanghai. Interested students are welcome to send their resumes to: [email protected] (Email indicates: Search and NLP Department)

Maybe you still want to watch

KDD Cup 2020 Debiasing competition champion technical solution and practice in Meituan advertising

|  KDD Cup 2020 automatic map learning competition champion technical plan and practice in Meituan advertising

|  Practice of MT-BERT in text retrieval tasks

more recommendations

"Meituan Technology Salon 54: Meituan Data Mining Technology-KDD Special" was successfully held on September 5, 2020. This event mainly shared the five achievements Meituan published in KDD 2020, including Meituan’s New works in the scope of delivery, Meituan search advertising winning solutions and applications, the application of multimodal search in the Meituan review search business, the application of DeBias technology in search ranking, and the application of ranking learning in multi-objective learning.

Please scan the QR code at the end of the article to follow the official official account of [Meituan Technical Team], and click [Technical Salon] in the [Menu Bar] to view event videos and PPT, and you can also view the details of past Meituan Technical Salon activities.

Guess you like

Origin blog.csdn.net/MeituanTech/article/details/108786662