Practice of MT-BERT in text retrieval tasks

Based on the reading comprehension data set MS MARCO of Microsoft's large-scale real scene data, Meituan Search and NLP Center proposed a BERT algorithm scheme for this text retrieval task, DR-BERT. This scheme is the first official evaluation indicator MRR@ A model that broke 0.4 on 10.

 

This article is about the practical sharing of the DR-BERT algorithm in text retrieval tasks, and hopes to be inspiring and helpful to students who are engaged in retrieval and sorting related research.

background

Improving machine reading comprehension (MRC) capabilities and open domain question answering (QA) capabilities is an important goal in the field of natural language processing (NLP). In the field of artificial intelligence, many breakthroughs are based on some large public data sets. For example, in the field of computer vision, the object classification model based on the ImageNet dataset has surpassed human performance. Similarly, in the field of speech recognition, some large speech databases have also enabled deep learning models to greatly improve the ability of speech recognition.

In recent years, in order to improve the natural language understanding of the model, more and more MRC and QA data sets have emerged. However, these data sets have some shortcomings, such as insufficient data volume and relying on manual construction of Query. In response to these problems, Microsoft proposed a reading comprehension data set MS MARCO (Microsoft Machine Reading Comprehension) based on large-scale real scene data [1]. The data set is based on real search queries in the Bing search engine and Cortana's intelligent assistant, and contains 1 million queries, 8 million documents and 180,000 manually edited answers.

Based on the MS MARCO data set, Microsoft proposes two different tasks: one is to retrieve and sort documents in all the data sets for a given problem, which is a document retrieval and sorting task; the other is based on the problem and the given correlation The document generates the answer, which is a QA task. In Meituan's business, document retrieval and sorting algorithms are widely used in search, advertising, and recommendation scenarios. In addition, the time consumption of directly performing QA tasks on all candidate documents is unacceptable. QA tasks must rely on sorting tasks to filter out the top-ranked documents, and the performance of the sorting algorithm directly affects the performance of QA tasks. Based on the above reasons, we mainly focus on document retrieval and sorting tasks based on MS MARCO.

Since the release of the MACRO document sorting task in October 2018, it has so far attracted the participation of many companies and universities including Alibaba Dharma Academy, Facebook, Microsoft, Carnegie Mellon University, Tsinghua University, etc. On Meituan’s pre-training MT-BERT platform [14], we proposed a BERT algorithm scheme for this text retrieval task, called DR-BERT (Enhancing BERT-based Document Ranking Model with Task-adaptive Training and OOV Matching Method). DR-BERT is the first model to break 0.4 on the official evaluation indicator MRR@10, and it topped the list from May 21, 2020 (model submission date) to August 12, and the organizer also posted a separate tweet Congratulations, as shown in Figure 1 below. The core innovations of the DR-BERT model mainly include domain-adaptive pre-training, two-stage model fine-tuning, and two OOV (Out of Vocabulary) matching methods.

Figure 1 Official congratulatory tweets and MARCO rankings

Related introduction

Learning to Rank

In the field of information retrieval, there have been many machine learning ranking models (Learning to Rank) used to solve document ranking problems in the early days, including LambdaRank[2], AdaRank[3], etc. These models rely on many manually constructed features. With the popularity of deep learning technology in the field of machine learning, researchers have proposed many neural ranking models, such as DSSM [4], KNRM [5] and so on. These models map the representation of the problem and the document into a continuous vector space, and then calculate their similarity through a neural network, thus avoiding tedious manual feature construction.

Figure 2 Training goals of Pointwise, Pairwise, and Listwise

According to different learning goals, the sorting model can be roughly divided into Pointwise, Pairwise and Listwise. The schematic diagrams of these three methods are shown in Figure 2 above. Among them, the Pointwise method directly predicts the correlation score of each document and question. Although this method is easy to implement, it is more important for sorting to learn the sorting relationship between different documents. Based on this idea, the Pairwise method transforms the sorting problem into a comparison of pairwise documents. Specifically, given a problem, each document will be compared with other documents in pairs to determine whether the document is better than other documents. In this case, the model learns the relative relationship between different documents.

However, there are two problems with Pairwise's sorting task: First, this method optimizes the comparison of two documents rather than the sorting of more documents, which is different from the goal of document sorting; second, randomly extracting pairs from documents can easily cause training The problem of data bias. In order to make up for these problems, the Listwise method extends Pairwsie's ideas and directly learns the relationship between sorts. According to the form of loss function used, researchers have proposed a variety of different Listwise models. For example, ListNet [6] directly uses the top-1 probability distribution of each document as a sorted list, and uses cross-entropy loss to optimize. ListMLE [7] uses maximum likelihood to optimize. SoftRank [8] directly uses the ranking metric of NDCG for optimization. Most studies have shown that, compared to Pointwise and Pairwise methods, Listwise learning methods can produce better ranking results.

BERT

Since the proposal of Google's BERT[9] in 2018, the pre-trained language model has achieved great success in the field of natural language processing, and has achieved SOTA effects on a variety of NLP tasks. BERT is essentially an encoder based on the Transformer architecture. The key factor for its success is to use the self-attention mechanism in the multi-layer Transformer to extract different levels of semantic features, which has strong semantic representation capabilities. As shown in Figure 3, BERT training is divided into two parts, one is pre-training based on large-scale corpus, and the other is fine-tuning on specific tasks.

Figure 3 BERT structure and training mode

In the field of information retrieval, many researchers have also begun to use BERT to complete sorting tasks. For example, [10][11] used BERT to conduct experiments on MS MARCO, and the results obtained greatly exceeded the best neural network ranking model at the time. [10] used the Pointwise learning method, and [11] used the Pairwise learning method. Although these works have achieved good results, the comparison information of the sort itself has not been used. Based on this, we have made great progress by combining the semantic representation capabilities of BERT itself and Listwise sorting.

Model introduction

mission details

Initial screening of candidates based on DeepCT

Due to the large amount of data in MS MARCO, it will consume a lot of time to directly use the deep neural network model to do the query and the correlation calculation of all documents. Therefore, most ranking models will use a two-stage ranking method. In the first stage, the top-k candidate documents are initially screened, and then in the second stage, a deep neural network is used to refine the candidate documents. Here we use the BM25 algorithm to perform the first step of the retrieval. Commonly used document representation methods for BM25 include TF-IDF and so on.

But TF-IDF cannot consider the contextual semantics of each word. In order to improve this problem, DeepCT [12] first uses BERT to encode the document separately, and then outputs the importance score of each word. Through the powerful semantic representation capabilities of BERT, the importance of words in documents can be well measured. As shown in Figure 4 below, the darker the word, the higher its importance. Among them, "stomach" is more important in the first document.

Figure 4 DeepCT estimates the importance of words, the same word has different importance in different documents

The training objectives of DeepCT are as follows:

Among them, QTR(t,d) represents the importance score of the word t in the document d, Qd represents the question related to the document d, and Q{d,t} represents the subset of the word t in the question corresponding to the document d. The output score can be used as the word frequency (TF), which is equivalent to re-estimating the importance of the word of the document, so it can be retrieved directly using the BM25 algorithm. We use DeepCT as the retrieval model of the first stage, and get top-k documents as the document candidate set D={D1,D2,...,Dk}.

Domain adaptive pre-training

Since our model is based on BERT, the corpus used in the pre-training of BERT itself is not the same field as the corpus used in the current task. We came to this conclusion based on the analysis of the top-10000 high-frequency words in the two parts of the corpus. We found that the corpus used by MARCO's top-10000 high-frequency words and the BERT baseline is more than 40% different. Therefore, it is necessary for us to pre-train BERT using the corpus of the current field. Since MS MARCO is a large-scale corpus, we can directly use the document content in the dataset to pre-train BERT. In the first stage, we use MLM and NSP pre-training objective function to pre-train on MS MARCO.

Two-stage fine tuning

Figure 5 Model structure

The following introduces the fine-tuning model we proposed. Figure 5 above shows the structure of the model we proposed. Fine tuning is divided into two stages: Pointwise fine tuning and Listwise fine tuning.

Fine tuning of Pointwise problem type perception

In the first stage of fine tuning, our goal is to establish the relationship between the problem and the document through Pointwise training. We take Query-Document as input and use BERT to encode it to match the question and document. Considering that the matching mode of the question and the document has a great relationship with the type of the question, we believe that the type of the question needs to be considered at this stage. Therefore, we use questions, question types and documents together to encode them through BERT to obtain a semantic representation of deep interaction. Specifically, we splice the question type T, the question Q and the i-th document Di into a sequence input, as shown in the following formula:

Among them, <SEP> represents the separator, and the code corresponding to the position of <CLS> represents the Query-Document relationship.

After BERT encoding, we take the representation hi of the <CLS> position in the last layer as the relation representation of Query-Document. Then use Softmax to calculate their scores and get:

The score Ti is optimized by the cross-entropy loss function. Through the above pre-training, the model learns different matching modes for different problems. The pre-training at this stage can be called Type-Adaptive model fine-tuning.

Listwise fine tuning

In order to make the model directly learn the comparison relationship of different sorts, we fine-tune the model through Listwise. Specifically, in the training process, for each question, we sample n+ positive examples and n- negative examples as input. These documents are randomly generated from the candidate document set D. Note that due to hardware limitations, we cannot input all candidate documents into the current model. Therefore, we chose a random sampling method for training.

Similar to the way of using BERT in pre-training, we get the representation of each document in the positive and negative examples, hi+ and hi-. Then use a single-layer perceptron to reduce the dimensionality of the representation obtained above and convert it into a score, namely:

Where W and b are the learnable parameters in the model. Next, for the score of each document, we get through a document-level comparison and normalization:

In this step, we compare the scores of the positive examples and the negative examples in the document to get the ranking score of Listwise. Through this step, we get a sorted list of documents, and we can transform the optimization of document sorting into maximizing the score of positive examples. Therefore, the model can be optimized by negative log-likelihood loss, as shown in the following formula:

As for why the two-stage fine-tuning model is used, it is mainly based on the following two considerations:

1. We found that first learning the relevance features of the problem and the document and then learning the ranking features is better than learning the ranking features directly.

2. MARCO is an under-labeled data set. In other words, many documents related to the problem are not marked as 1, and these noises can easily cause the model to overfit. The model in the first stage can be used to filter the noise in the training data, so that there can be better data to supervise the fine-tuned model in the second stage.

Solve the mismatch problem of OOV

In BERT, in order to reduce the scale of the vocabulary and solve the problem of Out-of-vocabulary (OOV), the WordPiece method is used to segment words. WordPiece will split words that are not in the vocabulary, that is, OOV words into fragments. As shown in Figure 6, the original question contains the word "bogue" and the document contains the word "bogus". In the WordPiece method, "bogue" is divided into "bog" and "##ue", and "bogus" is divided into "bog" and "##us". We found that "bogus" and "bogue" are two unrelated words, but because WordPiece cuts out the matching segment "bog", the correlation calculation score between the two is relatively high.

Figure 6 Text before/after BERT WordPiece processing

In order to solve this problem, we propose a feature that accurately matches the original word (before WordPiece cuts the word). The so-called "exact match" means that a certain word appears in the document and the question at the same time. Precise matching is a very important technology in information retrieval and machine reading comprehension. According to previous research, many reading comprehension models can have certain effects after adding this feature. Specifically, in the Fine-tuning stage, we construct an exact matching feature for each word, which indicates whether the word appears in the question and the document. Before the encoding stage, we map this feature to a vector and combine it with the original Embedding:

Figure 7 The working principle of the word reduction mechanism

In addition, we also propose a word reduction mechanism as shown in Figure 7. The word reduction mechanism can merge the subtoken representations of WordPiece segmentation, so as to better solve the problem of OOV mismatch. Specifically, we use Average Pooling's representation of Subtoken as the input of the hidden layer. In addition, as shown in Figure 7 above, we used MASK to handle the non-first hidden layer position corresponding to Subtoken. It is worth noting that the word reduction mechanism can also avoid the overfitting problem of the model. This is because MARCO's set labeling is relatively sparse. In other words, there are many positive examples that are not labelled as 1, which easily causes the model to overfit these negative samples. The word reduction mechanism plays the role of Dropout to a certain extent.

Summary and outlook

The above content gives a detailed introduction to our DR-BERT model. The DR-BERT model we proposed mainly uses task adaptive pre-training and two-stage model fine-tuning training. In addition, a word reduction mechanism and exact matching feature are also proposed to improve the matching effect of OOV words. Through experiments on the large-scale data set MS MARCO, the superiority of the model has been fully verified. I hope these can be helpful or inspiring for everyone.

references

[1] Payal Bajaj, Daniel Campos, et al.  2016. "MS MARCO: A Human Generated MAchine Reading COmprehension Dataset" NIPS.

[2] Christopher J. C. Burges, Robert Ragno,  et al. 2006. "Learning to Rank with Nonsmooth Cost Functions" NIPS.

[3] Jun Xu and Hang Li. 2007. "AdaRank: A Boosting Algorithm for Information Retrieval". SIGIR.

[4] Po-Sen Huang, Xiaodong He, et al. 2013. "Learning deep structured semantic models for web search using clickthrough data". CIKM.

[5] Chenyan Xiong, Zhuyun Dai, et al. 2017. "End-to-end neural ad-hoc ranking with kernel pooling". SIGIR.

[6] Zhe Cao, Tao Qin, et al. 2007. "Learning to rank: from pairwise approach to listwise approach". ICML.

[7] Fen Xia, Tie-Yan Liu, et al. 2008. "Listwise Approach to Learning to Rank: Theory and Algorithm". ICML.

[8] Mike Taylor, John Guiver, et al. 2008. "SoftRank: Optimising Non-Smooth Rank Metrics". In WSDM.

[9] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. 2018. "Bert: Pre-training of deep bidirectional transformers for language understanding". arXiv preprint arXiv:1810.04805.

[10] Rodrigo Nogueira and Kyunghyun Cho. 2019. "Passage Re-ranking with BERT". arXiv preprint arXiv:1901.04085 (2019).

[11] Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019. "Multi-stage document ranking with BERT". arXiv preprint arXiv:1910.14424 (2019).

[12] Zhuyun Dai and Jamie Callan. 2019. "Context-aware sentence/passage term importance estimation for first stage retrieval". arXiv preprint arXiv:1910.10687 (2019)

[13] Hiroshi Mamitsuka. 2017. "Learning to Rank: Applications to Bioinformatics".

[14] Yang Yang, Jia Hao, etc. Exploration and practice of BERT in Meituan .

 

About the Author

Xingwu, Hongyin, King Kong, Fuzheng, Wuwei, etc., all come from the Meituan AI platform/search and NLP center.

Special thanks to Jin Beihong, a researcher from the Institute of Software, Chinese Academy of Sciences, for his guidance and help during the MARCO competition and article writing.

----------  END  ----------

Job Offers

Meituan's technical operation team is new! This is a small team that is warm and loving and attaches great importance to learning and growth. What it does is interesting and challenging. If you join us, you can deal with nearly 10,000 outstanding engineers and students of Meituan. You can get in touch with many cutting-edge technologies and ideas, and you can also get close to many technical experts in the industry...

Looking forward to the excellent you join us, welcome everyone to recommend or recommend~ ~

Job Responsibilities

1. According to the company's strategic direction, plan the operation projects that support the R&D team inside and outside the company, including content output, online and offline event planning and organization, etc.

2. Effectively expand, operate, and maintain communication channels, and establish a sound cooperation and communication mechanism and system.

3. Independently responsible for the implementation of the project, obtain necessary resources through communication with project stakeholders, and evaluate the effects of various operational actions through data analysis.

4. Effectively integrate resources from all parties, promote the sharing and communication of the company's internal R&D team, and enhance the external technical influence of the R&D team.

job requirements

1. Like to deal with R&D classmates and understand their emotions.

2. Bachelor degree or above, more than 3 years of operating experience.

3. Clear thinking, attention to details, good data analysis and time management capabilities.

4. Responsible, smart and love learning, confident and cheerful.

5. Excellent writing skills and expression skills, certain event/conference/exhibition organization and execution ability, product or user operation, project management, market copywriting, and editorial reporter experience are preferred.

Interested students can submit their resumes to: [email protected] (the subject of the email indicates: technical operations)

Maybe you still want to watch

Exploration and Practice of NER Technology in Meituan Search

Exploration and Practice of BERT in the Core Sorting of Meituan Search

Exploration and Practice of Meituan BERT

Guess you like

Origin blog.csdn.net/MeituanTech/article/details/108138536