Detailed introduction to Deeper Text Understanding for IR with Contextual Neural Language Modeling


Christmas in 2022 is coming, I am very happy that we can spend it together again this time~

Thesis title: Deeper Text Understanding for IR with Contextual Neural Language Modeling

Neural networks offer new possibilities for automatically learning complex linguistic patterns and query-document relationships. Neural IR models have achieved promising results in learning query-document relevance patterns, but have been less explored in understanding the textual content of queries or documents. This paper studies the use of contextual neural language model BERT to provide deeper text understanding for IR. Experimental results show that BERT's contextual text representation is more effective than traditional word embeddings. Compared with bag-of-words retrieval models, contextual language models can make better use of linguistic structures, bringing great improvements to queries written in natural language. Combining text understanding capabilities with search knowledge to form an enhanced pre-trained BERT model can benefit related search tasks with limited training data.

1 Introduction

Text retrieval requires an understanding of document meaning and search tasks. Neural networks are an attractive solution because they can derive this understanding from both raw document text and training data. Most neural IR methods focus on learning query-document correlation patterns, i.e. knowledge about the search task. However, learning only correlation patterns requires a large amount of training data and still does not generalize well to tail queries or new search domains. These issues make pretrained, general-purpose models for text understanding ideal.

Pretrained word representations such as word2vec have been widely used in neural IR. They learn from word co-occurrences in large corpora, providing hints about synonyms and related words. But word co-occurrence is only a shallow bag-of-words understanding of the text. Recently, we have seen rapid progress in text understanding with the introduction of pretrained neural language models such as ELMo and BERT. Unlike traditional word embeddings, they are contextual – the representation of a word is a function of the entire input text, taking into account word dependencies and sentence structure. These models are pretrained on a large set of documents, so contextual representations encode general linguistic patterns. Contextual neural language models outperform traditional word embeddings on various NLP tasks.

Deeper text understanding with contextual neural language models brings new possibilities for IR. This paper explores the impact of BERT on ad-hoc document retrieval. BERT is a state-of-the-art neural language model that is also well suited for search tasks. BERT is trained to predict the relationship between two pieces of text (usually sentences); its attention-based architecture models the local interactions of words in text 1 with words in text 2. It can be viewed as an interaction-based neural ranking model and thus requires minimal specific search architecture engineering.

This paper examines BERT models on two ad-hoc retrieval datasets with different characteristics. Experiments show that fine-tuning a pre-trained BERT model with limited search data can achieve better performance than a strong baseline . Contrary to the observation of traditional retrieval models, longer natural language queries under the action of BERT can outperform short keyword queries by a large margin. Further analysis reveals that stop words and punctuation marks, often ignored by traditional IR methods, play a key role in understanding natural language queries by defining grammatical structures and word dependencies. Finally, augmenting BERT with search knowledge from large search logs yields a pre-trained model that is knowledgeable about both text understanding and search tasks, which is beneficial for related search tasks where labeled data is limited.

2. Related work

Recent neural IR models have made promising progress in learning query-document correlation patterns. One research direction is to learn textual representations tailored for search tasks via search signals fed by click records or pseudo-relevance. Another research direction is to design neural architectures to capture different matching features, such as exact match signals and paragraph-level signals. How to understand the text content of the query/document has been discussed less at present. Most neural IR models use word embeddings to represent text, such as Word2Vec.

Contextual neural language models are proposed to improve traditional word embeddings by incorporating context. One of the best performing neural language models is BERT. BERT is pretrained on large-scale, open-domain documents to learn general patterns in language. The pre-training tasks include predicting the words in a sentence and the relationship between two sentences. BERT achieves state-of-the-art on various NLP tasks, including the paragraph ranking task. Its effectiveness on standard document retrieval tasks remains to be studied.

3. Document Search Using BERT

This work uses an off-the-shelf BERT sentence pair classification architecture, shown in Figure 1. The model takes as input the concatenation of query tokens and document tokens, with a special token "[SEP]" separating the two fragments. Markers are embedded into embeddings. To further separate queries from documents, segment embeddings 'Q' (for query tokens) and 'D' (for document tokens) are added to the token embeddings . To capture word order, positional embeddings are added . At each layer, a new contextual embedding is generated for each token by weighted summing the embeddings of all other tokens. The weights are determined by several attention matrices (multi-head attention). Words with stronger attention are considered to be more related to the target word. Different attention matrices capture different types of word relations, such as exact matches and synonyms. Attention spans queries and documents in order to account for query-document interactions. Finally, the output embedding of the first token is used as a representation of the entire query-document pair. It is fed into a multi-layer perceptron (MLP) to predict the likelihood of correlation (binary classification). The model is initialized with a pretrained BERT model to leverage a pretrained language model, while the final MLP layer is learned from scratch. During training, the whole model is tuned to learn more IR-specific representations.

insert image description here
Figure 1: BERT sentence pair classification architecture

Paragraph-Level Proofs : Due to the complexity of the interaction of each pair of tokens, applying BERT to long documents leads to increased memory usage and runtime. Models trained on sentences may also not be as effective on long texts. We employ a simple paragraph-level approach for document retrieval. We split documents into overlapping paragraphs, and a neural ranker independently predicts the relevance of each paragraph. **The document score is the score of the first paragraph (BERT-FirstP), the best paragraph (BERT-MaxP), or the sum of all paragraph scores (BERT-SumP). ** For training, paragraph-level labels are not available in this work. We consider all paragraphs of related documents to be relevant and vice versa. When document headings are available, headings are added to the beginning of each paragraph to provide context.

Augmenting BERT with Search Knowledge : Some search tasks require both general text understanding (e.g. Honda is a company) and more specific search knowledge (e.g. people want to see special deals about Honda). While pretrained BERT encodes general linguistic patterns, search knowledge must be learned from labeled search data. Such data is often expensive and takes time to collect. It is best to have a pre-trained ranking model that has knowledge of both language understanding and search. We augment BERT's search knowledge by tuning it on large search logs.

Discussion : To apply BERT to search tasks, we only make a few minor adjustments: a paragraph-based approach to handle long documents, and a concatenation approach to handle multiple document fields. Our goal is to study the value of BERT's contextual language model for search, not to make major extensions to the architecture.

4. Experimental setup

Datasets : We used two standard text retrieval collections with different characteristics. Robust04 is a news corpus with 0.5M documents and 249 queries, including two versions of queries: a short keyword query (title) and a longer natural language query (description). A narrative is also included as a guide for the assessment of relevance. An example is shown in Table 1. ClueWeb09-B contains 50M web pages and 200 queries with title and description. Paragraphs are generated with a sliding window of 150 words, with a span of 75 words. For ClueWeb09-B, document headings are added to the beginning of each paragraph. For augmenting BERT with search data, we follow the domain adaptation setting of Dai et al. and use the same Bing search log samples. The sample contains 0.1M queries and 5M query-document pairs.

insert image description here
Table 1: Examples of Robust04 search topics (topic 697)

Baseline and Implementation : The unsupervised baseline uses Indri's Bag of Words (BOW) and Sequential Dependent Model Query (SDM). The learning-to-rank baseline includes RankSVM and Coor-Ascent's bag-of-words features. Neural baselines include DRMM and Conv-KNRM. DRMM uses word2vec to model word soft matching; it proved to be one of the best performing neural models on our two datasets. Conv-KNRM learns n-gram embeddings for search tasks and shows strong performance when trained on large search logs. When trained with domain adaptation, Bing-adapted Conv-KNRM is the state-of-the-art neural IR model and compared with Bing-augmented BERT. The BERT model is based on the implementation released by Google. Baselines use standard stop word removal and stemming; BERT uses raw text. The supervised model is used to rerank the top 100 documents retrieved by BOW with 5-fold cross-validation. Due to space constraints, we only report nDCG@20; similar trends are observed in nDCG@10 and [email protected].

5. Results and Discussion

This section studies the effectiveness of BERT on document retrieval tasks, the differences between different types of queries, and the impact of augmenting BERT with search logs.

5.1 Pretrained BERT for Document Retrieval

The ranking accuracy of each ranking method is shown in Table 2. On Robust04, the BERT model consistently achieves better search accuracy than the baseline, with a 10% advantage on title queries and a 20% advantage on description queries. On ClueWeb09-B, BERT is comparable to Coor-Ascent on title queries and better on description queries. These results demonstrate the effectiveness of BERT for document retrieval, especially for description queries. Among neural rankers, Conv-KNRM has the lowest accuracy. Conv-KNRM needs to learn n-gram embeddings from scratch, which is powerful when trained on a large number of search logs, but tends to overfit when trained with only a small amount of data. BERT is pre-trained and not prone to overfitting. DRMMs represent words with pre-trained word embeddings. The better performance of the BERT model shows that contextual textual representations are more effective for IR than bag-of-words embeddings.

insert image description here
Table 2: Search accuracy on Robust04 and ClueWeb09-B. †Denotes statistically significant improvement over Coor-Ascent, by permutation test at P<0.05.
insert image description here
Figure 2: Visualization of BERT. The colors represent different attention heads, and the darker the color, the higher the attention.

Source of effectiveness : Figure 2 shows the two layers of the BERT-MaxP model when predicting the correlation between the descriptive query "Where are the wind turbines located?" and the sentence "There are 1200 wind turbines in Germany". Example 1 shows the attention received by the word "power" in the document. The strongest attention comes from 'power' in the query (query-document exact term matching), and the previous and next word of 'power' (bigrammatical modeling). Local matching of words and n-grams has proven to be a powerful neural IR feature; BERT is also able to capture them. Example 2 shows that the document word "in" gets the strongest attention from the query word "where". "in" occurs in the context of "in Germany", so it satisfies the "where" question. Words like 'in' and 'where' are often overlooked by traditional IR methods due to their high document frequency in the corpus. This example shows that these stop words actually provide important evidence about relevance as the understanding of the text progresses. In short, BERT's strengths lie in architecture and data. The Transformer architecture allows BERT to extract various effective matching features. Transformer has been pre-trained on a large corpus, so search tasks with a small amount of training data can also benefit from this deep network.

Title query and description query : The BERT model has a greater benefit in description queries. On Robust04, description queries using BERT-MaxP outperform the best title query baseline (SDM) by 23%. Most other ranking methods only get similar or worse performance on descriptions compared to titles. To our knowledge, this is the first time we've seen description queries beat title queries by such a large margin. On ClueWeb09-B, BERT managed to close the gap between titles and descriptions. Although intuitively, description queries should carry richer information, due to the difficulty in estimating the importance of terms, it is difficult to make full use of them in traditional bag-of-words methods. Our results show that longer natural language queries are indeed more expressive than keywords and can effectively leverage richer information using deep, contextual neural language models to improve search. See Section 5.2 for further analysis of BERT's ability to understand different types of search queries.

Robust04 vs. ClueWeb09-B : The BERT model outperforms ClueWeb09-B on Robust04. This may be due to the fact that Robust04 is closer to the pre-trained model. Robust04 has very well written articles; the facts its queries look for depend largely on an understanding of the meaning of the text. ClueWeb09-B documents are web pages that include tables, navigation bars, and other discontinuous text. The task also addresses issues specific to web search, such as page authority. Learning such search-specific knowledge may require more training data. We investigate this possibility in Section 5.3.

5.2 Understanding Natural Language Queries

This section studies BERT on 3 queries that require varying degrees of text understanding: title, description, and narrative . To test the influence of grammatical structure, keyword versions of descriptions and narratives were generated by removing stop words and punctuation marks. To test how well BERT understands the logic in the narrative, a "positive" version of the narrative is generated by removing negative conditions (e.g. "the document is not relevant..."). Table 3 shows the performance of SDM, Coor-Ascent and BERT-MaxP on Robust04. Due to the low recall of BOW on narratives, supervised methods use narratives to rearrange the initial results of title queries, which gives narratives an advantage over other types of queries.

insert image description here
Table 3: Accuracy on different types of Robust04 queries. Percentages show relative gain/loss compared to headline query.

SDM works best with titles. Coor-Ascent is relatively better in terms of description and narrative. These two methods only weight words based on term frequency, but the importance of words should depend on the meaning of the entire query. In contrast, BERT-MaxP achieves larger improvements on longer queries by modeling word meaning and context. Keyword versions perform better than raw queries for SDM and Coor-Ascent, since stop words are noisy for traditional matching signals like TF. In contrast, BERT is more effective for raw natural language queries. While stop words and punctuation do not define the required information, they build structure in a language. BERT is able to capture such structures, enabling deeper query understanding than flat bag-of-words. Table 3 also shows the limitations of BERT. It cannot take advantage of evidence of negative logical conditions in the narrative; removing negative conditions does not hurt performance.

5.3 Understanding Search Tasks

Text representations trained on corpora are not always consistent with search tasks. Search-specific knowledge is necessary, but requires labeled data for training. The final section investigates whether BERT's language modeling knowledge can be stacked with additional search knowledge to build a better ranker, and whether search knowledge can be learned in a domain-adaptive manner to alleviate the cold-start problem.

insert image description here
Table 4: Accuracy of Bing-enhanced BERT on ClueWeb09-B. †: Statistically significant improvement over Coor-Ascent.

We train BERT on Bing search log samples and fine-tune it on ClueWeb09-B. The results are shown in Table 4. BERT-FirstP is the best in-domain BERT model on ClueWeb09-B (Table 2). Its pretrained language model encodes general word associations like ('Honda', 'car'), but lacks search-specific knowledge like ('Honda', 'special offer'). Conv-KNRM+Bing is the previous state-of-the-art domain-adaptive neural IR model. It is trained on millions of query-document pairs, but does not explicitly model common language patterns. BERT-FirstP+Bing achieves the best performance, confirming that text retrieval requires both understanding of text content and search tasks . BERT's simple domain adaptation leads to a pretrained model with both types of knowledge that improves related search tasks with limited labeled data.

6 Conclusion

Text understanding is a long-desired feature of text retrieval. Contextual neural language models offer new possibilities for understanding word context and modeling language structure. This paper investigates the impact of the deep neural language model BERT on the task of ad-hoc document retrieval.

Tuning and fine-tuning BERT achieves high accuracy on two different search tasks, demonstrating the effectiveness of BERT's language modeling for IR. Contextualization models have brought huge improvements to natural language queries. A corpus-trained language model can complement search knowledge with simple domain adaptation, leading to a powerful ranker that models both meaning and relevance of text in search.

People are accustomed to using keyword queries because the bag-of-words retrieval model cannot effectively extract key information from natural language. We found that queries written in natural language actually lead to better search results when the system can mimic the structure of the language. Our findings encourage more research on search systems with natural language interfaces.

Guess you like

Origin blog.csdn.net/zag666/article/details/128428128
Recommended