Retrieval scene pre-training

1. Retrieval pre-training

1.1 PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

three types of pre-training tasks have been proposed including:

  • Inverse Cloze Task (ICT): The query is a sentence randomly drawn from the passage and the document is the rest of sentences;
  • Body First Selection (BFS): The query is a random sentence in the first section of a Wikipedia page, and the document is a random passage from the same page;
  • Wiki Link Prediction (WLP): The query is a random sentence in the first section of a Wikipedia page, and the document is a passage from another page where there is a hyperlink link to the page of the query.

motivation novelty:

The assumption of the Query Likelihood
language model is: p(R=1|q,d)≈p(q|d,R=1). The probability that the document is relevant to the query is approximately equal to the user input under the premise that the document is relevant. The probability of q. For details, see: Document Sorting Model – Query Likelihood
The editor believes that the principle is actually similar to TF-IDF, which calculates the similarity between query and doc.

The key idea is inspired by the traditional statistical language model for IR, specifically the query likelihood model [27] which was proposed in the last century. The query likelihood model assumes that the query is generated as the piece of text representative of the “ideal” document [19]. Based on the Bayesian theorem, the relevance relationship between query and document could then be approximated by the query likelihood given the document language model under some mild prior assumption. Based on the classical IR theory, we propose the Representative wOrds Prediction (ROP) task for pretraining. Specifically, given an input document, we sample a pair of word sets according to the document language model, which is defined by a popular multinomial unigram language model with Dirichlet prior smoothing. The word set with higher likelihood is deemed as more “representative” of the document. We then pretrain the Transformer model to predict the pairwise preference between the two sets of words, jointly with the Masked Language Model (MLM) objective.The pre-trained model, namely PROP for short, could then be fine-tuned on a variety of downstream ad-hoc retrieval tasks. The key advantage of PROP lies in that it roots in a good theoretical foundation of IR and could be universally trained over large scale text corpus without any special document structure (eg hyperlinks) requirement.
Find two sets by querying likelihood, and train by adding comparison loss and Masked Language Model (MLM) loss, so that you can train a model similar to BERT An equivalent but more suitable pre-trained model for retrieval scenarios.

1.2.B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval

This work is a companion piece to PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval. The motivation is to solve the problem of query likelihood in PROP that only uses unigram without reference to context. Therefore, it is proposed to use BERT to select key words.

The most direct way is to directly use the attention of CLS and other tokens in BERT as the weight of the word, but the words selected in this way are often some common words in, the, and of, as follows: In order to solve this problem, the author uses random
Insert image description here
deviation Model (divergence from randomness), this is a probabilistic statistical model in retrieval. Therefore, the author relies on this theory as a foothold, which I think is an innovative point of this article.

When I was reading the article here, I thought why not use tfidf to filter? In fact, after reading the above theory of divergence from randomness, I found that using cross entropy for statistics, in fact, after a little extrapolation, I found that it is basically equivalent to tfidf. But if you directly use tfidf filtering when writing a paper, it will obviously not be so advanced. This is not to say that the authors of B-PROP are opportunistic, but that writing still requires certain skills, but these skills are rooted in the basic theoretical system. Regarding divergence from randomness,
I also discovered that TF-IDF is similar in principle to cross-entropy. I put it here:
TFIDF: Insert image description here
Cross-entropy: Remove the sum and look at it~ (Editor's small class haha, for details, see: Cross-Entropy
Insert image description here

other

  • Document Expansion by Query Prediction
    identified document expansion terms using a sequence-to-sequence model that generated possible queries for which the given document would be relevant.
    This method is a sparse retrieval scheme similar to BM25, and the effect is superior to BM25. The idea is to generate possible queries through articles and add them directly to the original articles, which solves the problem of having the same meaning but different terms in sparse retrieval. Another simple yet effective article.

    Algorithms and models in the field of information retrieval are roughly divided into two categories, sparse and dense. This refers to the way the data is represented in the model. If a model represents query and document as high-dimensional sparse vectors, then the model is "sparse"; if it represents them as relatively low-dimensional dense vectors, then it is "dense". Typical sparse models include TF-IDF and BM25, while typical dense models include most of today's deep learning retrieval models, such as Two-tower BERT. It should be noted that whether the model is sparse or dense has nothing to do with whether it uses deep learning technology, but only depends on how its data is represented.

  • Context-Aware Term Weighting For First Stage Passage Retrieval. Interpretation link
    used a BERT [12] model to learn relevant term weights in a document and generated a pseudo-document representation.
    This method is similar to the method of mining query weights during my previous internship at a search engine company. The method is basically similar. Get the weight in the query by clicking on the data (2018). The difference is that this article not only adds weight to the query, but also uses the same method to obtain the term weight of the document. The results are equally valid.

    Editor's experience: If the number of clicks is large enough, the above method may work better, because such term weight may be more statistically significant.

Guess you like

Origin blog.csdn.net/u014665013/article/details/127655457