Detailed text retrieval benchmark BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Thesis title: BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models

This paper presents benchmarks for text retrieval tasks, using 18 existing datasets from different domains and task complexities, and covers various models used to demonstrate retrieval and ranking performance, especially in a transfer learning setting. The main contribution of this work is to propose a standardized benchmark for zero-shot evaluation of retrieval systems . It tests retrieval systems on a variety of tasks and domains. Previous (standardized) benchmarks include a narrow evaluation setting, either on their task (e.g. MultiReQA only focuses on question answering) or on their retrieval corpus (e.g. KILT only retrieves from Wikipedia). BEIR overcomes this shortcoming and provides an easy-to-use evaluation framework for new retrieval methods.

Existing neural information retrieval (IR) models are usually studied in homogeneous and narrow settings, which provide rather limited insight into their out-of-distribution (OOD) generalization capabilities. To address this issue, and facilitate researchers to broadly evaluate the effectiveness of their models, we introduce Benchmarking-IR (BEIR), a powerful and heterogeneous evaluation benchmark for information retrieval. We leverage a careful selection of 18 publicly available datasets from different text retrieval tasks and domains, and evaluate 10 state-of-the-art retrieval systems on the BEIR benchmark, including lexical, sparse, dense, late-interaction, and reranking architectures. Our results show that BM25 is a robust baseline, and models based on reranking and late interaction achieve the best zero-shot performance on average, however, it is computationally expensive. In contrast, dense and sparse retrieval models are more computationally efficient but often underperform other methods, highlighting that their generalization capabilities leave much room for improvement. It is hoped that this framework will allow us to better evaluate and understand existing retrieval systems and help accelerate progress towards more robust and general systems in the future.

1 Introduction

Major natural language processing (NLP) problems all rely on a practical and efficient retrieval component as a first step in finding relevant information. Challenging problems include open-domain question answering, assertion verification, duplicate question detection, and more. Traditionally, retrieval has been dominated by lexical methods such as TF-IDF or BM25. However, these methods suffer from a lexical gap and can only retrieve documents containing the keywords in the query. Also, the lexer treats queries and documents as bags of words, regardless of word ordering.

Recently, deep learning, especially pre-trained Transformer models like BERT, has become popular in information retrieval. These neural retrieval systems can improve retrieval performance in many fundamentally different ways. A brief overview of these systems is given in Section 2.1. Many previous works train neural retrieval systems on large datasets, such as Natural Questions (NQ) (133k training examples) or MS MARCO (533k training examples), which all focus on the Paragraph search. In previous work, most of the methods were evaluated on the same datasets, where they showed a significant performance improvement compared to lexical approaches like BM25.

However, creating a large training corpus is often time-consuming and expensive, so many retrieval systems are applied in a zero-shot setting, where no training data is available to train the system. So far, it is unclear how well existing trained neural models perform on other text domains or text retrieval tasks. More importantly, it is unclear how well different approaches, such as sparse versus dense embeddings, generalize to out-of-distribution data.

In this work, we propose a new robust and heterogeneous benchmark called BEIR (Benchmarking IR), consisting of 18 retrieval datasets, for comparison and evaluation of model generalization. Previous retrieval benchmarks suffer from a relatively narrow evaluation scope, either focusing on a single task, such as question answering, or focusing on a certain domain. In BEIR, we focus on diversity and we include nine different retrieval tasks. Fact checking, citation prediction, duplicate question retrieval, evidence retrieval, news retrieval, question answering, tweet retrieval, biomedical IR, and entity retrieval. Additionally, we include datasets from different text domains, datasets covering broad topics (e.g. Wikipedia) and specialized topics (e.g. COVID-19 publications), different text types (news articles vs. tweets), different scales datasets (3.6k - 15M documents), and datasets with different query lengths (average query length between 3 and 192 words) and document lengths (average document length between 11 and 635 words).

We use BEIR to evaluate ten different retrieval methods from five major architectures: lexical, sparse, dense, late interactive, and reranking. From our analysis, we find that no method consistently outperforms the others on all datasets. Furthermore, we note that a model's in-domain performance is not related to its generalization ability: models fine-tuned with the same training data may have different generalization abilities. In terms of efficiency, we find a trade-off between performance and computational cost: computationally expensive models such as rearrangement models and late interaction models perform best. More efficient methods, such as those based on dense or sparse embeddings, can substantially underperform traditional vocabulary models such as BM25. Overall, BM25 remains a strong baseline for zero-shot text retrieval.

Finally, we note that the datasets included in the benchmarks may suffer from strong lexical biases, possibly because lexical models are preferentially used during the annotation or creation of the datasets. This may place an unfair disadvantage on non-lexical methods. We perform our analysis on the TREC-COVID dataset. We manually annotate missing relevance judgments for the systems under test and see a significant improvement in the performance of non-lexical approaches. Therefore, future work requires better unbiased datasets that allow fair comparisons of all types of retrieval systems.

With BEIR, we have taken an important step towards establishing a single, unified benchmark for evaluating the zero-shot capabilities of retrieval systems. It allows for the study of when and why certain methods perform well, and hopefully leads innovation to more robust retrieval systems. We release BEIR and integrate different retrieval systems and datasets into a well-documented, easy-to-use and extensible open-source package. BEIR is model-agnostic, welcomes various approaches, and also allows easy integration of new tasks and datasets. More details are available at https://github.com/UKPLab/beir

2. Related work and background

To our knowledge, BEIR is the first broad, zero-shot information retrieval benchmark. Existing works do not deeply evaluate the zero-shot retrieval setting, and they either focus on a single task, small corpus, or domain. This setting hinders the investigation of model generalization across domains and task types.

MultiReQA consists of eight Question Answering (QA) datasets, evaluating sentence-level answer retrieval for a given question. It only tested on one task, and five of the eight datasets were from Wikipedia. Furthermore, MultiReQA is evaluated for retrieval on fairly small corpora: 6 of 8 tasks have fewer than 100,000 candidate sentences, which favors dense retrieval over lexical retrieval. KILT consists of five knowledge-intensive tasks covering a total of 11 datasets. These tasks involve retrieval, but it is not the main task. Also, KILT only retrieves documents from Wikipedia.

2.1 Neural Retrieval

Information retrieval is the process of searching for and returning relevant documents from a collection for a query. This paper focuses on text retrieval and uses documents as covering words for text of any length in a given collection, whereas queries are user-entered and can be of any length. Traditionally, lexical approaches like TF-IDF and BM25 have dominated text information retrieval. Recently, there has been interest in using neural networks to improve or replace these lexical approaches.

Retriever-based : Lexification is affected by lexical gaps. To overcome this problem, early techniques proposed using neural networks to improve vocabulary retrieval systems. Sparse methods such as docT5query identify document expansion terms using a sequence-to-sequence model that yields possible queries to which a given document will be related. DeepCT uses the BERT model to learn the relevant term weights in the document and generate a pseudo-document representation. Both of these methods still rely on the BM25 to do the rest. Similarly SPARTA learns token-level context representations with BERT and converts documents into an efficient inverted index. Recently, dense retrieval methods have been proposed. They capture semantic matches and attempt to overcome (potential) lexical gaps. Dense retrievers map queries and documents in a shared dense vector space. This enables file representations to be precomputed and indexed. A dual-encoder neural architecture based on pretrained Transformers demonstrates robust performance in various open-domain question answering tasks. This dense approach has recently been extended by the hybrid vocabulary dense approach, which aims to combine the strengths of both approaches. Another parallel work proposes an unsupervised approach to domain adaptation to train a dense retriever by generating synthetic queries of the target domain. Finally, ColBERT (Contextualized Late Interaction on BERT) computes multiple contextualized embeddings for queries and documents at the token level and uses a maximum similarity function to retrieve relevant documents.

Reranking-based : Neural rearranging methods use the output of the first-level retrieval system, usually BM25, and rerank the documents to create a better comparison of the retrieved documents. With BERT's cross-attention mechanism, the performance is significantly improved. However, at the disadvantage of high computational overhead.

3. BEIR Benchmark

BEIR aims to provide a one-stop zero-shot evaluation benchmark for all different retrieval tasks. In order to build a comprehensive evaluation baseline, the choice of methods is crucial to collect tasks and datasets with desirable properties. For BEIR, the approach is motivated by the following three factors.
(i) Diverse tasks: Information retrieval is a multifunctional task, and the length of query and index files may vary between different tasks. Sometimes queries are short, like a keyword, while in other cases they can be long, like a news article. Likewise, indexed files are long and short.
(ii) Diverse domains: Retrieval systems should be evaluated in various types of domains. From broad fields such as journalism or Wikipedia, to highly specialized fields such as scientific publications in a specific field. Therefore, we include domains that are representative of real-world problems and range from general to specialized.
(iii) Task Difficulty: Our benchmarks are challenging and the difficulty of the included tasks must be adequate. If a task is easily solved by any algorithm, it is useless to compare various models for evaluation. We evaluate several tasks based on the existing literature and select some popular tasks that we believe are recently developed, challenging, and not yet fully solved by existing methods.
(iv) Diverse annotation strategies: Creating retrieval datasets is inherently complex and suffers from annotation bias, which prevents a fair comparison of various methods. To reduce the effect of this bias, we selected datasets that were created in a number of different ways. Some are annotated by workers in the crowd, some by experts, and some based on feedback from a large online community.

In total, we included 18 English zero-shot evaluation datasets from 9 different retrieval tasks. Since most of the evaluated methods are trained on the MS MARCO dataset, we also report performance on this dataset, but do not include this result in our zero-shot comparison.

insert image description here
Table 1: Statistics of the datasets in the BEIR benchmark. A few datasets contain files without headers. Relevance represents the relationship of a query to a document: binary (relevant, not relevant) or divided into sub-levels. Avg. D/Q means the average relevant documents per query.

Table 1 summarizes the statistics of the datasets provided in BEIR. Most datasets contain binary correlation judgments, i.e. related or not, and a few datasets contain fine-grained correlation judgments. Some datasets contain a small number of documents (<2) related to a query, while other datasets like TREC-COVID can contain up to 500 documents related to a query. Of the 19 datasets, only 8 out of 8 datasets (including MS MARCO) have training data, illustrating the practical importance of zero-shot retrieval benchmarks. All datasets except ArguAna have short queries (either a single sentence or 2-3 keywords). Figure 1 shows an overview of the tasks and datasets of the BEIR benchmark.

insert image description here
Figure 1: Overview of various tasks and datasets in the BEIR benchmark

Information retrieval (IR) is ubiquitous, with many datasets for every task, and even more for retrieval tasks. However, it is not feasible to incorporate all datasets into the evaluation benchmark. We try to have a balanced mix of various tasks and datasets and focus on specific tasks such as question answering that are not overweight. Future datasets can be easily integrated into BEIR, and existing models can be quickly evaluated on any new dataset.

3.1 Dataset and diversity analysis

Datasets in BEIR are selected from different domains, ranging from Wikipedia, scientific publications, Twitter, news, to online user communities, and more. To measure domain diversity, we use a pair-weighted Jaccard similarity score to compute the domain overlap between pairs of datasets, i.e., the single-slice word overlap between all dataset pairs. Figure 2 shows a force-directed placement diagram representing a heatmap and clustering of pairwise weighted jaccard scores. Closer nodes (or datasets) in the graph have higher word coincidence, while farther nodes in the graph have lower coincidence. From Fig. 2, we observe that the weighted Jaccard word coincidence for different domains is quite low, which indicates that BEIR is a challenging benchmark and the method must generalize well to different out-of-distribution domains.

insert image description here
Figure 2: Domain overlap for each pairwise dataset in the BEIR benchmark. Heatmap (left) showing pairwise weighted jaccard similarity scores between BEIR datasets, 2D representation (right) using NetworkX's forced directional placement algorithm. We color and label datasets in different domains differently.

3.2 BEIR software and framework

BEIR software provides an easy-to-use Python framework (pip install beir) for model evaluation. It contains a large number of wrappers for replicating experiments and evaluating models from well-known repositories, including Sentence-Transformers, Transformers, Anserini, DPR, Elasticsearch, ColBERT and Universal Sentence Encoder. This makes the software useful both in academia and industry. The software also provides all IR-based metrics from precision, recall, MAP (average precision), MRR (average exchange rate) to nDCG (normalized cumulative discounted gain) for any top-k hit . One can use the BEIR benchmark to evaluate existing models on new retrieval datasets and to evaluate new models on included datasets.

Datasets are often scattered online and available in different file formats, making it difficult to evaluate models on different datasets. BEIR introduces a standard format (corpus, query, and qrels) and transforms existing datasets into this simple common data format, allowing faster evaluation on an increasing number of datasets.

3.3 Evaluation indicators

Depending on the nature and requirements of real-world applications, retrieval tasks can be either precision- or recall-focused. In order to obtain comparable results across different models and datasets in BEIR, we believe it is important to leverage a single evaluation metric that can be computed comparatively across all tasks. Decision support metrics such as precision and recall are inappropriate because none of them are rank aware. Binary rank-awareness metrics such as MRR (Mean Reciprocal Rate) and MAP (Mean Precision) fail to assess tasks with rank-related judgments. We find that Normalized Cumulative Discounted Gain (nDCG@k) provides a good balance for tasks involving judgments of binary and hierarchical relevance.

4. Experimental setup

We use BEIR to compare different, state-of-the-art retrieval architectures, focusing on transformer-based neural methods. We evaluate on publicly available pretrained checkpoints. Due to transformer-based network length limitations, we only use the first 512 words in all documents in experiments with all neural architectures.
insert image description here
We group the models according to their structure: (i) vocabulary, (ii) sparse, (iii) dense, (iv) late interaction and (v) reranking. Apart from the included models, the BEIR benchmark is model-agnostic, and different model configurations can be easily incorporated into this benchmark in the future.

(i) Lexical retrieval :
(a) BM25 is a commonly used bag-of-words retrieval function based on token matching between two high-dimensional sparse vectors with TF-IDF token weights. We use Anserini's default Lucene parameters (k=0.9 and b=0.4). We index headings (if any) and paragraphs as separate fields of the document. On our leaderboard, we also tested Elasticsearch BM25 and Anserini + RM3 extensions, but found that Anserini BM25 performed best.

(ii) Sparse retrieval :
(a) DeepCT uses a Bert-base-based model trained on MS MARCO to learn term weight frequencies (tf). It generates a pseudo-document with keywords multiplied by learned term frequencies. We use Dai and Callan's original settings, combined with BM25 and the default Anserini parameters, which we empirically find outperform the tuned MS MARCO parameters.
(b) SPARTA computes the similarity score between uncontextualized query embeddings from BERT and contextualized document embeddings. These scores can be precomputed for a given document, resulting in a 30k-dimensional sparse vector. We fine-tuned the DistilBERT model on the MS MARCO dataset and used a sparse vector with 2000 non-zero entries.
© DocT5query is a popular document expansion technique that uses the T5 (base) model trained on MS MARCO to generate synthetic queries that are appended to the original document for lexical search. We replicated Nogueira and Lin's setup, generating 40 queries per document, and using BM25's default Anserini parameters.

(iii) Dense retrieval :
(a) DPR is a dual-tower dual-encoder trained with a single BM25 hard negative and intra-batch negative. We found that the open-source Multi model performed better than the single NQ model in our setting. The Multi-DPR model is a Bert-base-based model trained on four QA datasets (including captions): NQ, TriviaQA, WebQuestions and CuratedTREC. (b) ANCE is a dual encoder that builds hard negatives from the Approximate Nearest Neighbor (ANN) index of the corpus, which is updated in parallel during the fine-tuning of the model to select hard negative training instances. We use the publicly available RoBERTa model trained for 600K steps on MS MARCO for our experiments.
© TAS-B is a dual encoder trained with balanced topic-aware sampling, using dual supervision with cross-encoders and a ColBERT model. The model is trained with a combination of pairwise Margin-MSE loss and in-batch negative loss function. We use the publicly available DistilBERT model for experiments.
(d) GenQ: is an unsupervised domain adaptation method that builds dense retrieval models by training on synthetically generated data. First, we fine-tune the T5 (base) model 2 times on MS MARCO. Then, for a target dataset, we use a combination of top-k and nucleus-sampling (top-k: 25; top-p: 0.95) to generate 5 queries per document. Due to resource constraints, we limit the maximum number of target documents in each dataset to 100K. For retrieval, we continue to fine-tune the TAS-B model using synthetic queries and batch in-batch negation of document-to-data. Note that GenQ creates a separate model for each task.

(iv) Late interaction :
(a) ColBERT encodes and represents queries and paragraphs as multiple contextualized token embedding bags. Late interactions are aggregated as the sum of max-pooled query terms and dot multiplication of all paragraph terms. We use the ColBERT model as a dense retriever (end-to-end retrieval): first retrieve the top-k candidates using ANN with faiss (faiss depth = 100), and ColBERT re-ranks them by computing the interactions aggregated later. We train a Bert-base-uncased based model on the MS MARCO dataset with a maximum sequence length of 300 and a stride of 300K.

(v) Reranking model :
(a) BM25+CE rearranges the top 100 hits retrieved in the first-stage BM25 (Anserini) model. We evaluate 14 different cross-attention rearrangement models publicly available on HuggingFace Model Hub and find that a 6-layer, 384-h MiniLM cross-encoder model provides the best performance on MS MARCO. The model is trained on MS MARCO using a knowledge distillation setup with an ensemble of three teacher models: BERT-base, BERT-large and ALBERT-large models following the setup of Hofstätter et al.

Training settings : Models included for zero-shot evaluation were initially trained differently. DocT5query and DeepCT are trained for document expansion and word reweighting. Both the cross-encoder (MiniLM) and SPARTA are trained with ranking data. All dense retrieval models (DPR, ANCE, and TAS-B) and ColBERT are trained with a mixture: ranked data and random in-batch negatives. Another important difference is in hard negatives, few models are trained on better optimized hard negatives, while other models use simpler hard negatives, which may imply unfair comparison. DPR uses mined BM25 hard negatives for training, ColBERT uses hard negatives provided by the original MS MARCO, ANCE uses mined to approximate hard negatives, and TAS-B uses cross-encoder and ColBERT models and BM25 hard negatives for cross-model refinement.

5. Results and Analysis

This section evaluates and analyzes the performance of retrieval models on the BEIR benchmark. Table 2 reports the results of all evaluated systems on selected benchmark datasets. Using BM25 as a benchmark, our retrieval system is compared with it. Figure 3 shows on how many datasets the respective models were able to perform better or worse than BM25.

insert image description here
Table 2: In-domain and zero-shot performance on the BEIR benchmark. All scores represent nDCG@10. The best performance on a given dataset is in bold and the second best performance is underlined. Z represents in-domain performance.
insert image description here
Figure 3: Zero-shot neural retrieval performance comparison with BM25. Reranking-based models, namely BM25+CE and sparse models: docT5query outperforms BM25 on more than half of the BEIR evaluation datasets.

1. In-domain performance is not a good indicator of out-of-domain generalization . We observe that BM25 significantly underperforms neural methods by 7-18 points on in-domain MS MARCO. However, BEIR shows it to be a strong generalization baseline and generally outperforms many other more sophisticated methods. This underscores the point that retrieval methods must be evaluated on a wide range of datasets.

2. Term weighting fails, document extension captures out-of-domain keyword vocabulary . Both DeepCT and SPARTA use transformer networks to learn term weightings. While these two methods perform well in-domain on MS MARCO, they perform worse than BM25 on almost all datasets. In contrast, docT5query based on document expansion is able to add new relevant keywords to documents and performs well on the BEIR dataset. It outperforms BM25 on 11/18 datasets and is competitive on the remaining datasets.

3. Dense retrieval models with out-of-distribution data problems . Dense retrieval models (especially ANCE and TAS-B), which independently map queries and documents into vector spaces, perform robustly on some datasets and significantly underperform BM25 on many others. For example, dense retrievers are observed to have a large domain shift compared to the dataset they were trained on, like BioASQ, or task shift like Touché-2020. DPR, the only non-MSMARCO training dataset, has the worst generalization performance on the benchmark.

4. The rearrangement model and the late interaction model generalize well to out-of-distribution data . The cross-attention rearrangement model (BM25+CE) performs the best, outperforming BM25 on almost all (16/18) datasets. It only fails on ArguAna and Touché-2020, two retrieval tasks that are extremely different from the MS MARCO training dataset. The late interaction model ColBERT computes token embeddings independently for queries and documents, and scores (query, document) via a MaxSim-like cross-attention operation. It performs slightly weaker than the cross-attention rearrangement model, but still outperforms BM25 on 9/18 datasets. It appears that cross-attention and cross-attention-like operations are important for good out-of-distribution generalization.

5. Strong training loss for dense retrieval leads to better out-of-distribution performance . TAS-B provides the best zero-shot generalization performance among its dense peers, and it surpasses ANCE and DPR on 14/18 and 17/18 datasets, respectively. We speculate that the reason lies in the strong training setting of the TAS-B model combining in-domain batch negation and Margin-MSE loss. The TAS-B model is more inclined to retrieve documents of shorter length. This training loss function (with a strong ensemble teacher in the knowledge distillation setting) shows strong generalization performance.

6. The TAS-B model is more inclined to retrieve shorter documents . TAS-B outperforms ANCE on two datasets: 17.3 points for TREC-COVID and 7.8 points for Touché-2020. We observe that the document lengths retrieved by these models vary greatly, as shown in Figure 4. On TREC COVID, the median length of retrieved documents for TAS-B was only 10 words, compared to 160 words for ANCE. Likewise, on Touché-2020, 14 words pair 89 words using TAS-B and ANCE, respectively. This preference for shorter or longer documents is due to the loss function used.

insert image description here
Figure 4: Distribution plot of the length of the top 10 documents retrieved using TAS-B (blue, top) or ANCE (orange, bottom). TAS-B is more biased towards short documents in BEIR.

7. Does domain adaptation help improve the generalization of density retrievers ? We evaluate GenQ, which further fine-tunes the TAS-B model on synthetic query data. It outperforms the TAS-B model in specialized domains such as scientific publications, finance, or StackExchange, while it underperforms the original TAS-B model in broader and more general domains such as Wikipedia.

5.1 Efficiency: Retrieval Latency and Index Size

Models need to compare a single query against millions of documents at inference time, thus requiring a high computational speed to retrieve results in real time. Besides speed, index size is also critical and is usually stored entirely in memory. We randomly sample 1 million documents from DBPedia and evaluate latency. For dense models we use exact search, while for late interaction model ColBERT we follow the original setting and use approximate nearest neighbor search. Performance on the CPU is measured with an 8-core Intel Xeon Platinum 8168 CPU @2.70GHz and a single Nvidia Tesla V100, CUDA 11.0 on the GPU.

insert image description here
Table 3: Estimates of average retrieval latency and index size for a single query in DBPedia. Ranking from best to worst on zero-sample BEIR. Hopefully lower latency or memory.

Tradeoff between performance and retrieval latency : The best out-of-distribution generalization performance by rearranging the top 100 BM25 documents and using a late interaction model comes at the cost of high latency (>350 ms), slowest at inference time. In contrast, the dense retriever is 20-30 times faster (<20 ms) than the rearrangement model and follows a low-latency model. On CPU, the sparse model has an advantage in terms of speed (20-25 ms).

Trade-off between performance and index size : The vocabulary, reordering, and dense methods have the smallest index size (<3GB) to store 1 million documents from DBPedia. SPARTA requires the second largest index to store sparse vectors of 30k dims, while ColBERT requires the largest index because it stores multiple dense vectors of 128 dims for one document. The index size is especially relevant when the document size increases: ColBERT needs about 900GB to store the BioASQ (about 15M documents) index, while BM25 only needs 18GB.

6. Effect of Annotation Selection Bias

Creating a fully unbiased retrieval evaluation dataset is inherently complex and influenced by (i) annotation guidelines, (ii) annotation settings, and (iii) human annotators. Furthermore, it is not possible to manually annotate the relevance of all (query, document) pairs. Instead, existing retrieval methods are used to obtain a library of candidate documents, which are then labeled for their relevance. All other unseen documents are considered irrelevant. This is a source of selection bias: a new retrieval system may retrieve substantially different results than the system used for annotation. These hits are automatically considered insignificant.

Many BEIR datasets are found to suffer from lexical bias, i.e., lexical-based retrieval systems such as TF-IDF or BM25 are used to retrieve candidate words for annotation. For example, in BioASQ, candidate words are retrieved for annotation by term matching with boosted tags. Creating Signal-1M (RT) involves retrieving tweets from queries, and 7 of the 8 techniques rely on lexical term matching signals. This lexical bias is disadvantageous for methods that do not rely on lexical matching, such as dense retrieval methods, because hits retrieved without lexical overlap are automatically considered irrelevant, even though these hits may be related to the query.

To investigate the impact of this particular bias, we conduct a study on the recent TREC-COVID dataset. TREC-COVID uses a pooling approach to reduce the impact of the aforementioned bias: the annotation set is constructed by using the search results of the individual systems participating in the challenge. Table 4 shows the Hole@10 rate of the tested systems, that is, how many of the top 10 retrieved by each system were not seen by the annotators.

insert image description here
Table 4: Hole@10 analysis of TREC-COVID. The annotated scores show the performance improvement after removing Hole@10 (top 10 hit files not seen by the annotator) in each model.

The results show large differences between different methods: Lexical methods like BM25 and docT5query have rather low Hole@10 values ​​of 6.4% and 2.8%, respectively, indicating that the annotation pool contains popular hits from lexical retrieval systems. In contrast, the Hole@10 of dense retrieval systems such as ANCE and TAS-B were 14.4% and 31.8%, respectively, suggesting that the majority of hits found by these systems were not judged by the annotators. Next, we manually added all systems, adding missing annotations (or bugs) following the original annotation guidelines. During annotation, we do not know which system retrieved missing annotations to avoid preference bias. We annotated a total of 980 query-document pairs in TREC-COVID. We then recalculated nDCG@10 for all systems with these additional annotations.

As shown in Table 4, we observe only slight improvement in lexical, e.g., only from 0.713 to 0.714 for docT5query after adding the missing relevance judgment. On the contrary, for the dense retrieval system ANCE, the performance improves from 0.654 (slightly lower than BM25) to 0.735, which is 6.7 points higher than the performance of BM25. ColBERT also has a similar improvement (5.8 points). Although many systems have contributed to the TREC-COVID annotation pool, the annotation pool is still biased towards lexical approaches.

7. Conclusions and future work

In this work, we propose BEIR: A Heterogeneity Benchmark for Information Retrieval. We provide a wider selection of target tasks, from narrow expert domains to open domain datasets. We include 9 different retrieval tasks across 18 different datasets.

By open-sourcing BEIR, and providing a standardized data form and easily adaptable code examples for many different retrieval strategies, we take an important step toward a unified benchmark for evaluating the zero-shot capabilities of retrieval systems. It is hoped that it will lead to innovations leading to more powerful retrieval systems and new insights into which retrieval architectures perform well across different tasks and domains.

We investigate the effectiveness of ten different retrieval models and demonstrate that in-domain performance does not predict a method's generalizability in the zero-shot setting. Many methods that outperformed BM25 on MS MARCO's in-domain evaluation, performed poorly on the BEIR dataset. Cross-attention rearrangement, late interaction ColBERT, and the document expansion technique docT5query perform well overall on the evaluated tasks.

Our study of annotation selection bias highlights the challenges of evaluating new models on existing datasets: even though TREC-COVID is based on predictions from many systems, contributed by different teams, we found that the Hole@10 rates of the test systems vary widely , has a negative impact on non-lexical methods. For a fair evaluation of retrieval methods, better datasets are needed, using different aggregation strategies. By integrating a large number of different retrieval systems into BEIR, creating such diverse collections is made very simple.

8. Limitations of the BEIR benchmark

Although we cover a wide range of tasks and domains in BEIR, no benchmark is perfect and has its limitations. Identifying these is the key point to understand the benchmark results, and to propose better benchmarks for future work.

1. Multilingual tasks: While we aim to build a diverse retrieval evaluation benchmark, due to the limited availability of multilingual retrieval datasets, all datasets currently covered in the BEIR benchmark are in English. As a next step for the benchmark, it is worth adding more multilingual datasets (given the selection criteria). Future work could include multi- and cross-lingual tasks and models.

2. Long Document Retrieval: Most of our tasks have an average document length of several hundred words, roughly equivalent to a few paragraphs, including tasks that require retrieval of longer documents will be highly relevant. However, since transformer-based methods typically have a length limit of 512 word fragments, a fundamentally different setting is required to compare methods.

3. Multi-factor retrieval: So far in BEIR we have focused on plain text retrieval. In many real-world applications, further signals are used to estimate document relevance such as PageRank, recency, authority score or user interaction such as click-through rate. The integration of these signals is often not straightforward among the methods tested and is an interesting research direction.

4. Multi-field retrieval: retrieval can usually be performed on multiple fields. For example, for scientific publications we have the title, abstract, document body, author list, and journal name. So far we've only looked at datasets with one or two fields.

5. Task-specific models: In our benchmark, we focus on evaluating models that generalize to a broad range of retrieval tasks. Naturally, in the real world, for some few tasks or domains, specialized models can easily outperform general-purpose models because they focus on a single task and perform well, such as on question answering. Such task-specific models do not necessarily need to generalize across all different tasks.

Guess you like

Origin blog.csdn.net/zag666/article/details/128336349