Transformer Memory as a Differentiable Search Index paper reading

Title: Transformer Memory as a Differentiable Search Index: Transformer Memory as a Differentiable Search Index

This paper demonstrates that information retrieval can be accomplished with a Transformer, where all information about the corpus is encoded in the parameters of the model. To this end, we introduce Differentiable Search Index ( DSI ), a new paradigm that learns a text-to-text model that maps string queries directly to relevant documents; in other words , the DSI model can directly answer the query with only its parameters, which greatly simplifies the entire retrieval process.

We investigate how changes in how documents are represented and their identifiers, changes in training procedures, and the interplay between model and corpus size. Experiments show that with appropriate design choices, DSI significantly outperforms strong baselines such as dual-encoder models. Furthermore, DSI demonstrates strong generalization capabilities, surpassing the performance of the BM25 benchmark in the case of zero-shot learning.

Introduction

Information retrieval (IR) systems map a user query q ∈ Q to a ranked list {d1, . . dn} ⊆D of relevant documents, usually represented by integers or short strings, called document identifiers (docids). The most widely used IR method is based on a pipelined retrieval-then ranking strategy. For retrieval, methods based on inverted index or nearest neighbor search are common, while dual encoders based on inverse learning are state-of-the-art.

This paper proposes an alternative architecture, where a sequence-to-sequence (seq2seq) learning system is used to directly map a query q to a relevant document j ∈ Y. This scheme, shown in the bottom half of Figure 1, is a sequence-to-sequence encoder-decoder architecture.

insert image description here
Figure 1: Comparison of Dual Encoder (top) and Differentiable Search Index (bottom)

We call this proposed architecture a Differentiable Search Index (DSI), and implement it as a large pretrained Transformer model, building on recent work on large generative language models (LM). In this proposed architecture, all information of the corpus is encoded in the parameters of the Transformer language model.

At inference time, the trained model takes a text query q as input and outputs a document j. If desired, beam search can be used to produce a ranked list of potentially relevant documents. As we have shown, this process can be surprisingly effective if trained properly. In our experiments, it consistently outperforms the DE baseline, sometimes by a large margin: for a base-sized T5 model, Hits@1 improves by more than 20 points on the smallest corpus, from 12.4% of DE Improves to 33.9% of DSI; while on a 30 times larger corpus, the performance improves by nearly 7 points. These gains increase when using larger models: for an 11B parameter T5 model, Hits@1 outperforms DE (Dual Encoder) by more than 25 points on small corpora and more than 25 points on large corpora. 15 points. DSI also performs very well with zero-shot learning, e.g. 14 points improvement on BM25.

Besides these quantitative improvements, the structure of DSI is also much simpler than that of DE (see Table 1). DE systems fix a search program (MIPS) and learn internal representations to optimize the performance of this search program; in contrast, DSI systems do not contain a fixed search program for a specific target, but use standard model inference to learn from encoding Mapped to a document.

As shown in Table 1, the machine learning community is particularly interested in DSI, where various aspects of retrieval are mapped to well-established ML tasks. This could lead to new potential approaches to address long-standing IR problems. As an example, since indexing is now a special case of model training, incrementally updating an index becomes a special case of model updating.

insert image description here
Table 1: Information retrieval requires a series of decisions related to the subproblems of document representation, indexing, and retrieval. The structured document variant DSI is also sensitive to a fourth decision, how to represent docids.

In this paper, DSI is applied to moderately sized corpora (ranging from 10k to 32k documents), all from a challenging retrieval task, and we leave the important issue of extending DSI to larger corpora to future work . The task we consider is to retrieve supporting passages for questions in the Natural Questions (NQ) dataset, a challenging task for lexical models.

While the idea of ​​DSI is simple, there are many ways to achieve it, some surprisingly well and others surprisingly poorly. Below we explore some variants of the DSI architecture.

Document representation : We explore several ways to represent documents, including "naive" methods using the full text of documents, and variations of the bag-of-words representation used by traditional IR engines.

Document representation : We considered several ways to represent documents. In addition to representing integers as text strings, unstructured atomic documents are considered, where each document is assigned a unique token, and some simple benchmarks for building structured semantic documents describing how Navigate to a document through hierarchical clustering of a corpus. Structured documents – either semantically structured through clustering or actively structured as tokenized integers, scale better to large corpora as the vocabulary used in the decoder becomes larger.

Indexing : A trainable IR system traditionally has two phases: indexing the corpus (i.e. memorizing information about each document), and learning how to efficiently retrieve from the index. In DSI, the index is stored in the model parameters, and indexing is just another kind of model training. Figure 1 presents a method for indexing a corpus: that is, training on examples (1) (dj, j) of pairing document dj with docid j, in addition to examples pairing query q with relevant docid j (2) (q,j). In this setup, examples of type (1) are "indexed" examples.

While it is clear that examples of type (2) alone do not provide enough information for a system to generalize new searches, there are many alternatives to examples of type (1) that can reasonably be "taught" A model of the association between documents and document IDs. We explore some of these methods below and show that some plausible techniques perform poorly. We also explore some alternative multi-task optimization and curriculum learning schemes to incorporate these types of examples.

Effects of model and corpus size : Since recent results have shown that some properties of large LMs are only appropriate for very large model sizes, we explore the performance of DSI at a range of model sizes and corpus sizes of 10k, 100k, and 320k documents.

Summary : We show that even raw representations of documents and docids, coupled with appropriate training procedures to fine-tune modern large-scale LMs, can perform surprisingly well; we propose two improved representations of docids, unstructured docids and semantic Structured docids, which improve the selection of raw representations. We show that different indexing/training strategies vary widely in performance, and we show that DSI performance improves significantly and consistently with model size . To our knowledge, this is the first case of generative indexing improving performance on a strong benchmark in a well-studied document retrieval task.

related work

Autoregressive entity linking describes a related sequence-to-sequence system, where mention of an entity in the document – ​​perhaps implicitly, for example, by asking a question that the entity is an answer – is mapped to the canonical name of the entity. In the case of Wikipedia, canonical entity names correspond to page titles, so this can be considered a type of document retrieval. This approach has been adapted for other uses, such as generating knowledge base triads in canonical form.

The task we consider is different from that considered in autoregressive entity linking: our goal is to retrieve documents containing answers, not documents titled answers. More importantly, in autoregressive entity linking, the generated target is a semantic name, while we allow the target to be an arbitrary document. This makes our method suitable for general retrieval tasks, but also raises new questions about docid representation and indexing strategies.

In autoregressive entity linking, generation is restricted to returning an output from a fixed set. It is also possible to constrain the output generated by DSI to be a valid docid. Although we did not use this technique, the extent to which this improves performance is a question worth exploring.

There is a large body of work on retrieval augmentation generation, i.e. retrieving auxiliary documents to augment language models. These techniques are useful for many tasks including question answering, but rely on traditional retrieval methods such as DEs. Here, instead of augmenting the generation process with retrieval, we replace the retrieval process with generation .

Dual encoders are a well-established retrieval paradigm. The key idea is to generate query and document embeddings independently, and perform similarity retrieval on all embedding pairs in a vector space. Queries and candidate documents are produced by a sequence encoder and trained using a form of contrastive loss.

The interpretation of large Transformer models as memory stores has been investigated in previous work. (Roberts et al., 2020) demonstrated success on a closed-book QA task by training a T5 model to retrieve facts encoded in the model's parameters during pre-training. However, unlike CBQA, the questions posed in this paper are based on documents rather than generating direct answers to retrieve complete documents. At the same time, "Language Models as Knowledge Bases?" also studied language models as knowledge bases and found that pre-trained LMs may already contain relational knowledge. "Transformer feed-forward layers arekey-value memories" analyzes the knowledge of encoding in the Transformer feed-forward layer. There are also works demonstrating the relationship of Transformers to associative memory and Hopfield networks, which reinforces the notion that Transformers should intuitively serve as a good associative memory store or search index.

Differentiable Search Index DSI

The core idea of ​​the Differentiable Search Index (DSI) is to fully parameterize the traditional multi-stage retrieval-then-ranking pipeline in a single neural model. To do this, the DSI model must support two basic modes of operation.
1. Indexing: The DSI model should learn to associate the content of each document dj with its corresponding docid j. This paper adopts a direct sequence-to-sequence (seq2seq) approach that takes file tokens as input and generates identifiers as output.
2. Retrieval: Given an input query, a DSI model should return a ranked list of candidate docids. Here, this is achieved by autoregressive generation.

After these two operations, a DSI model can be trained to index a corpus of documents, optionally fine-tuned on an available labeled dataset (query and labeled documents), and then used to retrieve relevant documents – all with in a unified model. In contrast to retrieval-ranking approaches, this type of model allows for simple end-to-end training and can easily be used as a differentiable subcomponent of a larger, more complex neural model.

indexing strategy

We study various indexing strategies designed to learn associations between documents and their identifiers. We train our model to predict docids given a sequence of document tokens. This enables our model to learn which identifier belongs to which document, and can be thought of as a differentiable treatment of traditional search indexes. We considered various alternatives and prune these settings in subsequent sections. The final strategy adopted was Inputs2Targets with direct indexing.

index method

This section discusses the indexing task variants we consider.
Inputs2Target : We see this as a seq2seq task of doc_tokens → docid. As the name suggests, this combines docids with file markers in a direct input-to-target fashion. The benefit here is that the identifier is the denoising target, which makes it closer to the loss function. Since the retrieval task is also concerned with predicting identifiers, this formulation allows the network to follow a similar input-target balance with respect to sequence length. A potential shortcoming is that document labeling is not a denoising target, so there is no opportunity for general pre-training on document labeling.

Targets2Inputs : This formulation considers the inverse of the above, i.e. docid → doc_tokens, where document tokens are generated from identifiers. Intuitively, this is equivalent to training an autoregressive language model conditioned on docid.

Bidirectional : This formulation trains both Inputs2Targets and Targets2Inputs in the same training setting. Prefix flags are pre-set, which allow the model to know in which direction the task is performed.

Span corruption : We also explored a setting that performs span corruption-based denoising when docid markers are included. In this approach, we concatenate identifiers to file tags as prefixes that can randomly mask spans in span corruption targets. The advantages of this approach are: (1) general pre-training is also performed during indexing; (2) a good balance of docids as denoising targets and inputs is achieved.

document representation strategy

ぁ The previous section explored "how to index". This section examines "what to index?", i.e. how best to represent doc_tokens. We state our choices here and carefully trim them down in later experiments. The final best option is the direct indexing method.

Direct Indexing : This strategy accurately represents a document. We take the first L tokens of a document, preserving order, and associate them with docids.

Collection indexes : Documents may contain repeated terms and/or non-informative words (eg, stop words). This strategy uses the default Python set operation to remove duplicate terms and remove stop words from documents. The remainder of the filtered documents are passed into the model in a manner similar to direct indexing.

Inverted indexing : This strategy maps chunked documents (contiguous blocks of tokens) rather than entire documents directly to docids. We randomly subsample a contiguous block of k tokens and associate them with docids. The main advantage of this approach is that it allows looking beyond the top k tokens.

Indicates docids for retrieval

Retrieval in the seq2seq-based DSI model is done by decoding the docids of the input query. How to do the decoding in an efficient way mainly depends on how the docids are represented in the model. The remainder of this section explores some possible ways of representing docids, and how their decoding is handled.

Unstructured atomic identifiers : The most naive way of representing documents is to assign each document an arbitrary (possibly random) unique integer identifier. We call these unstructured atomic identifiers.

Given these identifiers, an obvious decoding approach is to learn a probability distribution of the identifiers. In this case, the model is trained to emit a logit for each unique docid (|Ndocuments|). This is similar to the output layer of a standard language model, but extended to include docids. To accommodate this, we expand the output vocabulary of standard language models as follows: insert image description here
where [;] is the inline join operator, W tokensR dmodel × |Ntokens| and W docsR dmodel × |Ndocuments| . hlast is the last hidden state of the decoder stack (∈R dmodel ). When retrieving the top-k documents for a given query, we simply sort the output logits and return the corresponding indices. This is also reminiscent of standard list-based learning ranking, where all documents are considered at the same time.

Naive structured string identifiers : We also considered the seemingly absurd approach of treating unstructured identifiers, i.e. arbitrary unique integers, as tokenizable strings. We call these truly structured identifiers.

In this representation, retrieval is accomplished by sequentially decoding one docid string at a time. This removes the need for the huge softmax output space that comes with unstructured atomic identifiers. It also removes the need to learn embeddings for each individual docid.

On decoding, beam search is used to obtain the predicted best docid. With this strategy, getting top-k rankings is less straightforward. We can exhaustively comb the entire document space and obtain the likelihood of each document being in the query. Instead, we use partial beam search trees to build top-k retrieval scores. We have found this approximation to be quite efficient and useful in practice.

insert image description here
Figure 2: Visual example of the hierarchical clustering process for assigning semantically structured identifiers. During inference, beam search navigates this trie to decode the correct docid.
insert image description here

Semantically structured identifiers : To date, all approaches to representing docids assume that identifiers are assigned in arbitrary ways. While it is quite interesting to explore the limitations of arbitrary identifiers, intuitively, injecting semantic structure into the document space can lead to better indexing and retrieval capabilities. Therefore, this section explores semantically structured identifiers.

Specifically, our goal is to automatically create identifiers that satisfy the following properties: (1) docids should capture some information about the semantics of their related documents, and (2) docids should be structured in such a way that after each decoding step the search space. This leads to semantically similar file share identifier prefixes.

In this work, we use it as a fully unsupervised preprocessing step. However, as part of future work, it may be possible to integrate and automatically learn semantic identifiers in a fully end-to-end manner.

To construct identifiers with this property, we employ a simple hierarchical clustering procedure on document embeddings to induce a decimal tree (or more generally, a trie).

Given a corpus to be indexed, all documents are clustered into 10 clusters. Each document is assigned an identifier and its cluster is numbered 0-9. For each cluster containing more than c documents, the algorithm is applied recursively, and the result of the next level (the remaining suffix of the identifier) ​​is appended to the existing identifier.

For clusters with c files or fewer, each element is assigned an arbitrary number from 0 up to c-1, and likewise its number is appended to the existing identifier. While this specific procedure induces a decimal tree, it is possible to induce a similar type of trie using any number of other reasonable strategies. In practice, we just apply k-means to embeddings produced by a small 8-layer BERT model with c=100.

training and optimization

The DSI models we train are optimized for seq2seq cross-entropy loss and trained with teacher forcing. We explore two main strategies for training DSI models. The first and more straightforward strategy is to first train a model to perform indexing (memorization), followed by a fine-tuning stage to map queries to docids (e.g. retrieval) with the trained model. The second strategy is to train them together in a multi-task setting. To this end, we structure co-training tasks in a similar way to T5-style co-training (e.g., using task hints to differentiate them). The latter performs significantly better, especially when the proportion of retrieval task instance indices is high. Therefore, we adopt multi-task learning as the default strategy.

Here, we observe that our setting is unique and different from traditional multi-task learning or transfer learning. In a typical multi-task setting, two tasks have something in common that can improve the performance of both tasks if they are learned together. However, in our setting, the retrieval task is completely dependent on the indexing task. In particular, identifiers utilized by retrieval tasks are completely meaningless without indexing tasks. Therefore, in order to solve task B (retrieval), the model needs to fully learn task A (indexing). This problem setting presents unique and largely unexplored research challenges that may be of interest to the ML community.

experiment

This section discusses the experimental setup, datasets used, and comparisons to baselines. We also discuss the experimental results, findings, and effects of various strategies discussed earlier in this paper. As this is a fairly new concept, the purpose of this work is to present a proof of concept and seek to answer research questions, not to make a "sotaeesque" comparison. We leave extensive comparisons on other settings and baselines to future work.

Dataset : We conduct experiments on the challenging Natural Questions (NQ) dataset. NQ consists of 307K query-document training pairs and 8K validation pairs, where the query is a natural language question and the document is a Wikipedia article. Given a question, the retrieval task is to identify the Wikipedia articles that answer the question. To evaluate the performance of DSI models at different scales, we constructed three ensembles from NQ to form our testbed, namely NQ10K, NQ100K and NQ320K, representing different numbers of total query-document pairs in the combined training and validation splits. . NQ320K is the full NQ set and is evaluated using its pre-determined training and validation splits. Unlike NQ320K, NQ10K and NQ100K construct a randomly sampled validation set. For all datasets, we use the same docid space/320K tokens budget for all unstructured atoms and naively structured identifier experiments. Semantically structured identifiers are generated individually for each dataset to prevent semantic information from leaking from larger splits into smaller splits. Text is lowercase. Note that in these datasets, there are fewer unique documents than query-document pairs.

Implementation details :
All DSI models are initialized using the standard pretrained T5 model configuration. Configuration names and corresponding number of model parameters: Base (0.2B), Large (0.8B), XL (3B) and XXL (11B). For runs on unstructured atomic identifiers, we randomly initialize the identifiers as new parameters and only fine-tune the weights during the indexing phase. We use the Jax/T5X implementation for our experiments. The DSI model uses a batch size of 128 and can be trained for up to 1 million steps. We pick the best checkpoints based on retrieval validation performance. Our training hardware consists of 128-256 TPUv4 chips for models above 1B parameters, and the others consist of 64-128 TPUv3 or TPUv4 chips. As an estimate, models above 1B parameters typically take at least a full day of NQ320K to converge. We tune the learning rate between {0.001, 0.0005} and linear warmup at {10K, 100K, 200K, 300K} and/or over-the-air. The identifiers for the semantic structure were generated using an 8-layer BERT model and the default k-means clustering method in scikit-learn. Based on our early ablation experiments on various DSI settings, the main results presented use direct indexing (L = 32) and Inputs2Targets indexing strategies. We present results for all DOCID representation methods. For the main outcome, we provide ablation studies.

baseline

For the baseline, we use a T5-based dual encoder. We use the gensim package to calculate the BM25 score. For the T5-based dual encoder, we train the NQ pairs with contrastive learning until convergence (≈10K steps), and obtain the top-k nearest neighbors via a ScaNN-like system. For zero-shot retrieval, we also compare with the state-of-the-art unsupervised baseline Sentence T5, which has been specifically pretrained for the similarity learning task. Two reasons lead us to consider Sentence T5 the relevant dual-encoder baseline for this work rather than other intensive retrieval efforts such as DPR.

First, we employed the exact same pre-trained model, which enables systematic subtraction of the proposed method without confounding other factors. Scientifically, we think this comparison to the fine-tuned T5 is the best we've offered. Second, the fine-tuned T5 dual encoder is considered to be very identical to DPR in structure and approach (with some minor differences such as parameter sharing, but using the same concept of intra-batch negation).

Experimental results

insert image description here
Table 2: Experimental results on NQ document retrieval. DSI outperforms BM25 and dual-encoder baselines. Among all Docid representation methods, semantic string Docids perform best.
insert image description here
Table 3: Experimental results on zero-shot learning for NQ document retrieval. DSI outperforms BM25, T5 embedding and SentenceT5, the state-of-the-art unsupervised similarity modeling methods. Among Docid representation methods, Atomic Docid performs best in zero-shot learning.

Table 2 reports the fine-tuned retrieval results for NQ10K, NQ100K, and NQ320K, and Table 3 reports the retrieval results for zero-shot learning. For zero-shot learning retrieval, the model is only trained on the indexing task, not on the retrieval task, so the model does not see labeled query→docid data points.

Supervised fine-tuning results : Our results show that DSI outperforms DE across all dataset sizes. On small datasets (NQ10K), the performance gap between DSI and DE is large, e.g. the best DSI variant outperforms DE by a factor of 2. On NQ100K, the gap becomes less prominent, and the best DSI model (unstructured atomic identifier) ​​outperforms DE by +5% Hits@1 and Hits@10. On a large dataset (NQ320K), the best DSI model (Structured Semantic Identifier) ​​outperforms the best DE model by +66% and +4.5% against Hits@1 and Hits@10, respectively.

Zero-Shot Learning : Table 3 reports the results of zero-shot learning retrieval. To recap, zero-shot learning retrieval is performed by performing only indexing and not retrieval tasks. In other words, the model has not seen any annotated query or document pairs. In general, on the NQ100K and NQ320K, the best results were obtained with DSI using unstructured atomic identifiers. The best performance on all NQ datasets surpasses well-established unsupervised retrieval baselines such as BM25. Furthermore, DSI outperforms unsupervised representation learning methods such as SentenceT5, which is trained to learn similarity-aware representations via contrastive learning. We also note that the raw T5 embeddings perform extremely poorly and do not produce reasonable results on the task of unsupervised retrieval. Given that unsupervised neural methods often struggle to outperform BM25, we find these early results very encouraging.

Document Identifiers : A key research question in this paper is the key choice of how to represent document identifiers. In general, we found that structured semantic identifiers were helpful and improved over unstructured identifiers. When comparing naive vs. semantic string identifiers, it seems that semantic identifiers must be used if possible. This is intuitive since infusing the target space with semantic structure can facilitate easier optimization and additional unsupervised representation learning methods as external knowledge. Competition for unstructured atomic identifiers has been somewhat mixed, and we had some difficulty optimizing such a model. We hypothesized that training such a system from scratch would alleviate these issues, perhaps because of the newly initialized softmax layer. However, we defer this direction of investigation to future work. Instead, the instability and high variability of unstructured atomic identifiers has led to inconsistent performance across different datasets. In addition, these identifiers may also experience intermittent misfusion, which we trace to an optimization-related quirk. However, we also note that unstructured atomic identifiers perform best in the zero-shot learning retrieval setting, and the achieved performance is often more than double that of beam decoding methods.

Indexing Strategies : This section explores the effects of different indexing methods. We conduct experiments on NQ100K with different indexing strategies described earlier. The model is trained with the Naive Docid method. Without indexing, the model has a hit rate of 0%@1. This is intuitive since Docids are meaningless without indexing tasks. Second, Inputs2Targets and Bidirectional perform the best, compared with the former, the bidirectional method performs slightly worse (13.5 vs 13.2). Finally, the accuracy of Targets2Inputs and Span Corrpution with Docids did not produce meaningful results (0% accuracy). This shows that there can be huge differences between different indexing strategies, with some working reasonably well and others not working at all.

insert image description here
Figure 5: Performance of different document representations

Document representation : In this section, we explore the performance of different document representation strategies described in the previous section . Figure 5 reports the results on the NQ320K. In general, we found that direct indexing works best. We also found that it is difficult to train inverted indexing methods since docids are repeatedly exposed to different tokens. We also found that shorter document lengths seemed to work well, while performance seemed to drop off substantially above 64 tokens, suggesting that it may be harder to optimize or memorize efficiently when document tokens are high in number. Finally, we also find that there is no additional benefit to doing ensemble or stopword preprocessing for document tokenization.

Scaling law : Another interesting insight is how the scaling law of DSI differs from dual encoders. Understanding the scaling behavior of Transformers has attracted significant interest in recent years. We find that in DE, the retrieval performance gain from increasing model parameterization appears to be relatively small. In contrast, the scaling characteristics of DSI appear to be more optimistic.

insert image description here
Figure 3: Scale plot of DSI versus DE for different model sizes. Performance refers to the Hits@1 indicator

Figure 3 depicts the scaling behavior (on a logarithmic scale) of the three methods (DE and DSI with naive and divergent IDs). The DSI (plain) strongly benefits from the expansion from base to XXL, and there still seems to be room for improvement. Meanwhile, DSI (Semantics) started out equally competitive with DEbase, but performed better as it scaled up. Unfortunately, DE models more or less stagnate at smaller parameterizations.

insert image description here
Figure 4: Effect of Multitasking Ratio for Indexing vs. Retrieving Instances

Interaction between Indexing and Retrieval : Our early experiments show that learning the indexing task first and then the retrieval task in a sequential fashion results in mediocre performance. We therefore focus on exploring a good ratio r for jointly training indexing and retrieval tasks using multi-task learning. Figure 4 shows the effect of modifying the ratio of index to retrieved samples. We find that the optimization process is significantly affected by the interaction between indexing and retrieval tasks. Setting r too high or too low generally results in poor performance. We found that a ratio of 32 generally performed well.

in conclusion

This paper proposes Differentiable Search Indexing (DSI), a new paradigm for learning end-to-end search systems in a unified manner, paving the way for the next generation of search. We define novel indexing and retrieval tasks that fully encode the relationship between terms and documents in the parameters of the Transformer model. The paper proposes some different ways of representing files and documents, and explores different model architectures and model training strategies. Experiments on the Natural Questions dataset show that DSI outperforms common baselines such as BM25 and dual encoders, both in the standard fine-tuning setting and in the zero-shot learning setting.

While the model and results presented here are promising, building on this work there is a wealth of potential future research that can be explored to improve this approach. For example, exploring alternative strategies for file and document representation, and investigating expert mixture models to expand the memory capacity of DSI. An important direction is to explore how to update such models for dynamic corpora, where documents may be added or removed. Finally, it might also be interesting to investigate DSI further as an unsupervised approach to representation learning and/or in-memory storage of other language models.

Recommended reading: A single Transformer completes information retrieval, and Google defeats the dual encoder model with a differentiable search index

Guess you like

Origin blog.csdn.net/zag666/article/details/128265330