The importance of embedding models in large language models

introduction

With the development of large-scale language models, led by ChatGPT, various applications such as ChatPDF, BingGPT, NotionAI, etc. have emerged. The public has focused a lot of attention on the rapid progress of generative models, but little attention has been paid to the indispensable Embedding model that supports the implementation of many large-scale language model applications. This article will mainly introduce why the Embedding model is very important in large language models, the current mainstream Embedding training methods, and some of our thoughts on the preliminary exploration of the Embedding model.

1 Introduction to Embedding technology and summary of its history

In machine learning and natural language processing, the embedding model refers to the process of mapping high-dimensional data (such as text, pictures, videos) to a low-dimensional space. Simply put, the embedding vector is an N-dimensional real-valued vector that represents the input data as a point in a continuous numerical space. This article mainly focuses on text embedding.

The reason why Embedding is important is that it can represent the semantics of a word or sentence. Embeddings of real-valued vectors can represent the semantic meaning of words, mainly because these embedding vectors are learned based on the word's occurrence patterns in the language context. For example, if a word frequently appears with another word in some context, then the embedding vectors of the two words will have similar positions in the vector space. This means they have similar meaning and semantics.

The concept of embedding can be traced back to the mid-20th century, when Harris proposed the distributed semantic theory. By the 1980s, people began to try to use neural networks to learn embedding representations of words. Since 2010, with the development of deep learning technology, there have been static vector embeddings represented by Word2Vec, GloVe, and FastText, and context-related dynamic vector embeddings represented by ELMo, GPT, and BERT. The latter can be better Capture the semantic and contextual information of words effectively.

2 The value of Embedding in large models

As mentioned above, and as we are well-known, embedding vectors contain semantic information. The more similar the meanings of words, the closer the embedding vectors are in the space. Real-valued vector embedding can learn the semantic and contextual information of words from a large amount of data, so that vector operations can be performed and shared and transferred in different natural language processing tasks.

However, this is the value before Embedding. In the era of big language models, what new value does Embedding have?

This starts with the shortcomings of the ChatGPT-like model. Although they are powerful, they still have the following problems:

  • The training data is not real-time (for example, ChatGPT is trained based on data before September 2021), and the cost of retraining is too high and unrealistic.
  • There is a limit to the input text length, usually between a few thousand and tens of thousands of tokens
  • Unable to access documents that cannot be made public

In this regard, OpenAI released a document explaining how to use a two-step search method based on embedding to solve the problem that GPT cannot handle long text and the latest data. Two-step search, that is, first search the text library to find relevant text parts, and then add the retrieved text parts to the input of a ChatGPT-like model to obtain responses.

To illustrate with a representative application, when we want a large model to respond to questions based on the PDF document we give, we can divide the ultra-long PDF into blocks, obtain the embedding of each block content, and use the vector database storage. Next, when you ask the question "How is xxx implemented in the document?", you can use your question embedding to retrieve the pdf content block embedding with the highest similarity to the question embedding in the database. Finally, the retrieved PDF content blocks and questions are input into the model to solve the problem of new knowledge and long text input.

Therefore, although the current discussion is not very popular, the exploration of embedding models is essential for the implementation of large language models.

3 Mainstream Embedding training methods

As mentioned earlier, OpenAI has already proposed a search solution based on Embedding to solve the problem of long text input and latest data. Naturally, OpenAI also has an Embedding solution whose training details are not disclosed: text-embedding-ada-002. This is OpenAI’s second-generation Embedding model, which uses only one model to complete three downstream tasks at the same time: text search, text similarity and code search. Compared with the first-generation model, which was divided into five models to complete the above three tasks, the second-generation model was simplified into a single model and showed better performance on both Chinese and English tasks.

In this chapter, we will sort out some mainstream Embedding training methods. In recent years, most of the related work on Sentence Embedding is based on BERT-like models. There is only a small amount of research and open code to obtain Embedding from Decoder structure-based models. The training details of the Embedding paper published by OpenAI are also unclear. Therefore, in this chapter, we mainly sort out some representative methods of Sentence Embedding based on BERT-like models. We will discuss the exploration of obtaining Embedding based on the Decoder structural model in Chapter 4.

In the pre-BERT era, word embedding trained by word2vec combined with the pooling strategy was generally used to represent sentence vectors. In the BERT era, people took advantage of the inherent advantages of pre-trained language models, first using the [CLS] vector of the BERT model as the sentence vector representation, and then Sentence-BERT cleverly used the framework of the twin network model to obtain the sentence vector, and then appeared one after another. BERT-Flow, BERT-Whitening, SimCSE, R-Drop, ESimCSE and other work. Among them, the most well-known ones are BERT-whitening and SimCSE. Since then, there has been a lot of work based on contrastive learning, and improvements have been made at the data level and training level in constructing positive and negative sample pairs. This chapter mainly provides a brief summary of this type of method.

Since recent Sentence Embedding work mostly revolves around contrastive learning, we first recall the basis of contrastive learning.

3.1 Comparative learning background

Contrastive learning is "to effectively learn data representation with the goal of bringing similar data closer and dissimilar data farther away." Given a pair sample set, where sum is a similar sample, the optimization goal is generally to use the in-batch negetives cross-entropy loss function, as follows:

Among them, sum is the sentence vector representation of sum, N is the batch size during the training process, is the vector sum cosine similarity, and is the temperature hyperparameter.

3.2 Classic method

In recent years, since the emergence of SimCSE, the field of sentence embedding has also caused a small wave of research enthusiasm. In this section, we mainly conduct a relatively detailed review of three recent works on SimCSE (SimCSE, ESimCSE, and CoSENT), and briefly summarize some of the subsequent representative work.

3.2.1 SimCSE

SimCSE is one of the most out-of-the-box jobs in the field of sentence embedding.

It is divided into two versions:

  • Unsupervised version SimCSE: Positive samples come from two similar representations generated by applying different dropout masks to the same sentence, and negative examples use in-batch negatives;
  • The supervised version of SimCSE constructs positive and negative samples based on the NLI data set. Positive examples take sentence pairs with implicit relationships, and negative examples take sentence pairs with contradictory relationships (hard negative examples) and in-batch negatives.

The above is the core idea of ​​SimCSE, which is simple, effective, and at the same time very inspiring, leading a subsequent wave of research on sentence embedding technology.

3.2.2 ESimCSE

ESimCSE improves SimCSE from the perspective of positive and negative sample construction respectively.

(1) How to construct positive example pairs:

Since SimCSE is constructed by adjusting the dropout rate, the lengths of positive example pairs are the same, and the lengths of negative examples are unequal. This will make the model tend to judge that sentences of the same or similar length are more similar in expression.

To alleviate this problem, ESimCSE chooses to randomly repeat some words in the sentence, which can change the length of the sentence without changing its semantics.

(2) How to construct negative example pairs:

In contrastive learning, theoretically the more negative pairs there are, the better the comparison between pairs. ESimCSE also follows this idea, but does not directly forcefully increase the batch size. Instead, it maintains a queue, reuses the encoding embedding of the previous mini-batch to expand the negative pairs, and uses a momentum encoder. The specific method is: since the queuing sentence embedding comes from the previous mini-batch, the moving average of its parameters can be taken to maintain the momentum update model, and the momentum model is used to generate the queuing sentence embedding. When using the momentum encoder, turn off dropout to close the gap between training and prediction. Encoder parameters and momentum update The encoder parameters are updated and calculated according to the following formula:

is the momentum coefficient parameter. Note that only parameters are updated via backpropagation. Here we introduce it to generate sentence embeddings for queues, since momentum updates can make the ratio evolution smoother. So, although the embeddings in the queue are encoded by different encoders (at different "steps" during training), the differences between these encoders can be small.

3.2.3 CoSENT

The early Sentence-BERT had problems with inconsistent training and prediction and difficulty in tuning. However, if the prediction target cos value is directly optimized, the effect is often particularly poor. Is there no way to directly optimize the cos value?

Fortunately, the answer is no. Teacher Su Jianlin proposed a CoSENT scheme, a loss function that optimizes the cos value:

is recorded as the set of all positive sample pairs, and is the set of all negative sample pairs, then we hope that for any positive sample pair and negative sample pair,

where are their respective sentence vectors. To put it bluntly, we only hope that the similarity of the positive sample pairs is greater than the similarity of the negative sample pairs. As for how much greater, the model can decide by itself. In fact, the same is true for spearman, a common evaluation index of semantic similarity. It only relies on the relative order of prediction results and does not rely on specific values.

For such needs, the formula in Circle Loss theory can be used as a solution:

To put it simply, if you want to achieve it eventually, then add an item to the log. Corresponding to our scenario here, we can get the loss function:

where is a hyperparameter. The above formula is essentially a loss function designed for sorting. It is also applicable to multi-class data and can be written in a more general form:

In other words, as long as we think that the true similarity of the sample pair (i, j) should be greater than the true similarity of (k, l), we can add it to the log; in other words, as long as we can design the order for the sample pair, You can use the CoSENT solution.

For NLI data, it has three labels: "implication", "neutral" and "contradiction". We can naturally think that the similarity of two "implication" sentences is greater than that of two "neutral" sentences, and the similarity between two "neutral" sentences is "The similarity of the sentences is greater than that of the two "contradictory" sentences, so the NLI sentence pairs can be sorted based on these three tags. With this sorting, NLI data can also be trained using CoSENT. Similarly, CoSENT is more suitable for data like STS-B, which is itself a score, because the score label itself is sorting information.

3.2.4 Summary of subsequent work

  • SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples
    • In response to the problem that the model "cannot distinguish between text similarity and semantic similarity, and prefers similar texts without considering actual semantic differences", a method of "explicitly adding negative words to generate soft negative samples" combined with "two-way marginal loss" was proposed ” plan.

  • EASE: Entity-Aware Contrastive Learning of Sentence Embedding
    • Emphasize the importance of entities in sentence vector representation. At the data level, positive and negative entities are used instead of positive and negative samples.
  • CLAIF:Improving Contrastive Learning of Sentence Embeddings from AI Feedback
    • In response to the lack of fine-grained supervision signals in the training process, that is, the similarity difference between pairs of positive samples is not taken into account, AI feedback from LLM is introduced to construct sample pairs with different similarities, and fine-grained information is given to these sample pairs. The similarity score is used as a supervision signal to help the learning of text representation.

3.3 PromptBERT

PromptBERT is another classic work in the field of sentence embedding after SimCSE.

The core of this work is to use Prompt to generate sentence representations. The author believes that the poor performance of native BERT is mainly due to bias caused by word frequency, capitalization, subword and other tokens, and BERT itself does not correct this problem with Transformers at each layer. By using prompt, the knowledge in each layer of BERT can be used more effectively, and using [MASK] to represent embedding can avoid averaging various tokens as before, thereby avoiding the bias introduced by tokens.

The core idea of ​​this working method is also relatively simple and is divided into two steps:

  1. Use Prompt to generate sentence representation, such as [X] means [MASK], [X] is the input sentence, [MASK] is the output representation, use this as the sentence representation
  2. Use different Prompt templates to generate comparative learning angles, and continue to use self-supervised training.

3.4 Instrcutor Embedding

According to OpenAI's paper "Text and Code Embeddings by Contrastive Pre-Training", text similarity and semantic retrieval are two different tasks, and there may be certain conflicts in training goals. As training proceeds, if the model may perform better on the semantic search task, it may perform worse on the sentence similarity task. At the same time, existing Embedding models often perform poorly when facing new tasks and new fields.

And our ideal Embedding should obviously have multiple capabilities at the same time. How can the Embedding model be adapted to multiple tasks at the same time and be generalized in new fields?

Instrcutor Embedding designs a new text embedding method based on instruction fine-tuning: splicing instructions that explain the use case (including task and domain information) before text input. Instrctor Embedding hand-coded task instructions for 330 text embedding datasets during training and evaluated INSTRUCTOR on 70 embedding evaluation tasks (64 of which were not seen during training), ranging from classification and information retrieval to semantics Text similarity and text generation evaluation achieve a better performance overall.

4 Embedding related exploration and thinking

The previous chapter sorted out the representative work of Sentence Embedding based on BERT-like models. In fact, it seems reasonable that BERT-like models using the bidirectional attention mechanism are good at content understanding tasks. However, the better effect of the OpenAI Embedding model and OpenAI's insistence on the Decoder-Only architecture model, as well as the rapid development of large models in the past six months, make us curious whether it is possible for the Decoder-only large model to also help us in the Embedding task. A surprise?

We have made some exploratory attempts on this. In the process of exploring, we most hope to clarify two questions:

  • Are BERT-like models really more suitable for Embedding tasks than Decoder-Only architecture models?
  • For Embedding tasks, is the bigger the model, the better?

In the end, after we explored the decoder-only model padding method, pooling method, and the degree of anisotropy of different layers, the final conclusion was relatively consistent with some of the current public conclusions.

In response to the first question, the paper "How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings" conducted an exploratory experiment to compare the effects of different layers of BERT and GPT. The experimental results are as follows:

According to the above table, we can find:

  • On different layers, the overall effect of BERT is significantly better than GPT.
  • The anisotropy of GPT-2 last layer is relatively serious. The middle layer or lower layer is more suitable for similarity tasks than the top layer.

In response to the second question, the Instructor Embedding paper also provides comparative experiments on the effects of models with different parameter quantities, as shown in the following table:

According to the above table, we can find:

  • Compared with the 335M GTR_LARGE model, the performance of the 4.8B GTR-XXL model, which has dozens of times the number of parameters, has not significantly improved.
  • The 5.8B SGPT-NLI model of the Decoder-Only architecture is defeated by the 4.8B GTR-XXL model of the Encoder-Only architecture with a similar number of parameters.

In summary, combined with our experiments, the preliminary conclusion is:

  • From the perspective of model parameter quantity: On the Embedding task, increasing the number of model parameters does not necessarily lead to improved effects.
  • Model structure perspective: According to current experimental results, BERT-like models with bidirectional attention are indeed better than the Decoder-only structure with unidirectional attention.

Of course, since OpenAI has not announced the technical details of their Embedding solution, perhaps we have not yet obtained the correct way to open Embedding using GPT.

Guess you like

Origin blog.csdn.net/lsb2002/article/details/135330059