Sentence Bert model for text matching

foreword

At present, for most NLP tasks, fine-tuning the pre-trained model has achieved good results, but for some specific scenarios, we often need text representation, such as text clustering, Text matching (search scenarios), etc.;
for text matching tasks, when calculating semantic similarity, the Bert model needs to enter two sentences into the model at the same time for information interaction.

Scenario 1: If there are 10,000 sentences, find the most similar sentence pairs.
The Bert model needs to be calculated (10000*9999/2) times, which is very time-consuming and takes about 65 hours. For the Sentencebert model, 10,000 sentences only need to be calculated 10,000 times to obtain their own embedding, which only takes about 5 seconds. The time for cosine similarity calculation is negligible compared to the running time of the model, and it takes about 5 seconds to complete. ,

5 seconds compared to 65 hours, this is the terrifying gap between presentation and interaction.

Scenario 2: The user asks a question, calculates the similarity between the question and the standard questions in the database, and finds the most similar standard question.
The Bert model needs to calculate both user questions and all standard questions, which is very inefficient. However, the SentenceBert model can pre-code the questions in the standard library offline to obtain sentence vectors. After a new question (query), you only need It is enough to encode new problems, which will greatly improve efficiency.

Sentence Bert paper address:
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
code link:
https://www.sbert.net/docs/pretrained_models.html
https://github.com/UKPLab/sentence-transformers

Dataset: Harbin Institute of Technology open source LCQMC dataset, a total of 20+w pieces of data;
evaluation indicators:

  • Spearman Rank Correlation, which tests for monotonic relationships
  • Pearson Rank Correlation, used to test the linear relationship,
    it is recommended to use Spearman Rank Correlation.

Why is the Bert sentence vector representation not effective?

insert image description here
As can be seen from the above figure, Bert's CLS vector representation is not good.

reason:

Bert's vector space is anisotropic, and the word embedding presents a cone-shaped distribution. High-frequency words are gathered at the head of the cone, and low-frequency words are scattered at the tail. Since the high-frequency words themselves are high-frequency, they will dominate the sentence representation.

Word frequency affects the distribution of word vector space. The L2 distance between high-frequency words and the origin is closer and closer to the origin; the
insert image description here
distribution of low-frequency words is more sparse. Concentrated, low-frequency words are more sparse)
insert image description here
Another reason is that the method generally used to calculate text similarity is cosine similarity, the formula is as follows The insert image description here
calculation of the above formula is to ensure that the basis of the vector is an orthonormal basis, so if it is non-orthogonal basis, the calculation results will be unrepresentative (inaccurate).

Principle of Sentence Bert

There are two ways for the Bert model to output sentence vectors. One is to use CLS to represent sentence vectors, and the other is to pool token_embedding (average pooling/maximum pooling). Of course, different layers can be pooled (finally One layer, the last two layers, the first layer and the last layer), of course, it is commonly used to take the average pooling of the last layer . Sentence vectors do not perform well in similarity calculations, but Roberta's sentence vectors are better than bert's sentence vectors.

The commonly used loss function is cross_entropy. You can consider changing the objective function to cosine similarity, so that the value of [CLS] is used to calculate cosine similarity. Three methods are proposed in the paper:

The first method: Change a single Bert into a twin network structure. Although it is two parts, the weights are shared. Input two sentences respectively to get the corresponding CLS values, that is, u and v in the figure, and then put u , v, |u - v| are spliced, and then classified by softmax. This method supports two tasks: NLI task (three classifications), STS task (two classifications).
insert image description here
The second method is that the output u and v need to calculate cosine similarity, and finally the loss function is replaced by a square loss function. This method is only applicable to STS tasks.
insert image description here

The third method is to replace the loss function with hinge loss. a represents the original sentence, p represents the positive example, and n represents the negative example. The goal is to make the difference between the original sentence and the positive example greater than the difference between the original sentence and the negative example. set the threshold. In the paper, the distance metric is Euclidean distance and the margin size is 1.
insert image description here

Reference: https://blog.csdn.net/u012526436/article/details/115736907?spm=1001.2014.3001.5501
https://zhuanlan.zhihu.com/p/113133510
https://www.sbert.net/docs/ training/overview.html
https://zhuanlan.zhihu.com/p/351678987

Guess you like

Origin blog.csdn.net/dzysunshine/article/details/120490675