Similar question generation and similar sentence retrieval capabilities::: simbert -> roformer-sim (simbertv2)

insert image description here

simbert

Open source address: https://github.com/ZhuiyiTechnology/simbert

UNILM:

UniLM is a Transformer model that integrates NLU and NLG capabilities. It was proposed by Microsoft in May last year and was upgraded to the v2 version in February this year.
The core of UniLM isEndow the model with the ability of Seq2Seq through a special Attention Mask.
If the input is "what do you want to eat" and the target sentence is "white chicken", then UNILM will combine these two sentences into one: [CLS] what do you want to eat [SEP] white chicken [SEP], and then follow as The Attention Mask of the picture:
insert image description here
insert image description here
Seq2Seq can only show that UniLM has the ability of NLG, so why did it say that it has both NLU and NLG capabilities?
Because of UniLM’s special Attention Mask, [CLS] What do you want to eat [SEP] these 6 tokens only do Attention among themselves, and have nothing to do with the white-cut chicken [SEP], which means that although the splicing later [SEP], but this will not affect the first 6 encoded vectors.
Due to this feature, UniLM also randomly adds some [MASK] when inputting, so that the input part can do MLM tasks, and the output part can do Seq2Seq tasks. MLM enhances NLU capabilities, and Seq2Seq enhances NLG capabilities, killing two birds with one stone.

SimBERT belongs to supervised training. The training corpus is similar sentence pairs collected by itself. The Seq2Seq part is constructed by predicting the similar sentence generation task of another sentence, as shown in the figure below: Assuming that SENT_a and SENT_b are a group of similar sentences, then in the
insert image description here
same In the batch, add [CLS] SENT_a [SEP] SENT_b [SEP] and [CLS] SENT_b [SEP] SENT_a [SEP] to the training, and do a similar sentence generation task, which is the Seq2Seq part.

On the other hand, take out the [CLS] vectors in the entire batch to get a sentence vector matrix V∈ℝb×d (b is batch_size, d is hidden_size), and then perform l2 normalization on the d dimension to get Ṽ, Then do the inner product in pairs to get the b×b similarity matrix Ṽ Ṽ ⊤, then multiply it by a scale (we took 30), and mask off the diagonal part, and finally perform softmax on each row as a classification task training , the target label of each sample is its similar sentence (as for itself has been masked). To put it bluntly, it is to treat all non-similar samples in the batch as negative samples, and use softmax to increase the similarity of similar samples and reduce the similarity of other samples.

In the final analysis, the key is that "the vector of [CLS] actually represents the input sentence vector", so it can be used to do some NLU-related things. The final loss is the sum of the two-part loss of Seq2Seq and similar sentence classification.

roformer-sim (simbert v2)

Open source address: https://github.com/ZhuiyiTechnology/roformer-sim

The training corpus of RoFormer-Sim consists of two parts:
1. Similar sentences of interrogative type; [Like SimBERT, by collecting similar interrogative sentences known to Baidu, and then further cleaning through rules]
2. Similar sentences of general type. [Two schemes, to a certain extent, can construct (pseudo) similar sentence pairs unsupervised].

Solution 1: Based on the idea that "the answers to the same question are similar", if we have a ready-made question-and-answer corpus that has multiple answers to the same question, then we can divide each answer into sentences, and then use a ready-made The similarity function is used to compare the similarity between answers, and the sentence pairs whose similarity exceeds a certain threshold are selected as similar sentence pairs;

Solution 2: Based on the idea of ​​"sentences in the same article are similar", it is simpler and more direct, that is, divide each article into sentences, and then use a ready-made similarity function to calculate the similarity in pairs, and pick out the similarity exceeding a certain The sentence pairs with the threshold value are used as similar sentence pairs. Obviously, the rationality of this scheme is weaker, so its threshold value is also higher.
"Off-the-shelf similarity function", we use a variant of Jaccard similarity. In other words, we only need a regular, character-level similarity. The semantic association is through the association within the chapter and the pre-training model itself. to obtain the generalization ability.

In scheme 1, we constructed about 4.5 million (pseudo) similar sentence pairs from several reading comprehension datasets; in scheme
2, we constructed about 4.7 million (pseudo) similar sentence pairs from more than 30 G parallel predictions; and The crawled questions reached about 30 million similar sentence groups (one group can form multiple pairs). From this point of view, the number of question sentences is far more than that of general sentence patterns, so we sample them in a 1:1 manner, so that the samples of each sentence pattern are balanced.

training method
It is basically the same as SimBERT, as shown in the figure below.
The difference is that in order to enhance the generation ability of the model, when constructing the training corpus, we also randomly replace some tokens of the input sentences with [MASK]. This pre-training method was first proposed by BART.
The difference between us and BART is: BART is "input a sentence with noise, output the original sentence", we are "input a sentence with noise, output a sentence similar to the original sentence", theoretically our task is even more difficult.
insert image description here

Due to the BART-like training, in addition to directly generating similar sentences, we can also mask some parts by ourselves, allowing the model to diverge and expand by itself

Increasing the corpus of general sentences and introducing BART-like training, these changes have relatively improved the effect of the generative model.
However, we unexpectedly found that retrieval models (i.e. sentence encoding models) were less effective.
The reason for the estimation may be that more corpus and greater noise increase the difficulty of generating models, but for comparative learning, these samples with different sentence patterns or noise are used as negative samples, but the difficulty is reduced. . For example, if a batch has both interrogative sentences and declarative sentences, then the model can identify many negative samples simply through sentence patterns (rather than semantics), thus reducing the ability to understand semantics.
Of course, the essential positioning of SimBERT and RoFormer-Sim are similar sentence amplification models, and the retrieval model is just its "by-product", but we still hope that this "by-product" can be as good as possible. To this end, after RoFormer-Sim training, weFurther transfer the retrieval effect of SimBERT to RoFormer-Sim by means of distillation, so that the retrieval effect of RoFormer-Sim is basically the same or even better than SimBERT.
The method of distillation is very simple. If for the same batch of sentences, the sentence vectors from SimBERT are u1, u2, ⋯, un, and the sentence vectors from RoFormer-Sim are v1, v2, ⋯, vn, then it is
sim=λn2∑i=1n∑j=1n(cos(ui,uj)−cos(vi,vj))2(1)
considered as Here λ=100. Of course, in order to prevent the model from "forgetting" the generation model, the generation loss must be added at the same time as the distillation, that is, =sim+gen. The distillation of the base version does not require many steps, and the training can be completed in about 5000 steps.

insert image description here

Guess you like

Origin blog.csdn.net/weixin_36378508/article/details/127267852