NLP practice - SBERT-based semantic search, semantic similarity calculation, SimCSE, GenQ and other unsupervised training

0. Some thoughts triggered by SBERT

Since the concept of sentence-transformer was proposed last year, comparative learning has attracted wide attention in the field of deep learning in the first half of this year, especially methods such as SimCSE , which have achieved great success in unsupervised and small-sample scenarios. Both Sentence Bert and SimCSE are very simple in principle and easy to implement.

At that time, due to the environment configuration problem of the sentence-transformer module (I didn’t want to upgrade the version of the transformers library too high), I used bert4keras to write the sentence bert of the single-tower and double-tower structures respectively, and trained on the STS data set and the Chinese translation version of STS. Because the description and code of the reference paper is an encoder, but the picture in the paper seems to be two encoders, I did experiments on the single-tower and double-tower structures respectively, and the result is that the difference is not obvious . Therefore, I personally think that it is not necessary to spend almost twice as many parameters to build a double-tower structure for tasks such as fine-tuning of semantic space pre-training models.

But when I looked at SBERT 's official website not long ago, I found that sentence-transformer has been updated to version 2.0, and many pre-trained models have been uploaded on huggingface , and some examples of specific application scenarios have been added, so I decided to introduce the content here, mainly to carry and translate SBERT 's official documentation. I have to admit that the power of the open source community is too strong. In just a few months, there has been great development. I personally mainly used keras before. I remember Su Shen said that what torch can do, keras can also do. I have always believed in this sentence, but the problem is that the development speed of the pre-training model and the activity of the huggingface community far exceeded my expectations. If I continue to use keras, I have to maintain two sets of codes, and it takes a lot of time to migrate between the two styles. So I decided to give up being a believer in bert4keras and switch to transformers as the main tool.

1. Introduction to SBERT

This blog is mainly about the handling and introduction of SBERT's official documents.
SBERT official help document: www.sbert.net
sentence-transformer GitHub: https://github.com/UKPLab/sentence-transformers
Pre-trained model address on Huggingface: https://huggingface.co/sentence-transformers

The introduction on the official website has been relatively detailed. For more specific application examples, please refer to the example on git. For some commonly used applications, this blog also organizes them.

The sentence-transformer is based on the huggingface transformers module. If there is no sentence-transformer module in the environment, only the transformers module can also use its pre-trained model. In terms of environment configuration, for the current version 2.0, it is best to upgrade transformers, tokenizers and other related modules to the latest, especially tokenizers. If you do not upgrade, you will report an error when creating a tokenizer.

At present, sentence-transformers has released a total of 98 pre-trained models.
Some of the pre-trained models listed in the help documentation
In the selection of the pre-training model, if accuracy is required, choose mpnet-base-v2 (this model seems to have been trained on data sets such as flickr30k, and can be used to encode images. I have not tried it yet). If the application scenario is Chinese, you can choose a multilingual- related model. This model currently supports more than 50 languages. Taking semantic similarity calculation as an example, a multilingual model can compare the semantic similarity between Chinese and English. However, there is currently no dedicated Chinese pre-training model. If there is a requirement for model efficiency, then choose a distil- related model.

If it is a symmetric semantic search problem (query and answer are similar in length, such as comparing semantic similarity between two sentences), use the pre-trained model given in https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models
; and if it is an asymmetric semantic search problem (query is very short, but the answer to be retrieved is a relatively long document), use https://www.sbert.net/docs/pretrained-mod The model given in els/msmarco-v3.html.

2. Basic application

2.1 Semantic similarity calculation

Pretrained models are very simple to use as encoders.

from sentence_transformers import SentenceTransformer, util
# 【创建模型】
# 这里的编码器可以换成mpnet-base-v2等
# 模型自动下载,并在/root/.cache下创建缓存
# 如果是想加载本地的预训练模型,则类似于huggingface的from_pretrained方法,把输入参数换成本地模型的路径
model = SentenceTransformer('paraphrase-MiniLM-L12-v2')
# model= SentenceTransformer('path-to-your-pretrained-model/paraphrase-MiniLM-L12-v2/')

# 计算编码
sentence1 = 'xxxxxx'
sentence2 = 'xxxxxx'
embedding1 = model.encode(sentence1, convert_to_tensor=True)
embedding2 = model.encode(sentence2, convert_to_tensor=True)

# 计算语义相似度
cosine_score = util.pytorch_cos_sim(embedding1, embedding2)

In addition to calculating the semantic similarity of two sentences as above, you can also compare two lists

sentences1 = ['The cat sits outside',
             'A man is playing guitar',
             'The new movie is awesome']

sentences2 = ['The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']

#Compute embedding for both lists
embeddings1 = model.encode(sentences1, convert_to_tensor=True)
embeddings2 = model.encode(sentences2, convert_to_tensor=True)

#Compute cosine-similarits
cosine_scores = util.pytorch_cos_sim(embeddings1, embeddings2)

If you want to find similar words in a bunch of input text:

# Single list of sentences
sentences = ['The cat sits outside',
             'A man is playing guitar',
             'I love pasta',
             'The new movie is awesome',
             'The cat plays in the garden',
             'A woman watches TV',
             'The new movie is so great',
             'Do you like pizza?']

# Compute embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)

# 两两计算相似度
cosine_scores = util.pytorch_cos_sim(embeddings, embeddings)

# 找到相似度最高的句对
pairs = []
for i in range(len(cosine_scores)-1):
    for j in range(i+1, len(cosine_scores)):
        pairs.append({
    
    'index': [i, j], 'score': cosine_scores[i][j]})

# 根据相似度大小降序排列
pairs = sorted(pairs, key=lambda x: x['score'], reverse=True)

for pair in pairs[0:10]:
    i, j = pair['index']
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], pair['score']))

2.2 Semantic Search

Semantic search and similarity calculation are actually the same, just regard sentences1 as query and sentences2 as corpus, then use util.pytorch_cos_sim() to calculate similarity, and then return the topk with the highest similarity.
Wrote a simple method:

def semantic_search(query, corpus_or_emb, topk=1, model=model):
	"""
	:param query: 查询语句
	:param corpus_or_emb: 候选答案或候选答案的编码
	:param topk: 返回前多少个答案
	:param model: 用于编码的模型
	:return [(most_similar, score)]: 结果和分数
	---------------
	ver: 2021-08-23
	by: changhongyu
	"""
	topk = min(topk, len(corpus))
	q_emb = model.encode(query, convert_to_tensor=True)
	if type(corpus_or_emb) == list:
		c_emb = model.encode(corpus, convert_to_tensor=True)
	elif type(corpus_or_emb) == torch.Tensor:
		c_emb = corpus_or_emb
	else:
		raise TypeError("Attribute 'corpus_or_emb' must be list or tensor.")
	
	cosine_scores = util.pytorch_cos_sim(q_emb, c_emb)[0]
	top_res = torch.topk(cosine_scores, k=topk)
	
	return [(corpus[int(index.cpu())], float(score.cpu().numpy()) for score, index in zip(top_res[0], top_res[1])]

2.3 Clustering and topic models

In addition to being used for similarity calculation, the embedding obtained by using the sentence-transformer series of models can also be directly applied to clustering as features. Provides three methods of K-means, Agglomerative cluster and fast cluster.
K-means clustering also uses skearn's k-means, and takes SBERT's coding features as the input features of k-means:

from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

embedder = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Corpus with example sentences
corpus = ['A man is eating food.',
          'A man is eating a piece of bread.',
          'A man is eating pasta.',
          'The girl is carrying a baby.',
          'The baby is carried by the woman',
          'A man is riding a horse.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          'Someone in a gorilla costume is playing a set of drums.',
          'A cheetah is running behind its prey.',
          'A cheetah chases prey on across a field.'
          ]
corpus_embeddings = embedder.encode(corpus)

# Perform kmean clustering
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_

clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
    clustered_sentences[cluster_id].append(corpus[sentence_id])

for i, cluster in enumerate(clustered_sentences):
    print("Cluster ", i+1)
    print(cluster)
    print("")

Topic models with sentence transformer as encoder, including Top2Vec , BERTopic, etc.

topic

2.4 Image retrieval

sentence-transformer also provides a vit-based pre-training model, so it can calculate the similarity between images and text. Usage is similar to text.

from sentence_transformers import SentenceTransformer, util
from PIL import Image

#Load CLIP model
model = SentenceTransformer('clip-ViT-B-32')

#Encode an image:
img_emb = model.encode(Image.open('two_dogs_in_snow.jpg'))

#Encode text descriptions
text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night'])

#Compute cosine similarities 
cos_scores = util.cos_sim(img_emb, text_emb)
print(cos_scores)

In addition, the use of sentence-transformer can also support some other applications, such as recall + rearrangement, etc. For details, please refer to the official documentation.

3. Training of unsupervised methods

3.1 SimCSE

SimCSE uses an encoder, uses the random principle in the dropout mechanism to construct training positive samples, and pulls in the distance between positive samples in the space.
SimCSE
A simple training example given in the help documentation:

from sentence_transformers import SentenceTransformer, InputExample
from sentence_transformers import models, losses
from torch.utils.data import DataLoader

# Define your sentence transformer model using CLS pooling
model_name = 'distilroberta-base'
word_embedding_model = models.Transformer(model_name, max_seq_length=32)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

# Define a list with sentences (1k - 100k sentences)
train_sentences = ["Your set of sentences",
                   "Model will automatically add the noise",
                   "And re-construct it",
                   "You should provide at least 1k sentences"]

# Convert train sentences to sentence pairs
train_data = [InputExample(texts=[s, s]) for s in train_sentences]

# DataLoader to batch your data
train_dataloader = DataLoader(train_data, batch_size=128, shuffle=True)

# Use the denoising auto-encoder loss
train_loss = losses.MultipleNegativesRankingLoss(model)

# Call the fit method
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=1,
    show_progress_bar=True
)

model.save('output/simcse-model')

3.2 TSDAE

Transformers and Sequential D enoising A uto- E ncoder (TSDAE) is an encoder-decoder structure based on transformer, which is used to denoise the input text containing noise.
TSDAE
Training code:

from sentence_transformers import SentenceTransformer
from sentence_transformers import models, util, datasets, evaluation, losses
from torch.utils.data import DataLoader

# Define your sentence transformer model using CLS pooling
model_name = 'bert-base-uncased'
word_embedding_model = models.Transformer(model_name)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), 'cls')
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

# Define a list with sentences (1k - 100k sentences)
train_sentences = ["Your set of sentences",
                   "Model will automatically add the noise", 
                   "And re-construct it",
                   "You should provide at least 1k sentences"]

# Create the special denoising dataset that adds noise on-the-fly
train_dataset = datasets.DenoisingAutoEncoderDataset(train_sentences)

# DataLoader to batch your data
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)

# Use the denoising auto-encoder loss
train_loss = losses.DenoisingAutoEncoderLoss(model, decoder_name_or_path=model_name, tie_encoder_decoder=True)

# Call the fit method
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=1,
    weight_decay=0,
    scheduler='constantlr',
    optimizer_params={
    
    'lr': 3e-5},
    show_progress_bar=True
)

model.save('output/tsdae-model')

I haven't read this article yet, here is the address of the paper:
https://arxiv.org/abs/2104.06979

3.3 GenQ

The application scenario of GenQ is asymmetric semantic search. In the absence of labeled query-result sentence pairs, first use the T5 model to generate queries, construct a 'silver dataset', and then use the constructed dataset to fine-tune the SBERT model with a double-tower structure.
GenQ
First, a paragraph is given to generate a question sentence. Note that the pre-trained model here is query-gen-msmarco-t5-large-v1, not the T5 model released by Google. If it is a T5 model, the generated result is not a corresponding question, but a fragment similar to para.

In addition, BeIR only released three models, all of which were pre-trained on English predictions. Therefore, if they want to apply to Chinese scenarios, they can only translate the documents into English first, and then translate back after generating English questions.

from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

tokenizer = T5Tokenizer.from_pretrained('BeIR/query-gen-msmarco-t5-large-v1')
model = T5ForConditionalGeneration.from_pretrained('BeIR/query-gen-msmarco-t5-large-v1')
model.eval()

para = "Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects."

input_ids = tokenizer.encode(para, return_tensors='pt')
with torch.no_grad():
    outputs = model.generate(
        input_ids=input_ids,
        max_length=64,
        do_sample=True,
        top_p=0.95,
        num_return_sequences=3)

print("Paragraph:")
print(para)

print("\nGenerated Queries:")
for i in range(len(outputs)):
    query = tokenizer.decode(outputs[i], skip_special_tokens=True)
    print(f'{
      
      i + 1}: {
      
      query}')

Train an SBERT model using the generated dataset

from sentence_transformers import SentenceTransformer, InputExample, losses, models, datasets
import os

train_examples = []
for para in paras:
	# 所有的未标注的篇章数据
	for query in para['queries']:
		# 根据自己存储的数据格式调整
		train_examples.append(InputExample(texts=[query, para]))

# For the MultipleNegativesRankingLoss, it is important
# that the batch does not contain duplicate entries, i.e.
# no two equal queries and no two equal paragraphs.
# To ensure this, we use a special data loader
train_dataloader = datasets.NoDuplicatesDataLoader(train_examples, batch_size=64)

# Now we create a SentenceTransformer model from scratch
word_emb = models.Transformer('distilbert-base-uncased')
pooling = models.Pooling(word_emb.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_emb, pooling])

# MultipleNegativesRankingLoss requires input pairs (query, relevant_passage)
# and trains the model so that is is suitable for semantic search
train_loss = losses.MultipleNegativesRankingLoss(model)


#Tune the model
num_epochs = 3
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=num_epochs, warmup_steps=warmup_steps, show_progress_bar=True)

os.makedirs('output', exist_ok=True)
model.save('output/programming-model')

Next, a new query can be queried using the semantic search method described above.

3.4 CT

CT is another unsupervised training method. Its idea is to use two encoders to encode two sets of inputs separately. If it is the same sentence, the encoding of encoder 1 and encoder 2 should be similar, but for different sentences, the encoding given by the two encoders is different. Therefore, the goal of training is to make the dot product of the former two encoding vectors as large as possible, while the latter dot product is as small as possible.

CT
Training code:

import math
from sentence_transformers import models, losses, SentenceTransformer
import tqdm

## Training parameters
model_name = 'distilbert-base-uncased'  # 如果是本地模型则修改为路径
batch_size = 16
pos_neg_ratio = 8   # batch_size must be devisible by pos_neg_ratio
num_epochs = 1
max_seq_length = 75
output_name = ''  # 模型保存路径

model_output_path = 'output/train_ct{}-{}'.format(output_name, datetime.now().strftime("%Y-%m-%d_%H-%M-%S"))

# 建立编码器,可以采用SBERT的预训练模型
word_embedding_model = models.Transformer(model_name, max_seq_length=max_seq_length)

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

################# Read the train corpus  #################
train_sentences = ["Your set of sentences",
                   "Model will automatically add the noise",
                   "And re-construct it",
                   "You should provide at least 1k sentences"]

# For ContrastiveTension we need a special data loader to construct batches with the desired properties
train_dataloader =  losses.ContrastiveTensionDataLoader(train_sentences, batch_size=batch_size, pos_neg_ratio=pos_neg_ratio)

# As loss, we losses.ContrastiveTensionLoss
train_loss = losses.ContrastiveTensionLoss(model)

warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1)  # 10% of train data for warm-up

# Train the model
model.fit(train_objectives=[(train_dataloader, train_loss)],
          epochs=num_epochs,
          warmup_steps=warmup_steps,
          optimizer_params={
    
    'lr': 5e-5},
          checkpoint_path=model_output_path,
          show_progress_bar=True,
          use_amp=False  # Set to True, if your GPU supports FP16 cores
          )

All in all, sentence-transformer is a very easy-to-use encoder, and the principles of the functions used above are very easy to understand. On this basis, you can also develop some creative functions according to your actual application scenarios.

Guess you like

Origin blog.csdn.net/weixin_44826203/article/details/119868241