How to compare sentence similarities using embeddings from BERT

KOB :

I am using the HuggingFace Transformers package to access pretrained models. As my use case needs functionality for both English and Arabic, I am using the bert-base-multilingual-cased pretrained model. I need to be able to compare the similarity of sentences using something such as cosine similarity. To use this, I first need to get an embedding vector for each sentence, and can then compute the cosine similarity.

Firstly, what is the best way to extratc the semantic embedding from the BERT model? Would taking the last hidden state of the model after being fed the sentence suffice?

import torch
from transformers import BertModel, BertTokenizer

model_class = BertModel
tokenizer_class = BertTokenizer
pretrained_weights = 'bert-base-multilingual-cased'

tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

sentence = 'this is a test sentence'

input_ids = torch.tensor([tokenizer.encode(sentence, add_special_tokens=True)])
with torch.no_grad():
    output_tuple = model(input_ids)
    last_hidden_states = output_tuple[0]

print(last_hidden_states.size(), last_hidden_states)

Secondly, if this is a sufficient way to get embeddings from my sentence, I now have another problem where the embedding vectors have different lengths depending on the length of the original sentence. The shapes output are [1, n, vocab_size], where n can have any value.

In order to compute two vectors' cosine similarity, they need to be the same length. How can I do this here? Could something as naive as first summing across axis=1 still work? What other options do I have?

Swier :

You can use the [CLS] token as a representation for the entire sequence. This token is typically prepended to your sentence during the preprocessing step. This token that is typically used for classification tasks (see figure 2 and paragraph 3.2 in the BERT paper).

It is the very first token of the embedding.

Alternatively you can take the average vector of the sequence (like you say over the first(?) axis), which can yield better results according to the huggingface documentation (3rd tip).

Note that BERT was not designed for sentence similarity using the cosine distance, though in my experience it does yield decent results.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=24935&siteId=1