Unsupervised semantic similarity

How to calculate semantic similarity for text without pair

1. The pit of bert

bert calculated that the similarity between the sentences is very close. After finetune on my data set, it is a little better. The output of cls is used directly as the vector of the sentence, and then the cosine is calculated. The result is a bit overturned. The main problem is:

  • The distance between sentences is very close
  • The length of the sentence will also affect the distance between sentences of the same length. In response to this problem, I tried to add up the word vector of each word to the number of characters, but found that it is still the same

Later, I read a Zhihu answer, the original text: https://www.zhihu.com/question/354129879

Using the pool of word vectors is indeed better than the output of cls.

In addition, cosine only calculates the distance between the angles of the two vectors, if you want to change to European style, but the difference is not big.

2. The pit of cilin

The effect of Cilin is not good, not as good as bert. I don’t know if there is a problem with the method. My method is: calculate the distance between each word in sentence A and sentence B, and then take the word with the nearest distance. It is the word shift distance, but the distance between words is calculated with cilin.

There are two reasons for this rollover:

  • The words in cilin are still limited.
  • Only calculate the similarity between words and the importance of words. Even if it is multiplied by the tfidf weight, the result is not good. The reasons for the analysis are: word segmentation will introduce noise, the cilin dictionary also lacks important words, tfidf is fine, but overall, this method is not working.

Use tfidf to filter words whose weights are not important. For short texts, few can be filtered

 

Guess you like

Origin blog.csdn.net/qq_20849045/article/details/109333157