Article directory

1. Statistical-based methods
2. Methods based on deep learning
- 2.1.Word2Vec calculation
6.
references

The following is modified based on the brother's blog post: Jingmi » Several methods for calculating sentence similarity in natural language processing

1. Statistical-based methods

1.1. Edit distance calculation

Edit distance, also known as Edit Distance in English, also known as Levenshtein distance, refers to the minimum number of editing operations required between two strings to convert one into the other. The larger the distance between them, the more different they are. Permitted editing operations include replacing one character with another, inserting a character, and deleting a character.

For example, we have two strings: string and setting. If we want to convert string into setting, we need these two steps:

The first step is to add the character e between s and t.
The second step is to replace r with t.

Therefore, their edit distance difference is 2, which corresponds to the minimum number of steps required to change (add, replace, delete) the two.

So how to implement it in Python? We can use the distance library directly:

import distance

def edit_distance(s1, s2):
    return distance.levenshtein(s1, s2)

s1 = 'string'
s2 = 'setting'
print(edit_distance(s1, s2))

Here we directly use the levenshtein() method of the distance library and pass in two strings to obtain the edit distance of the two strings.
The running results are as follows:

We can directly use pip3 to install the distance library here: pip3 install distance
in this way, if we want to obtain similar text, we can directly set an edit distance threshold to achieve this. For example, set the edit distance to 2. Here is an example:

import distance

def edit_distance(s1, s2):
    return distance.levenshtein(s1, s2)

strings = [
    '你在干什么',
    '你在干啥子',
    '你在做什么',
    '你好啊',
    '我喜欢吃香蕉'
]

target = '你在干啥'
results = list(filter(lambda x: edit_distance(x, target) <= 2, strings))
print(results)

Here we define some strings, then define a target string, and then set it with an edit distance threshold of 2. The final result is the result with an edit distance of 2 or less. The running results are as follows:

['你在干什么', '你在干啥子']

In this way, we can roughly filter out similar sentences, but we find that some sentences such as "What are you doing" are not recognized, but their meanings are indeed similar. Therefore, edit distance is not a good choice. way, but simple and easy to use.

1.2. Calculation of Jaccard coefficient

The Jaccard coefficient, also known as the Jaccard similarity coefficient in English, is used to compare the similarities and differences between limited sample sets. The larger the Jaccard coefficient value is, the higher the sample similarity is.
In fact, its calculation method is very simple, which is the value obtained by dividing the intersection of two samples by the union. When the two samples are completely consistent, the result is 1, and when the two samples are completely different, the result is 0.
The algorithm is very simple, just divide the intersection by the union. Let’s use Python code to implement it:

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
 
 
def jaccard_similarity(s1, s2):
    def add_space(s):
        return ' '.join(list(s))
    
    # 将字中间加入空格
    s1, s2 = add_space(s1), add_space(s2)
    # 转化为TF矩阵
    cv = CountVectorizer(tokenizer=lambda s: s.split())
    corpus = [s1, s2]
    vectors = cv.fit_transform(corpus).toarray()
    # 求交集
    numerator = np.sum(np.min(vectors, axis=0))
    # 求并集
    denominator = np.sum(np.max(vectors, axis=0))
    # 计算杰卡德系数
    return 1.0 * numerator / denominator
 
 
s1 = '你在干嘛呢'
s2 = '你在干什么呢'
print(jaccard_similarity(s1, s2))

Here we use CountVectorizer in the Sklearn library to calculate the TF matrix of the sentence, and then use Numpy to calculate the intersection and union of the two, and then calculate the Jaccard coefficient.

What is worth learning here is the usage of CountVectorizer. Through its fit_transform() method, we can convert the string into a word frequency matrix. For example, here are two sentences "What are you doing" and "What are you doing". First, CountVectorizer Accounting Calculate which words are not repeated and you will get a list of words. The result is:

['么', '什', '你', '呢', '嘛', '在', '干']

This can actually be obtained through the following code, which is to obtain the vocabulary content: cv.get_feature_names()
After conversion, the vectors variable becomes:

[[0 0 1 1 1 1 1]
 [1 1 1 1 0 1 1]]

It corresponds to the word frequency statistics of the word list corresponding to the two sentences. Here are two sentences, so the result is a two-dimensional array with a length of 2. For example, the first sentence "What are you doing" does not contain "?" character, then the result corresponding to the first "?" character is 0, that is, the quantity is 0, and so on.

Later, we used the np.min() method and passed in the axis as 0. In fact, we obtained the minimum value of each column, which actually obtained the intersection. The np.max() method obtained the maximum value of each column. The value is actually the union.

The sum of the two is the intersection size and union size, and then the quotient can be made. The result is as follows:

0.5714285714285714

The larger this value is, the closer the two strings are, otherwise the opposite is true. Therefore, we can also use this method and filter by setting a similarity threshold.

1.3.TF calculation

The third solution is to directly calculate the cosine similarity of the two vectors in the TF matrix . In fact, it is to solve for the cosine value of the angle between the two vectors, which is the point product divided by the module length of the two. The formula is as follows: $cosθ=\frac {a \cdot b}{|a||b|}$
More about cosine similarity:

We have obtained the TF matrix above. Now we only need to solve for the cosine value of the angle between the two vectors. The code is as follows:

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from scipy.linalg import norm

def tf_similarity(s1, s2):
    def add_space(s):
        return ' '.join(list(s))
    
    # 将字中间加入空格
    s1, s2 = add_space(s1), add_space(s2)
    # 转化为TF矩阵
    cv = CountVectorizer(tokenizer=lambda s: s.split())
    corpus = [s1, s2]
    vectors = cv.fit_transform(corpus).toarray()
    # 计算TF系数
    return np.dot(vectors[0], vectors[1]) / (norm(vectors[0]) * norm(vectors[1]))


s1 = '你在干嘛呢'
s2 = '你在干什么呢'
print(tf_similarity(s1, s2))

Here we use the np.dot() method to obtain the dot product of the vector, and then use the norm() method to obtain the module length of the vector. After calculation, the TF coefficient of the two is obtained. The results are as follows:

0.7302967433402214

1.4.TFIDF calculation

In addition to calculating the TF coefficient, we can also calculate the TFIDF coefficient. TFIDF actually adds IDF information to the word frequency TF. IDF is called inverse document frequency. If you don’t understand, you can read the explanation of teacher Ruan Yifeng: http:/ /www.ruanyifeng.com/blog/2013/03/tf-idf.html, the explanation of TFIDF is also very thorough.
The idf(t) should be understood like this : a word appears n times in the document collection, and the total number of document collections is N. idf(t) comes from information theory. Then the probability of this word appearing in each document is: $n / N$ , so the amount of information about this word appearing in this document is: $- l o g (n / N)$ . This is somewhat similar to information entropy $(- P (x) l o g P (x))$ , When performing feature selection using the filtering method of data mining, mutual information needs to be used. In fact, it is to calculate information gain and decision trees. Put $- l o g (n / N)$ transform, $l o g (N / n)$ , in order to avoid the appearance of 0, smoothing is performed, which is the above formula (just like Naive Bayes requires Laplacian smoothing).

Next we still use the module TfidfVectorizer in Sklearn to implement it. The code is as follows:

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.linalg import norm


def tfidf_similarity(s1, s2):
    def add_space(s):
        return ' '.join(list(s))
    
    # 将字中间加入空格
    s1, s2 = add_space(s1), add_space(s2)
    # 转化为TF矩阵
    cv = TfidfVectorizer(tokenizer=lambda s: s.split())
    corpus = [s1, s2]
    vectors = cv.fit_transform(corpus).toarray()
    # 计算TF系数
    return np.dot(vectors[0], vectors[1]) / (norm(vectors[0]) * norm(vectors[1]))


s1 = '你在干嘛呢'
s2 = '你在干什么呢'
print(tfidf_similarity(s1, s2))

The vectors variable here actually corresponds to the TFIDF value, and the content is as follows:

[[0.         0.         0.4090901  0.4090901  0.57496187 0.4090901 0.4090901 ]
 [0.49844628 0.49844628 0.35464863 0.35464863 0.  0.35464863 0.35464863]]

The running results are as follows:

0.5803329846765686

Therefore, we can also calculate the similarity through the TFIDF coefficient.

1.5.BM25

BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document (e.g., their relative proximity). It is not a single function, but actually a whole family of scoring functions, with slightly different components and parameters. One of the most prominent instantiations of the function is as follows.

The BM25 algorithm is usually used to search for correlation bisection. In one sentence, the main idea is: perform morpheme analysis on Query to generate morpheme qi; then, for each search result D, calculate the correlation score of each morpheme qi with D, and finally, calculate the correlation score of qi with respect to D. Weighted summation is performed to obtain the correlation score between Query and D.

The general formula of the BM25 algorithm is as follows:
$\sum _i ^n W_i \cdot R(q_i,d)$
Among them, $Q$ display Query, $q_i$ means $A morpheme after Q$ analysis (for Chinese, we can analyze the word segmentation of Query as a morpheme, and each word is regarded as a morpheme qi.); $d$ represents a search result document; $W_i$ Represents the morpheme $q_i$ weight; $R(q_i, d)$ represents the morpheme $q_i$ with documentation $d’$ s relevance score.

Let's look at how to define $W_i$ . There are many methods to determine the weight of the relevance of a word to a document, and the most commonly used one is IDF. Taking IDF as an example, the formula is as follows:
$IDF(q_i) = log \frac {Nn(q_i)+0.5} {n(q_i) +0.5}$

Among them, $N$ is the number of all documents in the index, $n(q_i)$ is the number of documents containing qi.

According to the definition of IDF, it can be seen that for a given document collection, it contains $q_i$ The more documents there are, the more $q_i$ The weight is lower. That is, when many documents contain $q_i$ 时， $q_i$ The discrimination is not high, so $q_i is used$ The importance is lower when judging relevance.

$q_i$ again $q_{i}$ $R(q_i, d)$ with document d $R (q_{i}, d)$ . First, let’s look at the general form of the correlation score in BM25:
$R(q_i,d) = \frac {f_i(k_1+1)} {f_i+K} \cdot \frac{df_i(k_2+1)}{qf_i+k_2}$
$k_1\cdot(1-b+b \cdot \frac{dl}{avbdl})$
where, $k_1$ ， $k_2$ ， $b$ is the adjustment factor, usually set based on experience,generally $k_1$ =2， $b$ =0.75； $f_i$ 为 $q_i$ in Frequency of occurrence in $d$ $qf_i$ 为 $q_i$ Frequency of occurrence in Query. $d l$ is the length of document d, $a v g d l$ is the average length of all documents. Since in most cases, $q_i$ It will only appear once in Query, that is $qf_i$ =1, so the formula can be simplified to:
$KR(q_i,d) = \frac {f_i(k_1+1)} {f_i+K}$
From As can be seen from the definition of $K$ $The function of b$ is to adjust the impact of document length on relevance. $The larger b$ is, the greater the impact of document length on the relevance score, and vice versa. The longer the relative length of the document, $The larger the K$ value will be, the smaller the correlation score will be. This can be understood as, when the document is longer, it contains $q_i$ The greater the chance, therefore, under the same condition of fi, the correlation between long documents and qi should be higher than that between short documents and $q_i$ The correlation is weak.

In summary, the correlation score formula of the BM25 algorithm can be summarized as:
$\sum _i ^n IDF(q_i) \cdot \frac {f_i(k_1+1)} {f_i+k_1\cdot(1-b+b \cdot \frac{dl}{avbdl} )}$
BM25 considers 4 factors: IDF factor, document length factor, document term frequency factor and query term frequency factor . The BM25 inside Lucene is simpler than the above formula. I personally think it is not very good. As
can be seen from the BM25 formula, by using different morpheme analysis methods, morpheme weight determination methods, and correlation determination methods between morphemes and documents, we Different search relevance score calculation methods can be derived, which provides us with greater flexibility in designing algorithms.
.
Here is a simple source code demo, see my github for details: BM25

For more details, see: Classic search algorithm: BM25 principle (the format is a bit messy, it is recommended to paste it into Typora for viewing)

2. Methods based on deep learning

2.1.Word2Vec calculation

The above methods are all based on statistics. Statistics-based methods cannot satisfy semantic similarity matching. The following method is based on deep learning, which solves semantic similarity matching to a certain extent.

Word2Vec, as the name suggests, is actually the process of converting each word into a vector. If you don’t understand, you can refer to: https://blog.csdn.net/itplus/article/details/37969519.

Here we can directly download the trained Word2Vec model. The link address of the model is: https://pan.baidu.com/s/1TZ8GII0CEX32ydjsfMc0zw. It is a 64-dimensional Word2Vec model trained using news, Baidu Encyclopedia, and novel data. The data The volume is large, and the overall effect is pretty good. We can download it directly and use it. Here we use the news_12g_baidubaike_20g_novel_90g_embedding_64.bin data, and then implement Sentence2Vec. The code is as follows:

import gensim
import jieba
import numpy as np
from scipy.linalg import norm

model_file = './word2vec/news_12g_baidubaike_20g_novel_90g_embedding_64.bin'
model = gensim.models.KeyedVectors.load_word2vec_format(model_file, binary=True)

def vector_similarity(s1, s2):
    def sentence_vector(s):
        words = jieba.lcut(s)
        v = np.zeros(64)
        for word in words:
            v += model[word]
        v /= len(words)
        return v
    
    v1, v2 = sentence_vector(s1), sentence_vector(s2)
    return np.dot(v1, v2) / (norm(v1) * norm(v2))

When obtaining the Sentence Vector, we first segment the sentence, then obtain its corresponding Vector for each segmented word, then add all the Vectors and average them, so that we can get the Sentence Vector, and then calculate it The cosine value of the included angle is sufficient.

The calling example is as follows:

s1 = '你在干嘛'
s2 = '你正做什么'
vector_similarity(s1, s2)

The result is as follows:

0.6701133967824016

At this time, if we go back to the original example to see the effect:

strings = [
    '你在干什么',
    '你在干啥子',
    '你在做什么',
    '你好啊',
    '我喜欢吃香蕉'
]

target = '你在干啥'

for string in strings:
    print(string, vector_similarity(string, target))

Still using the previous example, let’s take a look at their matching results. The running results are as follows:

你在干什么 0.8785495016487204
你在干啥子 0.9789649689827049
你在做什么 0.8781992402695274
你好啊 0.5174225914249863
我喜欢吃香蕉 0.582990841450621

It can be seen that the similarity of similar sentences can reach more than 0.8, while the similarity of different sentences is less than 0.6. This distinction is very large. It can be said that with Word2Vec, we can combine some semantic information to make some judgments, and the effect is obvious Much better too.

So overall, the way Word2Vec calculates is very good.

The above five sections are the basic method and Python implementation of sentence similarity calculation. The code address of this section is: https://github.com/AIDeepLearning/SentenceDistance.

6.

In addition, there are some potentially better research results in the academic world. For this, you can refer to some answers on Zhihu: https://www.zhihu.com/question/29978268/answer/54399062.

Introduction to NLP Sentence Similarity