Text similarity calculation - HanLP word segmentation + cosine similarity algorithm

1. Introduction to cosine similarity

Cosine similarity (also known as cosine similarity): is to measure the similarity between two vectors by measuring the cosine value of the angle between them. The cosine value is close to 1, and the angle tends to 0, indicating that the two vectors are more similar; the cosine value is close to 0, and the angle tends to 90 degrees, indicating that the two vectors are less similar.

So how to calculate cosine similarity?

The law of cosines is a mathematical formula that relates the lengths of the sides of a triangle to the cosine (cos) of an angle.

The expression of the law of cosines is as follows:

The Pythagorean theorem is a special case of the cosine theorem. When the angle is a right angle, that is: $cos\gamma = 0$ when, the formula is simplified to $c^{2}=a^{2}+b^{2}$

According to the cosine theorem expression, the calculation formula of cosine is as follows:

And a,b,c are the lengths of the three sides. Assuming that vector a is [x1, y1] and vector b is [x2, y2], then calculate the length formula according to the vector (the length of the vector is also called the modulus of the vector, use double vertical lines to wrap the vector to indicate the length of the vector)

$\vec{\left | a \right |}=\sqrt{x_{1}^{2}+y_{1}^{2}}$ , that is $a^{2}=x_{1}^{2}+y_{1}^{2}$ ; similarly, $\vec{\left | b \right |}=\sqrt{x_{2}^{2}+y_{2}^{2}}$ , that is $b^{2}=x_{2}^{2}+y_{2}^{2}$ ,

and $c=\sqrt{(x_{1}-x_{2})^{2}+(y_{1}-y_{2})^{2}}$

Bring a, b, and c at this time into the cosine formula, and you can derive

That is $cos\theta =\frac{a\cdot b}{\left \| a \right \| *\left \| b \right \|}$ , the numerator is the dot product of the vector a and the vector b, and the denominator is the L2 multiplication of the two, that is, the square root after adding the squares of all dimension values.

2. Calculation ideas of text similarity

Sentence A: I want to raise a cow so that I can drink fresh milk every day.

Sentence B: I want to drink fresh milk every day, so I plan to raise a cow.

1. Word segmentation

A participle result: [I, want, raise, a cow, in this way, just, can, every day, drink, fresh, of, milk]

B participle result: [I, want, every day, drink, fresh, of, milk, so, plan, raise, a cow]

2. Take the union (take the union of the results after sentence A and B are segmented)

[I want to raise a cow so that I can drink fresh milk every day so I intend to]

3. Write the word frequency vector

Calculate the word frequency vector according to the word segmentation results of sentences A and B, where the length of the word frequency vector is the union in the second step, and each bit represents the number of occurrences of the word.

Sentence A: (1,1,1,1,1,1,1,1,1,1,1,1,1,0,0)

Sentence B: (1,1,1,1,1,0,0,0,1,1,1,1,1,1,1)

At this point, the question becomes how to calculate the similarity of these two vectors.

4. Calculate the cosine similarity

Substitute the word frequency vector into the cosine similarity formula:

$cos(\theta )=\frac{1\times 1+1\times 1+1\times 1+1\times 1+1\times 1+1\times 0+1\times 0+1\times 0+1\times 1+1\times 1+1\times 1+1\times 1+1\times 1+0\times 1+0\times 1}{\sqrt{1^{2}+1^{2}+1^{2}+1^{2}+1^{2}+1^{2}+1^{2}+1^{2}+1^{2}+1^{2}+1^{2}+1^{2}+1^{2}+0^{2}+0^{2}}\times \sqrt{1^{2}+1^{2}+1^{2}+1^{2}+1^{2}+0^{2}+0^{2}+0^{2}+1^{2}+1^{2}+1^{2}+1^{2}+1^{2}+1^{2}+1^{2}}}$

$=\frac{10}{\sqrt{13}\times \sqrt{12}}\approx 0.80064$

The cosine value of the included angle in the calculation result is 0.80064, which is very close to 1. Therefore, sentence A and sentence B above are basically similar.

3. Code implementation

To use HanLP, you need to import maven dependencies first:

        <!-- https://mvnrepository.com/artifact/com.hankcs/hanlp -->
        <dependency>
            <groupId>com.hankcs</groupId>
            <artifactId>hanlp</artifactId>
            <version>portable-1.7.2</version>
        </dependency>

The Java code is as follows:

package com.scb.dss.udf;

import com.hankcs.hanlp.HanLP;

import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.stream.Collectors;

public class CosineSimilarity {
    /**
     * 使用余弦相似度算法计算文本相似性
     *
     * @param sentence1
     * @param sentence2
     * @return
     */
    public static double getSimilarity(String sentence1, String sentence2) {
        System.out.println("Step1. 分词");
        List<String> sent1Words = getSplitWords(sentence1);
        System.out.println(sentence1 + "\n分词结果：" + sent1Words);
        List<String> sent2Words = getSplitWords(sentence2);
        System.out.println(sentence2 + "\n分词结果：" + sent2Words);

        System.out.println("Step2. 取并集");
        List<String> allWords = mergeList(sent1Words, sent2Words);
        System.out.println(allWords);


        int[] statistic1 = statistic(allWords, sent1Words);
        int[] statistic2 = statistic(allWords, sent2Words);

        // 向量A与向量B的点乘
        double dividend = 0;
        // 向量A所有维度值的平方相加
        double divisor1 = 0;
        // 向量B所有维度值的平方相加
        double divisor2 = 0;
        // 余弦相似度 算法
        for (int i = 0; i < statistic1.length; i++) {
            dividend += statistic1[i] * statistic2[i];
            divisor1 += Math.pow(statistic1[i], 2);
            divisor2 += Math.pow(statistic2[i], 2);
        }

        System.out.println("Step3. 计算词频向量");
        for(int i : statistic1) {
            System.out.print(i+",");
        }
        System.out.println();
        for(int i : statistic2) {
            System.out.print(i+",");
        }
        System.out.println();

        // 向量A与向量B的点乘 / （向量A所有维度值的平方相加后开方 * 向量B所有维度值的平方相加后开方）
        return dividend / (Math.sqrt(divisor1) * Math.sqrt(divisor2));
    }

    // 3. 计算词频
    private static int[] statistic(List<String> allWords, List<String> sentWords) {
        int[] result = new int[allWords.size()];
        for (int i = 0; i < allWords.size(); i++) {
            // 返回指定集合中指定对象出现的次数
            result[i] = Collections.frequency(sentWords, allWords.get(i));
        }
        return result;
    }

    // 2. 取并集
    private static List<String> mergeList(List<String> list1, List<String> list2) {
        List<String> result = new ArrayList<>();
        result.addAll(list1);
        result.addAll(list2);
        return result.stream().distinct().collect(Collectors.toList());
    }

    // 1. 分词
    private static List<String> getSplitWords(String sentence) {
        // 标点符号会被单独分为一个Term，去除之
        return HanLP.segment(sentence).stream().map(a -> a.word).filter(s -> !"`~!@#$^&*()=|{}':;',\\[\\].<>/?~！@#￥……&*（）——|{}【】‘；：”“'。，、？ ".contains(s)).collect(Collectors.toList());
    }
}

Test code:

package com.scb.dss.udf;

import org.junit.Test;

public class CosineSimilarityTest {

    @Test
    public void testGetSimilarity() throws Exception {
        String text3 = "我想养一头奶牛，这样就可以每天喝新鲜的牛奶。";
        String text4 = "我想每天喝新鲜的牛奶，所以打算养一头奶牛。";
        System.out.println("文本相似度为："+CosineSimilarity.getSimilarity(text3, text4));
    }
}