Introduction to the principle of text similarity calculation based on SimHash algorithm

background

One day last week, the author searched the entire Internet, integrated various incomplete code snippets, dozens of SimHash projects on GitHub, and dozens of related web resource articles, and finally got a fairly accurate Java version of the SimHash algorithm. .

The output is a simple standard for testing and mastering a knowledge point. This article will introduce in detail the principle and implementation process of similar text retrieval based on the SimHash algorithm.

Application of text similarity

Recently I am working on a vulnerability database crawler project. I need to comprehensively analyze and merge the vulnerability information of several vulnerable websites. The releases of different vulnerabilities are basically similar, but different websites may be slightly different. How to publish different websites is essentially the same This kind of loophole record merge?

This involves the calculation of text similarity. The basic idea is: when parsing web vulnerabilities, first go to the system’s vulnerability database with the vulnerability profile or title to search whether there are similar vulnerabilities. If they exist, treat them as the same vulnerability information, and just add the source of the vulnerability.

The classic solution to this problem is the SimHash algorithm used by Google to process massive text deduplication. It is a locally sensitive hash [full name: Locality Sensitive Hashing]. The full name and the abbreviation Sim are somewhat different. I don't know why!

The basic process of using SimHash to judge text similarity is:

Calculate the SimHash values of the two texts respectively to obtain two fixed-length binary sequences;
Perform XOR calculation on two binary numbers to get the Hamming distance of two texts;
The Hamming distance is less than 3 and is considered similar.

Introduction to SimHash algorithm

There is a HashMap in Java, and everyone is familiar with it. It uses a function to get the value to identify the location of a certain data. In order to avoid Hash conflicts, the selected function should be as scattered as possible.

SimHash is different from the familiar Hash. It requires the same or similar hash value to be generated for the same or similar text. The degree of similarity can reflect the similarity of the input text.

Its principle is:
Insert picture description here

The basic process is:

Word segmentation: text to get N words $word_1$ ， $word_2$ ， $word_3$ …… $word_n$ ；
Weighting: word frequency of the text, and set a reasonable for each word $weight_1$ ， $weight_2$ ， $weight_3$ …… $weight_n$ ；
Hash: Calculate the hash value hash value of each word segmentation $hash_i$ , Get a fixed-length binary sequence, generally 64 bits, or 128 bits;
Weighted weight: change each $word_i$ 的 $hash_i$ , Turn 1 into a positive weight value $weight_i$ ，0 变成 $weight_i$ , Get a new sequence $weightHash_i$ ；
Overlay weight: for each $weightHash_i$ Accumulate the values of each bit, and finally get a sequence $l a s t H a s h$ , each bit of the sequence is the cumulative value of the weights of all word segments;
Dimensionality reduction: then $l a s t H a s h is$ transformed into 01 sequence simHash, the method is: the position with a weight value greater than zero is set to 1, and the position with a weight value of less than 0 is set to 0, which is the partial hash value of the entire text, namely the fingerprint.

There is a classic example of the process from text to SimHash fingerprint:
Insert picture description here

Three technical points that need to be solved for algorithm implementation:

Word segmentation, tools such as hanlp, IKAnalyzer and word segmentation can be used;
Weight, assign a reasonable weight to each word segmentation, you can customize the statistical algorithm, or you can use word frequency statistics;
Hash algorithm, this can be your own choice, or you can use existing algorithms, such as Murmur hash, JDK hash.

Hamming distance

Look at the explanation from Baidu Encyclopedia:

The Hamming distance is named after Richard Wesley Hamming. In information theory, the Hamming distance between two strings of equal length is the number of different characters in the corresponding positions of the two strings. In other words, it is the number of characters that need to be replaced to transform one string into another.

From the point of view of its definition, the difference is 1 when the binary sequence is calculated. The total number of 1 in the result of the XOR is the Hamming distance of the two strings, so the method to calculate the Hamming distance is also very simple.

Revelation

This article first introduces the basic concepts, the next article I will continue to introduce the specific algorithm implementation. From the perspective of the flow of the SimHash algorithm, it is more difficult to understand why you can directly replace the 0 or 1 in the hash value with the weight value, and then add weight to all the word segments?

Although I did not search the Java implementation of the SimHash algorithm that can be used directly, I found two more detailed articles, which provided great help for the author to understand the local sensitive hash algorithm. They are:

Reference link 1
Reference link 2