Introduction to the principle of simhash

Preface
    Local sensitive hash is a local sensitive hash thing. It is said that this thing can reduce the dimension of documents to hash numbers, and the calculation amount of two pairs of numbers is much smaller. After looking up a lot of documents, I saw that google uses simhash for web page deduplication, and the documents they need to process every day are in the billions level; simhash was proposed by Charikar in 2002, refer to "Similarity estimation techniques from rounding algorithms".


Introduce the main principles of this algorithm. In order to facilitate understanding, try not to use mathematical formulas. It is divided into these steps:

1. Word segmentation, the word segmentation of the text that needs to be judged to form the characteristic words of this article. Finally, a word sequence is formed to remove noise words and weights are added to each word. We assume that the weights are divided into 5 levels (1~5). For example: "America's "Area 51" employee said there were 9 flying saucers inside, and had seen gray aliens" ==> After the participle, it was "America (4) Area 51 (5) Employee (3) said (1) Internal (2) ) have (1) 9 (3) flying saucer (5) ever (1) saw (3) grey (4) alien (5)”, the parentheses represent the importance of the word in the whole sentence, the larger the number, the more important .

2. Hash, turn each word into a hash value through the hash algorithm. For example, "United States" is calculated as 100101 through the hash algorithm, and "51" is calculated as 101011 through the hash algorithm. In this way, our string becomes a string of numbers. Remember what I said at the beginning of the article, the similarity calculation performance can be improved only if the article is converted into a number calculation. Now is the time for the dimensionality reduction process.

3. Weighting, through the 2-step hash generation result, it is necessary to form a weighted number string according to the weight of the word. For example, the hash value of "United States" is "100101", which is calculated as "4 -4 -4 4 -4 4" by weighting; The hash value of "Area 51" is "101011", which is calculated as "5 -5 5 -5 5 5" by weighting.

4. Merge, accumulate the sequence values ​​calculated by the above words, and become only one sequence string. For example, "4 -4 -4 4 -4 4" in "America", "5 -5 5 -5 5 5" in "Area 51", accumulate each bit, "4+5 -4+-5 - 4+5 4+-5 -4+5 4+5" ==> "9 -9 1 -1 1 9". As an example, only two words are counted here. The real calculation needs to accumulate the sequence strings of all words.

5. Dimension reduction, change the "9 -9 1 -1 1 9" calculated in step 4 into a 0 1 string to form our final simhash signature. If each bit is greater than 0, it is recorded as 1, and if it is less than 0, it is recorded as 0. The final result is: "1 0 1 0 1 1".

The whole process is shown as follows:


You may have doubts, after so many steps, is it not just to get a 0 1 string? I input this text directly as a string, it is easier to generate 0 1 values ​​with a hash function. In fact, this is not the case. The traditional hash function solves the problem of generating unique values, such as md5, hashmap, etc. md5 is used to generate a unique signature string, as long as one more character is added, the two numbers of md5 look very different; hashmap is also a data structure used for key-value pair lookup, which is convenient for fast insertion and lookup. However, what we mainly solve is the calculation of text similarity. What we need to compare is whether the two articles are acquainted. Of course, the hashcode generated by our dimension reduction is also used for this purpose. Seeing this, it is estimated that everyone will understand that the simhash we use can still be used to calculate the similarity even if the strings in the article are turned into 01 strings, but the traditional hashcode cannot. Let's take a test, two text strings that differ by only one character, "your mother called you to go home for dinner, go home and go home" and "your mother told you to go home for dinner, go home and go home Luo".

The result calculated by simhash is:

1000010010101101111111100000101011010001001111100001001011001011

1000010010101101011111100000101011010001001111100001101010001011

by hashcode calculated as:

1111111111111111111111111111111110001000001100110100111011011110

1010010001111111110010110011101

we can see that, just like the text string section 01 changes, while ordinary hashcode can not do it, this is the locality sensitive hashing charm. At present, the shingling algorithm proposed by Broder and the simhash algorithm of Charikar should be regarded as the best algorithms in the industry. In the paper of Charikar, the inventor of simhash, the specific simhash algorithm and proof are not given. The proof obtained by "Quantum Turing" that simhash is evolved from the random hyperplane hash algorithm.

Now through this conversion, we convert the text in the library into simhash code, and convert it to long type storage, which greatly reduces the space. Now that we have solved the space, how to calculate the similarity of two simhash? Is it to compare how many 01s of the two simhash are different? Yes, in fact, this is the case. We can calculate whether two simhash are similar or not through the Hamming distance. The number of two simhash corresponding binary (01 strings) different values ​​is called the Hamming distance of the two simhash. An example is as follows: 10101 and 00110 have the first, fourth, and fifth positions different from the first one, then the Hamming distance is 3. For a and b of binary strings, the Hamming distance is equal to the number of 1s in the result of a XOR b operation (general algorithm).

For efficient comparison, we preload the existing text in the library and convert it to simhash code and store it in memory space. A text is first converted into simhash code, and then compared with the simhash code in memory, and the test is calculated for 100w times in 100ms. Speed ​​is greatly improved.

To be continued:

1. The current speed has improved but the data is constantly increasing. If the future data develops to 100w per hour, according to the current 100ms, one thread processes 10 times per second, 60 * 10 times per minute, one hour 60*10*60 times = 36000 times, 60*10*60*24 = 864000 times a day. Our goal is 100w times a day, which can be done by adding two threads. But what if it takes 100w times an hour? Then 30 threads and corresponding hardware resources need to be added to ensure that the speed can be achieved, so the cost also goes up. Is there a better way to improve the efficiency of our comparison?

2. After a large number of tests, simhash is used to compare large texts. For example, the effect of more than 500 words is quite good. The distances less than 3 are basically similar, and the misjudgment rate is relatively low. However, if we are dealing with Weibo information, which is only 140 characters at most, the effect of using simhash is not so ideal. As shown in the figure below, when the distance is 3, it is a relatively compromised point. When the distance is 10, the effect is already very poor, but we tested that many short texts that look similar are indeed at a distance of 10. If the distance is 3, a large amount of repeated information in short text will not be filtered, and if the distance is 10, the error rate of long text is also very high, how to solve it?


http://my.oschina.net/leejun2005/blog/150086?p={{currentPage+1}}Another

article explains minhash
http://blog.csdn.net/heiyeshuwu/article/details/44117473

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326267543&siteId=291194637