Massive text deduplication using SimHash

read table of contents
1. The difference between SimHash and traditional hash functions
2. SimHash algorithm idea
3. SimHash process implementation
4. SimHash signature distance calculation
5. SimHash storage and indexing
6. SimHash storage and indexing
7. References
 
The SimHash introduced in this article is a local sensitive hash, and it is also the main algorithm used by Google to deduplicate massive web pages.
 
1. The difference between SimHash and traditional hash functions
The traditional Hash algorithm is only responsible for mapping the original content into a signature value as uniformly and randomly as possible, which is only equivalent to a pseudo-random number generation algorithm in principle. If the original contents of the two signatures generated by the traditional hash algorithm are equal under a certain probability; if they are not equal, no information is provided except that the original contents are not equal, because even if the original contents differ only by one byte, so The resulting signatures are also likely to vary widely. Therefore, the traditional Hash cannot measure the similarity of the original content in the dimension of signature, while SimHash itself belongs to a locality-sensitive hashing algorithm, and the hash signature it generates can represent the similarity of the original content to a certain extent.
 
What we mainly solve is the calculation of text similarity. What we need to compare is whether the two articles are acquainted. Of course, we reduce the dimension and generate the hash signature for this purpose. Seeing this, it is estimated that everyone will understand that the simhash we use can still be used to calculate the similarity even if the strings in the article are turned into 01 strings, but the traditional hash cannot. Let's take a test, two text strings that differ by only one character, "your mother called you to go home for dinner, go home and go home" and "your mother told you to go home for dinner, go home and go home Luo".
 
The result calculated by simhash is:
1000010010101101111111100000101011010001001111100001001011001011
1000010010101101011111100000101011010001001111100001101010001011
Calculated by traditional hash as:
0001000001100110100111011011110
1010010001111111110010110011101
 
As you can see, only some 01 strings of similar texts have changed, but ordinary hashes cannot. This is the charm of locality-sensitive hashing.
 
2. SimHash algorithm idea
Suppose we have massive text data, we need to deduplicate them according to the text content. For text deduplication, there are currently many NLP-related algorithms that can be solved with high precision, but we are dealing with text deduplication in the dimension of big data, which has high requirements for the efficiency of the algorithm. The locality-sensitive hash algorithm can map the original text content to a number (hash signature), and the hash signature corresponding to the relatively similar text content is also relatively similar. The SimHash algorithm is Google's efficient algorithm for deduplication of massive web pages. It maps the original text into a 64-bit binary number string, and then compares the difference between the binary number strings to represent the difference in the original text content.
 
3. SimHash process implementation
Simhash was proposed by Charikar in 2002. In order to facilitate understanding, this article tries not to use mathematical formulas and is divided into these steps:
(Note: The specific example is taken from Lanceyan's blog "simhash and Hamming distance for similarity calculation of massive data")
 
1、分词,把需要判断文本分词形成这个文章的特征单词。最后形成去掉噪音词的单词序列并为每个词加上权重,我们假设权重分为5个级别(1~5)。比如:“ 美国“51区”雇员称内部有9架飞碟,曾看见灰色外星人 ” ==> 分词后为 “ 美国(4) 51区(5) 雇员(3) 称(1) 内部(2) 有(1) 9架(3) 飞碟(5) 曾(1) 看见(3) 灰色(4) 外星人(5)”,括号里是代表单词在整个句子里重要程度,数字越大越重要。
 
2、hash,通过hash算法把每个词变成hash值,比如“美国”通过hash算法计算为 100101,“51区”通过hash算法计算为 101011。这样我们的字符串就变成了一串串数字,还记得文章开头说过的吗,要把文章变为数字计算才能提高相似度计算性能,现在是降维过程进行时。
 
3、加权,通过 2步骤的hash生成结果,需要按照单词的权重形成加权数字串,比如“美国”的hash值为“100101”,通过加权计算为“4 -4 -4 4 -4 4”;“51区”的hash值为“101011”,通过加权计算为 “ 5 -5 5 -5 5 5”。
 
4、合并,把上面各个单词算出来的序列值累加,变成只有一个序列串。比如 “美国”的 “4 -4 -4 4 -4 4”,“51区”的 “ 5 -5 5 -5 5 5”, 把每一位进行累加, “4+5 -4+-5 -4+5 4+-5 -4+5 4+5” ==》 “9 -9 1 -1 1 9”。这里作为示例只算了两个单词的,真实计算需要把所有单词的序列串累加。
 
5、降维,把4步算出来的 “9 -9 1 -1 1 9” 变成 0 1 串,形成我们最终的simhash签名。 如果每一位大于0 记为 1,小于0 记为 0。最后算出结果为:“1 0 1 0 1 1”。
 
整个过程的流程图为:
4. SimHash签名距离计算
我们把库里的文本都转换为simhash签名,并转换为long类型存储,空间大大减少。现在我们虽然解决了空间,但是如何计算两个simhash的相似度呢?难道是比较两个simhash的01有多少个不同吗?对的,其实也就是这样,我们通过海明距离(Hamming distance)就可以计算出两个simhash到底相似不相似。两个simhash对应二进制(01串)取值不同的数量称为这两个simhash的海明距离。举例如下: 10101 和 00110 从第一位开始依次有第一位、第四、第五位不同,则海明距离为3。对于二进制字符串的a和b,海明距离为等于在a XOR b运算结果中1的个数(普遍算法)。
 
5. SimHash存储和索引
经过simhash映射以后,我们得到了每个文本内容对应的simhash签名,而且也确定了利用汉明距离来进行相似度的衡量。那剩下的工作就是两两计算我们得到的simhash签名的汉明距离了,这在理论上是完全没问题的,但是考虑到我们的数据是海量的这一特点,我们是否应该考虑使用一些更具效率的存储呢?其实SimHash算法输出的simhash签名可以为我们很好建立索引,从而大大减少索引的时间,那到底怎么实现呢?
 
这时候大家有没有想到hashmap呢,一种理论上具有O(1)复杂度的查找数据结构。我们要查找一个key值时,通过传入一个key就可以很快的返回一个value,这个号称查找速度最快的数据结构是如何实现的呢?看下hashmap的内部结构:
如果我们需要得到key对应的value,需要经过这些计算,传入key,计算key的hashcode,得到7的位置;发现7位置对应的value还有好几个,就通过链表查找,直到找到v72。其实通过这么分析,如果我们的hashcode设置的不够好,hashmap的效率也不见得高。借鉴这个算法,来设计我们的simhash查找。通过顺序查找肯定是不行的,能否像hashmap一样先通过键值对的方式减少顺序比较的次数。看下图:
存储:
1、将一个64位的simhash签名拆分成4个16位的二进制码。(图上红色的16位)
2、分别拿着4个16位二进制码查找当前对应位置上是否有元素。(放大后的16位)
3、对应位置没有元素,直接追加到链表上;对应位置有则直接追加到链表尾端。(图上的 S1 — SN)
 
查找:
1、将需要比较的simhash签名拆分成4个16位的二进制码。
2、分别拿着4个16位二进制码每一个去查找simhash集合对应位置上是否有元素。
3、如果有元素,则把链表拿出来顺序查找比较,直到simhash小于一定大小的值,整个过程完成。
 
原理:
借鉴hashmap算法找出可以hash的key值,因为我们使用的simhash是局部敏感哈希,这个算法的特点是只要相似的字符串只有个别的位数是有差别变化。那这样我们可以推断两个相似的文本,至少有16位的simhash是一样的。具体选择16位、8位、4位,大家根据自己的数据测试选择,虽然比较的位数越小越精准,但是空间会变大。分为4个16位段的存储空间是单独simhash存储空间的4倍。之前算出5000w数据是 382 Mb,扩大4倍1.5G左右,还可以接受。
 
6. SimHash存储和索引
1. 当文本内容较长时,使用SimHash准确率很高,SimHash处理短文本内容准确率往往不能得到保证;
2. 文本内容中每个term对应的权重如何确定要根据实际的项目需求,一般是可以使用IDF权重来进行计算。
 
7. 参考内容
1. 严澜的博客《海量数据相似度计算之simhash短文本查找》
2. 《Similarity estimation techniques from rounding algorithms》

 

http://bi.dataguru.cn/article-9604-1.html

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326605652&siteId=291194637