simhash introduction and application scenarios

Brief introduction

simhash algorithm is a hashing algorithm partially sensitive, can achieve a similar weight to text content.

The difference between the message digest algorithm

Message Digest algorithm: if both the original content is only a difference of one byte signature generated is also likely to vary considerably.

simhash algorithm: if the original is only a difference of one byte, the signature difference is very small.

Comparative simhash value: difference of the original text represented by the difference in bit values ​​simhash both. The number of differences is also known as the Hamming distance.

Note:
simhash to 500 words long text + more applicable, short text may be relatively large deviations.

Google data given paper, the 64-bit value simhash, in the case where the Hamming distance is 3, two documents are considered to be similar or repetitive. Of course, this value is only a reference value for your application may have different test values.

Simhash module in Python

Simhash algorithm implemented in python, the length of the module simhash value obtained is 64 bits.

github Address: https://github.com/leonsim/simhash

Then we look at a simple example:

# 测试 simhash 库的简单使用
# pip install simhash

import re
from simhash import Simhash


def get_features(s):
    """
    对文本全部转小写 去掉空白字符以及标点符号 
    :param s: 
    :return: 
    """
    width = 3
    s = s.lower()
    s = re.sub(r'[^\w]+', '', s)
    return [s[i:i + width] for i in range(max(len(s) - width + 1, 1))]


# 计算出这几个文本的 simhash 值 
print('%x' % Simhash(get_features('How are you? I am fine. Thanks.')).value)
print('%x' % Simhash(get_features('How are u? I am fine.     Thanks.')).value)
print('%x' % Simhash(get_features('How r you?I    am fine. Thanks.')).value)

In fact, this we can know, prior to simhash, a certain degree of pre-treatment is very important.

Obtaining the distance between two values ​​simhash:

print(Simhash('furuiyang').distance(Simhash('yaokailun')))
print(Simhash('furuiyang').distance(Simhash('ruanyifeng')))
print(Simhash('ruiyang').distance(Simhash('ruiyang')))

Generally, we use simhash patterns in reptiles project:

"""以一种更加通用的模式去运用海明距离"""

import re
from simhash import Simhash, SimhashIndex


def get_features(s):
    """
    对文本进行预处理
    转小写;去除空白字符以及标点符号
    :param s:
    :return:
    """
    width = 3
    s = s.lower()
    s = re.sub(r'[^\w]+', '', s)
    return [s[i:i + width] for i in range(max(len(s) - width + 1, 1))]


# 我们已经存在的数据
data = {
    1: u'How are you? I Am fine. blar blar blar blar blar Thanks.',
    2: u'How are you i am fine. blar blar blar blar blar than',
    3: u'This is simhash test.',
}
# 由初始数据建立的 key 以及 simhash 值的对象集
objs = [(str(k), Simhash(get_features(v))) for k, v in data.items()]
# 建立索引 可索引到的相似度海明距离是 3
index = SimhashIndex(objs, k=3)
print(index.bucket_size())  # 11
# 计算一个新来数据的 simhash 值 
s1 = Simhash(get_features(u'How are you i am fine. blar blar blar blar blar thank'))
# 找到数据库中与此最接近的一个 simhash 值的索引 
print(index.get_near_dups(s1))
# 将新数据添加到原有的索引中 
index.add('4', s1)
print(index.get_near_dups(s1))

If we want to calculate the actual use simhash on the project, it is clear that the need to preserve the index object.
Therefore, we can consider using serialization tool.

Sequence tool: converting an object into a binary data.
De-serialization tool: revert to a binary object.

postscript

Aunt to day, it is a bit sore. But after a difficult first day, and again you can jump up. Life is so be it, after the most difficult times, in fact, there will still be time thoroughfare.

But, in short, I do not want to say negative things myself only.

Updated: 2020-02-04

Published 291 original articles · won praise 104 · views 410 000 +

Guess you like

Origin blog.csdn.net/Enjolras_fuu/article/details/104167456