simhash 介绍以及应用场景

简介

simhash 算法是一种局部敏感的哈希算法，能实现相似文本内容的去重。

与信息摘要算法的区别

信息摘要算法：如果两者原始内容只相差一个字节，所产生的签名也很有可能差别很大。

simhash 算法：如果原始内容只相差一个字节，所产生的签名差别非常小。

simhash值的对比：通过两者的 simhash 值的二进制位的差异来表示原始文本内容的差异。差异的个数又被称为海明距离。

注意：
simhash 对长文本 500字+ 比较适用，短文本可能偏差比较大。

在 google 的论文给出的数据中，64 位的simhash值，在海明距离为 3 的情况下，可认为两篇文档是相似的或者说是重复的。当然这个值只是参考值，针对自己的应用可能有不同的测试取值。

Python 中的 simhash 模块

使用 python 实现 simhash 算法，该模块得出的 simhash 值的长度是 64 位。

github 地址： https://github.com/leonsim/simhash

接着我们来看一下简单的示例：

# 测试 simhash 库的简单使用
# pip install simhash

import re
from simhash import Simhash


def get_features(s):
    """
    对文本全部转小写 去掉空白字符以及标点符号 
    :param s: 
    :return: 
    """
    width = 3
    s = s.lower()
    s = re.sub(r'[^\w]+', '', s)
    return [s[i:i + width] for i in range(max(len(s) - width + 1, 1))]


# 计算出这几个文本的 simhash 值 
print('%x' % Simhash(get_features('How are you? I am fine. Thanks.')).value)
print('%x' % Simhash(get_features('How are u? I am fine.     Thanks.')).value)
print('%x' % Simhash(get_features('How r you?I    am fine. Thanks.')).value)

其实，由此我们也可以知道，在进行simhash之前，进行一定的预处理是非常重要的。

获取两个 simhash 值之间的距离：

print(Simhash('furuiyang').distance(Simhash('yaokailun')))
print(Simhash('furuiyang').distance(Simhash('ruanyifeng')))
print(Simhash('ruiyang').distance(Simhash('ruiyang')))

一般我们在爬虫项目中使用 simhash 的模式：

"""以一种更加通用的模式去运用海明距离"""

import re
from simhash import Simhash, SimhashIndex


def get_features(s):
    """
    对文本进行预处理
    转小写；去除空白字符以及标点符号
    :param s:
    :return:
    """
    width = 3
    s = s.lower()
    s = re.sub(r'[^\w]+', '', s)
    return [s[i:i + width] for i in range(max(len(s) - width + 1, 1))]


# 我们已经存在的数据
data = {
    1: u'How are you? I Am fine. blar blar blar blar blar Thanks.',
    2: u'How are you i am fine. blar blar blar blar blar than',
    3: u'This is simhash test.',
}
# 由初始数据建立的 key 以及 simhash 值的对象集
objs = [(str(k), Simhash(get_features(v))) for k, v in data.items()]
# 建立索引 可索引到的相似度海明距离是 3
index = SimhashIndex(objs, k=3)
print(index.bucket_size())  # 11
# 计算一个新来数据的 simhash 值 
s1 = Simhash(get_features(u'How are you i am fine. blar blar blar blar blar thank'))
# 找到数据库中与此最接近的一个 simhash 值的索引 
print(index.get_near_dups(s1))
# 将新数据添加到原有的索引中 
index.add('4', s1)
print(index.get_near_dups(s1))

如果我们要在实际项目上使用 simhash 计算，很显然需要保存这个索引对象。
因此我们可以考虑使用序列化工具。

序列化工具：将一个对象转换为二进制的一个数据。
反序列化工具：将二进制恢复为一个对象。

后记

大姨妈来的一天，实在是有点疼。不过过了艰难的第一天，就又可以活蹦乱跳了。人生也是这样吧，过了最艰难的时候，其实处于通途的时刻还是会有的。

可是，总之，也不想自己说消极的话而已。

更新时间： 2020-02-04

furuiyang_

发布了291 篇原创文章 · 获赞 104 · 访问量 41万+

他的留言板关注