Final chapter: Java implementation of SimHash algorithm and similar text retrieval tool code

background

The previous two articles introduced the SimHash algorithm flow and the similar text retrieval process based on SimHash fingerprint segmentation. This article introduces the specific code implementation.

IT colleagues know that coding is the least difficult task, but it is just translating the previous process description into code. This article will translate the SimHash algorithm and retrieval tool.

Process review

SimHash algorithm flow:

  1. Word segmentation: segment the text to get N words word 1 word_1word1 w o r d 2 word_2 word2 w o r d 3 word_3 word3 …… w o r d n word_n wordn
  2. Weighting: Calculate the word frequency of the text, and set a reasonable weight for each word weight 1 weight_1weight1 w e i g h t 2 weight_2 weight2 w e i g h t 3 weight_3 weight3 …… w e i g h t n weight_n weightn
  3. Hash: Calculate the hash value hash value of each word segmentation hash_ihashi, Get a fixed-length binary sequence, generally 64 bits, or 128 bits;
  4. Weighted weight: change each wordi word_iwordi h a s h i hash_i hashi, Turn 1 into a positive weight value weighti weight_iweighti,0 变成 − w e i g h t i -weight_i weighti, Get a new sequence weight H ashi weightHash_iweightHashi
  5. Overlay weight: for each weight H ashi weightHash_iweightHashiAccumulate the values ​​of each bit, and finally get a sequence last H ash lastHashl a s t H a s h , each bit of the sequence is the cumulative value of the weights of all word segments;
  6. Dimensionality reduction: then last H ash lastHashl a s t H a s h is transformed into 01 sequence simHash, the method is: the position with a weight value greater than zero is set to 1, and the position with a weight value of less than 0 is set to 0, which is the partial hash value of the entire text, that is, the fingerprint.

Similar text retrieval tool process based on SimHash:

  1. Calculate SimHash: Calculate the SimHash value of the target text, the length is 64 bits;
  2. Disassemble the SimHash value into 4 segments, and store 16 bits in each segment in the array hashs;
  3. Traverse the array of hashs, use the value of this segment as the key, call RedisUtil.get(key) to determine whether there is a cache record;
  4. matchedSimHash is empty, indicating that there is no similar record, the SimHash value of the target text is cached, and the search ends; otherwise, continue.
  5. Traverse the matchedSimHash list and calculate the Hamming distance between the current SimHash and each SimHash value in the list. If a value less than 3 is found, the similar text is found and the process ends; otherwise, continue.
  6. The traversal of each segment in the outer layer ends, and the traversal of each matching list in the memory ends, and there are still no similar records, indicating that there are no similar records, the SimHash value of the target text is cached, and the search ends.

Class diagram design

The main difficulty in the realization of SimHash algorithm lies in word segmentation and weighting. This article uses a word segmentation tool word on GitHub to implement it by inheriting its TextSimilarity. The function involves class diagrams:
Insert picture description here

Dependence preparation

Create a project and add two dependencies to pom.xml:

<dependency>
      <groupId>org.apdplat</groupId>
      <artifactId>word</artifactId>
      <version>1.3.1</version>
    </dependency>
<dependency>
      <groupId>com.alibaba</groupId>
      <artifactId>fastjson</artifactId>
      <version>1.2.70</version>
    </dependency>

It should be noted that the word dependency is not easy to download successfully. You can directly put word.1.3.1.jar into the project directory and manually add the dependency.

Implementation code

The SimHash algorithm processes to achieve SimHashBaseOnWordthis type of code is as follows:

public class SimHashBaseOnWord extends TextSimilarity {
    private static final Logger LOGGER = LoggerFactory.getLogger(SimHashBaseOnWord.class);

    // 生成 64 位的 SimHash
    private int hashBitCount = 64;
    public SimHashBaseOnWord(){
    }

    public SimHashBaseOnWord(int hashBitCount) {
        this.hashBitCount = hashBitCount;
    }

    @Override
    protected double scoreImpl(List<Word> words1, List<Word> words2){
        //用词频来标注词的权重
        taggingWeightWithWordFrequency(words1, words2);
        //计算SimHash
        String simHash1 = simHash(words1);
        String simHash2 = simHash(words2);
        //计算SimHash值之间的汉明距离
        int hammingDistance = hammingDistance(simHash1, simHash2);
        if(hammingDistance == -1){
            LOGGER.error("文本1和文本2的SimHash值长度不相等,不能计算汉明距离");
            return 0.0;
        }

        int maxDistance = simHash1.length();
        double score = (1 - hammingDistance / (double)maxDistance);
        return score;
    }

    /**
     * 计算文本对应的 SimHash 值
     * @param text
     * @return
     */
    public String simHash(String text) {
        List<Word> words = seg(text);
        return this.simHash(words);
    }

    /**
     * 计算等长的SimHash值的汉明距离
     * 如不能比较距离(比较的两段文本长度不相等),则返回 64
     * @param simHash1 SimHash值1
     * @param simHash2 SimHash值2
     * @return 汉明距离
     */
    public int hammingDistance(String simHash1, String simHash2) {
        if (simHash1.length() != simHash2.length()) {
            return this.hashBitCount;
        }

        int distance = 0;
        int len = simHash1.length();
        for (int i = 0; i < len; i++) {
            if (simHash1.charAt(i) != simHash2.charAt(i)) {
                distance++;
            }
        }

        return distance;
    }

    /**
     * 计算词列表的SimHash值,通过分词的时候已经统计了词的权重
     * @param words 词列表
     * @return SimHash值
     */
    private String simHash(List<Word> words) {
        float[] hashBit = new float[hashBitCount];
        words.forEach(word -> {
            float weight = word.getWeight()==null?1:word.getWeight();
            BigInteger hash = hash(word.getText());
            for (int i = 0; i < hashBitCount; i++) {
                BigInteger bitMask = new BigInteger("1").shiftLeft(i);
                if (hash.and(bitMask).signum() != 0) {
                    hashBit[i] += weight;
                } else {
                    hashBit[i] -= weight;
                }
            }
        });

        StringBuffer fingerprint = new StringBuffer();
        for (int i = 0; i < hashBitCount; i++) {
            if (hashBit[i] >= 0) {
                fingerprint.append("1");
            }else{
                fingerprint.append("0");
            }
        }

        return fingerprint.toString();
    }

    /**
     * 计算文本的哈希值,很常见的一个 Hash 算法
     * @param word 词
     * @return 哈希值
     */
    private BigInteger hash(String word) {
        if (word == null || word.length() == 0) {
            return new BigInteger("0");
        }

        char[] charArray = word.toCharArray();
        BigInteger x = BigInteger.valueOf(((long) charArray[0]) << 7);
        BigInteger m = new BigInteger("1000003");
        BigInteger mask = new BigInteger("2").pow(hashBitCount).subtract(new BigInteger("1"));
        long sum = 0;
        for (char c : charArray) {
            sum += c;
        }

        x = x.multiply(m).xor(BigInteger.valueOf(sum)).and(mask);
        x = x.xor(new BigInteger(String.valueOf(word.length())));
        if (x.equals(new BigInteger("-1"))) {
            x = new BigInteger("-2");
        }

        return x;
    }

    /**
     * 对文本进行分词
     * @param text
     * @return
     */
    private List<Word> seg(String text) {
        if(text == null){
            return Collections.emptyList();
        }

        Segmentation segmentation  = SegmentationFactory.getSegmentation(SegmentationAlgorithm.MaxNgramScore);
        List<Word> words = segmentation.seg(text);
        if(filterStopWord) {
            //停用词过滤
            StopWord.filterStopWords(words);
        }
        return words;
    }

public static void main(String[] args) throws Exception{
        String text1 = "我的兴趣爱好是看书";
        String text2 = "看书是我的兴趣爱好";
        String text3 = "我爱好看书";

        SimHashBaseOnWord textSimilarity = new SimHashBaseOnWord();
        double score1pk2 = textSimilarity.similarScore(text1, text2);
        double score1pk3 = textSimilarity.similarScore(text1, text3);
        double score2pk3 = textSimilarity.similarScore(text2, text3);

        String sim1 = textSimilarity.simHash("我的兴趣爱好是看书");
        String sim2 = textSimilarity.simHash("看书是我的兴趣爱好");
        LOGGER.info("我的兴趣爱好是看书"+"和看书是我的兴趣爱好的汉明距离是:"+textSimilarity.hammingDistance(sim1,sim2));
        System.out.println(text1+" 和 "+text2+" 的相似度分值:"+score1pk2);
        System.out.println(text1+" 和 "+text3+" 的相似度分值:"+score1pk3);
        System.out.println(text2+" 和 "+text3+" 的相似度分值:"+score2pk3);
    }
}

As a result, the Hamming distance between "My hobby is reading" and "Reading is my hobby" is 0:
Insert picture description here

Search tool

Based SimHashBaseOnWordsearch tool class class implements the SearchBaseOnSimHashcomplete code:

public class SearchBaseOnSimHash {
    // 常量定义:数据库中相似记录的 id
    public static final String idKey = "id";

    // 常量定义:数据库中相似记录的 simHash 指纹
    public static final String fingerKey = "finger";

    // 日志记录类
    private static final Logger LOGGER = LoggerFactory.getLogger(SearchBaseOnSimHash.class);

    // 相似的汉明距离阀值
    private static int similarityThreshold = 3;

    /**
     * 根据 SimHash 值检索文本相似的记录编号,算法流程:
     *  1、将 SimHash 分成四段,以每一段为 Key 在 全局表中查找当前段的完整指纹
     *  2、如果当前 SimHash 匹配到了缓冲中某一段的信息,则计算改匹配段的指纹和当前指纹的距离
     *  3、如果距离小于相似度阀门值,则视为找到,返回对象;
     *  4、否则,视为无相似记录
     * @param segments 分段
     * @param simHash
     * @return
     */
    public static JSONObject search(String [] segments , String simHash) {
        if(segments == null || segments.length != 4) {
            return null;
        }

        // 创建一个 SimHash 对象,用于计算汉明距离
        SimHashBaseOnWord simHashByWord = new SimHashBaseOnWord();

        // 逐段遍历,查找计算文本相似的记录
        long start = System.currentTimeMillis();
        for (int i= 0; i<segments.length ;i++){
            // Key 视为 某一个分段 Hash,值是一个对象
            Map<String, String> hashSets = RedisUtil.get(segments[i]);
            if(hashSets == null) {
                continue;
            }

            // 匹配到某一段,则计算汉明距离
            String finger = hashSets.get(fingerKey);
            int hammingDistance = simHashByWord.hammingDistance(finger, simHash);
            if (hammingDistance <= similarityThreshold){
                long end = System.currentTimeMillis();
                JSONObject responseObj = new JSONObject();
                responseObj.put(idKey, hashSets.get(idKey));
                return responseObj;
            }
        }

        return null;
    }

    /**
     * 向索引中添加一个 SimHash 的记录,分段遍历,将各个段加入缓存
     * @param segments 分段,因为分段信息在 search 和 push 是都会用到,为了避免重复操作,作为参数
     * @param simHash
     * @param data
     */
    public static void push(String[] segments,String simHash, JSONObject data) {
        if(segments == null || segments.length != 4) {
            return;
        }

        Map<String,String> redisData = new HashMap<>();
        redisData.put(fingerKey, simHash);
        redisData.put(idKey, data.getString(idKey));

        for (int i= 0; i<segments.length ;i++){
            RedisUtil.put(segments[i],redisData);
        }
    }
    public static void main(String[] args) {
        String text1 = "我的兴趣爱好是看书";
        String text2 = "看书是我的兴趣爱好";
        String text3 = "我爱好看书";

        SimHashBaseOnWord textSimilarity = new SimHashBaseOnWord();
        String simhash = textSimilarity.simHash("我的兴趣爱好是看书");

        String[] hashs=new String[4];
        hashs[0]= simhash.substring(0, 16);
        hashs[1] = simhash.substring(16, 32);
        hashs[2] = simhash.substring(32, 48);
        hashs[3] = simhash.substring(48, 64);
        JSONObject target = search(hashs,simhash);

        // 没有找到相似记录,则插入到缓存
        if(target == null) {
            target = new JSONObject();
            target.put(idKey,"123");
            push(hashs,simhash,target);
        } else {
            System.out.println(target.get(idKey));
        }

        String sim2 = textSimilarity.simHash("看书是我的兴趣爱好");
        hashs[0]= sim2.substring(0, 16);
        hashs[1] = sim2.substring(16, 32);
        hashs[2] = sim2.substring(32, 48);
        hashs[3] = sim2.substring(48, 64);
        target = search(hashs,sim2);

        // 没有找到相似记录,则插入到缓存
        if(target == null) {
            target = new JSONObject();
            target.put(idKey,"456");
            push(hashs,sim2,target);
        } else {
            System.out.println(target.get(idKey));
        }
    }
}

Revelation

In theory, the text similarity based on SimHash will be more accurate for the longer text, but the algorithm implemented by wrod word segmentation is more accurate for both long text and short text. Why?

Analyze its startup log:
Insert picture description here
Insert picture description here
Insert picture description here

One or two can be learned:

  1. Large number of resource words;
  2. Automatically carry out word frequency statistics and automatic calculation of weights during word segmentation;
  3. It is a "heavyweight" tool, and the calculation time is longer than other word segmentation tools.

Guess you like

Origin blog.csdn.net/wojiushiwo945you/article/details/108878845