字符串查找之Rabin Karp算法

本文承接《字符串查找之暴力匹配法》一文。

使用暴力匹配法查找字符串虽然简单直观，但是时间复杂度太高。对于字符串查找，被广泛接受的标准算法是KMP算法，其时间复杂度只有O(m+n)时间，但是KMP算法难以理解且没有普适性。本文将介绍的Rabin Karp算法，不但容易理解，而且可以达到和KMP算法相同的时间复杂度。

Rabin Karp算法的思想基于暴力匹配法，字符串比较的时间复杂度是O(m)，整数比较的时间复杂度却是O(1)。所以，Rabin Karp算法在暴力匹配法的基础上引入哈希表，把要比较的字符串转换为整数，即哈希表的索引是字符串，值是整数。根据哈希表的性质可以推断出：在哈希表中，1）两个不同的字符串可以对应相等的值；2）两个相同的字符串必须对应相等的值；3）两个不等的值肯定对应两个不同的字符串。所以，当两个字符串的值相同时，Rabin Karp算法仍然需要逐一比较两个字符串中的字符。但是，和暴力匹配法相比，Rabin Karp算法避免了比较两个哈希值不等的字符串，从而降低了字符串比较的时间复杂度。下面是两个关键的问题。

问题一：如何把字符串转换为整数呢？举个例子，对于字符串"target"，在哈希表中对应的值可以写做： $(t \times 31^5 + a \times 31^4 + r \times 31^3 + g \times 31^2 + e \times 31^1 + t \times 31^0) \% 10^6$ 其中，t, a, r, g, e, t分别为字符在ASCII码表中对应的数值，31是一个经验值。假如字符串很长，其数值会很大，很有可能导致越界，所以取余。除数 $10^6$ 表示哈希表的大小，除数可以设为其他值，注意除数越大，哈希表冲突的可能性越小。

问题二：如何变换一个字符串呢？例如把tar变换成为arg，假设tar的值为x，那么targ的则值为 $(x \times 31 + g) \% 10^6$ 。把该值设为y，那么arg的值则为 $(y - a \times 31^3) \% 10^6$ ，注意如果该值为负数，则应将该值变成一个正数 $(y - a \times 31^3 + 10^6) \% 10^6$ 。

解决了以上两个问题，Rabin Karp算法的流程便迎刃而解，Rabin Karp算法的时间复杂度为O(m+n)，代码如下：

class Solution 
{
public:
    /*
     * @param source: A source string
     * @param target: A target string
     * @return: An integer as index
     */
    
    int strStr2(const char* source, const char* target) 
    {
        if (source == nullptr || target == nullptr)
        {
            return -1;
        }
        
        int hash_size = 10^6;
        int s = strlen(source);
        int t = strlen(target);
        
        if (t == 0)
        {
            return 0;
        }
        
        int target_code = 0;
        int power = 1;
        for (int i = 0; i < t; i++)
        {
            target_code = (target_code * 31 + target[i]) % hash_size;
            power = (power*31) % hash_size;
        }
        
        int source_code = 0;
        for (int i = 0; i < s; i++)
        {
            source_code = (source_code * 31 + source[i]) % hash_size;
            
            if (i < t - 1)
            {
                continue;
            }
            if (i >= t)
            {
                source_code = source_code - (source[i-t] * power) % hash_size;
                
                if (source_code < 0)
                {
                    source_code += hash_size;
                }
            }
            
            if (source_code == target_code)
            {
                if (string(source).substr(i - t + 1, t) == target)
                {
                    return i - t + 1;
                }
            }
        }
        
        return -1;
    }
};

字符串查找之Rabin Karp算法

猜你喜欢