Detailed explanation of KMP algorithm (detailed explanation of next array)

The key to the KMP algorithm is its next array, which can be used to efficiently determine how many bits the pattern string should be shifted to avoid unnecessary matches in the case of the current mismatch.
We want to compute a transfer function next of length m.

The meaning of the next array is that the longest prefix and longest suffix of a fixed string have the same length.
Therefore, when there is no match, we can directly make the subscript j=next[j] of the current substring, eliminating unnecessary matching. For
example
, for the target string ptr, aabaca, the length is 7, so next[0], next[1], next[2], next[3], next[4], next[5], next[6] respectively calculate
the same maximum value of a, ab, aba, abab, ababa, ababac, abaca The length of the long prefix and longest suffix. Since the same longest prefix and longest suffix of a, ab, aba, abab, ababa, ababac, aabaca are "", "", "a", "ab", "aba", "", "a", So the value of the next array is [-1,-1,0,1,2,-1,0], where -1 means it does not exist, 0 means it exists with a length of 1, and 2 means it exists with a length of 3.

First paste a kmp code here:

int KMP(char *str, int slen, char *ptr, int plen)
{
    int *next = new int[plen];
    cal_next(ptr, next, plen);//计算next数组
    int k = -1;
    for (int i = 0; i < slen; i++)
    {
        while (k >-1&& ptr[k + 1] != str[i])//ptr和str不匹配,且k>-1(表示ptr和str有部分匹配)
            k = next[k];//往前回溯
        if (ptr[k + 1] == str[i])
            k = k + 1;
        if (k == plen-1)//说明k移动到ptr的最末端
        {
            //cout << "在位置" << i-plen+1<< endl;
            //k = -1;//重新初始化,寻找下一个
            //i = i - plen + 1;//i定位到该位置,外层for循环i++可以继续找下一个(这里默认存在两个匹配字符串可以部分重叠),感谢评论中同学指出错误。
            return i-plen+1;//返回相应的位置
        }
    }
    return -1;  
}

The above code is relatively easy to understand, so the core of the kmp algorithm is how to quickly solve the next array.
We know that the meaning of the next array is that the longest prefix and longest suffix of a fixed string are the same length, and direct violence will definitely time out.
Suppose we Now, next[1], next[2], ... next[i] have been obtained, which represent the maximum common length of the prefix and suffix of strings of length 1 to i respectively, and now next[i+1] is required. As we can see from the above figure, if the two characters at position i and next[i] are the same (subscripts start from zero), then next[i+1] is equal to next[i] plus 1. If the characters in the two positions are not the same, we can continue to split the string of length next[i] to obtain its maximum common length next[next[i]], and then compare it with the character in position i. This is because the prefix and suffix of length next[i] can be divided into upper structures. If the characters at position next[next[i]] and position i are the same, then next[i+1] is equal to next[next[i] ]]plus 1. If they are not equal, you can continue to split the string of length next[next[i]] until the string length is 0.

For example, we already know that ababab, when q=4, next[4]=2 (k=2, which means that the substring ababa composed of the first 5 letters of the string has the same longest prefix and longest suffix length is 3 , so k=2, next[4]=2. This result can be understood as calculated by our own observation, or as calculated by the program itself, this is not the point, the point is how the program calculates next[5] according to the current result ). Then for the string ababab, when we calculate next[5], q=5, k=2 (the result after the end of the previous loop). Then what we need to compare is whether str[k+1] and str[q] are equal, in fact, whether str[1] and str[5] are equal! , why compare from k+1, because in the last cycle, we have guaranteed that str[k] and str[q] (note that this q is the q of the last cycle) are equal (think about this sentence for yourself , it is easy to understand), so in this loop, we directly compare whether str[k+1] and str[q] are equal (this q is the q of this loop).

You can understand the equality by drawing it yourself. It is more important when it is not equal. For example, when
aabaabcdeaabaa [f]
reaches 13, k=4, and when it reaches 14, it does not satisfy str[k + 1] != str[q]
At this time, from k =3 to start? In fact, k=3 must be unequal,
because the longest prefix and suffix for k only ends at index 1, so starting from 2 must be different before and after, so it
can start from next[4]=1
From this we can write the code to find the next array

void pro_next(char *str, int *next, int len)
{
    next[0] = -1;//next[0]初始化为-1,-1表示不存在相同的最大前缀和最大后缀
    int k = -1;//k初始化为-1
    for (int q = 1; q <= len-1; q++)
    {
        while (k > -1 && str[k + 1] != str[q])//如果下一个不同,那么k就变成next[k],注意next[k]是小于k的,无论k取任何值。
        {
            k = next[k];//往前回溯
        }
        if (str[k + 1] == str[q])//如果相同,k++
        {
            k = k + 1;
        }
        next[q] = k;//这个是把算的k的值(就是相同的最大前缀和最大后缀长)赋给next[q]
    }
}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324735436&siteId=291194637