The KMP algorithm is the most simple to understand - you can understand it at a glance

illustrate

The KMP algorithm is very simple when I understand it, and the idea is very simple. Before I can't understand it, I check all kinds of information, and I am confused. Even the simplest explanation on the Internet is still confused.
I spent half a day trying to figure out what this thing is in the shortest possible time.
There is no concept here, only the algorithm process and code understanding:

What type of problem does the KMP algorithm solve?

String match. Given two strings, find if one of them contains the other, and if so, return the starting position of the inclusion.
Such as the following two strings:

char *str = "bacbababadababacambabacaddababacasdsd";
char *ptr = "ababaca";

There are two places in str that contain ptr
, respectively, at subscripts 10 and 26 of str, including ptr.

"Sabd sdd sdd sdd sdsd";
write picture description here

The problem type is very simple, and the algorithm is directly introduced below.

Algorithm Description

Generally, when matching strings, we select the substring with the same length as ptr (length m) from the first subscript of the target string str (assuming the length is n), and if it is the same, return the substring at the beginning. The subscript value is different. Select the next subscript of str, and also select a string of length n to compare until the end of str (in actual comparison, the subscript moves to nm). Such time complexity is O(n*m) .

KMP algorithm: can be implemented with a complexity of O(m+n)

Why simplify the time complexity:
Make full use of the properties of the target string ptr (for example, the repeatability of some strings in it, even if there are no repeated fields, the maximum amount of movement is achieved during comparison).
It doesn't matter if I understand it or not, and what I said actually didn't deeply analyze the internal reasons.

Examine the target string ptr :
ababaca
Here we want to calculate a transfer function next of length m.

The meaning of the next array is that the longest prefix and longest suffix of a fixed string have the same length.

For example: abcjkdabc, then the longest prefix and longest suffix of this array must be abc.
cbcbc, the longest prefix and longest suffix are the same as cbc.
abcbc, the longest prefix and longest suffix do not exist.

**Note the longest prefix: it means starting with the first character, but not including the last character.
For example, the longest prefix and longest suffix of aaaa are aaa. **
For the target string ptr, ababaca , the length is 7, so next[0], next[1], next[2], next[3], next[4], next[5], next[6] respectively Calculated is the length of the same longest prefix and longest suffix of
a , ab , aba , abab , ababa , ababac , ababaca . Since the same longest prefix and longest suffix of a , ab , aba , abab , ababa , ababac , ababaca are "", "", "a", "ab", "aba", "", "a", So the value of the next array is [-1,-1,0,1,2,-1,0], where -1 means it does not exist, 0 means it exists with a length of 1, and 2 means it exists with a length of 3. This is to correspond to the code.

1, 2, 3, 4 in the picture below are the same. The same is true between 1-2 and 3-4. We find that A and B are different; the previous algorithm is that I move the following string forward a distance and start from the beginning to compare, there must be a lot of Repeated comparisons. The current practice is that I move the following string forward to make 3 and 2 align, and directly compare whether C and A are the same.

write picture description here

write picture description here

Code parsing

void cal_next(char *str, int *next, int len)
{
    next[0] = -1;//next[0]初始化为-1,-1表示不存在相同的最大前缀和最大后缀
    int k = -1;//k初始化为-1
    for (int q = 1; q <= len-1; q++)
    {
        while (k > -1 && str[k + 1] != str[q])//如果下一个不同,那么k就变成next[k],注意next[k]是小于k的,无论k取任何值。
        {
            k = next[k];//往前回溯
        }
        if (str[k + 1] == str[q])//如果相同,k++
        {
            k = k + 1;
        }
        next[q] = k;//这个是把算的k的值(就是相同的最大前缀和最大后缀长)赋给next[q]
    }
}

km²

This is very similar to next. It depends on the code. In fact, the whole matching process has been roughly described above.

int KMP(char *str, int slen, char *ptr, int plen)
{
    int *next = new int[plen];
    cal_next(ptr, next, plen);//计算next数组
    int k = -1;
    for (int i = 0; i < slen; i++)
    {
        while (k >-1&& ptr[k + 1] != str[i])//ptr和str不匹配,且k>-1(表示ptr和str有部分匹配)
            k = next[k];//往前回溯
        if (ptr[k + 1] == str[i])
            k = k + 1;
        if (k == plen-1)//说明k移动到ptr的最末端
        {
            //cout << "在位置" << i-plen+1<< endl;
            //k = -1;//重新初始化,寻找下一个
            //i = i - plen + 1;//i定位到该位置,外层for循环i++可以继续找下一个(这里默认存在两个匹配字符串可以部分重叠),感谢评论中同学指出错误。
            return i-plen+1;//返回相应的位置
        }
    }
    return -1;  
}

test

    char *str = "bacbababadababacambabacaddababacasdsd";
    char *ptr = "ababaca";
    int a = KMP(str, 36, ptr, 7);
    return 0;

Note that if there are multiple strings matching ptr in str, in order to find all the subscript positions that meet the requirements, the KMP algorithm needs to be slightly modified. See the commented out code above.

Complexity Analysis

The computational complexity of the next function is (m), at first I thought it was O(m^2), and then I thought about it carefully, the while loop in cal__next, and the outer for loop, using the idea of ​​equalization, it is actually O(m), this I'll write it later when I think about it.

…………………………………………..Separation line……………………………………..In
fact, this article is over, the following is just for the questions in the comments, I try to answer.

Further explanation (2018-3-14)

After reading the comments, everyone is concerned about the cal_next(..) function and the KMP() function.

while (k > -1 && str[k + 1] != str[q])
        {
            k = next[k];
        }

and

while (k >-1&& ptr[k + 1] != str[i])
            k = next[k];

This while loop and k=next[k] are very confusing!
Indeed, I started to look at these lines of code, and I was quite confused. What was it written? Why was it written like this? Later, I ran it on the computer and slowly understood why it was written like this. These few lines of code can be said to be able to think of the essence of the KMP algorithm very clearly. Very cool!
Look directly at the cal_next(..) function:
first we look at the first while loop, what does it do.

Before that, let's go back to the original program. There is a large for() loop in the original program. What is the purpose of this for() loop?

This for loop is to calculate the value of next[0], next[1],...next[q]....

The last sentence next[q]=k in it means that at the end of each loop, we have calculated the " length of the same longest prefix and longest suffix " of the substring composed of the first (q+1) letters of ptr . (This sentence has been explained earlier!) This " length " is k.

Ok, so far, assuming that the loop goes to the qth time, that is, next[q] has been calculated, how do we calculate next[q+1]?

For example, we already know that ababab , when q=4, next[4]=2 (k=2, it means that the substring ababa composed of the first 5 letters of the string has the same longest prefix and longest suffix length is 3 , so k=2, next[4]=2. This result can be understood as calculated by our own observation or calculated by the program itself, this is not the point, the point is how the program calculates next[5] according to the current result ). Then for the string ababab , when we calculate next[5], q=5, k=2 (the result after the end of the previous loop). Then what we need to compare is whether str[k+1] and str[q] are equal, in fact, whether str[1] and str[5] are equal! , why compare from k+1, because in the last cycle, we have guaranteed that str[k] and str[q] (note that this q is the q of the last cycle) are equal (think about this sentence for yourself , it is easy to understand), so in this loop, we directly compare whether str[k+1] and str[q] are equal (this q is the q of this loop).
If it is equal, then jump out of while(), enter if(), k=k+1, then next[q]=k. That is, for ababab , we get next[5]=3. This is calculated by the program itself, and it is the same as what we observe.
If not, we can use " abaac " to describe the situation. Not waiting, enter while(), perform k=next[k], this sentence means that in the case of str[k + 1] != str[q], we go forward to find a k, so that str[ k + 1]==str[q], should we search forward one by one, or is there a faster way to search? (You can find them one by one, that is, if you replace k = next[k] with k- - it will also work completely (correction: this sentence is wrong, replacing k=next[k] with k- will not work, comment The 25th floor gives a counter example) . But the program gives a faster way to find, that is k = next[k]. The program means that once str[k + 1] != str[q], that is, when I can't find it in the suffix, I can skip the middle paragraph and go to the prefix to find it, next[k] is the same The length of the longest prefix and longest suffix . Therefore, k=next[k] becomes, k=next[2], that is, k=0. At this time, compare whether str[0+1] and str[5] are equal or not, then k=next[0]=-1. Break out of the loop.
(Can you understand this explanation?)

The above is in this cal_next() function

while (k > -1 && str[k + 1] != str[q])
        {
            k = next[k];
        }

One of the most difficult places in my understanding to understand, welcome to point out anything wrong.

Complexity Analysis:

To analyze the KMP complexity, look directly at the KMP function.

int KMP(char *str, int slen, char *ptr, int plen)
{
    int *next = new int[plen];
    cal_next(ptr, next, plen);//计算next数组
    int k = -1;
    for (int i = 0; i < slen; i++)
    {
        while (k >-1&& ptr[k + 1] != str[i])//ptr和str不匹配,且k>-1(表示ptr和str有部分匹配)
            k = next[k];//往前回溯
        if (ptr[k + 1] == str[i])
            k = k + 1;
        if (k == plen-1)//说明k移动到ptr的最末端
        {
            //cout << "在位置" << i-plen+1<< endl;
            //k = -1;//重新初始化,寻找下一个
            //i = i - plen + 1;//i定位到该位置,外层for循环i++可以继续找下一个(这里默认存在两个匹配字符串可以部分重叠),感谢评论中同学指出错误。
            return i-plen+1;//返回相应的位置
        }
    }
    return -1;  
}

This thing is really hard to explain. Let me put it simply: it
is difficult to explain the complexity from the code.
write picture description here

This figure explains it.

We can see that each time the matching string moves forward, it is a large movement. Assuming that there are no repeated prefixes and suffixes in the matching string, that is, the value of next is -1, then each movement is actually A whole matching string is moved forward m distances. Then compare them one by one again, so that the comparisons are made m times, which can be summarized as moving m distances, comparing m times, and moving to the end, that is, comparing n times, O(n) complexity. Assuming that there are repeated prefixes and suffixes in the matching string, the distance we move is relatively small, but the number of comparisons is also small, and the overall cost is also O(n).
So the complexity is a linear complexity.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324883138&siteId=291194637