What is KMP algorithm (detailed explanation)

What is the KMP algorithm:

KMP was discovered by three big cows: DEKnuth, JHMorris and VRPratt at the same time. The first of them is the author of "The Art of Computer Programming"! !

The problem to be solved by the KMP algorithm is to locate the pattern in the string (also called the main string). To put it simply is the keyword search we usually say. The pattern string is the keyword (hereafter called P), if it appears in a main string (hereafter called T), return its specific position, otherwise return -1 (commonly used means).
Insert picture description here

First of all, I have a very simple idea about this problem: one by one from left to right, if there is a character that does not match in the process, jump back and move the pattern string to the right. What's so difficult about this?

We can initialize like this:
Insert picture description here

After that, we only need to compare whether the character pointed to by the i pointer is consistent with the character pointed to by the j pointer. If they are consistent, move backward, if they are not consistent, as shown below:

Insert picture description here

A and E are not equal, then move the i pointer back to the first position (assuming that the subscript starts from 0), move j to the 0th position of the pattern string, and then restart this step:
Insert picture description here

Based on this idea, we can get the following program:

/**

 * 暴力破解法

 * @param ts 主串

 * @param ps 模式串

 * @return 如果找到,返回在主串中第一个字符出现的下标,否则为-1

 */

public static int bf(String ts, String ps) {
    
    

    char[] t = ts.toCharArray();

    char[] p = ps.toCharArray();

    int i = 0; // 主串的位置

    int j = 0; // 模式串的位置

    while (i < t.length && j < p.length) {
    
    

       if (t[i] == p[j]) {
    
     // 当两个字符相同,就比较下一个

           i++;

           j++;

       } else {
    
    

           i = i - j + 1; // 一旦不匹配,i后退

           j = 0; // j归0

       }

    }

    if (j == p.length) {
    
    

       return i - j;

    } else {
    
    

       return -1;

    }

}

The above program is fine, but not good enough!

If it is artificially searched, I will definitely not be moved back to the first place, because there is no A in front of the main string matching failure position except the first A, why can we know that there is only one A in front of the main string ? Because we already know that the first three characters are matched! (This is very important). Moving in the past is definitely not a match! There is an idea, i can not move, we only need to move j, as shown below:

Insert picture description here

The above situation is still relatively ideal, and we will compare it again at best. But if you search for "SSSSB" in the main string "SSSSSSSSSSSSSA" and compare it to the last one to know that it doesn't match, and then i backtrack, the efficiency of this is obviously the lowest.

The big cows couldn't stand the inefficient method of "brute force cracking", so the three of them developed the KMP algorithm. The idea is the same as we saw above: "Using the valid information that has been partially matched, keeping the i pointer from backtracking, and by modifying the j pointer, the pattern string is moved to a valid position as much as possible."

Therefore, the point of the entire KMP is that when a certain character does not match the main string, we should know where to move the j pointer?

Next, let's discover the movement law of j by ourselves:
Insert picture description here

As shown in the figure: C and D do not match, where do we want to move j? Obviously number one. why? Because the front A is the same:
Insert picture description here

The same situation as the following figure:

Insert picture description here

You can move the j pointer to the second position, because the two letters in front are the same:

Insert picture description here

So far we can see a clue, when the match fails, the next position k that j will move. There is such a property: the first k characters are the same as the last k characters before j.

If you use a mathematical formula to express it like this

P[0 ~ k-1] == P[j-k ~ j-1]

This is very important. If you feel hard to remember, you can understand it through the following figure:
Insert picture description here

After understanding this, it should be possible to understand why j can be moved directly to position k.

because:

当T[i] != P[j]时
有T[i-j ~ i-1] == P[0 ~ j-1]
由P[0 ~ k-1] == P[j-k ~ j-1]
必然:T[i-k ~ i-1] == P[0 ~ k-1]

The formula is boring, you can read it and understand it, you don’t need to remember it.
Insert picture description here

This paragraph is just to prove why we can directly move j to k without comparing the previous k characters.

Okay, the next step is the point. How do we find this (these) k? Because a mismatch may occur at each position of P, that is to say, we have to calculate the k corresponding to each position j, so we use an array next to save, next[j] = k, which means when T[i] != When P[j], the next position of j pointer.

Many textbooks or blog posts are rather vague in this place, or they are just mentioned in one stroke, or even posted a piece of code. Why do you ask for it? How can I ask for this? It is not clear at all. And here is precisely the most critical part of the entire algorithm.

public static int[] getNext(String ps) {
    
    

    char[] p = ps.toCharArray();

    int[] next = new int[p.length];

    next[0] = -1;

    int j = 0;

    int k = -1;

    while (j < p.length - 1) {
    
    

       if (k == -1 || p[j] == p[k]) {
    
    

           next[++j] = ++k;

       } else {
    
    

           k = next[k];

       }

    }

    return next;

}

This version of the algorithm for finding the next array should be the most widely spread, and the code is very concise. But it is really confusing. What is the basis for this calculation?

Ok, let’s put this aside, let’s derive our own ideas, now we must always remember that the value of next[j] (that is, k) means that when P[j] != T[i], the j pointer Next move the position.

Let's look at the first one first: when j is 0, what if there is no match at this time?

Insert picture description here

In the case of the above picture, j is already on the far left and it is impossible to move it. At this time, the i pointer should move backward. So there will be next[0] = -1; this initialization in the code.

What if it is when j is 1?
Insert picture description here

Obviously, the j pointer must be moved back to the 0 position. Because there is only this place in front of it~~~

The following is the most important, please see the picture below:

Insert picture description here
Insert picture description here

Please compare these two figures carefully.

We found a rule:

当P[k] == P[j]时,
有next[j+1] == next[j] + 1

In fact, this can be proved:

因为在P[j]之前已经有P[0 ~ k-1] == p[j-k ~ j-1]。(next[j] == k)
这时候现有P[k] == P[j],我们是不是可以得到P[0 ~ k-1] + P[k] == p[j-k ~ j-1] + P[j]。
即:P[0 ~ k] == P[j-k ~ j],即next[j+1] == k + 1 == next[j] + 1

The formula here is not very easy to understand, but it will be easier to understand by looking at the picture.

What if P[k] != P[j]? For example, as shown in the figure below:
Insert picture description here

In this case, if you look at the code, it should be this sentence: k = next[k]; why is it like this? You should see below.
Insert picture description here

Now you should know why k = next[k]! Like the example above, we can no longer find the longest suffix string [A, B, A, B ], but we can still find prefix strings such as [A, B] and [B ]. So this process seems to be positioning the string [A, B, A, C], when C is different from the main string (that is, the position of k is different), of course the pointer is moved to next[k] .

With the next array, everything is easy, we can write the KMP algorithm:

public static int KMP(String ts, String ps) {
    
    

    char[] t = ts.toCharArray();

    char[] p = ps.toCharArray();

    int i = 0; // 主串的位置

    int j = 0; // 模式串的位置

    int[] next = getNext(ps);

    while (i < t.length && j < p.length) {
    
    

       if (j == -1 || t[i] == p[j]) {
    
     // 当j为-1时,要移动的是i,当然j也要归0

           i++;

           j++;

       } else {
    
    

           // i不需要回溯了

           // i = i - j + 1;

           j = next[j]; // j回到指定位置

       }

    }

    if (j == p.length) {
    
    

       return i - j;

    } else {
    
    

       return -1;

    }

}

Compared with brute force cracking, 4 places have been changed. The main point is that i does not need to be backtracked.

Finally, let's take a look at the flaws in the above algorithm. Look at the first example:
Insert picture description here

Obviously, when the next array obtained by the above algorithm should be [-1, 0, 0, 1]

So the next step is to move j to the first element:
Insert picture description here

It is not difficult to find that this step is completely meaningless. Because the latter B does not match anymore, the former B must also not match. The same situation actually happens on the second element A.

Obviously, the reason for the problem is that P[j] == P[next[j]].

So we only need to add a judgment condition:

public static int[] getNext(String ps) {
    
    
    char[] p = ps.toCharArray();

    int[] next = new int[p.length];

    next[0] = -1;

    int j = 0;

    int k = -1;

    while (j < p.length - 1) {
    
    

       if (k == -1 || p[j] == p[k]) {
    
    

           if (p[++j] == p[++k]) {
    
     // 当两个字符相等时要跳过

              next[j] = next[k];

           } else {
    
    

              next[j] = k;

           }

       } else {
    
    

           k = next[k];

       }

    }

    return next;

}

Guess you like

Origin blog.csdn.net/weixin_52622200/article/details/110563434