Explain the KMP algorithm in detail

The KMP algorithm should be covered in every "Data Structure" book, and it is one of the most well-known algorithms, but unfortunately, I didn't understand it at all in my sophomore year~~~

After that, I often see articles explaining the KMP algorithm in many places. After reading it for a long time, I seem to know what is going on, but I always feel that I still do not fully understand some places. I spent some time summarizing it in the past two days. I have a little experience. I hope that I can sort out some details of this algorithm through my own language, and it can be regarded as a test that I really understand this algorithm.

 

What is the KMP algorithm:

KMP was discovered at the same time by three big cows: DEKnuth, JH Morris and VRPratt. The first of these is the author of The Art of Computer Programming! !

The problem to be solved by the KMP algorithm is to locate the pattern in the string (also called the main string). To put it simply, it is what we usually call keyword search. The pattern string is the keyword (call it P next), and if it appears in a main string (call it T next), it returns its specific position, otherwise it returns -1 (the usual way).

 

First of all, there is a very simple idea for this problem: match one by one from left to right, if there is a character that does not match in the process, jump back and move the pattern string one bit to the right. What's so hard about this?

We can initialize like this:

 

After that, we only need to compare whether the character pointed to by the i pointer is the same as the character pointed to by the j pointer. If they are consistent, they will all move backward. If they are inconsistent, as shown below:

 

 

If A and E are not equal, then move the i pointer back to the first position (assuming the subscript starts at 0), move j to the 0th position of the pattern string, and then start the process again:

 

Based on this idea we can get the following program:

/**

 * 暴力破解法

 * @param ts 主串

 * @param ps 模式串

 * @return 如果找到,返回在主串中第一个字符出现的下标,否则为-1

 */

public static int bf(String ts, String ps) {

    char[] t = ts.toCharArray();

    char[] p = ps.toCharArray();

    int i = 0; // 主串的位置

    int j = 0; // 模式串的位置

    while (i < t.length && j < p.length) {

       if (t[i] == p[j]) { // 当两个字符相同,就比较下一个

           i++;

           j++;

       } else {

           i = i - j + 1; // 一旦不匹配,i后退

           j = 0; // j归0

       }

    }

    if (j == p.length) {

       return i - j;

    } else {

       return -1;

    }

}

The above program is fine, but not good enough! (Reminds me of a sentence from my digital teacher in high school: I can't say you're wrong, I can only say you're wrong~~~)

If it is artificially searched, it will definitely not move i back to the first position, because there is no A in front of the position where the main string fails to match except the first A , why can we know that there is only one A in front of the main string ? Because we already know that the first three characters are matched! (this is important) . Moving the past is definitely not a match either! There is an idea, i can not move, we only need to move j , as shown below:

 

The above situation is still an ideal situation, and we will compare it again at most. But if you search for "SSSSB" in the main string "SSSSSSSSSSSSSA", compare to the last one to know that there is no match, and then i backtrack, the efficiency of this is obviously the lowest.

 

The big cows couldn't stand the inefficient method of "brute force cracking", so the three of them developed the KMP algorithm. The idea is the same as what we saw above: " Use the valid information that has been partially matched, keep the i pointer from backtracking, and modify the j pointer to move the pattern string to a valid position as much as possible ."

 

So, the whole point of KMP is that when a character does not match the main string, we should know where the j pointer should move ?

 

Next, let's discover the movement law of j by ourselves:

 

As shown in the figure: C and D do not match, where should we move j to? Obviously number 1. Why? Because there is an A in front of it that is the same:

 

The same is the case as shown below:

 

You can move the j pointer to the 2nd position because the two letters in front are the same:

 

At this point, we can roughly see a clue that when the matching fails, the next position k to be moved by j. There is the property that the first k characters are the same as the last k characters before j .

If the mathematical formula is expressed like this

P[0 ~ k-1] == P[j-k ~ j-1]

This is very important. If you feel that it is difficult to remember, you can understand it through the following figure:

 

After understanding this, it should be possible to understand why j can be moved directly to the k position.

because:

When T[i] != P[j]

have T[ij ~ i-1] == P[0 ~ j-1]

By P[0 ~ k-1] == P[jk ~ j-1]

Necessary: ​​T[ik ~ i-1] == P[0 ~ k-1]

The formula is boring, you can understand it, you don't need to memorize it.

This paragraph is just to demonstrate why we can directly move j to k without comparing the previous k characters.

 

Ok, the next point is the key point. How to ask for this (these) k? Because mismatches may occur at each position of P, that is to say, we need to calculate the k corresponding to each position j, so use an array next to save, next[j] = k , indicating that when T[i] != When P[j] , the next position of the j pointer.

 

Many textbooks or blog posts are vaguely described in this place or simply mentioned in one stroke, or even a piece of code is posted. Why do you ask for this? How can you ask for this? It was not made clear at all. And here is precisely the most critical part of the whole algorithm.

public static int[] getNext(String ps) {

    char[] p = ps.toCharArray();

    int[] next = new int[p.length];

    next[0] = -1;

    int j = 0;

    int k = -1;

    while (j < p.length - 1) {

       if (k == -1 || p[j] == p[k]) {

           next[++j] = ++k;

       } else {

           k = next[k];

       }

    }

    return next;

}

This version of the algorithm for finding the next array should be the most widely spread, and the code is very concise. But it is really confusing. What is the basis for its calculation?

Well, let's put this aside for now, let's deduce the idea ourselves, and now we must always remember that the value of next[j] (that is, k) means that when P[j] != T[i] , the value of the j pointer Next move the location .

Let's look at the first one: when j is 0, what if there is no match at this time?

 

In the case of the above picture, j is already on the far left, and it is impossible to move any more. At this time, the i pointer should be moved backward . So there will be next[0] = -1; initialization in the code.

What if it is when j is 1?

 

Obviously, the j pointer must be moved back to the 0 position . Because this is the only place in front of it~~~

 

The following is the most important, please see the picture below:

  

 

Please compare the two pictures carefully.

We found a pattern:

When P[k] == P[j],

有next[j+1] == next[j] + 1

In fact, this can be proved:

Because there is already P[0 ~ k-1] == p[jk ~ j-1] before P[j]. (next[j] == k)

At this time, the existing P[k] == P[j], can we get P[0 ~ k-1] + P[k] == p[jk ~ j-1] + P[j].

即:P[0 ~ k] == P[j-k ~ j],即next[j+1] == k + 1 == next[j] + 1。

The formula here is not very easy to understand, but it will be easier to understand by looking at the picture.

 

What if P[k] != P[j]? For example, as shown in the following figure:

 

In this case, if you look at the code, it should be this sentence: k = next[k]; Why is it like this? You should see below.

 

Now you should know why k = next[k]! Like the example above, it is impossible for us to find the longest suffix string [A, B, A, B], but it is still possible to find prefix strings such as [A, B], [B]. So this process seems to be locating the string [ A, B, A, C ]. When C and the main string are different (that is, the position of k is different), of course, the pointer is moved to next[k]. .

 

After we have the next array, everything is easy to do, we can write the KMP algorithm by hand:

public static int KMP(String ts, String ps) {

    char[] t = ts.toCharArray();

    char[] p = ps.toCharArray();

    int i = 0; // 主串的位置

    int j = 0; // 模式串的位置

    int[] next = getNext(ps);

    while (i < t.length && j < p.length) {

       if (j == -1 || t[i] == p[j]) { // 当j为-1时,要移动的是i,当然j也要归0

           i++;

           j++;

       } else {

           // i不需要回溯了

           // i = i - j + 1;

           j = next[j]; // j回到指定位置

       }

    }

    if (j == p.length) {

       return i - j;

    } else {

       return -1;

    }

}

Compared with brute force cracking, 4 places have been changed. The main one is that i does not need backtracking.

 

Finally, let's look at the flaws in the algorithm above. Take a look at the first example:

 

Obviously, when our algorithm above gets the next array should be [-1, 0, 0, 1]

So the next step we should be to move j to the first element:

 

It is not difficult to find that this step is completely meaningless. Because the latter B is no longer matched, then the former B must also be mismatched. The same situation actually happens to the second element A.

Obviously, the problem occurs because P[j] == P[next[j]] .

So we only need to add a judgment condition:

public static int[] getNext(String ps) {

    char[] p = ps.toCharArray();

    int[] next = new int[p.length];

    next[0] = -1;

    int j = 0;

    int k = -1;

    while (j < p.length - 1) {

       if (k == -1 || p[j] == p[k]) {

           if (p[++j] == p[++k]) { // 当两个字符相等时要跳过

              next[j] = next[k];

           } else {

              next[j] = k;

           }

       } else {

           k = next[k];

       }

    }

    return next;

}

Alright, so far. The KMP algorithm is also over.

It's weird how something that doesn't seem to be too hard has stuck me for so long?

Thinking about it carefully, it was because I was too impetuous. I used to deal with it carelessly, and I didn't figure out many details, so I thought I understood it. The result can only be vaguely understood. To learn something really need to calm down.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324642248&siteId=291194637