Detailed explanation of KMP algorithm, from brute force search to KMP to KMP optimization

Given a main string (replaced by S) and a pattern string (replaced by P), it is required to find the position of P in S. This is the problem of string pattern matching.

The Knuth-Morris-Pratt algorithm (KMP for short) is one of the commonly used algorithms to solve this problem. This algorithm was conceived by Donald Ervin Knuth and Vaughan Pratt in 1974, the same year James H. Morris also independently designed the algorithm, and finally the three jointly published it in 1977.

Before moving on, it is necessary to introduce two concepts here: True prefix and proper suffix .
Insert picture description here

From the above figure, "True Prefix" refers to the combination of all the heads of a string except itself; "True Suffix" refers to the combination of all the tails of a string except itself. (Many blogs on the Internet, it should be said that almost all blogs are "prefix". Strictly speaking, "true prefix" and "prefix" are different. Since they are different, it is better not to be confused!)

Naive string matching algorithm When we
first encountered the pattern matching problem of strings, the first reaction in our minds was naive string matching (the so-called brute force matching). The code is as follows:

/* 字符串下标始于 0 */
int NaiveStringSearch(string S, string P)
{
    
    
    int i = 0;    // S 的下标
    int j = 0;    // P 的下标
    int s_len = S.size();
    int p_len = P.size();

    while (i < s_len && j < p_len)
    {
    
    
        if (S[i] == P[j])  // 若相等,都前进一步
        {
    
    
            i++;
            j++;
        }
        else               // 不相等
        {
    
    
            i = i - j + 1;
            j = 0;
        }
    }

    if (j == p_len)        // 匹配成功
        return i - j;

    return -1;
}

Time complexity is matched violence O(nm), which nis the Slength mof Pthe length. Obviously, this time complexity is difficult to meet our needs.

Next, enter the topic: time complexity of O(n+m)KMP algorithm.

KMP string matching algorithm
Algorithm flow
1) Insert picture description here
First, compare the first character of the main string "BBC ABCDAB ABCDABCDABDE" with the first character of the pattern string "ABCDABD". Because B does not match A, the pattern string is shifted one bit backward.
2)
Insert picture description here
Because B and A do not match again, the pattern string moves backward.
3)
Insert picture description here
So, until there is a character in the main string, which is the same as the first character of the pattern string.
4)
Insert picture description here
Then compare the next character of the main string and the pattern string, and they are still the same.
5)
Insert picture description here
Until the main string has a character that is different from the character corresponding to the pattern string.
6)
Insert picture description here
At this time, the most natural reaction is to move the entire pattern string back by one bit, and then compare one by one from the beginning. Although this is feasible, it is very inefficient, because you have to move the "search position" to a position that has already been compared and repeat the comparison.
7)
Insert picture description here
A basic fact is that when the space does not match D, you actually already know that the first six characters are "ABCDAB". The idea of ​​the KMP algorithm is to try to use this known information, not to move the "search position" back to the position that has already been compared, but to continue to move it backward, which improves efficiency.
8)
Insert picture description here
How to do this? You can set a jump array for the pattern string. int next[]How this array is calculated will be introduced later, as long as you can use it here.
9) When it is
Insert picture description here
known that the space does not match D, the first six characters "ABCDAB" are matched. According to the jump array, the next value of D at the mismatch is 2, so the next match starts at the position where the subscript of the pattern string is 2.
10)
Insert picture description here
Because the space does not match C, the next value at C is 0, so the pattern string starts to match from the subscript 0.
11)
Insert picture description here
Because the space does not match A, the value of next here is -1, which means that the first character of the pattern string does not match, so move it one bit forward.
12)
Insert picture description here

Compare bit by bit until C and D do not match. So, the next step is to start matching from the place where the subscript is 2.
13)
Insert picture description here

Comparing bit by bit, until the last bit of the pattern string, a complete match is found, and the search is completed.

nextHow to obtain an array of
next arrays based on solving the "true prefix" and "suffix true", i.e. next[i]equal to P[0]...P[i - 1]the longest length of the prefix and suffix of the same true (temporarily ignored when i is equal to 0, will be explained below). We still use the above table as an example. For the convenience of reading, I copied it below.
Insert picture description here

  • i = 0, for the first character of the pattern string, we unified it as next[0] = -1;
  • i = 1, the preceding string is A, the longest same true prefix and suffix length is 0, that is, next[1] = 0;
  • i = 2, the preceding string is AB, its longest identical true prefix and suffix length is 0, that is, next[2] = 0;
    i- = 3, the preceding string is ABC, its longest identical true prefix and suffix length is 0 , That is, next[3] = 0;
  • i = 4, the preceding string is ABCD, the longest identical true prefix and suffix length is 0, that is, next[4] = 0;
  • i = 5, the preceding string is ABCDA, its longest identical true prefix and suffix is ​​A, that is, next[5] = 1;
  • i = 6, the preceding string is ABCDAB, the longest identical true prefix and suffix is ​​AB, that is, next[6] = 2;
  • i = 7, the preceding string is ABCDABD, the longest identical true prefix and suffix length is 0, that is, next[7] = 0.
    So, why can the jump in the case of mismatch be achieved based on the length of the longest identical true prefix and suffix? For representative examples: if i = 6not matched, then we know its position before the string ABCDAB, the string careful observation, has a head and tail AB, since the i = 6D mismatch at, why we do not directly link i = 2the the C continues to compare it to take over, because there is a ABah, and this ABis ABCDABthe longest prefix and suffix really the same, the length of the jump is just 2 subscript position.

Some readers may doubt, if the i = 5match fails, as I explain the idea, at this time should be i = 1the character at the comparison continues to take over, but the character of these two positions are the same, ah, all B, since, like, take Isn’t it useless to come over? In fact, it is not that there is a problem with my explanation, nor is there a problem with this algorithm, but the algorithm has not been optimized. I will explain this in detail below, but readers are advised not to entangle here, skip this, and you will naturally understand it below.

The idea is so simple, the next step is to implement the code, as follows:

void GetNext(string P, int next[])
{
    
    
    int p_len = P.size();
    int i = 0;   // P 的下标
    int j = -1;  
    next[0] = -1;

    while (i < p_len)
    {
    
    
        if (j == -1 || P[i] == P[j])
        {
    
    
            i++;
            j++;
            next[i] = j;
        }
        else
            j = next[j];
    }
}

Looks dumbfounded, isn't it? . . The above code is used to solve the next[]value of each position in the pattern string .

Following specific analysis, I divide the code into two parts:

(1): iand jwhat is the role?

iThe sum is jlike two "pointers", one after the other, by moving them to find the longest identical true prefix and suffix.

(2): if...else...What is done in the sentence?
Insert picture description here
Assumptions iand jpositions above, by the next[i] = javailable, i.e. the position i, the section [0, i - 1]of the longest prefix and suffix are really the same [0, j - 1]and [i - j, i - 1]that the same two sections content.

According to the algorithm flow, if (P[i] == P[j]),则 i++; j++; next[i] = j;if it is not equal, j = next[j]see the figure below: it
Insert picture description here
next[j]represents [0, j - 1]the length of the longest identical true prefix and suffix in the segment. As shown in the figure, use the two ellipses on the left to represent the longest identical true suffix, that is, the two ellipses represent the same section content; for the same reason, there are two identical ellipses on the right. So elsethe statement is the use of a fourth ellipse and the ellipse obtained the same contents to speed [0, i - 1]the length of the prefix and suffix of the same segment true.

Attentive friends will ask ifstatements in j == -1the meaning of existence is what? First, just run the program, jit was initially -1, direct P[i] == P[j]judgment will undoubtedly spill over the border; second, elsestatements j = next[j], jis constantly receding, if jis assigned in the back -1(that is j = next[0]), the P[i] == P[j]overflow will determine boundary . To sum up the above two points, its meaning is to judge the special boundary.

#include <iostream>
#include <string>

using namespace std;

/* P 为模式串,下标从 0 开始 */
void GetNext(string P, int next[])
{
    
    
    int p_len = P.size();
    int i = 0;   // P 的下标
    int j = -1;  
    next[0] = -1;

    while (i < p_len)
    {
    
    
        if (j == -1 || P[i] == P[j])
        {
    
    
            i++;
            j++;
            next[i] = j;
        }
        else
            j = next[j];
    }
}

/* 在 S 中找到 P 第一次出现的位置 */
int KMP(string S, string P, int next[])
{
    
    
    GetNext(P, next);

    int i = 0;  // S 的下标
    int j = 0;  // P 的下标
    int s_len = S.size();
    int p_len = P.size();

    while (i < s_len && j < p_len) // 因为末尾 '\0' 的存在,所以不会越界
    {
    
    
        if (j == -1 || S[i] == P[j])  // P 的第一个字符不匹配或 S[i] == P[j]
        {
    
    
            i++;
            j++;
        }
        else
            j = next[j];  // 当前字符匹配失败,进行跳转
    }

    if (j == p_len)  // 匹配成功
        return i - j;
    
    return -1;
}

int main()
{
    
    
    int next[100] = {
    
     0 };

    cout << KMP("bbc abcdab abcdabcdabde", "abcdabd", next) << endl; // 15
    
    return 0;
}

Insert picture description here

Take the table in 3.2 as an example (copied above). If the match fails when i = 5, follow the code in 3.2. At this time, the character at i = 1 should be used to continue the comparison, but the characters in these two positions are The same, both are B. Since they are the same, isn't it useless to compare them? I explained this in 3.2. The reason for this is that KMP has not been optimized. How can this problem be solved by rewriting it? It's very simple.

/* P 为模式串,下标从 0 开始 */
void GetNextval(string P, int nextval[])
{
    
    
    int p_len = P.size();
    int i = 0;   // P 的下标
    int j = -1;  
    nextval[0] = -1;

    while (i < p_len)
    {
    
    
        if (j == -1 || P[i] == P[j])
        {
    
    
            i++;
            j++;
          
            if (P[i] != P[j])
                nextval[i] = j;
            else
                nextval[i] = nextval[j];  // 既然相同就继续往前找真前缀
        }
        else
            j = nextval[j];
    }
}

Guess you like

Origin blog.csdn.net/JACKSONMHLK/article/details/114168906