Given a main string (replaced by S) and a pattern string (replaced by P), it is required to find the position of P in S. This is the problem of string pattern matching.
The Knuth-Morris-Pratt algorithm (KMP for short) is one of the commonly used algorithms to solve this problem. This algorithm was conceived by Donald Ervin Knuth and Vaughan Pratt in 1974, the same year James H. Morris also independently designed the algorithm, and finally the three jointly published it in 1977.
Before moving on, it is necessary to introduce two concepts here: True prefix and proper suffix .
From the above figure, "True Prefix" refers to the combination of all the heads of a string except itself; "True Suffix" refers to the combination of all the tails of a string except itself. (Many blogs on the Internet, it should be said that almost all blogs are "prefix". Strictly speaking, "true prefix" and "prefix" are different. Since they are different, it is better not to be confused!)
Naive string matching algorithm When we
first encountered the pattern matching problem of strings, the first reaction in our minds was naive string matching (the so-called brute force matching). The code is as follows:
/* 字符串下标始于 0 */
int NaiveStringSearch(string S, string P)
{
int i = 0; // S 的下标
int j = 0; // P 的下标
int s_len = S.size();
int p_len = P.size();
while (i < s_len && j < p_len)
{
if (S[i] == P[j]) // 若相等,都前进一步
{
i++;
j++;
}
else // 不相等
{
i = i - j + 1;
j = 0;
}
}
if (j == p_len) // 匹配成功
return i - j;
return -1;
}
Time complexity is matched violence O(nm)
, which n
is the S
length m
of P
the length. Obviously, this time complexity is difficult to meet our needs.
Next, enter the topic: time complexity of O(n+m)
KMP algorithm.
KMP string matching algorithm
Algorithm flow
1)
First, compare the first character of the main string "BBC ABCDAB ABCDABCDABDE" with the first character of the pattern string "ABCDABD". Because B does not match A, the pattern string is shifted one bit backward.
2)
Because B and A do not match again, the pattern string moves backward.
3)
So, until there is a character in the main string, which is the same as the first character of the pattern string.
4)
Then compare the next character of the main string and the pattern string, and they are still the same.
5)
Until the main string has a character that is different from the character corresponding to the pattern string.
6)
At this time, the most natural reaction is to move the entire pattern string back by one bit, and then compare one by one from the beginning. Although this is feasible, it is very inefficient, because you have to move the "search position" to a position that has already been compared and repeat the comparison.
7)
A basic fact is that when the space does not match D, you actually already know that the first six characters are "ABCDAB". The idea of the KMP algorithm is to try to use this known information, not to move the "search position" back to the position that has already been compared, but to continue to move it backward, which improves efficiency.
8)
How to do this? You can set a jump array for the pattern string. int next[]
How this array is calculated will be introduced later, as long as you can use it here.
9) When it is
known that the space does not match D, the first six characters "ABCDAB" are matched. According to the jump array, the next value of D at the mismatch is 2, so the next match starts at the position where the subscript of the pattern string is 2.
10)
Because the space does not match C, the next value at C is 0, so the pattern string starts to match from the subscript 0.
11)
Because the space does not match A, the value of next here is -1, which means that the first character of the pattern string does not match, so move it one bit forward.
12)
Compare bit by bit until C and D do not match. So, the next step is to start matching from the place where the subscript is 2.
13)
Comparing bit by bit, until the last bit of the pattern string, a complete match is found, and the search is completed.
next
How to obtain an array of
next
arrays based on solving the "true prefix" and "suffix true", i.e. next[i]
equal to P[0]...P[i - 1]
the longest length of the prefix and suffix of the same true (temporarily ignored when i is equal to 0, will be explained below). We still use the above table as an example. For the convenience of reading, I copied it below.
- i = 0, for the first character of the pattern string, we unified it as next[0] = -1;
- i = 1, the preceding string is
A
, the longest same true prefix and suffix length is 0, that is, next[1] = 0; - i = 2, the preceding string is
AB
, its longest identical true prefix and suffix length is 0, that is, next[2] = 0;
i- = 3, the preceding string isABC
, its longest identical true prefix and suffix length is 0 , That is, next[3] = 0; - i = 4, the preceding string is
ABCD
, the longest identical true prefix and suffix length is 0, that is, next[4] = 0; - i = 5, the preceding string is
ABCDA
, its longest identical true prefix and suffix is A, that is, next[5] = 1; - i = 6, the preceding string is
ABCDAB
, the longest identical true prefix and suffix is AB, that is, next[6] = 2; - i = 7, the preceding string is
ABCDABD
, the longest identical true prefix and suffix length is 0, that is, next[7] = 0.
So, why can the jump in the case of mismatch be achieved based on the length of the longest identical true prefix and suffix? For representative examples: ifi = 6
not matched, then we know its position before the stringABCDAB
, the string careful observation, has a head and tailAB
, since thei = 6
D mismatch at, why we do not directly linki = 2
the the C continues to compare it to take over, because there is aAB
ah, and thisAB
isABCDAB
the longest prefix and suffix really the same, the length of the jump is just 2 subscript position.
Some readers may doubt, if the i = 5
match fails, as I explain the idea, at this time should be i = 1
the character at the comparison continues to take over, but the character of these two positions are the same, ah, all B
, since, like, take Isn’t it useless to come over? In fact, it is not that there is a problem with my explanation, nor is there a problem with this algorithm, but the algorithm has not been optimized. I will explain this in detail below, but readers are advised not to entangle here, skip this, and you will naturally understand it below.
The idea is so simple, the next step is to implement the code, as follows:
void GetNext(string P, int next[])
{
int p_len = P.size();
int i = 0; // P 的下标
int j = -1;
next[0] = -1;
while (i < p_len)
{
if (j == -1 || P[i] == P[j])
{
i++;
j++;
next[i] = j;
}
else
j = next[j];
}
}
Looks dumbfounded, isn't it? . . The above code is used to solve the next[]
value of each position in the pattern string .
Following specific analysis, I divide the code into two parts:
(1): i
and j
what is the role?
i
The sum is j
like two "pointers", one after the other, by moving them to find the longest identical true prefix and suffix.
(2): if...else...
What is done in the sentence?
Assumptions i
and j
positions above, by the next[i] = j
available, i.e. the position i, the section [0, i - 1]
of the longest prefix and suffix are really the same [0, j - 1]
and [i - j, i - 1]
that the same two sections content.
According to the algorithm flow, if (P[i] == P[j]),则 i++; j++; next[i] = j;
if it is not equal, j = next[j]
see the figure below: it
next[j]
represents [0, j - 1]
the length of the longest identical true prefix and suffix in the segment. As shown in the figure, use the two ellipses on the left to represent the longest identical true suffix, that is, the two ellipses represent the same section content; for the same reason, there are two identical ellipses on the right. So else
the statement is the use of a fourth ellipse and the ellipse obtained the same contents to speed [0, i - 1]
the length of the prefix and suffix of the same segment true.
Attentive friends will ask if
statements in j == -1
the meaning of existence is what? First, just run the program, j
it was initially -1
, direct P[i] == P[j]
judgment will undoubtedly spill over the border; second, else
statements j = next[j]
, j
is constantly receding, if j
is assigned in the back -1
(that is j = next[0]
), the P[i] == P[j]
overflow will determine boundary . To sum up the above two points, its meaning is to judge the special boundary.
#include <iostream>
#include <string>
using namespace std;
/* P 为模式串,下标从 0 开始 */
void GetNext(string P, int next[])
{
int p_len = P.size();
int i = 0; // P 的下标
int j = -1;
next[0] = -1;
while (i < p_len)
{
if (j == -1 || P[i] == P[j])
{
i++;
j++;
next[i] = j;
}
else
j = next[j];
}
}
/* 在 S 中找到 P 第一次出现的位置 */
int KMP(string S, string P, int next[])
{
GetNext(P, next);
int i = 0; // S 的下标
int j = 0; // P 的下标
int s_len = S.size();
int p_len = P.size();
while (i < s_len && j < p_len) // 因为末尾 '\0' 的存在,所以不会越界
{
if (j == -1 || S[i] == P[j]) // P 的第一个字符不匹配或 S[i] == P[j]
{
i++;
j++;
}
else
j = next[j]; // 当前字符匹配失败,进行跳转
}
if (j == p_len) // 匹配成功
return i - j;
return -1;
}
int main()
{
int next[100] = {
0 };
cout << KMP("bbc abcdab abcdabcdabde", "abcdabd", next) << endl; // 15
return 0;
}
Take the table in 3.2 as an example (copied above). If the match fails when i = 5, follow the code in 3.2. At this time, the character at i = 1 should be used to continue the comparison, but the characters in these two positions are The same, both are B. Since they are the same, isn't it useless to compare them? I explained this in 3.2. The reason for this is that KMP has not been optimized. How can this problem be solved by rewriting it? It's very simple.
/* P 为模式串,下标从 0 开始 */
void GetNextval(string P, int nextval[])
{
int p_len = P.size();
int i = 0; // P 的下标
int j = -1;
nextval[0] = -1;
while (i < p_len)
{
if (j == -1 || P[i] == P[j])
{
i++;
j++;
if (P[i] != P[j])
nextval[i] = j;
else
nextval[i] = nextval[j]; // 既然相同就继续往前找真前缀
}
else
j = nextval[j];
}
}