The kmp algorithm is an improved character matching algorithm. The difference between it and the bf algorithm is that each time the secondary string fails to match the primary string, the position where the secondary string matches the primary string.
The difference between the two algorithms is described below:
Main string: BABCDABBCDABCED
From the string: ABCDABCED
BF algorithm:
first step:
B | A | B | C | D | A | B | A | B | C | D | A | B | C | E | D |
A |
Start from the first character position of the main string to match the first character position of the substring, and the match fails
Step 2:
B | A | B | C | D | A | B | A | B | C | D | A | B | C | E | D |
A | B | C | D | A | B | C |
Start from the second character position of the main string to match the second character position of the slave string, the match fails
third step:
B | A | B | C | D | A | B | A | B | C | D | A | B | C | E | D |
A |
Start from the third character position of the main string to match the first character position of the slave string, the match fails
...........
Step 7:
B | A | B | C | D | A | B | A | B | C | D | A | B | C | E | D |
A |
Match from the seventh character position of the main string to the first character position of the substring, and the match fails
Step 8:
B | A | B | C | D | A | B | A | B | C | D | A | B | C | E | D |
A | B | C | D | A | B | C | E | D |
If the match is successful, this is the BF algorithm. After each match fails, the next character position of the main string is matched with the first character of the substring.
kmp algorithm:
first step:
B | A | B | C | D | A | B | A | B | C | D | A | B | C | E | D |
A |
Step 2:
B | A | B | C | D | A | B | A | B | C | D | A | B | C | E | D |
A | B | C | D | A | B | C |
third step:
B | A | B | C | D | A | B | A | B | C | D | A | B | C | E | D |
A | B | C |
the fourth step:
B | A | B | C | D | A | B | A | B | C | D | A | B | C | E | D |
A | B | C | D | A | B | C | E | D |
If you don't understand it, I will explain it clearly below, and then find the subscript according to the program to know the reason.
The kmp algorithm is divided into two parts: ①next array (the core of the kmp algorithm is the above transfer work)
② string matching
Before explaining the next array, first explain the knowledge of prefixes and suffixes (by the way, the longest length of the matching between prefixes and suffixes is also explained).
For example, the above substring: ABCDABCED (the prefix is counted from the first character, but the suffix is not counted backwards, it is also counted backwards, see the writing below)
Before the first character, it has no string, so there is no prefix and suffix, so it also has no maximum length, I initialize its length to 0.
Before the second character, its string is A, and the prefix and suffix cannot contain the string itself, so it has no prefix and suffix, and its length is 0.
Before the third character, its string is AB, prefix is A, suffix is B, does not match, length is 0.
Before the fourth character, its string is ABC, the first type, the prefix is A, the suffix is C, does not match, the length is 0,
The second, the prefix is AB, the suffix is BC, does not match, the length is 0,
So the longest length is 0.
Before the fifth character, its string is ABCD, the first type, the prefix is A, the suffix is D, does not match, the length is 0,
The second, the prefix is AB, the suffix is BC, does not match, the length is 0,
The third type, the prefix is ABC, the suffix is BCD, does not match, the length is 0,
So the longest length is 0.
Before the sixth character, its string is ABCDA, the first type, prefix is A, suffix is A, match, length is 1,
The second, the prefix is AB, the suffix is DA, does not match, the length is 0,
The third type, the prefix is ABC, the suffix is CDA, does not match, the length is 0,
The fourth type, the prefix is ABCD, the suffix is BCDA, does not match, the length is 0,
So the longest length is 1.
Before the seventh character, its string is ABCDAB, the first type, the prefix is A, the suffix is B, does not match, the length is 0,
The second, the prefix is AB, the suffix is AB, matching, the length is 2,
The third type, the prefix is ABC, the suffix is DAB, does not match, the length is 0,
The fourth type, the prefix is ABCD, the suffix is CDAB, does not match, the length is 0,
The fifth type, the prefix is ABCDA, the suffix is BCDAB, does not match, the length is 0,
So the longest length is 2.
Before the eighth character, its string is ABCDABC, the first type, the prefix is A, the suffix is C, does not match, the length is 0,
The second, the prefix is AB, the suffix is BC, does not match, the length is 0,
The third type, the prefix is ABC, the suffix is ABC, matching, the length is 3,
The fourth type, the prefix is ABCD, the suffix is DABC, does not match, the length is 0,
The fifth type, the prefix is ABCDA, the suffix is CDABC, does not match, the length is 0,
The sixth type, the prefix is ABCDAB, the suffix is BCDABC, does not match, the length is 0,
So the longest length is 3.
Before the ninth character, its string is ABCDABCE, the first type, the prefix is A, the suffix is E, does not match, the length is 0,
The second, the prefix is AB, the suffix is CE, does not match, the length is 0,
The third type, the prefix is ABC, the suffix is BCE, does not match, the length is 0,
The fourth type, the prefix is ABCD, the suffix is ABCE, does not match, the length is 0,
The fifth type, the prefix is ABCDA, the suffix is DABCE, does not match, the length is 0,
The sixth type, the prefix is ABCDAB, the suffix is CDABCE, does not match, the length is 0,
The seventh, the prefix is ABCDABC, the suffix is BCDABCE, does not match, the length is 0,
So the longest length is 0.
Then according to the longest length we found, we can draw the next table,
subscript | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
from string | A | B | C | D | A | B | C | E | D |
next | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 3 | 0 |
subscript | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
from string | A | B | C | D | A | B | C | E | D |
next | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 |
The second B has the same next value of 0 as the first B, and the second C has the same 0 next value as the first C.
This is the next array and its improved next array.
next array program:
void Get_next(int *next, string s2) { int i = 1; int j = 0; next[i] = 0; while(i <= s2.length()) { if(j == 0 || s2[i - 1] == s2[j - 1]) { i++; j++; if(s2[i - 1] == s2[j - 1]) next[i] = next[j]; else next[i] = j - 1; } else j = 0; } }
If you understand the evaluation method of the next array, you should be able to write the kmp character matching function according to the idea of the bf algorithm.
The following is the overall program of the kmp algorithm:
#include <iostream> #include <string> using namespace std; / / Find the next array subscript void Get_next(int *next, string s2) { int i = 1; int j = 0; next[i] = 0; while(i <= s2.length()) { if(j == 0 || s2[i - 1] == s2[j - 1]) { i++; j++; if(s2[i - 1] == s2[j - 1]) next[i] = next[j]; else next[i] = j - 1; } else j = 0; } } // match part int kmp(string s1, string s2, int pos) { int next[100]; Get_next(next, s2); int i = 0; int j = 0; int k = -1; while(i < s1.length() && j < s2.length()) { if(j == 0 || s1[i] == s2[j]) { i++; j++; k = j; } else { k = next[j + 1]; j = k; } } if(j >= s2.length()) return i - s2.length(); return -1; } intmain() { string s1= "babcdababcdabced"; string s2 = "abcdabced"; int pos; cin >> pos; cout << kmp(s1, s2 ,pos) << endl; return 0; }
This program is just my personal thoughts and ideas. If there are any mistakes, please suggest them. If there is something you don't understand, please comment below. Thank you!