String pattern matching algorithm: BF and KMP

Reprinted from http://www.ruanyifeng.com/blog/2013/05/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm.html

1.BF Violent Solution

The basic idea :
compare the first character of the target string s with the first character string of the pattern string ss, if they are equal, compare the subsequent character strings;
otherwise , compare the next character of the string s with the pattern string again.
By analogy, until each character in ss is equal to a continuous substring in s, the match is successful. At this time, the position of the first character of ss in s is the position of ss in s, otherwise it matches unsuccessful.
(In fact, you can use the find function in this place)

2.KMP

For example, there is a string "BBC ABCDAB ABCDABCDABDE", I want to know whether it contains another string "ABCDABD"?
1.
Insert picture description here
First, compare the first character of the string "BBC ABCDAB ABCDABCDABDE" with the first character of the search term "ABCDABD". Because B does not match A, the search term is shifted one digit backward.
2.Insert picture description here

Because B does not match A, the search term moves back.
3. Insert picture description here
That's it, until the string has a character that is the same as the first character of the search term.
4.
Insert picture description here
Then compare the string with the next character of the search term, and it is still the same.
5.
Insert picture description here
Until there is a character in the string that is different from the character corresponding to the search term, the violent solution at this time is to move the string one place backward and compare it again, as follows
6.
Insert picture description here
This is very inefficient, because you have to The search position is moved to the position that has been compared, and the comparison is performed again.
7.
Insert picture description here
A basic fact is that when the space does not match D, you actually know that the first six characters are "ABCDAB". The idea of ​​the KMP algorithm is to try to use this known information, do not move the "search position" back to the position that has been compared, and continue to move it backward, which improves efficiency.
8. How do you know how much to move?
At this time, you have to count as a "partial match table".
First, you have to understand two words: prefix and suffix. "Prefix" refers to all the head combinations of a string except the last character; "suffix" refers to the first combination. Except for characters, all tails of a string are combined.
Insert picture description here

Insert picture description here
"Partial match value" is the length of the longest common element of "prefix" and "suffix" . Take "ABCDABD" as an example,

     - "A"的前缀和后缀都为空集,共有元素的长度为0;

-The prefix of "AB" is [A], the suffix is ​​[B], and the length of common elements is 0;

- The prefix of "ABC" is [A, AB], the suffix is ​​[BC, C], and the length of the total elements is 0;

- The prefix of "ABCD" is [A, AB, ABC], the suffix is ​​[BCD, CD, D], and the length of the common elements is 0;

- The prefix of "ABCDA" is [A, AB, ABC, ABCD], the suffix is ​​[BCDA, CDA, DA, A], the total element is "A", and the length is 1;

- The prefix of "ABCDAB" is [A, AB, ABC, ABCD, ABCDA], the suffix is ​​[BCDAB, CDAB, DAB, AB, B], the total element is "AB" and the length is 2;

- The prefix of "ABCDABD" is [A, AB, ABC, ABCD, ABCDA, ABCDAB], the suffix is ​​[BCDABD, CDABD, DABD, ABD, BD, D], and the length of the common elements is 0.
  
  The essence of "partial matching" is that sometimes, the beginning and end of the string will be repeated. For example, if there are two "AB" in "ABCDAB", then its "partial match value" is 2 (the length of "AB"). When the search word moves, the first "AB" moves backward 4 digits (string length-partial matching value), and then you can come to the second "AB" position.
8. When the Insert picture description here
known space does not match D, the first six characters "ABCDAB" are matched. Looking up the table, we can see that the "partial matching value" corresponding to the last matching character B is 2, so calculate the number of bits moved backward according to the following formula:

Number of shifts = number of matched characters-corresponding partial matching value

Because 6-2 is equal to 4, move the search term backward by 4 places.
9. Insert picture description here
Because the space does not match the C, the search term must continue to move backward. At this time, the number of matched characters is 2 ("AB"), and the corresponding "partial match value" is 0. Therefore, the number of shifts = 2-0, and the result is 2, so the search term is shifted back by 2 digits.
10. Insert picture description here
Because the space does not match with A, continue to move back one digit.
11.
Insert picture description here

Compare bit by bit, until C and D do not match. So, move the number of digits = 6-2, continue to move the search term backward 4 places.
12.
Insert picture description here
Compare bit by bit until the last bit of the search term is found and a complete match is found, so the search is complete. If you want to continue the search (that is, find all matches), move the number of digits = 7-0, and then move the search term backward by 7 places, so I won’t repeat it here.

The following is the algorithm implementation

#include <iostream>
#include <string>
#include <vector>
using namespace std;

//部分匹配表
void cal_next(string &str, vector<int> &next)
{
    
    
    const int len = str.size();
    next[0] = -1;
    int k = -1;
    int j = 0;
    while (j < len - 1)
    {
    
    
        if (k == -1 || str[j] == str[k])
        {
    
    
            ++k;
            ++j;
            next[j] = k;//表示第j个字符有k个匹配(“最大长度值” 整体向右移动一位,然后初始值赋为-1)
        }
        else
            k = next[k];//往前回溯
    }
}

vector<int> KMP(string &str1, string &str2, vector<int> &next)
{
    
    
    vector<int> vec;
    cal_next(str2, next);
    int i = 0;//i是str1的下标
    int j = 0;//j是str2的下标
    int str1_size = str1.size();
    int str2_size = str2.size();
    while (i < str1_size && j < str2_size)
    {
    
    
        //如果j = -1,或者当前字符匹配成功(即S[i] == P[j]),
        //都令i++,j++. 注意:这里判断顺序不能调换!
        if (j == -1 || str1[i] == str2[j])
        {
    
    
            ++i;
            ++j;
        }
        else
            j = next[j];//当前字符匹配失败,直接从str[j]开始比较,i的位置不变
        if (j == str2_size)//匹配成功
        {
    
    
            vec.push_back(i - j);//记录下完全匹配最开始的位置
            j = -1;//重置
        }
    }
    return vec;
}

int main(int argc, char const *argv[])
{
    
    
    vector<int> vec(20, 0);
    vector<int> vec_test;
    string str1;
    cin>>str1;
    string str2 ;
    cin>>str2;
    vec_test = KMP(str1, str2, vec);
    vector<int>::iterator it;
    for(it = vec_test.begin(); it != vec_test.end(); it++)
    {
    
    
        cout<<*it + 1<<endl;
    }
//    for (const auto v : vec_test)
//        cout << v << endl;
    return 0;
}

Guess you like

Origin blog.csdn.net/weixin_51216553/article/details/110140770