String matching algorithm (RK algorithm and KMP algorithm)

Algorithm 1: RK algorithm

Algorithm Description:

(1) computing model train Hashcode

Mode 1: addition bitwise;

Embodiment 2: as hexadecimal number is converted to decimal 26, such as abc = 1x26 ^ 2 + 2x26 ^ 1 + 4x26 ^ 0;

2 embodiment disadvantage: when a long character string, the corresponding decimal number would be very large

(2) computing the incremental main string

For example: main string: abbcefg; pattern string bce

First calculation abb, bbc second calculation: New Old Hahcode = Hashcode - 'a' + 'c '

(3) Check Hash Collision

//20200306
#include<iostream>
#include<cstdio>
#include<cstring>
using namespace std;
const int MAXN = 1000;
char pattern[MAXN],major[MAXN];
int hashcode(char* arr,int begin,int end){
    int sum = 0;
    for(int i=begin;i<end;i++){
        sum += (int)arr[i] - 'a' + 1; // ASCII of 'a' is 97
    }
    return sum;
}
int changehashcode(int sumori,int begin,int end){
    return sumori - major[begin-1] + major[end-1]; // +96-96 = 0
}
bool hashcollision(int begin,int end){
    bool state = false;
    for(int i=begin;i<end;i++){
        if(major[i] != pattern[i-begin]){
            state = true;
            break;
        }
    }
    if(state) return true;
    else return false;
}

int main(){
    int pos = -1;
    scanf("%s\n%s",pattern,major);
    int lenp = strlen(pattern);
    int lenm = strlen(major);
    int hashp = hashcode(pattern,0,lenp);
    int hashm = hashcode(major,0,lenp);
    if(hashm == hashp && !hashcollision(0,lenp)) return 0;
    for(int i=1;i<lenm-lenp;i++){
        hashm = changehashcode(hashm,i,i+lenp);
        if(hashm != hashp) continue;
        else if(hashcollision(i,i+lenp)) continue;
        else {
            pos = i;
            break;
        }
    }
    cout<<pos;
    return 0;
}
/ * --- Test --- 
ECB 
abbcefgh 
// collision 
ECB 
abcbefgh 
* /

View Code

Algorithm 2: KMP algorithm

1. What is the next array?

　　next [i] is the substring s [0 ... i] is maximum equal to the prefix and suffix to the prefix last subscript . (Also up to the same length of the prefix and suffix )

　　What is a prefix? Substring s [0 ... i] in the s [0 ... k] (k <i) as a prefix, s [... i] suffix, where the prefix and suffix of equal length. I.e. from zero onward a number k, starts to move forward from the last character number k th (k <i); (note that k = i, i.e., substring is not itself!)

For example: s [] = abababc

　　When i = 0, substring of a, k <i = 0 The value of k can not, so next [0] = -1;

　　When i = 1, substring ab, k = 0 <i = 1, prefixes a, the suffix B, are not equal,

　　　　　　So next [1] = -1;

　　When i = 2, substring aba, k = 0 <i = 2, prefixes a, suffixes a, equal to, k = 0

　　　　　　　　　　　　k = 1 <i = 2, ab & prefix, suffix BA, not equal,

　　　　　　Therefore, before the suffix is equal to the maximum (i.e., maximum k) is 0, the next [2] = k = 0;

　　When i = 3, substring abab, k = 0 <i = 3, prefixes a, suffixes b, unequal

　　　　　　　　　　　　k = 1 <i = 3, the prefix ab, the suffix ab, equal to, k = 1

　　　　　　　　　　　　k = 2 <i = 3, the prefix ABA, BAB suffix, are not equal,

　　　　　　Therefore, before the suffix is equal to the maximum (i.e., maximum k) is 1, the next [3] = k = 1;

　　When i = 4,5,6, empathy;

　　When i = 7 (strlen (s) = 7), substring abababc, k = 0 <i = 7, prefixes a, the suffix c, are not equal. OK, this time value of k do not need to continue down, because no longer equal, and that is they are not equal situation can be terminated. [Continue to increment equal to k, k is not equal to immediately terminate, next [i] = k-1];

　　Finally obtained next array [-1, -1,0,1,2,3, -1].

2. How to get the next array? (Exemplified above seeking algorithm next array taken when more complicated, so the use of the following recursive method)

We will base next [i-1] of the seek next [i]:

　　Suppose we have calculated next [i-1], may wish to make it equal to J (that is the longest length of the front equal suffix s [0 ... i] is J ), then we continue when a reading , up to a maximum length of the suffix is equal to the former increased based on the original one , that is, before the longest length of up to equal suffix next [i-1] +1;

　　(Understood: before suffix length equal to the longest string of xyx strlen (x), then the first suffix length equal to the longest possible xyxa than strlen (xa) and the suffix is equal to the longest length before xyxa equal strlen (xa) if and only when Y == A [0] )

　　In other words, we judge read the new character and a y [0] is equal to; y [0] What is? Finally the subscript prefixes +1 It is a subscript j + 1 corresponding to the character, i.e., that the read character string before the former is equal to the longest suffix as a subscript corresponding to the character.

　　If a == y [0], then the next [i] = j + 1 can;

　　If a! = Y [0], how to do?

　　Intuitive idea from scratch (j = -1) to start matching, such an approach is too violent, so that we can return to the nearest j meet the requirements (a == s [j + 1 ]) location. In other words: We need to find the first in a string xyxa x after the character as much as possible by a, we assume that the demolition is the first x waz, after a split into x uw, then we should make the string wazyuwa length z as small as possible (w length as large as possible). So how do we find waz in a place? Now, we know the position of the last character in the string of z, the meaning of the next character array corresponding to the value is stored before the suffix string is equal to the longest prefix waz last bit index, it may be assumed that just subscript a corresponding character (if not we let cyclically j = next [j] to, or until it encounters a end-to-1), the string becomes z wa, then the whole is equal to the longest string of prior wazyuwa the length of the suffix is strlen (wa).

3. What is the KMP algorithm?

　　We found that the above requirements next array process, is the string "self-matching" process, that is given to the sub s string string [0,1,2 ... i] s to find [0,1, ... k] (k <i) and s [....., i] so that the two sub-strings exactly equal and k as large as possible. Then the same string matching a given pattern refers to pattern string, and to substring matching substring given main string of text, and to find the string that matches the pattern, the process arrays and the next string similar process mode .

　　Therefore, the process of seeking to imitate an array of next, we have to match the main text string and the first character of the pattern string pattern. If they are equal, we continue to match text and pattern of the second character to the end ...... If you have not encountered pattern of unequal situation, this time we want to move the pattern position, and how one bit mobile, then degenerate into a BF algorithm. So where exactly should be moved to do? We know that the pattern string next [i] array pattern is recorded in a string of sub-string pattern [0,1 ....., i] is equal to the longest suffix prefix before the last subscript k a, that is pattern [0, ... k] and the pattern [m, ...., i] (m = ik) is the same string. Mismatch occurs pattern [i + 1] and text [i + 1], that is to say pattern [0, ..., i] and text [0, .... i] are matched, then the pattern [m, ..., i] and text [m, .... i] match, then the pattern [0, ..., k] and text [m, .... i] match, then we continue alignment pattern [ k + 1] and text [i + 1] are equal to the line, and k + 1 equals next [i + 1].

4. Summary

next array exactly what meaning?

(1) substring s [0 ... i] is maximum equal to the prefix and suffix to the prefix last subscript;

(2) substring s [0 ... i] of the longest suffix length equal to the front;

(3) a main string and pattern matching, j + 1 when not match, j be retracted position.

5.KMP algorithm corresponding to the exercises : Luo Gu OJ

AC Code:

//20200306
#include<iostream>
#include<cstdio>
#include<cstring>
using namespace std;
const int MAXN = 1000000;
char pattern[MAXN],text[MAXN];
int next[MAXN];
int value[MAXN];//store the result in LOGU OJ 
void getNext(){
    int len = strlen(pattern);
    next[0] = -1;
    int j = next[0];
    for(int i=1;i<len;i++){
        while(j!=-1 && pattern[i] != pattern[j+1]){
            j = next[j];
        }
        if(pattern[i] == pattern[j+1]) j++;
        next[i] = j;
    }
}
int KMP(){
    getNext(); 
    int lenpt = strlen(pattern);
    int lentx = strlen(text);
    int j = next[0];
    int ans = 0;
    for(int i=0;i<lentx;i++){
        while(j!=-1 && text[i] != pattern[j+1]){
            j = next[j];
        }
        if(text[i] == pattern[j+1]) j++;
        if(j == lenpt - 1){
            value[ans++] = i+1-j;
            j = next[j];
        }
    }
    return ans;
}
int main(){
    scanf("%s\n%s",text,pattern);
    int ans = KMP();
    for(int i=0;i<ans;i++) cout<<value[i]<<endl;
    int len = strlen(pattern);
    for(int i=0;i<len;i++) cout<<next[i]+1<<" "; //adjust to LOGU OJ
    
}

View Code

References:

[1] Hu, who "algorithm notes" P455 ~ P464