KMP pattern matching algorithm

Profile
is a few Scientology felt violence matching string too grinding grumble, in a study of the new algorithm, the three seniors are DEKnuth, JHMorris and VRPratt, so this algorithm is called Knuth - Morris - Pratt algorithm, referred to as the kmp algorithm
core ideas
of the main string is incremented one by one without backtracking, the introduction of a next sub-string array, used to record the time when the character does not match the appropriate position of the substring should go back to the
algorithmic process
actually KMP algorithm is the basis of violence from the matching algorithm a modified algorithm, we start with violence understand kmp algorithm matching process procedure
example:
main string S = "abcdefgab"
pattern string T = "abcdex"
violence algorithm matching process
i represents the position of the main sequence, j represents a position of the pattern string

  1. Matching the top five, the sixth character do not match, at this time. 5 = I, J =. 5
    abcdefgab
    abcdex

  2. 此时i=1, ,j=0;
    a b c d e f g a b
      a b c d e x    
    3.i=2, ,j=0;
    a b c d e f g a b
        a b c d e x  
    4.i=3, ,j=0;
    a b c d e f g a b
          a b c d e …
    5.i=4, ,j=0;
    a b c d e f g a b
            a b  c   d  …

  3. . 5 = I, J = 0;
    abcdefgab
              A BC ... 
    In the above matching process, the substring "a" and the latter "bcdex" are not equal, i.e.,
    T [0] = T [. 1]!
    T [0] ! = T [2]
    T [0]! = T [. 3]
    T [0]! = T [. 4]
    in the first step when it has been determined through the
    S [0] == T [0]
    S [. 1] = T = [. 1]
    S [2] == T [2]
    S [. 3] == T [. 3]
    S [. 4] == T [. 4]
    so there is
    T [0]! = S [. 1]
    T [ 0]! = S [2]
    T [0]! = S [. 3]
    T [0]! = S [. 4]
    and then completely unnecessary and 2,3,4,5 step ah, with step 1 (note is the premise), I go directly to step 6 like it
    (6 why you want to keep it, because T [0]! = T [ 5] in the first step T [5]! = S [ 5] so I can not know the relationship between T [0] and S [0]) of
    step 1-6, the main strings i values back to 5, it does not go back, only consider the value of j on the line
    , he said front judged T [0] and the following characters are are not equal, if there is supposed to be equal
    see example
    main string S = "abcababca"
    pattern string T = "abcabx"

  4. Matching the top five, the sixth character do not match, at this time. 5 = I, J =. 5
    abcababca
    abcabx

  5. . 1 = I,, J = 0;
    abcababca
      abcabx    
    3.i = 2,, J = 0;
    abcababca
        abcabx  
    4.i =. 3,, J = 0;
    abcababca
          abcabx
    5.i =. 4,, J =. 1;
    abcababca
          abcabx
    =. 5 6.i,, J = 2;
    abcababca
          abcabx
    method according so we compare the first 2 and 3, the steps may be omitted omitted here
    to 4,5 is the same reason, the substring T [0 ] == T [3] step T [3] == S [3 ] Therefore, T [0] == S [3 ]
    Finally, a sixth step to directly

Summary Integrated 1,2 two examples we found that once a mismatch occurs, the one who does not match a substring should decide to jump to the right place,
as to what position to jump to the extent determined by repeating substring, and is a substring of characters is repeated until the degree of mismatch, the higher the degree of repetition, i.e. jump farther from the position of the first character
(talk mode back this high degree of repetition optimization once again)

KMP algorithm into the next array used to record not a character if you do not match up with the main string, I jump what position

Determining a position of: a set of prefix-suffix substring of length of the longest common string
is incorporated herein concepts two
prefixes: the last character is not included and must contain a sequence of the first character string
, such as: ABDD prefix A, AB, ABD
suffix : contains the first character and does not have a character string to the end of the last order
, such as: ABDD suffix D, DD, BDD
the summary
1,2 next two examples corresponding value
  abcdex
next [I] -1 0 0 0 00

Wherein the
substring before the first one is not a string of meaningless by -1 indicates
a second bit b previous string "a", no prefix, suffix no, no public string, with 0
third previous bits c string "ab", the prefix "a", the suffix "b", is not the same substring, represented as 0
and so
  abcababca
Next [I] -1. 1 2 0 0 0 2. 3. 1
wherein the
substring before the first bit a no string, represented by -1 meaningless
second bit b previous string "a", no prefix, suffix no, no public string, with 0
third c previous string "ab", the prefix "a", the suffix "b", it is not the same substring, 0 represents
the fourth as a previous string "abc", the prefix "a", "ab", the suffix "c", "bc", the same is not string 0 is
the fifth bit b previous string "abca", the prefix "a", "ab", "abc", the suffix "a", "ca", "bca", longest common string of "a "length of 1, 1 indicates a
sixth bit string is a previous" abcab ", the prefix" a "," ab ", " abc "," abca ", the suffix" b "," ab ", " cab "," bcab ", the longest common string as" ab ",A length of 2, with 2 represents
so
now this case in a
T string AAAAC
Next [I] -1. 1 0 2. 3
Imagine if a substring with the string comparison, for example, in the fourth character T [ 3] does not match the time
in accordance with "a" corresponding to the next [2] i.e. skip T [2] and in a position to compare with the main string
Do you think it necessary?
Certainly is a waste of time, because T [0] == T [1 ] == T [2] == T [3], if T [3] does not match up, according to this premise should get with T [3 ] the same value, so if the same then find its "father" if it "father" also went to it the same "grandfather" until its "fathers"
when Since it is so, we are in the beginning of traversal you can add, if the character's "children" (next character) looks like itself (T [n] == T [ n + 1]), j to put their child (next [i] = next [j]), such layer by layer to it, how far the near relatives j can be directly found by the ancestors of the
code is as follows
void getNext (S char *, int * Next)
{
int len = strlen (S) ;
int I, J;
I =. 1;
J = 0;
Next [0] = -1;
Next [. 1] = 0; // initializes
the while (I <len)
{
IF (J == -1 || S [ I] == S [J])
{
++ I;
++ J;
IF (S [I] == S [J])
{
Next [I] = Next [J];
}
the else
Next [I] = J ;
}
the else
{
j = next[j];
}
}
}

Looks for a match when the string
// return position T in S, if none -1
int indexKmp (S char *, char * T)
{
int I = 0;
int J = 0;
int sLen = strlen (S );
int TLEN = strlen (T);
int Next [32];
int the nextval [32];
getNext (T, Next);
the while (I <sLen && J <TLEN)
{
IF (J S == -1 || [I] == T [J])
{
++ I;
++ J;
}
the else
{
J = the nextval [J];
}
}
IF (J == TLEN)
{
return I - strlen (T);
}
the else
{
-1 return;
}
}

Guess you like

Origin blog.csdn.net/weixin_41369611/article/details/94723311