Understand KMP algorithm

 

KMP matching algorithm introduced by the violence ---->

Violence matching algorithm


 Problems: There is a text string S, and a pattern string P, P now to find the location in S.

 If the idea of ​​violence matching, and assuming now matches the text string S to position i, j to match the pattern string P position, there are:

  • If the current character matches successful (i.e., S [i] == P [j]), then i ++, j ++, continues to match the next character;
  • If the mismatch (! I.e., S [i] = P [j]), so that i = i - (j - 1), j = 0. When the failure corresponding to each match, i backtracking, j is set to 0. (Previously matched string of good full-j-1 overturned)

 

The above example :( S, below P)

A matching example, from here on:

 

 This match has to go:

 

 Here is no match on:

 

Rollback go back to start again:

 

This rollback question is:

Before matching step 4, we have learned that S [5] = P [1] = B, and P [0] = A, i.e., P [1]! = P [0], so that S [5] must is not equal to P [0], it will inevitably lead back past mismatch.

That is, in the red box on the S Figure B we knew before does not mean following a red box A, backtracking back is useless.

That there is no one algorithm, so i do not back back, just move it to j?

YES  ===> the KMP algorithm, the matching part has valid information before using it, i is not held back, by modifying the position j, as far as possible so that the pattern string to valid positions.

 

KMP algorithm

Knuth-Morris-Pratt string search algorithm, referred to as "the KMP algorithm", often used to find the position of a pattern string P appears in a text string S , the algorithm of Donald Knuth, Vaughan Pratt, James H. Morris on three in 1977 jointly published, so it chose the three men named after the algorithm.

KMP given directly following the first process algorithm (described in detail will follow):

Suppose now that matches the text string S to position i, j to match the pattern string P position:

  • If = -1 next [j], or the current character matching is successful (i.e., S [i] == P [j]), have made i ++, j ++, continues to match the next character;
  • If the next [j]! = -1, and the current character fails to match (i.e., S [i]! = P [j]), the same for i, j = next [j]. This means that when a mismatch, the pattern string P j with respect to the text string moves to the right S - next [j] bits. (Unlike the above move to match the violence i)

Example:

Then look at the example above, where a match is found not to:

 

Then we fall back here ( i have not changed, j move to the right 4 ==> Why do you want to move four to the right of it, because after the move four, and there were a string pattern "AB" can continue with S [8 ] S [9] corresponds to, so i do not allow backtracking ): KMP algorithm idea is to try to use this known information, not to the "Search position" to move back have compared the position, it continues to move backward, so to improve efficiency.

 

Can be seen that, when the mismatch bits pattern string is moving to the right: the number of characters matched - maximum length mismatch on a character corresponding to the character

 

Complete example :

If a given text string "BBC ABCDAB ABCDABCDABDE", and the pattern string "ABCDABD", now bring it to talk to a text string pattern string matching.

Shift bit number = number of characters matching - the value corresponding to the partial match   ==> maximum length table / NEXT array

The maximum length of the original pattern corresponding to the respective sub-string string prefixes and suffixes common elements table / next array is:

We look at how the table is constructed:

Maximum length (also referred to as "partial match value") is the length of "prefix" and "suffix" longest total elements. To "ABCDABD" for example,

  - "A" prefix and suffix are empty set, the length of the common elements is 0;

  - the prefix "AB" is [A], the suffix is ​​[B], the length of the common elements is 0;

  - the prefix "ABC" is [A, AB], the suffix [BC, C], the length of the common elements of 0;

  - "ABCD" prefix to the [A, AB, ABC], the suffix [BCD, CD, D], the length of the common elements is 0;

  - the prefix "ABCDA" is [A, AB, ABC, ABCD], the suffix [BCDA, CDA, DA, A], as common elements "A", length 1;

  - the prefix "ABCDAB" is [A, AB, ABC, ABCD, ABCDA], the suffix [BCDAB, CDAB, DAB, AB, B], common elements of "AB", a length of 2;

  - the prefix "ABCDABD" is [A, AB, ABC, ABCD, ABCDA, ABCDAB], the suffix [BCDABD, CDABD, DABD, ABD, BD, D], 0 is the length of the common elements.

 

 

 And then copy over the above algorithm process:

  • If = -1 next [j], or the current character matching is successful (i.e., S [i] == P [j]), have made i ++, j ++, continues to match the next character;
  • If the next [j]! = -1, and the current character fails to match (i.e., S [i]! = P [j]), the same for i, j = next [j]. This means that when a mismatch, the pattern string P j with respect to the text string moves to the right S - next [j] bits. (Unlike the above move to match the violence i)

 

Example matching process is as follows:

 

1. Because the characters in the pattern string text A string of characters with B, B, C, space is not the beginning of a match pattern string directly to the right one can continue until the mode A string of characters with text fifth character string a successful match:

 

2. continue to the next match, when the last character D pattern string text string with matching mismatched pattern strings need to move to the right. How many do but move to the right? Since at this time has the number of characters matching six (ABCDAB), then the gains and losses in accordance with the character D "Maximum length" may be the length of a character B corresponding to a value of 2 , so in accordance with the previous conclusions, shows that a right mobile 6--2 = 4.


3. moving pattern strings 4 to the right, again found at the C mismatch, because the two characters have been matched (AB), and the character B corresponding to a maximum length value of 0 , it moves to the right: 2--0 = 2.

 

4. A mismatch with the space, one moves to the right.

 

The comparison continues, D and C found mismatch is moving to the right so that the number of bits: 6 minus the number of characters matching the character B corresponding to a maximum length of 2 , i.e., rightward movement 6 - 2 = 4 .

 

6. Step 5 after experiencing (CDABD the partial match), a successful match is found, the process ends.

As can be seen by the matching process, the key issue is to find the same prefix and suffix pattern string of maximum length , based on the match.

 

Reference 1 

Reference 2

Reference 3

Guess you like

Origin www.cnblogs.com/shona/p/12571167.html