[Reserved] KMP string matching algorithm

Author: Ruan Yifeng

Date: May 1, 2013

String match is one of the basic tasks of the computer.

For example, there is a string "BBC ABCDAB ABCDABCDABDE", I want to know, which contains another string "ABCDABD"?

Many algorithms can accomplish this task, Knuth-Morris-Pratt algorithm (referred to as KMP) is one of the most commonly used. It named three inventor, K is the beginning of the famous scientist Donald Knuth.

This algorithm is not easy to understand, there are many online explanation , but it reads very strenuous. Until reading Jake Boxer article, I truly understand the algorithm. Now, I use their own language, trying to write a better explanation of KMP algorithm understand.

1.

First, the string "BBC ABCDAB ABCDABCDABDE" the first character in the search term "ABCDABD" the first character, for comparison. Because A and B do not match, so after a search term shift.

2.

Because A and B do not match the search term and then backward.

3.

In this way, until a string of the same character, with the first character of a search term so far.

4.

Then compare the string and the next character in the search term, or the same.

5.

Until there is a character string corresponding to the search word characters are not the same so far.

6.

At this time, the most natural reaction is that the entire search terms after a shift over again compare apples to apples. Although feasible to do so, but the efficiency is poor, because you want to "search locations" Move have compared the position, is heavier than it again.

7.

A basic fact is that when space and D do not match, you actually know the first six characters are "ABCDAB". KMP algorithm idea is to try to use this known information, not to the "Search position" to move back have compared the position, it continues to move backward, thus improving efficiency.

8.

How to do this? You can search for a word, to calculate a "partial match table" (Partial Match Table). How this table is generated, then back, but here is just use it.

9.

D is known space and does not match the first six characters "ABCDAB" match. Look-up table can be seen, the last matching character B corresponding to a "partial matching value" is 2, so the number of bits calculated by the following equation rearward movement:

  Shift bit number = number of characters matching - the value corresponding to a partial match

Because 6--2 equal to 4, so the search term move back four.

10.

Because spaces and C do not match the search term will continue backward. In this case, the number of characters matched is 2 ( "AB"), corresponding to the "part of the matching value" is 0. Therefore, the shift bit number = 2--0, the result is 2, then two search word shift rearwardly.

11.

Because space does not match with the A, after a continued shift.

12.

Bit-wise comparison, until the discovery of C and D do not match. Then, the shift bit number = 6--2, search word continues to move rearwardly 4.

13.

Bit-wise comparison, until the last bit of search terms, exact match, so the search is completed. If you also want to continue the search (ie, find all matching), moving median = 7--0, and then search for the word to move backward 7, will not repeat it here.

14.

Here are "partial match list" is how to generate.

First, we must understand two concepts: "prefix" and "suffix." "Prefix" refers to the addition to the last character of a string of all the head assembly; "suffix" means all composition except the tail first character of a string.

15.

"Partial match value" is the length of "prefix" and "suffix" of the longest common elements. To "ABCDABD" for example,

  - "A" prefix and suffix are empty set, the length of the common elements is 0;

  - the prefix "AB" is [A], the suffix is ​​[B], the length of the common elements is 0;

  - the prefix "ABC" is [A, AB], the suffix [BC, C], the length of the common elements of 0;

  - "ABCD" prefix to the [A, AB, ABC], the suffix [BCD, CD, D], the length of the common elements is 0;

  - the prefix "ABCDA" is [A, AB, ABC, ABCD], the suffix [BCDA, CDA, DA, A], as common elements "A", length 1;

  - the prefix "ABCDAB" is [A, AB, ABC, ABCD, ABCDA], the suffix [BCDAB, CDAB, DAB, AB, B], common elements of "AB", a length of 2;

  - the prefix "ABCDABD" is [A, AB, ABC, ABCD, ABCDA, ABCDAB], the suffix [BCDABD, CDABD, DABD, ABD, BD, D], 0 is the length of the common elements.

16.

"Partial match" in essence, sometimes, the head and tail string will be repeated. For example, "ABCDAB" into two "AB", it "partial matching value" is 2 (length "AB") is. When moving the search word, the first "AB" backward movement 4 (string length - Partial match value), can come to the second "AB" position.

(Finish)

Reproduced in: https: //www.cnblogs.com/ericsun/p/3334084.html

Guess you like

Origin blog.csdn.net/weixin_34352449/article/details/93154976