Detailed explanation of the data structure string matching KMP algorithm (the topic explanation is simple and easy to understand)

If you have any questions, please feel free to leave a message in the comment area~~~

The blogger recently encountered the KMP string matching algorithm in the process of reviewing the data structure. After browsing many articles on the Internet, I feel that the writing is not clear enough and easy to understand, especially from the perspective of doing questions. The following is a personal analysis of KMP Solve the problem with the understanding of the algorithm, please understand if you have any questions~

First, let's look at the definition of the KMP algorithm

KMP algorithm definition

The KMP algorithm is an improved string matching algorithm proposed by DEKnuth, JH Morris and VRPratt, so people call it the Knuth-Morris-Pratt operation (KMP algorithm for short). The core of the KMP algorithm is to use the information after the matching failure to minimize the number of matches between the pattern string and the main string to achieve the purpose of fast matching. The specific implementation is realized through a next() function, and the function itself contains the partial matching information of the pattern string. The time complexity of the KMP algorithm is O(m+n)

The KMP algorithm is an improved pattern matching algorithm proposed by three scholars on the basis of the Brute-Force algorithm. In the Brute-Force algorithm, when multiple characters in the pattern string are equal to several consecutive characters in the main string, but the last character is not equal, the comparison position of the main string needs to be rolled back. In the above cases, the KMP algorithm does not need to roll back the position of the main string, which can greatly improve the efficiency

So we can make it clear that the core of the KMP algorithm is the distance that the substring should move when a character string fails to match, so the key is to ask for the moving distance. After clarifying our goal, let's explain it with specific topics

basic concept

First of all, we clarify the basic concepts to be used in several topics

1: The main string is the string to be matched

2: The string is also called the pattern string, which is used to match the main string

3: prefix

4: Suffix

5: longest equal suffix length

6: Partial match value PM

7: next array

8: nextval array

 Presumably everyone sees so many concepts above that it is inevitable to feel dizzy and difficult. Let’s explain it with specific topics. This algorithm is actually very easy to understand.

The prefix refers to all headers of the string except the last character. The suffix refers to all tail substrings of the string except the first character, and the partial match value is the longest equal prefix and suffix length of the string prefix and suffix

Let's solve the above concept from a specific example 

According to the rules, you can find the suffixes one by one. There is a small pit here, that is, the suffixes are from the front, so it is ba bab...

Then, when a character in the above partial match value table does not match, the calculation formula for the number of digits that the substring needs to move backward is as follows 

The number of moving digits = the number of matched characters - the partial matching value of the corresponding character

Remember, what you are looking for here is the last matching character, which is the partial matching value of b in the figure below 

The method is as shown in the figure below, and the rest of the characters are not matched in the same way until the string matches successfully or fails 

The error here should be that the partial match value of b should be 0 

So the question is, since the number of moving digits can be calculated above, what are the next and nextval arrays?

Because the above calculation method is still not intuitive and simple enough, and we need to find the partial matching value of the previous character, so we simplify it further, and move the PM table to the right by one bit to get the next array. The first element is shifted to the right and left blank. Fill with -1, the last element is moved to the right and overflows, ignore it and discard it directly. Sometimes for the convenience of calculation, you can also add 1 to the value of the next array as a whole. This should be handled flexibly according to the conditions of the topic. In fact, the order of the string starts from 1 as a whole + 1, starting from 0, no need to add

At this time, the number of moving digits = the number of matched characters - the value of the next array of the corresponding part (that is, no need to look at the partial matching value of the previous digit)

So the next array in the above figure is -1 0 0 01 If you add 1, it will be 0 1 1 1 2 Readers can verify by themselves

Finally, the nextval array is a further optimization of the KMP algorithm. When there is a mismatch, if Pj=Pnext[j] will match equal characters, the solution is to recursively calculate next[j] and name the updated next[j] as nextval[ j], the recursive calculation method is

nextval[j]=next[next[j]] 

If it is still equal after recursing once, continue recursing until it is not equal

It's not easy to create and find it helpful, please like, follow and collect~~~ 

Guess you like

Origin blog.csdn.net/jiebaoshayebuhui/article/details/130394394
Recommended