Through [String Processing Python Implementation] The brute force of string pattern matching, the introduction and implementation of the BM algorithm , we introduced the brute force algorithm and the BM algorithm respectively. Among them:
- After each round of character matching failures, the brute force algorithm simply slides the pattern string to the right and then the length of one character. Therefore, although the simplest and intuitive, it will inevitably make a lot of meaningless comparisons;
- BM algorithm mirroring temptation and jump exploratory character , character after the match fails, can occur in each round as much as possible to slide a few characters to the right , thereby reducing unnecessary comparison.
This article will introduce another well-known and efficient string matching algorithm-KMP algorithm, which was jointly proposed by DE Knuth, JH Morris and VR Pratt, so the name is the first letter of the last name of the three.
1. Terminology
For the convenience of subsequent description, before introducing the KMP algorithm, here are a few nouns and terms about strings. Assuming a string 'GTGT'
is given , then:
- Prefix substring :
''
,'G'
,'GT'
,'GTG'
and the'GTGT'
string called'GTGT'
prefix substring;- True prefix substring : removing
'GTGT'
itself outside''
,'G'
,'GT'
,'GTG'
string called'GTGT'
true prefix sub-string;- Suffix substring :
''
,'T'
,'GT'
,'TGT'
and the'GTGT'
string called'GTGT'
suffix substring;- This suffix substring : removing
'GTGT'
itself outside''
,'T'
,'GT'
,'TGT'
string called'GTGT'
proper suffix sub-string.
This article is convenient for presentation. All the prefix substrings and suffix substrings mentioned later refer to true prefix substrings and true suffix substrings respectively .
Second, the algorithm details
First of all, a sentence to summarize the core idea of the KMP algorithm: make full use of the first few characters of the pattern string successfully matched with the main string in the current round, so that before the next round of matching, the pattern string can slide to the right relative to the main string as much as possible More characters .
1. Overview of KMP algorithm flow
Next, we first intuitively explain the specific matching process of the KMP algorithm through actual cases, assuming that the main string is 'GTGTCGTTGGGTGTG'
and the pattern string is 'GTGTCGTG'
.
Like brute force and BM matching algorithms, the first step of the KMP algorithm before matching is to align the main string and the pattern string on the leftmost side, and then compare them character by character from left to right.
In the first round , as shown in the following figure, we found that the first 7 characters of the pattern string are the same as the characters in the alignment position of the main string at this time, until the 8th character is inconsistent.
The question now is how to make full use of the matched substrings 'GTGTCGT'
. First, through analysis, we can find that 'GTGTCGT'
the prefix substring and suffix substring have such characteristics:
- The prefix substring of length 1 and the suffix substring of length 1 are the same, both
'G'
; - The prefix substring of length 2 and the suffix substring of length 2 are the same, both
'GT'
; - In addition, there is no other case of the same length that makes
'GTGTCGT'
the prefix substring and suffix substring the same.
We sub-string satisfies the above condition 'GT'
is called 'GTGTCGT'
the longest substring before the suffix can be matched .
Next, with this information, the essence of the KMP algorithm is that the first 7 characters of the pattern string and the main string that have been matched can be moved to the right until the longest matching prefix is 'GT'
aligned, as shown in the following figure. Here we directly slide the pattern string to the right for another 5 characters:
Here, some people may have questions. For the currently successfully matched substring 'GTGTCGT'
, it is known that the longest matching prefix and suffix substring is 'GT'
, why must the pattern string be directly swiped 5 characters to the right to make the longest When the matchable prefix substring and the longest matchable suffix substring are aligned, and at the same time ensure that the possible successful match is not missed? For example: slide the pattern string two characters to the right, so that the pattern string and the main string are aligned at the latter index 2 (as shown in the figure below).
To answer this question, you can use the contradiction method to prove:
Suppose that after sliding the pattern string two characters to the right relative to the main string, the pattern string is at a position where it may successfully match the main string. This means that at least from the position of the main string index 2 to 6, the pattern string is aligned with the main string The characters need to be the same , but at this time the longest matching prefix and suffix substring should not be 'GT'
, but should be 'GTCGT'
, this contradicts the actual situation, so the hypothesis does not hold.
In the second round , as shown in the following figure, we found that the first two characters of the pattern string are the same as the characters in the alignment position of the main string, until the third character is inconsistent.
At this time, since the matched substring is 'GT'
, obviously, the length of the longest matching prefix and suffix substring of the substring is 0. It can be considered that 'GT'
the longest matching prefix and suffix substring of the matched substring at this time is''
, therefore, You can directly slide the pattern string two more characters to the right.
In the third round , as shown in the figure below, we find that the first character of the pattern string is already different from the character in the alignment position of the main string, and then continue to slide the pattern string one character to the right, we can find that the end of the pattern string has exceeded At the end of the main string, it can be determined that the match failed. At this point, the complete process of the KMP algorithm is over.
2. Introduce the prefix and suffix substring array
Through the above case, you may already have a vague understanding. The key to implementing the KMP algorithm is: how to match the substrings successfully in each round :
- First find the longest matching prefix and prefix substring of the substring;
- Then determine the length of the longest matching prefix and prefix substring;
- Finally, using this length, before the next round of matching, determine the number of characters sliding to the right of the pattern string.
At first glance, after each round of matching, it is necessary to determine the longest matching prefix and prefix substring of the substring after obtaining the partial matching substring of the main string and the pattern string. In fact, combined with the above-mentioned matching process of the KMP algorithm, we know that part of the substrings obtained in each round of matching are prefix substrings of the pattern string .
Thus, only by the pattern string , through all possible partial matches are listed in the sub-master serial pattern string and string 1 may be information of the previous longest string departments molecule substring matches a predetermined suffix , and save it in an array pi
of 2 .
The subscript of the one-dimensional array represents the length of the partial matching substring , and the value at the corresponding subscript is the length of the partial matching substring that can match the prefix and suffix substring .
The above pi
description of the array is very confusing, the following figure is 'GTGTCGTG'
the corresponding pi
array of the above pattern string :
Specific location:
- When the partial matching substring
''
, the case having a length of 0, and the sub-suffix string before the longest match can be considered as''
having a length of 0, and thereforepi[0] = 0
; - When the partial matching substring
'G'
, the case having a length of 1, and the sub-suffix string before the longest match can be considered as''
having a length of 0, and thereforepi[1] = 0
; - When the partial matching substring
'GT'
, the case having a length of 2, and the sub-suffix string before the longest match can be considered as''
having a length of 0, and thereforepi[2] = 0
; - When the partial matching substring
'GTG'
, the case having a length of 3, and the front longest substring matches a suffix'G'
, a length of 1, and thereforepi[3] = 1
; - When the partial matching substring
'GTGT'
, the length of this time which is 4, and the front longest substring matches a suffix'GT'
having a length of 2, and thereforepi[4] = 2
; - When the partial matching substring
'GTGTC'
, the case having a length of 5, and the sub-suffix string before the longest match can be considered as''
having a length of 0, and thereforepi[5] = 0
; - When the partial matching substring
'GTGTCG'
, the case having a length of 6, and the front longest substring matches a suffix'G'
, a length of 1, and thereforepi[6] = 1
; - When the partial matching substring
'GTGTCGT'
, the case having a length of 7, and the longest matched prefix and suffix can substring'GT'
having a length of 2, thereforepi[7] = 2
.
3. Use suffix substring array
Above we have obtained the prefix and suffix substring array manually, so how to apply the array when using the KMP algorithm?
When using KMP algorithm for string matching algorithm, the pattern string does not really "slide" relative to the main string in the memory. The so-called sliding is reflected by the update of the auxiliary pointer variable: the general code implementation is to the main string And the pattern string uses a pointer variable i
and respectively j
, and then updates two pointers during the matching process, and the prefix and suffix substring array is used to update the pointer variable pointing to the pattern string .
Specifically, as shown in the following figure:
- Matching the first round, when
i = 7
andj = 7
occurs not match, then the length of the substring matching portion 7, so the code pattern string reflected auxiliary sliding pointerj = pi[7] = 2
;
- The second round of matching, when
i = 7
andj = 2
occurs when a mismatch, then partially matching substring of length 2, then the sliding mode code string is reflected in the secondary pointerj = pi[2] = 0
.
4. Generate an array of prefix and suffix substrings
Above we obtained the pi
array of pattern strings by direct observation . The following describes how to get the array through theoretical analysis, and also prepares for the subsequent implementation of the code for generating the prefix and suffix substring array. Here, the pattern string is represented by the word needle
3 .
First, as shown below, we use a two pointer variables i
, and j
wherein i
the representative part of the index matching substring next character , i.e., pi
the array index, j
representative of the longest prefix matches the index of a substring of characters , i.e., pi
the array value.
- At the beginning, because the partial matching substring and the longest matching prefix substring can be regarded as both
''
, it is initializedi = 0
andj = 0
obviouslypi[0] = 0
;
- Then,
i
1 will be added to make the partial matching substrings have a'G'
length of 1, so the longest matching prefix and suffix substrings can still be regarded''
as 0, which is easy to know from the foregoing discussionpi[1] = 0
;
- Continue to
i
add 1 so that the partial matching substring is'GT'
, and the length is 2. At this timeneedle[j] != needle[i - 1]
, that is'G' != 'T'
, the longest matching prefix and suffix substring can still be regarded as''
, and the length is 0, sopi[2] = 0
;
- Continue to
i
add 1 so that the partial matching substring is'GTG'
3, and the length is 3; finally, at this timeneedle[j] == needle[i - 1]
, that is'G' == 'G'
, the longest matching prefix and suffix substring'G'
is 1, and the length is 1, sopi[3] = pi[2] + 1 = 1
it will be updated laterj += 1
;
- Continue to
i
add 1 so that the partial matching substring is'GTGT'
, and the length is 4; at this timeneedle[j] == needle[i - 1]
, that is'T' == 'T'
, the longest matching prefix and suffix substring is'G'
, and the length is 2, sopi[4] = pi[3] + 1 = 2
it will be updated laterj += 1
;
- Continue to
i
add 1 so that the partial matching substring is'GTGTC'
5; at this timeneedle[j] != needle[i - 1]
, that is'G' != 'C'
, the longest matching prefix and suffix substring''
is 0, butpi[4]
the value cannot be derived directly at this timepi[5]
;
In order to obtain pi[5]
the value, that is, to calculate 'GTGTC'
the longest matching prefix substring of the partial matching substring, the problem can actually be transformed into the problem of solving the 'GTC'
longest matching prefix substring:
In fact, as shown in the figure below, it is equivalent to j
backtracking the auxiliary pointer variable j = pi[j] = pi[2] = 0
. After the backtracking needle[j] != needle[i - 1]
, that is 'G' != 'C'
, j can no longer be backtracked at this time pi[5] = 0
.
- Continue to
i
add 1 so that the partial matching substring is'GTGTCG'
, and the length is 6. At this timeneedle[j] == needle[i - 1]
, that is'G' == 'G'
, the longest matching prefix and suffix substring is'G'
, and the length is 1, sopi[6] = pi[5] + 1 = 1
it will be updated laterj += 1
;
- Continue to
i
add 1 so that the partial matching substring is'GTGTCGT'
, and the length is 7. At this timeneedle[j] == needle[i - 1]
, that is'T' == 'T'
, the longest matching prefix and suffix substring is'GT'
, and the length is 2. Thereforepi[7] = pi[6] + 1 = 2
, it will be updated laterj += 1
.
At this point, we have obtained the pi
array by calculation .
Three, algorithm implementation
According to the above theoretical analysis, the complete implementation of the KMP algorithm is given below in Python. In order to improve the readability of the code and reduce the coupling of function functions , two functions are given here:
kmp_match
The function can be regarded as the main function, which is responsible for finding a matchneedle
in the main string through the pattern stringhaystack
. The main realization refers to the use of the prefix and suffix substring array part of this article ;compute_prefix_func
It is used to pre-calculate the pi array, and the main implementation refers to the section of generating the prefix and suffix substring array in this article .
def kmp_match(haystack, needle):
"""KMP算法的主程序,haystack是主串,needle是模式串"""
pi = compute_max_prefix(needle) # 预处理,生成描述模式串子串最大可匹配前后缀子串的数组
n, m = len(haystack), len(needle)
j = 0
for i in range(n):
while j > 0 and haystack[i] != needle[j]:
# 当发生不匹配时,查询pi数组,更新模式串指针变量j的值
j = pi[j]
if haystack[i] == needle[j]:
j += 1
if j == m: # 匹配成功,返回下标
return i - m + 1
return -1
def compute_max_prefix(needle):
"""根据模式串生成pi数组"""
m = len(needle)
pi = [0] * m
j = 0
for i in range(2, m):
while j != 0 and needle[j] != needle[i - 1]:
# 回溯辅助指针j
j = pi[j]
if needle[j] == needle[i - 1]:
j += 1
pi[i] = j # 最长可匹配前缀子串的下一个位置的索引等于其长度
return pi
if __name__ == '__main__':
haystack = "ATGTGAGCTGGTGTGTGCFAA"
needle = "GTGTGCF"
index = kmp_match(haystack, needle)
print(index)
Intuitively, the KMP algorithm is more efficient when there are more repeated characters in the main string.
Four, reference materials
- [1] Comic: What is the KMP algorithm?
- [2] "Tianqin Open Class" KMP algorithm easy to understand version
- [3] (Original) Detailed KMP algorithm
- [4] Detailed explanation of KMP algorithm-thoroughly clear (reproduced + part of the original)
Suppose for the pattern string
needle
lengthm
, all parts matching substring herein is meantneedle[0:k]
and k ∈ [0, 1, ⋅ ⋅ ⋅, m - 1] k \ in {[0,1, \ cdot \ cdot \ cdot, m -1]}k∈[0,1,⋅⋅⋅,m−1]。 ↩︎Many tutorials and books also refer to this array
next
. ↩︎Later
haystack
, I named the main string in the code . This actually uses a proverb in"Search for a needle in a haystack."
English that is "find a needle in a haystack", but in English it literally means "find a needle in a haystack." Here will be a metaphor for the main string. For haystacks, the pattern string is likened to needles. ↩︎