[Implementation of String Processing in Python] After reading this article carefully, I still can't fully understand the KMP algorithm. Please hit me along the network cable!

Through [String Processing Python Implementation] The brute force of string pattern matching, the introduction and implementation of the BM algorithm , we introduced the brute force algorithm and the BM algorithm respectively. Among them:

  • After each round of character matching failures, the brute force algorithm simply slides the pattern string to the right and then the length of one character. Therefore, although the simplest and intuitive, it will inevitably make a lot of meaningless comparisons;
  • BM algorithm mirroring temptation and jump exploratory character , character after the match fails, can occur in each round as much as possible to slide a few characters to the right , thereby reducing unnecessary comparison.

This article will introduce another well-known and efficient string matching algorithm-KMP algorithm, which was jointly proposed by DE Knuth, JH Morris and VR Pratt, so the name is the first letter of the last name of the three.

1. Terminology

For the convenience of subsequent description, before introducing the KMP algorithm, here are a few nouns and terms about strings. Assuming a string 'GTGT'is given , then:

  • Prefix substring : '', 'G', 'GT', 'GTG'and the 'GTGT'string called 'GTGT'prefix substring;
  • True prefix substring : removing 'GTGT'itself outside '', 'G', 'GT', 'GTG'string called 'GTGT'true prefix sub-string;
  • Suffix substring : '', 'T', 'GT', 'TGT'and the 'GTGT'string called 'GTGT'suffix substring;
  • This suffix substring : removing 'GTGT'itself outside '', 'T', 'GT', 'TGT'string called 'GTGT'proper suffix sub-string.

This article is convenient for presentation. All the prefix substrings and suffix substrings mentioned later refer to true prefix substrings and true suffix substrings respectively .

Second, the algorithm details

First of all, a sentence to summarize the core idea of ​​the KMP algorithm: make full use of the first few characters of the pattern string successfully matched with the main string in the current round, so that before the next round of matching, the pattern string can slide to the right relative to the main string as much as possible More characters .

1. Overview of KMP algorithm flow

Next, we first intuitively explain the specific matching process of the KMP algorithm through actual cases, assuming that the main string is 'GTGTCGTTGGGTGTG'and the pattern string is 'GTGTCGTG'.

Like brute force and BM matching algorithms, the first step of the KMP algorithm before matching is to align the main string and the pattern string on the leftmost side, and then compare them character by character from left to right.

In the first round , as shown in the following figure, we found that the first 7 characters of the pattern string are the same as the characters in the alignment position of the main string at this time, until the 8th character is inconsistent.

Insert picture description here

The question now is how to make full use of the matched substrings 'GTGTCGT'. First, through analysis, we can find that 'GTGTCGT'the prefix substring and suffix substring have such characteristics:

  • The prefix substring of length 1 and the suffix substring of length 1 are the same, both 'G';
  • The prefix substring of length 2 and the suffix substring of length 2 are the same, both 'GT';
  • In addition, there is no other case of the same length that makes 'GTGTCGT'the prefix substring and suffix substring the same.

We sub-string satisfies the above condition 'GT'is called 'GTGTCGT'the longest substring before the suffix can be matched .

Next, with this information, the essence of the KMP algorithm is that the first 7 characters of the pattern string and the main string that have been matched can be moved to the right until the longest matching prefix is 'GT'aligned, as shown in the following figure. Here we directly slide the pattern string to the right for another 5 characters:

Insert picture description here

Here, some people may have questions. For the currently successfully matched substring 'GTGTCGT', it is known that the longest matching prefix and suffix substring is 'GT', why must the pattern string be directly swiped 5 characters to the right to make the longest When the matchable prefix substring and the longest matchable suffix substring are aligned, and at the same time ensure that the possible successful match is not missed? For example: slide the pattern string two characters to the right, so that the pattern string and the main string are aligned at the latter index 2 (as shown in the figure below).

To answer this question, you can use the contradiction method to prove:

Suppose that after sliding the pattern string two characters to the right relative to the main string, the pattern string is at a position where it may successfully match the main string. This means that at least from the position of the main string index 2 to 6, the pattern string is aligned with the main string The characters need to be the same , but at this time the longest matching prefix and suffix substring should not be 'GT', but should be 'GTCGT', this contradicts the actual situation, so the hypothesis does not hold.

Insert picture description here

In the second round , as shown in the following figure, we found that the first two characters of the pattern string are the same as the characters in the alignment position of the main string, until the third character is inconsistent.

At this time, since the matched substring is 'GT', obviously, the length of the longest matching prefix and suffix substring of the substring is 0. It can be considered that 'GT'the longest matching prefix and suffix substring of the matched substring at this time is'' , therefore, You can directly slide the pattern string two more characters to the right.

Insert picture description here

In the third round , as shown in the figure below, we find that the first character of the pattern string is already different from the character in the alignment position of the main string, and then continue to slide the pattern string one character to the right, we can find that the end of the pattern string has exceeded At the end of the main string, it can be determined that the match failed. At this point, the complete process of the KMP algorithm is over.

Insert picture description here

2. Introduce the prefix and suffix substring array

Through the above case, you may already have a vague understanding. The key to implementing the KMP algorithm is: how to match the substrings successfully in each round :

  • First find the longest matching prefix and prefix substring of the substring;
  • Then determine the length of the longest matching prefix and prefix substring;
  • Finally, using this length, before the next round of matching, determine the number of characters sliding to the right of the pattern string.

At first glance, after each round of matching, it is necessary to determine the longest matching prefix and prefix substring of the substring after obtaining the partial matching substring of the main string and the pattern string. In fact, combined with the above-mentioned matching process of the KMP algorithm, we know that part of the substrings obtained in each round of matching are prefix substrings of the pattern string .

Thus, only by the pattern string , through all possible partial matches are listed in the sub-master serial pattern string and string 1 may be information of the previous longest string departments molecule substring matches a predetermined suffix , and save it in an array piof 2 .

The subscript of the one-dimensional array represents the length of the partial matching substring , and the value at the corresponding subscript is the length of the partial matching substring that can match the prefix and suffix substring .

The above pidescription of the array is very confusing, the following figure is 'GTGTCGTG'the corresponding piarray of the above pattern string :

Insert picture description here

Specific location:

  • When the partial matching substring '', the case having a length of 0, and the sub-suffix string before the longest match can be considered as ''having a length of 0, and therefore pi[0] = 0;
  • When the partial matching substring 'G', the case having a length of 1, and the sub-suffix string before the longest match can be considered as ''having a length of 0, and therefore pi[1] = 0;
  • When the partial matching substring 'GT', the case having a length of 2, and the sub-suffix string before the longest match can be considered as ''having a length of 0, and therefore pi[2] = 0;
  • When the partial matching substring 'GTG', the case having a length of 3, and the front longest substring matches a suffix 'G', a length of 1, and therefore pi[3] = 1;
  • When the partial matching substring 'GTGT', the length of this time which is 4, and the front longest substring matches a suffix 'GT'having a length of 2, and therefore pi[4] = 2;
  • When the partial matching substring 'GTGTC', the case having a length of 5, and the sub-suffix string before the longest match can be considered as ''having a length of 0, and therefore pi[5] = 0;
  • When the partial matching substring 'GTGTCG', the case having a length of 6, and the front longest substring matches a suffix 'G', a length of 1, and therefore pi[6] = 1;
  • When the partial matching substring 'GTGTCGT', the case having a length of 7, and the longest matched prefix and suffix can substring 'GT'having a length of 2, therefore pi[7] = 2.

3. Use suffix substring array

Above we have obtained the prefix and suffix substring array manually, so how to apply the array when using the KMP algorithm?

When using KMP algorithm for string matching algorithm, the pattern string does not really "slide" relative to the main string in the memory. The so-called sliding is reflected by the update of the auxiliary pointer variable: the general code implementation is to the main string And the pattern string uses a pointer variable iand respectively j, and then updates two pointers during the matching process, and the prefix and suffix substring array is used to update the pointer variable pointing to the pattern string .

Specifically, as shown in the following figure:

  • Matching the first round, when i = 7and j = 7occurs not match, then the length of the substring matching portion 7, so the code pattern string reflected auxiliary sliding pointer j = pi[7] = 2;

Insert picture description here

  • The second round of matching, when i = 7and j = 2occurs when a mismatch, then partially matching substring of length 2, then the sliding mode code string is reflected in the secondary pointer j = pi[2] = 0.

Insert picture description here

4. Generate an array of prefix and suffix substrings

Above we obtained the piarray of pattern strings by direct observation . The following describes how to get the array through theoretical analysis, and also prepares for the subsequent implementation of the code for generating the prefix and suffix substring array. Here, the pattern string is represented by the word needle3 .

First, as shown below, we use a two pointer variables i, and jwherein ithe representative part of the index matching substring next character , i.e., pithe array index, jrepresentative of the longest prefix matches the index of a substring of characters , i.e., pithe array value.

  • At the beginning, because the partial matching substring and the longest matching prefix substring can be regarded as both '', it is initialized i = 0and j = 0obviously pi[0] = 0;

Insert picture description here

  • Then, i1 will be added to make the partial matching substrings have a 'G'length of 1, so the longest matching prefix and suffix substrings can still be regarded ''as 0, which is easy to know from the foregoing discussion pi[1] = 0;

Insert picture description here

  • Continue to iadd 1 so that the partial matching substring is 'GT', and the length is 2. At this time needle[j] != needle[i - 1], that is 'G' != 'T', the longest matching prefix and suffix substring can still be regarded as '', and the length is 0, so pi[2] = 0;

Insert picture description here

  • Continue to iadd 1 so that the partial matching substring is 'GTG'3, and the length is 3; finally, at this time needle[j] == needle[i - 1], that is 'G' == 'G', the longest matching prefix and suffix substring 'G'is 1, and the length is 1, so pi[3] = pi[2] + 1 = 1it will be updated later j += 1;

Insert picture description here

  • Continue to iadd 1 so that the partial matching substring is 'GTGT', and the length is 4; at this time needle[j] == needle[i - 1], that is 'T' == 'T', the longest matching prefix and suffix substring is 'G', and the length is 2, so pi[4] = pi[3] + 1 = 2it will be updated later j += 1;

Insert picture description here

  • Continue to iadd 1 so that the partial matching substring is 'GTGTC'5; at this time needle[j] != needle[i - 1], that is 'G' != 'C', the longest matching prefix and suffix substring ''is 0, but pi[4]the value cannot be derived directly at this time pi[5];

Insert picture description here

In order to obtain pi[5]the value, that is, to calculate 'GTGTC'the longest matching prefix substring of the partial matching substring, the problem can actually be transformed into the problem of solving the 'GTC'longest matching prefix substring:

Insert picture description here

In fact, as shown in the figure below, it is equivalent to jbacktracking the auxiliary pointer variable j = pi[j] = pi[2] = 0. After the backtracking needle[j] != needle[i - 1], that is 'G' != 'C', j can no longer be backtracked at this time pi[5] = 0.

Insert picture description here

  • Continue to iadd 1 so that the partial matching substring is 'GTGTCG', and the length is 6. At this time needle[j] == needle[i - 1], that is 'G' == 'G', the longest matching prefix and suffix substring is 'G', and the length is 1, so pi[6] = pi[5] + 1 = 1it will be updated later j += 1;

Insert picture description here

  • Continue to iadd 1 so that the partial matching substring is 'GTGTCGT', and the length is 7. At this time needle[j] == needle[i - 1], that is 'T' == 'T', the longest matching prefix and suffix substring is 'GT', and the length is 2. Therefore pi[7] = pi[6] + 1 = 2, it will be updated later j += 1.

Insert picture description here

At this point, we have obtained the piarray by calculation .

Three, algorithm implementation

According to the above theoretical analysis, the complete implementation of the KMP algorithm is given below in Python. In order to improve the readability of the code and reduce the coupling of function functions , two functions are given here:

def kmp_match(haystack, needle):
    """KMP算法的主程序,haystack是主串,needle是模式串"""
    pi = compute_max_prefix(needle)  # 预处理,生成描述模式串子串最大可匹配前后缀子串的数组
    n, m = len(haystack), len(needle)
    j = 0
    for i in range(n):
        while j > 0 and haystack[i] != needle[j]:
            # 当发生不匹配时,查询pi数组,更新模式串指针变量j的值
            j = pi[j]
        if haystack[i] == needle[j]:
            j += 1
        if j == m:  # 匹配成功,返回下标
            return i - m + 1
    return -1


def compute_max_prefix(needle):
    """根据模式串生成pi数组"""
    m = len(needle)
    pi = [0] * m
    j = 0
    for i in range(2, m):
        while j != 0 and needle[j] != needle[i - 1]:
            # 回溯辅助指针j
            j = pi[j]
        if needle[j] == needle[i - 1]:
            j += 1
        pi[i] = j  # 最长可匹配前缀子串的下一个位置的索引等于其长度
    return pi


if __name__ == '__main__':
    haystack = "ATGTGAGCTGGTGTGTGCFAA"
    needle = "GTGTGCF"
    index = kmp_match(haystack, needle)
    print(index)

Intuitively, the KMP algorithm is more efficient when there are more repeated characters in the main string.

Four, reference materials


  1. Suppose for the pattern string needlelength m, all parts matching substring herein is meant needle[0:k]and k ∈ [0, 1, ⋅ ⋅ ⋅, m - 1] k \ in {[0,1, \ cdot \ cdot \ cdot, m -1]}k[0,1,,m1]↩︎

  2. Many tutorials and books also refer to this array next. ↩︎

  3. Later haystack, I named the main string in the code . This actually uses a proverb in "Search for a needle in a haystack."English that is "find a needle in a haystack", but in English it literally means "find a needle in a haystack." Here will be a metaphor for the main string. For haystacks, the pattern string is likened to needles. ↩︎

Guess you like

Origin blog.csdn.net/weixin_37780776/article/details/112912439