String matching algorithm--BF/RK/BM/KMP algorithm notes


I. Overview

As the name implies, string matching is the operation of finding a match with the target string (pattern string) in the main string.
Traditional string matching algorithms can be summarized as prefix search, suffix search, and substring search.

This article mainly discusses the common algorithms such as BF, RK, BM, KMP from the algorithm deduction process and analysis.


Two, BF algorithm

BF: BruteForce, the algorithm uses a simple and rude way to compare the main string and the pattern string character by character.

2.1 Deduction process

Main string: GTTATAGCTGGTAGCGGCGAA
mode string: GTAGCGGCG

  1. In the first round, the pattern string is compared with the first equal-length substring of the main string, and it is found that the 0th character is consistent, the first character is consistent, and the second character is inconsistent:
    Insert picture description here

  2. The pattern string is moved one bit backward, and compared with the second equal length substring of the main string, it is found that the 0th character is inconsistent
    Insert picture description here

  3. ...And so on, until the Nth round:
    Insert picture description here

  4. When the pattern is moved to a suitable position and compared character by character, when every bit matches, the comparison ends
    Insert picture description here

2.2 Algorithm analysis
Through the above deduction process, we know that the idea of ​​BF algorithm is relatively simple and clear, and it can still be used for relatively short strings or use fields with low efficiency requirements. At the same time, the shortcomings of the BF algorithm are also more obvious, that is, the efficiency is too low, each round can only shift the pattern string to the right, in fact, a lot of unnecessary comparisons have been made.

Give a chestnut:
main string: aaaaaaaaaaaaaaaaaaaab
pattern string: aaab

In this case, in each round of character matching, the first three characters a of the pattern string match the characters in the main string, and the last character b of the pattern string is checked until the mismatch is found:
Insert picture description here
assuming the main string The length is m, and the length of the pattern string is n. In this extreme case, the worst time complexity of the BF algorithm is O(mn).


Three, RK algorithm

RK: The full name is Rabin-Karp, based on an improvement of the BF algorithm, named after the two inventors of the algorithm, Rabin and Karp. It is implemented by comparing the hash values ​​of two sets of strings.

What does it mean to compare hash values?
Friends who have used hash tables know that every string can be converted into an integer number through a certain hash algorithm. This integer number is hashcode: hashcode=hash(string).
Obviously, compared to comparing two strings one by one, it is much easier to just compare the hashcode of two strings.
Insert picture description here

3.1 Deduction process

Main string: abcefgh
pattern string: bce


  • There are two ways to generate hashcode for the pattern string :
    1) Method 1:
    Add by bit This is the simplest method, you can regard a as 1, b as 2, c as 3... Then all the characters of the string are compared Add, and the result of the addition is its hashcode, such as bce=2+3+5=10. Although this algorithm is simple, conflicts are likely to occur. For example, the hashcodes of bce, bec, and cbe are the same.

    2) Method 2: Convert into a 26-base number
    Since the string value contains 26 lowercase letters, then each string can be calculated as a 26-base number.
    bce = 2*(26^2) + 3*26 + 5 = 1435. This has the advantage of greatly reducing hash conflicts. The disadvantage is that the amount of calculation is large, and there may be cases beyond the integer range. The calculation results need to Perform modulo.

    For the convenience of demonstration, we use the hashcode algorithm of bitwise addition, so the hashcode of bce is 10.
    Insert picture description here

  • Generate the hashcode of the first equal length substring in the main string
    abb=1+2+2=5
    Insert picture description here
    Compare two hashcodes
    5! =10, indicating that the pattern string does not match the first substring, continue to the next round of comparison.

  • Generate the hashcode of the second equal length substring in the main string
    bbc=2+2+3=7
    Insert picture description here
    Compare two hashcodes, 7! =10 means that the pattern string does not match the second substring, continue to the next round of comparison.

  • Generate the hashcode of the third equal
    length substring bce=2+3+5=10 to
    Insert picture description here
    compare two hashcodes, 10==10, the two hashcodes are equal.
    Due to the possibility of hash conflicts, further verification is required.

  • Comparing the two strings
    Hashcode character by character is only a preliminary verification. After that, it is necessary to compare the two strings like the BF algorithm to determine whether the two strings match.
    Insert picture description here
    Finally, it is concluded that the pattern string bce is a substring of the main string abcefgh, and the first occurrence of the subscript is 2.

3.2 Algorithm analysis

  • The time complexity of each hash is O(n). If all substrings are hashed, isn't the total time complexity the same as BF, which is O(mn)?
    Answer: The hash calculation of substrings is not independent. Starting from the second substring, the hash of each substring can be calculated by simple incremental calculation of the previous substring. Take a chestnut: In the
    Insert picture description here
    above figure, it is known that the hashcode of the substring abcefg is 26, so how to calculate the hashcode of the next substring bbcefgd?
    Insert picture description here
    At this time, there is no need to re-accumulate the characters of the substring, but a simpler method can be used. Because the new substring has a missing a in the front and an additional d in the back, so:
    new hashcode=old hashcode-1+4=26-1+4=29

    The calculation of the next substring bcefgde is the same:
    new hashcode=old hashcode-1+4=26-2+5=32

  • Algorithm time complexity
    RK algorithm calculates the time complexity of a single substring hash is O(n), but since the subsequent substring hash is incremental calculation, the total time complexity is still O(n).
    Compared with the BF algorithm, the RK algorithm uses a hash value comparison method, eliminating many unnecessary character comparisons, so the time complexity is greatly improved.

  • Disadvantages
    of the RK algorithm Whenever there is a hash collision, the RK algorithm must compare the substring and the pattern string character by character. If there are too many conflicts, the RK algorithm will degenerate into the BF algorithm.


Four, BM algorithm

The name of the BM algorithm comes from its two inventors, Bob Boyer and Stronger Moore.
The BM algorithm uses "bad character rules" and "good suffix rules" to move the pattern string as much as possible in each round of comparison, reducing many unnecessary comparisons in the BF algorithm.

4.1 Deduction process

  1. Bad character rule
    "Bad character" refers to the characters that do not match in the pattern string and substring.
    Examples:
    Insert picture description here
    1) As shown in the figure above, when the pattern string is compared with the first equal-length substring of the main string, the last string T of the substring is a bad character.

    Question: Why is the bad character not the second character T in the main string? Isn't this location the first to be detected?

    Answer: The detection sequence of the BM algorithm is reversed, and the detection starts from the rightmost side of the string to the left . The purpose of such a detection sequence is that when the first bad character is detected, we don't need to move the pattern string bit by bit backward and compare it. This is because only when the alignment position of the pattern string and the bad character T is also the character T, the two can match.

    2) It is not difficult to find that the first character of the pattern string is also T. In this way, you can make a "universe shift" on the pattern string, and directly connect the character T in the pattern string with the bad characters of the main string, and proceed A round of comparison.
    Insert picture description here
    The closer the position of the bad character is to the right, the longer the shift span of the pattern string in the next round will be, and the more comparison times will be saved. This is the benefit of the BM algorithm from right to left.

    3) Next, continue to compare characters one by one and find that the GCGs on the right are all consistent, and the character A in the main string is detected as a bad character:
    Insert picture description here
    according to the method just now, the second bit of the pattern string is also A, so the pattern The character A of the string is aligned with the bad character in the main string, and the next round of comparison is carried out:
    Insert picture description here
    4) Next, continue to compare character by character from right to left. This time it is found that all characters are matched. The comparison is complete:
    Insert picture description here
    5) " Explanation: When the bad character does not exist in the pattern string, directly move the pattern string to the next bit of the bad character in the main string.
    Insert picture description here

  2. Good suffix rule
    "Good suffix" refers to the suffix that you want to match in the pattern string and substring.

    Examples:
    Insert picture description here
    1) The first round of comparison found that the main string and the pattern string have a common suffix "GCG", which is the so-called "good suffix".
    Insert picture description here

    If other parts of the pattern string also contain the same segment as "GCG", you can move the pattern string to align this segment with the good suffix, and perform the next round of comparison.

    2) Move
    Insert picture description here

    Obviously, in this example, adopting a good suffix rule can make the pattern string move backward by more bits, saving more unnecessary comparisons.

    3) If there are no other segments with the same suffix in the pattern string, what should I do? Is it possible to move the pattern string directly after the suffix?
    Insert picture description here
    the answer is negative.
    We can't directly move the pattern string to the back of the good suffix. We need to judge a special case first, whether the prefix of the pattern string matches the suffix of the good suffix, so as not to move too far.
    Insert picture description here

4.2 Algorithm analysis
When to use bad character rules and when to use good suffix rules?

Answer: After each round of character comparison, the corresponding moving distance can be calculated according to the rules of bad characters and good suffixes. Whichever distance is longer, move the pattern string to the corresponding length.


Five, KMP algorithm

The name of the KMP algorithm comes from the three computer scientists who invented this algorithm: DEKnuth, JHMorris, and VRPratt. The goal of the algorithm is to move the pattern string as many bits as possible in each round, thereby reducing indifferent character comparisons. The KMP algorithm puts its focus on "matched prefixes".

5.1 The deduction process
Insert picture description here
The "opening" of the KMP algorithm and the BF algorithm is the same, the same is to align the first digit of the main string and the pattern string, and compare character by character from left to right.

  • The first round: The pattern string is compared with the first equal length substring of the main string. It is found that the first 5 characters are matched, and the 6th character does not match, which is a "bad character":
    Insert picture description here
    How to effectively use the matched prefix "GTGTG" What? From the above observation, we can find that in the prefix "GTGTG", the last three characters "GTG" and the first three characters "GTG" are the same.
    Insert picture description here
    In the next round of comparison, only by aligning these two identical fragments can there be a match. These two string fragments are called the longest matching suffix substring and the longest matching prefix substring, respectively .

  • In the second round, directly move the pattern string back by two positions, align the two "GTGs", and continue to compare from the bad character A of the main string just now:
    Insert picture description here
    Obviously, the character A of the main string is still a bad character. The matching prefix is ​​shortened to GTG:
    Insert picture description here
    According to the first round of thinking, redefine the longest matching suffix substring and the longest matching prefix substring:
    Insert picture description here

  • The third round: again move the pattern string back by two positions, align the two Gs, and continue to compare from the bad character A of the main string just now: The
    Insert picture description here
    above is the overall idea of ​​the KMP algorithm: find the most in the matched prefixes The long matchable suffix substring and the longest matchable prefix substring are directly aligned in the next round, so as to realize the rapid movement of the pattern string.

5.2 Algorithm analysis

  • How to find the "longest matching suffix substring" and "the longest matching prefix substring" of a string prefix? Do we have to traverse again every round?

    Answer: It is not necessary to traverse every time, you can cache it in a collection in advance, and then go to the collection to fetch it when you use it. This set is called the next array. How to generate the next array is the biggest difficulty of the KMP algorithm.

  • Next array The
    next array is a one-dimensional integer array. The subscript of the array represents "the next position of the matched prefix", and the value of the element is "the next position of the longest matched prefix substring".

  • Backtracking
    Backtracking is relative to the pattern string. The amount of backtracking depends on the number of prefixes and suffixes shared by the pattern string.
    Insert picture description here
    Insert picture description here


Reprinted: https://blog.csdn.net/bjweimengshu/article/details/104528964?utm_medium=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.control&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog BlogCommendFromMachineLearnPai2-1.control

Guess you like

Origin blog.csdn.net/locahuang/article/details/110186766