"Algorithm"-string pattern matching algorithm 3 [BM algorithm]

1 Overview

\quad \quad BM (Boyer-Moore) algorithm is a very efficient string matching algorithm. According to experimental statistics, its performance is 3 to 4 times that of the well-known KMP algorithm. The larger the amount of text, the more obvious the effect of the BM algorithm. The "find" function (Ctrl+F) of various text editors mostly uses the Boyer-Moore algorithm

2. Principle

\quad \quad The BM algorithm is to compare the main string to the pattern string from back to front, and improve the moving distance through two tables. When the characters of the main string and the pattern string do not match, the length of the pattern string sliding backward is determined based on the bad character rule and the good suffix rule. The length of the rule moving backward is based on which length. Therefore, the "Bad Character Rule Table" and the "Good Suffix Rule Table" can be calculated in advance. When using it, just look up the table and compare it and choose the largest one.

2.1 Bad character rules

\quad \quad What does "bad character" mean? It refers to the characters that do not match in the pattern string and substring. For example, in the figure below, the character T is a bad character.

Insert picture description here

Bad character rule
\quad \quad When a character in the text string does not match a character in the pattern string, we call this mismatched character in the text string a bad character. At this time, the pattern string needs to be moved to the right.后移位数 = 坏字符对应在模式串中的位置 - 坏字符在模式串中最右出现的位置。

Note:

  • If the "bad character" is not included in the pattern string, we record the rightmost occurrence position as -1.
  • The position is counted from 0

Origin
\quad \quad There are three cases of bad characters in the text string in the pattern string. The movement of the pattern string will be discussed in the following cases.

1. There is a corresponding bad character in the pattern string, only once
Insert picture description here

  • The BM algorithm matches from the end of the pattern string. We find that the main string character T at the end does not match the pattern string character G, and the character Tis a bad character. Since the last digit does not match, then the match must be unsuccessful, and the pattern string should move backward. Because it is Tincluded in the pattern string, we align the character T in the pattern string with the bad characters in the main string. The length of the backward movement=8-1=7. [Because " T"This bad character corresponds to the 8th bit of the pattern string (numbered from zero), and its appearance position in the pattern string is 1)

Insert picture description here

2. There are corresponding bad characters in the pattern string, which appear many times

  • When the mode occur multiple times in a bad character, so that pattern string rightmost corresponding characters and bad characters opposite.
    Insert picture description here
  • Why should it be the far right ?
  • Assuming that the bad character in the above figure corresponds to the first position in the pattern string, that is, the left side, there will be a missing match.

Insert picture description here

  • And the correspondence with the farthest to the right will not be missed
    Insert picture description here

3. There is no corresponding bad character in the pattern string, and it appears many times

  • If the bad character does not exist in the pattern string, move the pattern string directly to the next bit of the bad character in the main string

Insert picture description here

  • First, the "text string" is aligned with the head of the "pattern string", and the comparison starts from the tail. "T" does not match "G". At this time, " T" is called a "bad character", that is, a non-matching character, which corresponds to the 8th bit of the pattern string. And " T" is not included in the pattern string (equivalent to the rightmost appearance position is -1), which means that the pattern string can be shifted back by 8-(-1)=9 bits, so as to move directly to Tthe next digit of " " .

2.2 Good suffix rules

\quad \quad What does "good suffix" mean? It refers to the matching suffix in the pattern string and substring.

Insert picture description here

Good suffix rules

\quad \quad When the character mismatch 后移位数 = 好后缀对应在模式串中的位置 - 好后缀在模式串上一次出现的最右位置.

Note:

  • The position of "good suffix" is based on the last character. Assuming that the "EF" of "ABCDEF" is a good suffix, its position is subject to "F", which is 5 (counting from 0).
  • If "good suffix" appears only once in the pattern string, its last occurrence position is -1. For example, if "EF" appears only once in "ABCDEF", its last occurrence position is -1 (that is, it did not appear).
  • If there are multiple "good suffixes", except for the longest "good suffix", the last occurrence of other "good suffixes" must be at the head. For example, suppose the "good suffix" of "BABCDAB" is "DAB", "AB", and "B". What is the last time the "good suffix" appeared at this time? The answer is that the good suffix used at this time is "B", and its last appearance position is the head, which is the 0th position. This rule can also be expressed like this: if the longest "good suffix" appears only once, the search term can be rewritten into the following form for position calculation "(DA)BABCDAB", that is, the first "DA" is added virtually.

Origin

  • If there is a good suffix that has been successfully matched in the pattern string, align the target string with the good suffix, and then match from the last element of the pattern string forward.

Insert picture description here

Insert picture description here

  • If there is no substring matching the good suffix at all, the entire pattern string is shifted to the right.

Reference

3. Code implementation

3.1 Bad character table

  • Each character in the record pattern string appears at the rightmost position of the pattern string (the position starts from 0)
  • Among them, because the BM algorithm starts the comparison from the end, the rightmost position of the bad character in the pattern string does not include the last character of the pattern string.
#坏字符表
def getBMBC(T):
    '''
    以字典的形式表示坏字符表,
    优点在于每遍历一个字符,其值会取代前面记录的
    到最后就会记录最右边的位置
    
    '''
    BMBC=dict( )
    #不包括模式串末尾字符
    for i in range(len(T)-1):
        char=T[i]
        #记录坏字符最右位置
        BMBC[char]=i
    return BMBC
if __name__=="__main__":
    t="abcdabd"
    print(getBMBC(t))
#结果: {'a': 4, 'b': 5, 'c': 2, 'd': 3}   

3.2 Good suffix table

  • Record the number of shifts after each good suffix
  • 后移位数 = 好后缀对应在模式串中的位置 - 好后缀在模式串上一次出现的最右位置
  • The number of rear shifts that cannot form a good suffix is ​​0
#好后缀表
def getBMGS(T):
    BMGS=dict( )
    # 无后缀仅根据坏字移位符规则
    BMGS['']=0
    for i in range(len(T)):
        #好后缀
        GS=T[len(T)-i-1:]
        for j in range(len(T)-i-1):
            #匹配部分
            NGS=T[j:j+i+1]
            if GS==NGS:
                BMGS[GS]=len(T)-j-i-1
    return BMGS
if __name__=="__main__":
    t="bcabab"
    print(getBMGS(t))
#结果:{'': 0, 'b': 2, 'ab': 2}

3.3 BM algorithm

  • Calculate the bad character table of the pattern string, good suffix table
  • Calculate moving length
  • Whichever is the largest, move backward and compare again
#坏字符表
def getBMBC(T):
    '''
    以字典的形式表示坏字符表,
    优点在于每遍历一个字符,其值会取代前面记录的
    到最后就会记录最右边的位置
    
    '''
    BMBC=dict( )
    #不包括模式串末尾字符
    for i in range(len(T)-1):
        char=T[i]
        #记录坏字符最右位置
        BMBC[char]=i
    return BMBC
#好后缀表
def getBMGS(T):
    BMGS=dict( )
    # 无后缀仅根据坏字移位符规则
    BMGS['']=0
    for i in range(len(T)):
        #好后缀
        GS=T[len(T)-i-1:]
        for j in range(len(T)-i-1):
            #匹配部分
            NGS=T[j:j+i+1]
            if GS==NGS:
                BMGS[GS]=len(T)-j-i-1
    return BMGS

# BM算法

def BM(S,T):
    m=len(T)
    n=len(S)
    i=0
    j=m
    indies=[]
    BMBC=getBMBC(T)#坏字符表
    BMGS=getBMGS(T)#好后缀表
    while i < n:
        while j>0:
            if i+j-1>=n:#当无法继续向下搜索就返回-1
                return indies
            #主串判断匹配部分
            a=S[i+j-1:i+m]
            #模式串匹配部分
            b=T[j-1:]
            #如果当前位匹配成功则继续匹配
            if a==b:
                j=j-1
            #如果当前位匹配失败根据规则最大者移位
            else:
                i=i+max(BMGS.setdefault(b[1:],m),j-BMBC.setdefault(S[i+j-1],0))
                j=m
            if j==0:
                indies.append(i)
                i+=1
                j=len(T)
if __name__=="__main__":
    s="abcabaabcabacbaab"
    t="baab"
    b=BM(s,t)
    print(b) 
# 结果:[4, 13]

Reference materials:
1. https://www.ruanyifeng.com/blog/2013/05/boyer-moore_string_search_algorithm.html

Guess you like

Origin blog.csdn.net/weixin_45666566/article/details/112463552