In-depth understanding of BM algorithm

I have shared the KMP algorithm with you before, and I will share with you a more efficient algorithm-BM algorithm.

1. Introduction:
Boyer-Moore string search algorithm is a very efficient string search algorithm. It was designed by Bob Boyer and J Strother Moore in 1977. This algorithm only preprocesses the search target string (keyword), not the string being searched.
Next, we use Professor Moore's own example to explain this algorithm:

Main string: HERE IS A SIMPLE EXAMPLE
Mode string: EXAMPLE

2. Basic principles

Main string definition: mainStr
Pattern string definition: patternStr
Main string mobile subscript is defined as: matchMainIndex
Pattern string mobile subscript is defined as: matchPatternIndex

1. Bad characters:

insert image description here

At first matchMainIndex = 0, matchPatternIndex = 0.
Start the comparison from the end. If the characters at the end do not match, you can know the first 7 characters with just one comparison, which is definitely not the result you are looking for. "S" does not match "E". At this time, "S" is called a "bad character". And "S" is not included in the search term "EXAMPLE". Think about it, no matter how we move the main string backwards (less than the length of the pattern string), the mismatched S cannot be equal to any character in the pattern string, so this At this time, we can directly move the length of the main string to matchMainIndex+len(patternStr)
to get the following matching diagram:
insert image description here
still start from the end to compare, and find that "P" does not match "E", so P is a bad character . However, P is included in the search term EXAMPLE . Therefore, the search term is shifted by two bits, and the P character in the main string is aligned with the P character in the pattern string .
According to the above example, we can roughly conclude that
the bmBc array stores the minimum length of the pattern string characters from the end.
insert image description here
The moving distance of the bad character=bmBc[i] - (m - 1 - i)
bmBc is the minimum length from the position where the bad character i appears in the pattern string to the end of the pattern string

if "bad character"
to the right. At this time, it is equivalent to the main string going backwards, and bmBc[i] - (m - 1 - i) is a negative number, so a good suffix rule will come out.

2. Good suffix rules

We follow the above matching and get the following matching diagram:
insert image description here
MPLE is a good suffix at this time, that is, all strings matched at the end. Note that "MPLE", "PLE", "LE", "E" are all good suffixes.
Found that "I" does not match "A". So, "I" is a "bad character".
According to the "bad character rule", the search term should be moved back by 2 + 1 = 3 digits,
and there are three situations for a good suffix:
(1) There is a substring in the pattern string that matches the good suffix, so move the pattern string so that the substring The string and the good suffix are aligned. If more than one substring matches the good suffix, the rightmost substring is selected for alignment.
For example, if the pattern string is MPLEEXAMPLE
insert image description here
(2) If there is no substring matching the longest good suffix in the pattern string, the last occurrence of other "good suffixes" must be at the head (think about it: take the above matching MPLE as an example, If there is no MPLE in the pattern string, only 'PLE' appears in the middle and not in the head, then the character before 'P' will definitely not be equal to M at this time. If they are equal, then it has the longest good suffix, which is the same as before contradicts the assumption) . Find a longest prefix of the pattern string, and make the prefix equal to the suffix of the good suffix.
If the good suffix is ​​MPLE, but there is no MPLE in the pattern string, then PLE conforms to the good suffix definition and appears at the head of the pattern string, which is the longest prefix.
insert image description here
(3) No substring in the pattern string matches the suffix, and the longest prefix cannot be found in the pattern string, let the prefix be equal to the suffix of the good suffix. At this time, the moving distance is the length of the pattern string.

Summary:
post shift number = good suffix position - last occurrence position in the search term
For example: if the next "AB" of the string "ABCDAB" is "good suffix". Then its position is 5 (starting from 0, take the value of the last "B"), and the "last occurrence position in the search term" is 1 (the position of the first "B"), so move back 5 - 1 = 4 bits, the previous "AB" is moved to the next "AB".
In the algorithm, a suffixes array is used to store the length of the longest prefix when each character does not match.
The bmGs array stores the suffix and the distance that the character should move

Then the above match:
the good suffix is ​​MPLE, but only E appears at the head of the pattern string, and the moving distance is 6 - 0 = 6 bits (6 is the subscript of the last character of the good suffix in the pattern string, 0 is the longest match The prefix character E is at the head of the pattern string.)
insert image description here
Continuing to compare from the end, "P" does not match "E", so "P" is a "bad character". According to the "bad character rule", 6 - 4 = 2 bits are shifted backwards.
insert image description here
successful match

3. PHP implementation code display

<?php
/**
 * Class BM Algorithm
 */
class BM
{
    
    
    public function execute($mainStr, $patternStr)
    {
    
    
        //坏字符应该移动的下标
        $bmBc = array();
        //当前字符是好后缀应该移动的长度
        $bmGs = array();
        //主串'utf-8'编码总长度
        $mainStrLen = mb_strlen($mainStr, 'UTF-8');
        //模式串'utf-8'编码总长度
        $patternStrLen = mb_strlen($patternStr, 'UTF-8');
        //当前匹配主串的下标
        $matchMainIndex = 0;
        //当前匹配模式串的下标
        $matchPatternIndex = 0;
        //是否匹配成功
        $isMatch = false;
        $mainStrChar = '';

        self::preBmBc($patternStr, $patternStrLen, $bmBc);

        self::preBmGs($patternStr, $patternStrLen, $bmGs);

        //主串剩余字符个数大于模式串,继续匹配
        while ($matchMainIndex <= $mainStrLen - $patternStrLen) {
    
    
            $matchPatternIndex = $patternStrLen - 1;
            while ($matchPatternIndex >= 0) {
    
    
                $patternStrChar = mb_substr($patternStr, $matchPatternIndex, 1, 'UTF-8');

                //当前匹配主串下标
                $tempMatchPatternIndex = $matchPatternIndex + $matchMainIndex;
                //获取主串匹配的字符
                $mainStrChar = mb_substr($mainStr, $tempMatchPatternIndex, 1, 'UTF-8');
                //不相等,停止匹配,找主串移动长度
                if ($patternStrChar != $mainStrChar) {
    
    
                    break;
                }
                else {
    
    
                    $matchPatternIndex--;
                }
            }

            //找到模式串
            if ($matchPatternIndex < 0) {
    
    
                echo  '匹配成功 主串下标:' . "$matchMainIndex\n";
                $isMatch = true;
                $matchMainIndex += $bmGs[0];
            }
            else {
    
    
                $tempBmBcIndex = isset($bmBc[$mainStrChar]) ? $bmBc[$mainStrChar] - $patternStrLen + 1 + $matchPatternIndex : $matchPatternIndex + 1;
                $matchMainIndex += max($bmGs[$matchPatternIndex], $tempBmBcIndex);
            }
        }
        if (!$isMatch) {
    
    
            echo "匹配失败\n";
        }
        return $isMatch;
    }

    /**
     * @param $patternStr string 模式串
     * @param $patternStrLen int 模式串长度 'utf-8'
     * @param $bmBc
     */
    public static function preBmBc($patternStr, $patternStrLen, &$bmBc)
    {
    
    
        for ($index = 0; $index < $patternStrLen; $index++)
        {
    
    
            //取出当前字符
            $patternStrChar = mb_substr($patternStr, $index, 1, 'UTF-8');
            //一直迭代记录当前字符在字符串中的下标,后面重复出现,只保留最后一次出现的位置
            $bmBc[$patternStrChar] = $patternStrLen - 1 - $index;
        }
    }

    /**
     * @param $patternStr string 模式串
     * @param $patternStrLen int 模式串长度
     * @param $suffixes array 模式串中每个字符和模式串本身从后向前匹配的最大长度
     */
    public static function suffixes($patternStr, $patternStrLen, &$suffixes)
    {
    
    
        //初始化最后一个字符的好后缀的前缀为字符串本身长度
        $suffixes[$patternStrLen - 1] = $patternStrLen;
        //下标最大值
        $indexLen = $patternStrLen - 1;
        for ($index = $patternStrLen - 2; $index >= 0; $index--) {
    
    
            //匹配下标先赋值为当前字符下标
            $matchPreLen = $index;

            while ($matchPreLen >= 0) {
    
    
                //获取当前匹配下标的字符
                $currentChar = mb_substr($patternStr, $matchPreLen, 1, 'UTF-8');

                //当前匹配的模式串尾部下标
                $patternStrIndex = $indexLen - $index + $matchPreLen;
                //获取要和模式串末尾匹配的字符
                $patternStrChar = mb_substr($patternStr, $patternStrIndex, 1, 'UTF-8');

                //遇到不相等的,退出循环
                if ($currentChar != $patternStrChar) {
    
    
                    break;
                }
                //匹配下标往前移一位
                $matchPreLen--;
            }
            //记录当前字符向前依次和模式串末尾比较,匹配的长度
            $suffixes[$index] = $index - $matchPreLen;

        }
    }

    /**
     * @param $patternStr string 模式串
     * @param $patternStrLen int 模式串长度
     * @param $bmGs array 当前字符是好后缀应该移动的长度
     */
    public static function preBmGs($patternStr, $patternStrLen, &$bmGs)
    {
    
    
        //模式串中每个字符和模式串本身从后向前匹配的最大长度
        $suffixes = array();
        self::suffixes($patternStr, $patternStrLen, $suffixes);

        //初始化好后缀移动长度为字符串总长度
        for ($index = 0; $index < $patternStrLen; $index++) {
    
    
            $bmGs[$index] = $patternStrLen;
        }
        //当前已经被'记录好后缀移动长度的字符'下标
        $preCurrentGoodStrIndex = 0;
        //从后向前记录,保证记录的是最大移动长度===>注意看下面的第二个for循环
        for ($index = $patternStrLen - 1; $index >= 0; $index--) {
    
    
            //如果当前字符满足 从当前字符一直到字符串最开始的位置倒着和模式串匹配完全匹配,证明当前字符有前缀串
            if ($suffixes[$index] == $index + 1) {
    
    
                //从前到当前字符下标,记录好后缀字符移动的长度
                for (; $preCurrentGoodStrIndex < $patternStrLen - 1 - $index; $preCurrentGoodStrIndex++) {
    
    
                    //之前没有记录过的好后缀字符才记录
                    if ($bmGs[$preCurrentGoodStrIndex] == $patternStrLen) {
    
    
                        $bmGs[$preCurrentGoodStrIndex] = $patternStrLen - 1 - $index;
                    }
                }
            }
        }
        //好后缀在模式串中出现过,记录移动长度
        for ($index = 0; $index < $patternStrLen - 1; $index++) {
    
    
            $bmGs[$patternStrLen - 1 - $suffixes[$index]] = $patternStrLen - 1 - $index;
        }
    }


}
$startTime = time();
$objBM = new BM();
$objBM->execute('HERE IS A SIMPLE EXAMPLE', 'EXAMPLE');
$endTime = time();
echo '脚本执行耗时: ' . ($endTime - $startTime) . 's';

Generate bmGs code annotations:
**Line 130-132 corresponds to the third case: **There is no prefix in the good suffix substring in the pattern string.
Lines 134-147 correspond to the second case:
Q: Why does the for loop traverse from the back to the front? index = patternStrLen - 1
A: The reason is that if the positions of index1 and index2 (index1 > index2) satisfy the second condition at the same time, then m-1-index1<m-1-index2, and the tenth line of code guarantees that each position has at most It can only be modified once, so it should be assigned a value of m-1-index, which also explains why it is calculated from the back to the front.
Q: What does line 138 mean?
A: It means finding a suitable position. Why do you say that? Because according to the definition of suffixes, we know that x[index+1-suffixes[index]...index]==x[m-1-suffixes[index]...m-1], and suffixes[index]==index+1, We know that x[index+1-suffixes[index]...index] =x[0,index], which is the prefix (a good suffix substring must appear at the head of the pattern string), which satisfies the second case.
Q: What is the meaning of lines 140-145?
A: The assignment in the second case is satisfied. Line 142 ensures that each location can only be modified at most once.
Lines 149-151 correspond to the first case:
why is the order from front to back, that is, i from small to large? The reason is that if suffixes[index1]==suffixes[index2], index1<index2, then m-1-index1>m-1-index2, we should take the latter as the value of bmGs[m - 1 - suff[index1]] (take the one on the far right).

Guess you like

Origin blog.csdn.net/weixin_43885417/article/details/112589547