Detailed String Matching Algorithm

In order to ensure the rigor of the code, all the codes in this article are on the Leetcode website AC, you can eat them with confidence.

On the occasion of the emperor's birthday, the whole country celebrates it. As the number one restaurant in the world, Yuanji Restaurant is selected as the supplier of dishes for this celebration. This celebration is an unprecedented challenge for Yuanji Restaurant. After all, it is the first time for Yuanji Restaurant The emperor celebrated his birthday, and if he was a little careless, it would be a serious crime to lose his head. The whole Yuanji restaurant was nervously arranged. At this time, a waiter suddenly ran to Chef Yuan to report in a panic. What happened to make the waiter so flustered?

Inside Yuan Kee Restaurant

Shopkeeper: It's not good, it's not good, shopkeeper, something serious happened.

Yuan Chu: What happened, please speak slowly, how unreasonable is it to be so flustered. (The store has been open for a long time, and the shelves have come out.)

Waiter 2: Your Majesty ordered 666 dishes according to our menu, but the master who made West Lake Vinegar Fish took leave of absence and went home to get married. I don’t know if the Emperor ordered this dish. The store is over.

(After hearing this, Chef Yuan sat down on the ground in fright, and said after a while)

Yuan Chu: Don't talk so much, quickly find out if this dish is among the dishes ordered by the emperor!

After searching for a long time and checking it many times, I finally confirmed that the emperor did not order this dish. Everyone in the restaurant breathed a sigh of relief.

Through the above example, let us have a brief understanding of string matching, let's take a closer look at it together.

String matching: Let S and T be two given strings. The process of finding the pattern string T in the main string S is called string matching. If the pattern string T is found in the main string S, the match is said to be successful. The function Return the first occurrence of T in S, otherwise the match is unsuccessful, return -1.

example:

In the figure above, we are trying to find the position where the pattern string T = baab appears for the first time in the main string S = abcabaabcabac , which is the red shaded part, and the subscript of the position where T first appears in S is 4 (character The first subscript of the string is 0), so return 4. If the pattern string T does not appear in the main string S, then return -1.

The algorithm that solves the above problems is called a string matching algorithm. Today we will introduce three string matching algorithms. Remember to check in. If you are not sure, you will ask about it during the interview.

BF algorithm (Brute Force)

This algorithm is easy to understand, that is, we compare the pattern string with the main string, and if they match, we continue to compare the next character until the entire pattern string is compared. If they are not consistent, move the pattern string back one bit, start the comparison again from the first position of the pattern string, and repeat the steps just now. Let’s take a look at the animation analysis of this method. After reading it, you will be sure to understand it.

, duration 00:12

Through the above code, can you understand this algorithm at once, let's use this algorithm to solve the following classic problem.

leetcdoe 28. Implement strStr()

topic description

Given a haystack string and a needle string, find the first position (starting from 0) of the needle string in the haystack string. Returns -1 if not present.

Example 1:

Input: haystack = "hello", needle = "ll" Output: 2

Example 2:

Input: haystack = "aaaaa", needle = "bba" Output: -1

topic analysis

In fact, this topic is easy to understand, but we need to pay attention to the following points, such as what should be returned when our pattern string is 0, and what should be returned when the length of our pattern string is greater than the length of the main string, which we also need to pay attention to place. Let's take a look at the topic code.

topic code

Let's take a look at another algorithm of the BF algorithm (display fallback). In fact, the principle is the same, that is, the code has been modified a bit. After reading our animation, this can also be understood at a glance. You can combine the following code Annotations and animations in .

BM algorithm (Boyer-Moore)

We just talked about the BF algorithm, but the BF algorithm is flawed, such as our following situation

如上图所示,如果我们利用 BF 算法,遇到不匹配字符时,每次右移一位模式串,再重新从头进行匹配,我们观察一下,我们的模式串 abcdex 中每个字符都不一样,但是我们第一次进行字符串匹配时,abcde 都匹配成功,到 x 时失败,又因为模式串每位都不相同,所以我们不需要再每次右移一位,再重新比较,我们可以直接跳过某些步骤。如下图

我们可以跳过其中某些步骤,直接到下面这个步骤。那我们是依据什么原则呢?

坏字符规则

我们之前的 BF 算法是从前往后进行比较 ,BM 算法是从后往前进行比较,我们来看一下具体过程,我们还是利用上面的例子。

BM 算法是从后往前进行比较,此时我们发现比较的第一个字符就不匹配,我们将主串这个字符称之为坏字符,也就是 f ,我们发现坏字符之后,模式串 T 中查找是否含有该字符 f,我们发现并不存在 f,此时我们只需将模式串右移到坏字符的后面一位即可。如下图

那我们在模式串中找到坏字符该怎么办呢?见下图

此时我们的坏字符为 f , 我们在模式串中,查找发现含有坏字符 f ,我们则需要移动模式串 T ,将模式串中的 f 和坏字符对齐。见下图。

然后我们继续从右往左进行比较,发现 d 为坏字符,则需要将模式串中的 d 和坏字符对齐。

那么我们在来思考一下这种情况,那就是模式串中含有多个坏字符怎么办呢?

那么我们为什么要让最靠右的对应元素与坏字符匹配呢?如果上面的例子我们没有按照这条规则看下会产生什么问题。

如果没有按照我们上述规则,则会漏掉我们的真正匹配。我们的主串中是含有 babac 的,但是却没有匹配成功,所以应该遵守最靠右的对应字符与坏字符相对的规则。

我们上面一共介绍了三种移动情况,分别是下方的模式串中没有发现与坏字符对应的字符,发现一个对应字符,发现两个。这三种情况我们分别移动不同的位数,那我们是根据依据什么来决定移动位数的呢?下面我们给图中的字符加上下标。见下图

下面我们来考虑一下这种情况。

此时这种情况肯定是不行的,不往右移动,甚至还有可能左移,那么我们有没有什么办法解决这个问题呢?继续往下看吧。

好后缀规则

好后缀其实也很容易理解,我们之前说过 BM 算法是从右往左进行比较,下面我们来看下面这个例子。

这里如果我们按照坏字符进行移动是不合理的,这时我们可以使用好后缀规则,那么什么是好后缀呢?

BM 算法是从右往左进行比较,发现坏字符的时候此时 cac 已经匹配成功,在红色阴影处发现坏字符。此时已经匹配成功的 cac 则为我们的好后缀,此时我们拿它在模式串中查找,如果找到了另一个和好后缀相匹配的串,那我们就将另一个和好后缀相匹配的串 ,滑到和好后缀对齐的位置。

是不是感觉有点拗口,没关系,我们看下图,红色代表坏字符,绿色代表好后缀

上面那种情况搞懂了,但是我们思考一下下面这种情况

上面我们说到了,如果在模式串的头部没有发现好后缀,发现好后缀的子串也可以。但是为什么要强调这个头部呢?

我们下面来看一下这种情况

但是当我们在头部发现好后缀的子串时,是什么情况呢?

下面我们通过动图来看一下某一例子的具体的执行过程

说到这里,坏字符和好后缀规则就算说完了,坏字符很容易理解,我们对好后缀总结一下

1.如果模式串含有好后缀,无论是中间还是头部可以按照规则进行移动。如果好后缀在模式串中出现多次,则以最右侧的好后缀为基准。

2.如果模式串头部含有好后缀子串则可以按照规则进行移动,中间部分含有好后缀子串则不可以。

3.如果在模式串尾部就出现不匹配的情况,即不存在好后缀时,则根据坏字符进行移动,这里有挺多文章没有提到,是个需要特别注意的地方,我是在这个论文里找到答案的,感兴趣的同学可以看下。

Boyer R S,Moore J S. A fast string searching algorithm[J]. Communications of the ACM,1977,10:762-772.

之前我们刚开始说坏字符的时候,是不是有可能会出现负值的情况,即往左移动的情况,所以我们为了解决这个问题,我们可以分别计算好后缀和坏字符往后滑动的位数(好后缀存在时),然后取两个数中最大的,作为模式串往后滑动的位数。

这破图画起来是真费劲啊。下面我们来看一下算法代码,代码有点长,我都标上了注释也在网站上 AC 了,如果各位感兴趣可以看一下,不感兴趣的话,理解坏字符和好后缀规则即可。可以直接跳到 KMP 部分

我们来理解一下我们代码中用到的两个数组,因为两个规则的移动位数,只与模式串有关,与主串无关,所以我们可以提前求出每种情况的移动情况,保存到数组中。

KMP算法(Knuth-Morris-Pratt)

我们刚才讲了 BM 算法,虽然不是特别容易理解,但是如果你用心看的话肯定可以看懂的,我们再来看一个新的算法,这个算法是考研时必考的算法。实际上 BM 和 KMP 算法的本质是一样的,你理解了 BM 再来理解 KMP 那就是分分钟的事啦。

我们先来看一个实例

注:为了让读者更容易理解,我们将指针移动改成了模式串移动,两者相对与主串的移动是一致的,重新比较时都是从指针位置继续比较。

通过上面的实例是不是很快就能理解 KMP 算法的思想了,我们继续往下看。

在上面的例子中我们提到了一个名词,最长公共前后缀,这个是什么意思呢?下面我们通过一个较简单的例子进行描述。

此时我们在红色阴影处匹配失败,绿色为匹配成功部分,则我们观察匹配成功的部分。

我们来看一下匹配成功部分的所有前后缀

我们的最长公共前后缀如下图,则我们需要这样移动

好啦,看完上面的图,KMP的核心原理已经基本搞定了,但是我们现在的问题是,我们应该怎么才能知道他的最长公共前后缀的长度是多少呢?怎么知道移动多少位呢?

刚才我们在 BM 中说到,我们移动位数跟主串无关,只跟模式串有关,跟我们的 bc,suffix,prefix 数组的值有关,我们通过这些数组就可以知道我们每次移动多少位啦,其实 KMP 也有一个数组,这个数组叫做 next 数组,那么这个 next 数组存的是什么呢?

next 数组存的咱们最长公共前后缀中,前缀的结尾字符下标。是不是感觉有点别扭,我们通过一个例子进行说明。

我们知道 next 数组之后,我们的 KMP 算法实现起来就很容易啦,另外我们看一下 next 数组到底是干什么用的。

剩下的就不用说啦,完全一致啦,咱们将上面这个例子,翻译成和咱们开头对应的动画大家看一下。

,时长00:11

下面我们看一下代码,标有详细注释,大家认真看呀。

注:很多教科书的 next 数组表示方式不一致,理解即可

好啦好啦先就写这么多吧,累屁了,剩下的几种就先不写了,觉得这个文章对你有帮助的话,欢迎各位点赞,评论,在看,转发。哦,我还没评论功能。哈哈

Guess you like

Origin blog.csdn.net/z_ssyy/article/details/128738424