String pattern matching - Violence and KMP algorithm

Disclaimer: This article is a blogger original article, reproduced, please attach Ming source ^ _ ^ https://blog.csdn.net/cprimesplus/article/details/90374592

1.BF algorithm (Violence)

Initialization two array subscript characters cycle compare two strings are equal, then the index is equal to a simultaneously comparing element until the next successful match; not equal strings of matches fallback index 0, is matched initial position +1 subscript fallback string comparison of the cycle.

code show as below

#include<iostream>
#include<string>
using namespace std;
int BFMatching(string s1, string s2, int pos)
{
	int len1 = s1.length();
	int len2 = s2.length();
    if(len2 > len1 || !len2) return -1;
	int i = pos, j = 0;
	while(i<len1 && j<len2){
		if(s1[i] == s2[j]){
			j++;i++;
		}
		else{
			i = i-j+1;j = 0;
		}
	}
    if(i<=len1 && j==len2) return i-j+1;
	return -1;
}
int main()
{
	string s1 = "HarryPotter";
	string s2 = "rryP";
	cout<<BFMatching(s1, s2, 0);
	return 0;
} 

Violence really good? (A method called violence can be good to go)

We can take an extreme chestnuts, targeted string  "abbababca"  , string pattern matching to be  "abababca" , wants to match the pattern string, then the first match the following picture: 

Not difficult to find, the current face of all six elements of a successful match, victory is at hand when the unfortunate thing happened --i pointer moved to a position, j c pointer moved to a position at this time encountered a mismatch Case. How to do? According to the idea BF algorithm, j pointer only back to the original starting point, i pointer is not much better, only then compared from the initial starting point of the next position. 

J pointer as a measure of pattern matching case, the rollback can understand, after all, to find out the exact match and pattern string subscript, back and forth comparison seems inevitable. However i pointer as the target string subscript, almost as long as the match is unsuccessful then fall back to the original starting point, this is not it a bit too much?

So, is there a way i can so it does not roll back?

This will be the introduction of the KMP algorithm.

 

2.KMP algorithm

Above, we discussed the extreme environment BF algorithm - a very bad situation, also put forward the corresponding problem, an algorithm that is designed such that i does not roll back the original starting point.

Assuming that the target string is called P, called the pattern string S, back to the top of FIG careful observation, a thing can be found: in the P region string became gray scale, and the first four characters are the same four characters, are : abab, which is part of, just right with the first four characters of S pattern string matching success!

这也就意味着,我们无需将 i 回退到最初的起点。

换言之,S 串标成灰色的前四个字符,恰好是 P 串 标灰部分的后四个字符,而这四个字符因为已经比较过的缘故,无需再进行比较。此时 j 仅仅需要回退到 P[4],而 i 则保留在原来的位置,与 P[4] 之后的元素接着进行比较。

其中黄色区域为接下来待比较的部分。

将上面的做法进行总结,可以归纳为一段话(引自严蔚敏版《数据结构》):每当一趟匹配过程中出现字符比较不相等时,不需要回溯 i 指针,而是利用已经得到的“部分匹配”的结果将模式串向右“滑动”尽可能远的一段距离后,继续进行比较。

上面提到“尽可能远”的距离,也就是尽可能多的和模式串匹配,进而求得的长度。对于上面的图来说,也就是 4(abab).此时这距离为 4 的长度因为已经比较过的缘故(P[0] 到 p[3] 这四个字符和 P[2] 到 P[5] 这四个字符相同,而P[0] 到 p[3] 已经比较过),是可以直接跳过的。

如何确定 模式串 S 应该从哪个字符开始重新比较?这便引入了部分匹配表 PMT(Partial Match Table)。

 

PMT的求取

假设有字符串 S_0S_1...S_{n-1}S_{n}

那么字符串的前缀组成的集合有:pre =\{"S_0","S_0S_1","S_0S_1S_2",..."S_0S_1...S_{n-1}"\}

字符串后缀组成的集合有:post = \{"S_1S_2...S_n","S_2S_3...S_n"..."S_2S_1","S_1"\}

假设:i\in pre,j\in post

PST的定义:PST = MAX\{i,j\},if\ i=j

一个例子:babcab,那么前缀集合为{{b},{ba},{bab},{babc},{babca}},后缀集合为{{abcab},{bcab},{cab},{ab},{b}},此时两个集合的交集为{{ab},{b}},那么交集中含有元素的最多的为{ab},此时 PST=2

 

PMT strike a specific algorithm, the venue: https://blog.csdn.net/x__1998/article/details/79951598#commentsedit

PMT strike the core idea:

Of a string, it is shifted to a position comparison, so that the top line of seeking the suffix, and the following line is in the prefix and suffix of the above comparison. which is:

KMP algorithm code

#include<iostream>
#include<string>
#define MAX_SIZE 1024
using namespace std;
int next[MAX_SIZE];
void get_next(string s1)
{
	int len = s1.length();
	int i = 0, j = -1;
	next[0] = -1;                       // 初始next[0]为-1,代表0号字符前无前缀后缀匹配
	while(i < len){
		if(j==-1 || s1[i]==s1[j]){      // 如果是next[1]则为0
			i++;
			j++;
			next[i] = j;
		}
		else{
			j = next[j];                // 如果不满足前缀后缀匹配,则将j跳转到前面next指定的位置
		}
	}
}
int KMP(string s1, string s2, int pos)
{
	int len1 = s1.length();
	int len2 = s2.length();
	if(!len2 || len2>len1) return -1;
	int i = pos, j = 0;
	while(i<len1 || j<len2){
		if(j==-1 || s1[i]==s2[j]){
			i++;
			j++;
		}
		else{
			j = next[j];
		}
	}
	if(j==len2 && i<=len1) 
		return i-j+1;
	else
		return -1;
}
int main()
{
	string s1 = "ababababca";
	string s2 = "abababca";
	get_next(s2);
	cout<<KMP(s1, s2, 0)<<endl;
	return 0;
}

 

Guess you like

Origin blog.csdn.net/cprimesplus/article/details/90374592