Three major O(n) algorithms for string matching: KMP, Manacher, and extended KMP

Preface

Hearing the string, the first reaction: enumeration, hash, DP, just have a hand.

But what I want to talk about is the string algorithm, and the string that is in contact with the C++ foundation is not at a level.

For example, for the question n≤1000, you may use a string, and for the question n≤10000000, you need to use the string algorithm.

The three algorithms in this article are about string matching in string algorithms. You need to understand some basic things (or skip it directly):

String matching is one of the oldest and most widely studied problems in computer science. A character string is a sequence of characters defined on a finite alphabet Σ. For example, ATCTAGAGA is a character string on the alphabet ∑ = {A,C,G,T}. The problem of string matching is to search for all occurrences of a string P in a large string T. Among them, T is called text, P is called pattern, and both T and P are defined on the same alphabet Σ.

Although string matching algorithms have been developed for decades, very practical algorithms have only appeared in recent years. There is a disconnect between theoretical research and practical application in the study of string matching. Those scholars who specialize in algorithm research care about algorithms that look great in theory-with good time complexity . The developers only pursue the fastest possible algorithm in practical applications. Between the two never pay attention to what the other is doing. Algorithms that combine theoretical research with practical applications (such as the BNDM algorithm) have only appeared in recent years. In practical applications, it is often difficult to find an algorithm that meets the needs-such an algorithm actually exists, but only senior experts understand it. Consider the following situation. A software developer, or a computational biologist, or a researcher, or a student does not have a deep understanding of the field of string matching, but now needs to deal with a text search problem. Those sweaty books make readers drown in a sea of ​​various matching algorithms, but there is not enough knowledge to choose the most suitable algorithm. In the end, it often leads to the situation: choose one of the simplest algorithms to implement. This often leads to poor performance, which affects the quality of the entire development system. To make matters worse, I chose an algorithm that looked beautiful in theory and spent a lot of effort to implement it. As a result, it turns out that the actual effect is similar to a simple algorithm, or even not as good as a simple algorithm. Therefore, a "practical" algorithm should be selected, that is, it performs better in practical applications, and an ordinary programmer can complete the algorithm's implementation code within a few hours. In addition, in the field of string matching research, a well-known fact is that "the simpler the idea of ​​the algorithm, the better the effect of practical application".

Traditional string matching algorithms can be summarized as prefix search, suffix search, and substring search. Representative algorithms include KMP , Shift-And , Shift- Or, BM , Horspool , BNDM , BOM, etc. The techniques used include sliding windows, bit parallelism, automata , suffix trees, etc.

——A certain degree

km²

The KMP algorithm is used to solve a type of prefix search problem: find the longest prefix of A that can match the suffix of each prefix of the B string.

Here we call B a matching string and A a pattern string (vivid image) ,

The brute force method is to enumerate, suppose the current string B matches to point i, and string A matches to point j:

  • If B_i=A_j, then the answer length of prefix i is j;
  • If B_i\neq A_j, then the answer length of prefix i can only be less than j, so j is constantly jumping forward, matching  A_1 ~  A_j and  B_{i-j+1} ~  B_i . If the match is successful, the answer is recorded as j;
  • ++i,++j

It can be proved that if i and j both start from the first place, the answer obtained by this method must be the longest match.

Obviously, this approach can be optimized pretreatment, although the second step B_iand A_jthe match fails, the foregoing there is a successful matching string length of the j-1, except the prefix answer less than 2, the final match of the resulting A Removing the last character must match the suffix of this j-1 string,

So we can preprocess the longest prefix of A that can match the suffix of each prefix of the A string (somewhat like the stem of the question), so that we can save  the time of matching  A_1 ~  A_j and  B_{i-j+1} ~  B_i:

Let fail[i]denote the longest prefix that can match the suffix of the prefix i of A, such as:

So we can be sure: every time the match fails, j can jump to the nearest position  fail[j], because the previous A1~ A_jhas been guaranteed to match, only need to compare A_{j+1}and B_i:

Specific code:

for(int i=1,j=0;i<=m;i++){
	while(j>0&&a[j+1]!=b[i])j=fail[j];
	if(a[j+1]==b[i])ans[i]=++j;//配得长度为j+1
}

Since the definitions look very similar, the method of preprocessing fail is actually the same as matching (see the code for details):

for(int i=2,j=0;i<=n;i++){  //fail从2开始求
	while(j>0&&a[j+1]!=a[i])j=fail[j];
	if(a[j+1]==a[i])fail[i]=++j;
}

Why is the complexity O (n)? Many people have this doubt when learning KMP. After a long time, they may forget it, thinking that KMP must bring a log and so on and dare not use it;

In fact, the interval of the substring of B obtained by the algorithm during matching is continuously shifted back. Specifically, let the substring of B obtained from the previous match be l~r (l=ij, r=i-1),

 When i and j match successfully, r moves backward;

 When i and j fail to match, j jumps forward, that is, l moves backward;

 r moves backward |B|at most times, and l moves backward at most |B|times, so the total complexity O (n).

Manacher

The Manacher algorithm is used to find the palindrome substring in the string.

The palindrome substring, that is, the symmetrical substring on the left and right sides, is essentially a string matching problem.

We know that palindrome strings are divided into two types, such as aba and abba. The even length is essentially the same as the odd length. We only need to insert the same character in every two characters: #a#b #a#, #a#b#b#a#, each palindrome string will be centered on one character.

Therefore, when solving such problems, the following preprocessing is used:

s[0]='$',s[1]='#';  //s[0]区别处理
while((s[n+1]=getchar())!='\n'&&s[n+1]>0)s[n+2]='#',n+=2;//每次输入加一个字符
s[n+1]=0;

Then start with the basics: find the longest palindrome centered on each subscript

Suppose there are the following two palindrome substrings (red and orange):

Then it can be concluded that there is another palindrome (blue, symmetrical to orange):

This is obvious, because the left and right sides are symmetrical in the red string;

Therefore, the algorithm idea is very obvious: For each subscript i, save the palindrome string (l, mid, r) (mid<i≤r) whose center is on the left and the right end is farthest to the right in advance, then since the subscript is before The palindrome string has been obtained for all points, and the preliminary answer can be obtained by directly pressing the mid symmetry;

But for this situation:

Since r'is outside the red string and cannot be guaranteed to be symmetrical to the left, we have to retract r'to r and then violently widen it;

If r'is in the red string, then it must not be widened, because the length of the symmetrical past palindrome has been determined.

The code looks like this:

for(int i=1,r=0,mid=0;i<=n;i++){
	if(i<=r)mnc[i]=min(r-i+1,mnc[(mid<<1)-i]);
	while(s[i+mnc[i]]==s[i-mnc[i]])mnc[i]++;
	if(i+mnc[i]-1>r)r=i+mnc[i]-1,mid=i;//mnc[i]-1为i处的真实回文串长度
	ans=max(ans,mnc[i]-1); //求最长回文串
}

Complexity: Since the r of the saved palindrome will only expand to the right, the total complexity is O (n).

EX (Extension) KMP

The extended KMP is used to solve another type of prefix search problem: finding the longest prefix of A that can match the prefix of each suffix of the B string.

The problem looks a lot like the KMP problem, but it's actually simpler.

Here we only solve a sub-problem: find the longest prefix of A that can match the prefix of each suffix of the A string, that is, the length of l from each place of the B string is the same as the prefix of the beginning of the B string with l To maximize l.

From the explanation of the KMP algorithm above, we know that after solving this sub-problem, anyone with a little thinking knows how to fight next...

Let e[i]denote the longest matching length from position i;

Assuming that the current enumeration reaches the i-th position, and there is a j before the i position, which satisfies the r=j+e[j]-1maximum,

If r<i, directly ask for violence e[i], and then assign j to i;

If r≥i: (The subscript starts from 0, which is different from the previous one) It is easy to find,

If e[i-j]<r-i+1, then it e[i]must be equal e[i-j], because the two strings marked with red are exactly the same and e[i-j]cannot be expanded to the right, then it e[i]must not be expanded either;

If e [ij] \ geq r-i+1, then only the match within i~r can be guaranteed to be successful, and there is no guarantee that after r, so from r onwards it is violently widened.

The code is very short:

for(int i=1,j=0,r=0;i<n;i++){
	if(i<=r)e[i]=min(e[i-j],r-i+1);
	while(A[i+e[i]]==A[e[i]])e[i]++;
	if(i+e[i]-1>r)r=i+e[i]-1,j=i;
}
for(int i=0,j=0,r=0;i<m;i++){
	if(i<=r)ans[i]=min(e[i-j],r-i+1);
	while(B[i+ans[i]]==A[ans[i]])ans[i]++;
	if(i+ans[i]-1>r)r=i+ans[i]-1,j=i;
}

The correctness of this is obvious, and then consider the complexity;

Each violent widening must move r back, up to n times, so the total complexity O (n).

postscript

Although I wrote for a long time and thought about it for a long time, I still feel that it is very convoluted. If you don't understand, just make do with the code.

Guess you like

Origin blog.csdn.net/weixin_43960287/article/details/111058741