Algorithm Notes (1) - KMP Algorithm

content

Brute force matching (BF) algorithm

basic concept

Analyzing BF Algorithms

Code 

A small test

Time complexity of BF algorithm

KMP algorithm

basic concept

Analyzing KMP Algorithms

Bring out the next array

Code

Key code explanation

A small test

Time complexity of KMP algorithm


Brute force matching (BF) algorithm

basic concept

The BF algorithm, the Brute Force algorithm , is a common pattern matching algorithm. The idea of ​​the BF algorithm is to match the first character of the target string S with the first character of the pattern string. If they are equal, continue to compare S. The second character of S and the second character of T; if they are not equal, compare the second character of S and the first character of T, and compare them in turn until the final matching result is obtained. BF algorithm is a brute force algorithm.

Analyzing BF Algorithms

Just looking at the definition is obscure and difficult to understand. Next, I will give an example to learn with you:

Suppose we give the string "ababcabcdabcde" as the main string, and then give the substring "abcd", now we need to find whether the substring appears in the main string, and return the first matching subscript in the main string, fail returns -1.

  

For this problem, we can easily think of it: matching from left to right, if the characters are equal, they are all shifted back by one; Start with 0 subscript, next time start with 1 subscript)

We can initialize like this:

According to our ideas, then we need to compare whether the numbers pointed to by the i pointer and the j pointer are consistent. If they are consistent, they will move backwards. If they are inconsistent, as shown below:

If b and d are not equal, then the i pointer is returned to the next position of the pointer just now (the pointer just started from the 0 subscript), and the j pointer is returned to the 0 subscript and starts again.

Code 

Based on the above analysis, let's start writing the code:

C code:

#include<stdio.h>
#include<string.h>
#include<assert.h>
int BF(char* str1, char* str2)
{
	assert(str1 != NULL && str2 != NULL);
	int len1 = strlen(str1);//主串的长度
	int len2 = strlen(str2);//子串的长度
	int i = 0;//主串的起始位置
	int j = 0;//子串的起始位置
	while (i < len1 && j < len2)
	{
		if (str1[i] == str2[j])
		{
			i++;//相等i和j都向后移动一位
			j++;
		}
		else {//不相等
			i = i - j + 1;//i回退
			j = 0;//j回到0位置
		}
	}
	if (j >= len2) {//子串遍历玩了说明已经找到与其匹配的子串
		return i - j;
	}
	else {
		return -1;
	}

}
int main()
{
	printf("%d\n", BF("ababcabcdabcde", "abcd"));//测试,为了验证代码是否正确尽量多举几个例子
	printf("%d\n", BF("ababcabcdabcde", "abcde"));
	return 0;
}

java code:

public class Test {
public static int BF(String str,String sub) {
if(str == null || sub == null) return -1;
int strLen = str.length();
int subLen = sub.length();
int i = 0;
int j = 0;
while (i < strLen && j < subLen) {
if(str.charAt(i) == sub.charAt(j)) {
i++;
j++;
}else {
i = i-j+1;
j = 0;
}
} i
f(j >= subLen) {
return i-j;
} r
eturn -1;
} 
public static void main(String[] args) {
System.out.println(BF("ababcabcdabcde","abcd"));
System.out.println(BF("ababcabcdabcde","abcde"));
}
}

A small test

Through the above study, I have a preliminary understanding of the BF algorithm. In order to have a deeper understanding and application, I will complete the following test questions with you:

Test questions here >> Implement strStr()

Interested partners can try it out, and we will discuss it together in the next chapter;

Time complexity of BF algorithm

The best case is that the time complexity of matching is O(1) from the first time;

The worst case is that each time the last one is matched only to find that it is different from the main string, such as "aaaaab", substring "aab"

 

 

 

 

Looking at the picture above, except for the last time, the rest are matched to the end every time, only to find out, ah, we are different.

In this case, in the above figure, the pattern string is in the first 3 times, and each time it matches 3 times, and does not match, until the 4th time, all matches, no need to continue to move, so the number of matches is (6 - 3 + 1) * 3 = 12 times.

It can be seen that, for the length of the main string of n and the length of the pattern string of m, the worst-case time complexity is O((n - m + 1) * m) = O(n * m).
I believe that thinking friends will find that if it is for searching, there is no need to move i to the position of 1 at all, because the previous characters are all matched, then move i to the position of 1, and move j to the position of 0 , the position is staggered, and obviously it will not match, so can we discard the unnecessary steps above, reduce the pointer backtracking to simplify the algorithm, and have an idea that the i position does not move, only the j position needs to be moved , which leads us today. The protagonist kmp algorithm.

KMP algorithm

basic concept

KMP algorithm is an improved string matching algorithm proposed by DEKnuth, JH Morris and VRPratt, so people call it Knuth-Morris-Platt operation (KMP algorithm for short). The core of the KMP algorithm is to use the information after the matching failure to minimize the matching times between the pattern string and the main string to achieve the purpose of fast matching . The specific implementation is through a next() function , and the function itself contains the local matching information of the pattern string. The time complexity of the KMP algorithm is O(m+n).

Difference: The only difference between K MP and BF is that i of my main string will not go back, and j will not move to position 0.

Analyzing KMP Algorithms

Suppose we give the string "ababcabcdabcde" as the main string, and then give the substring "abcd", now we need to find whether the substring appears in the main string, and return the first matching subscript in the main string, fail returns -1.

1. First, give an example, why the main string is not rolled back

2.j Fallback location

So how does j fall back to the capsule at the position of subscript 2? Below we lead to the next array

Bring out the next array

The essence of KMP is the next array: that is, it is represented by next[j] = k ;, different j corresponds to a K value, and this K is the position of the j you want to move in the future . And the value of K is calculated like this:

  •  Rule: Find two equal proper substrings (excluding itself) that match the successful part, one starting with the subscript 0 character and the other ending with the j-1 subscript character.
  • No matter what data next[0] = -1; next[1] = 0; here, we start with subscripts, and the number of times mentioned starts from 1;

Exercises for finding the next array:  

Exercise 1: For example, for "ababcabcdabcde", find its next array?

-1 0 0 1 2 0 1 2 0 0 1 2 0 0

Exercise 2: Find the next array of "abcabcabcabcdabcde"? "
-1 0 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 0

Here comes the core thing:
here everyone should have no problem with how to find the next array. The next question is, if we know that next[i] = k; how to find next[i+1] = ?
If we can pass next The value of [i], through a series of conversions to get the value of next[i+1], then we can implement this part.
So how to do it?

First assume: next[i] = k is established, then this formula is established: P0...Pk-1 = Px...Pi-1, get: P0...Pk-1 = Pi-k. .Pi-1; Analysis as shown below:


Then we assume that if Pk = Pi; we can get P0...Pk = Pi-k..Pi; then this is next[i+1] = k+1;



So: what about Pk != Pi ?


 

Code

C code:

#include<stdio.h>
#include<string.h>
#include<assert.h>
void GetNext(int* next, char* sub, int len2)
{
	next[0] = -1;//规定第一个为-1,第二个为0,则直接这样定义就好了;
	next[1] = 0;
	int k =0;//前一项的k
	int j = 2;//下一项
	while (j < len2)
	{
		if (k==-1||sub[j-1] == sub[k])
		{
			next[j] = k + 1;
			j++;
			k++;
		}
		else
		{
			k = next[k];
		}
	}
}
int KMP(char* str, char* sub, int pos)
{
	assert(str != NULL && sub != NULL);
	int len1 = strlen(str);
	int len2 = strlen(sub);
	assert(pos >= 0 && pos < len1);
	int i = pos;//i从指定下标开始遍历
	int j = 0;
	int* next = (int*)malloc(sizeof(int) * len2);//动态开辟next和子串一样长
	assert(next != NULL);
	GetNext(next, sub, len2);
	while (i < len1 && j < len2)
	{
		if (j == -1||str[i] == sub[j])//j==-1是防止next[k]回退到-1的情况
		{
			i++;
			j++;
		}
		else {
			j = next[j];//如果不相等,则用next数组找到j的下个位置
		}
	}
	if (j >= len2)
	{
		return i - j;
	}
	else {
		return -1;
	}
}
int main()
{
	char* str = "ababcabcdabcde";
	char* sub = "abcd";
	printf("%d\n", KMP(str, sub, 0));
	return 0;
}

java code:

public static void getNext(int[] next, String sub){
next[0] = -1;
next[1] = 0;
int i = 2;//下一项
int k = 0;//前一项的K
while(i < sub.length()){//next数组还没有遍历完
if((k == -1) || sub.charAt(k) == sub.charAt(i-1)) {
next[i] = k+1;
i++;
k++;
}else{
k = next[k];
}
}
} 
public static int KMP(String s,String sub,int pos) {
int i = pos;
int j = 0;
int lens = s.length();
int lensub = sub.length();
int[] next= new int[sub.length()];
getNext(next,sub);
while(i < lens && j < lensub){
if((j == -1) || (s.charAt(i) == sub.charAt(j))){
i++;
j++;
}else{
j = next[j];
}
} 
if(j >= lensub) {
return i-j;
}else {
return -1;
}
} 
public static void main(String[] args) {
System.out.println(KMP("ababcabcdabcde","abcd",0));
System.out.println(KMP("ababcabcdabcde","abcde",0));
System.out.println(KMP("ababcabcdabcde","abcdef",0));
}

Key code explanation

else{

   j=next[j]

}

if (j == -1||str[i] == sub[j])
        {             i++;             j++;         }


 Question : Why is there still a j==-1?

As shown in the figure below: when the first character does not match, i, j are both 0 at this time , j=next[j] >> j=next[0] >> j=-1;  at this time j is -1 , If you do not add j==-1, then the program will end and return no match, but if you look closely at the figure below, P[5]~P[8] match the substring, so the answer is obviously wrong. So We should add the case of j==-1 and let it traverse from the beginning;

 

   next[0] = -1;
    next[1] = 0;
    int k =0;//k
    int j = 2 of the previous item;//next item

 According to our regulations, the first and second numbers of the next array are -1 and 0, so there is no problem. k=0 is the value of the previous item k, and j=2 is the next item.

if (k==-1||sub[j-1] == sub[k])
        {             next[j] = k + 1;             j++;             k++;         } 



According to the above content, we can know that p[j]==p[k], next[i]=k; then we can deduce next[i+1]=k+1; as shown in the figure below, but here i is j- 1, everyone should pay attention to this, p[j]==p[k]>>sub[j-1]==sub[k];next[i+1]=k+1>>next[j]= k+1;

 

else
        {             k = next[k];         } 

This knowledge point has been mentioned above, when p[j]!=p[k], k rolls back, always finds p[j]==p[k] and then uses this next[i+1]=k+1;

A small test

topic here >> repeated substrings 

 Interested partners can try it out, and we will discuss it together in the next chapter;

Time complexity of KMP algorithm

Suppose to find the starting position of the N string in the M string, the lengths are m and n respectively, using the KMP algorithm, it is generally considered that the time complexity is O(m+n), that is, the time complexity of calculating the next array is O (n), and O(m) when matching.

 The above is the explanation of the KMP algorithm. There are deficiencies or better insights into the code. Welcome to leave a message in the comment area to discuss and make progress together!

Guess you like

Origin blog.csdn.net/m0_58367586/article/details/123073696