String matching algorithm-KMP algorithm

BF algorithm

When it comes to string matching, the first thing that comes to mind is that the pattern string matches the main string character by character. When an unmatched character is encountered, the pattern string moves back one position and continues from the first position of the pattern string. Start matching. This matching algorithm is also called BF algorithm , which is "brute force matching algorithm". The time complexity is O(n*m), where n and m represent the lengths of the main string and the pattern string.

The BM algorithm and the KMP algorithm are both optimized on the basis of the BF algorithm, that is, during the matching process, it is hoped to skip more characters.

KMP algorithm idea

Since we already know which characters have been read when a string comparison fails, is it possible to avoid the step of "jumping back to the next character and matching again"?

The basic idea of ​​the KMP algorithm is to use the pattern string that has been traversed to avoid the "backup" step in the brute force algorithm when encountering a non-matching character. In other words, you don't want to decrement the pointer of the main string and let it move forward forever.

Give a chestnut:

Match the pattern string - ABAB C in the main string - ABAB A BCAA

 When matching, it is found that the last character of the substring does not match the corresponding character of the main string, but the suffix of the main string matches the prefix of the substring, then you can skip two characters, move the substring back two places, and continue match.

So how do we know how many characters to skip each time we encounter a non-matching character?

The next array defined in the KMP algorithm is used here

Give a chestnut:

Here we don't care about how the next array is generated, let's first look at its functions and uses.

When the KMP algorithm encounters a character mismatch, it will look at the next value corresponding to the last matching character.

In this example, the next value corresponding to the last matched character is 2, so we move the substring backward two places so that it skips two characters in the main string and can continue to match.

 The value of the next array element represents the number of characters that the substring can "skip matching".

Since there is no need for a rollback pointer and only one traversal of the main string is needed to complete the matching, the efficiency will naturally be much higher than that of the brute force algorithm.

public int kmpSearch(char[] txt, char[] patt){ // txt代表主串,patt代表模式串
		int[] next =buider_nexts(patt); // 假设已经计算出了next数组
		int i = 0; // 主串中的指针
		int j = 0; // 子串中的指针
		while (true){
			if (i == txt.length) return -1;
			if (txt[i] == patt[j]) { // 字符匹配,指针后移一位继续匹配
				i++;
				j++;
			} else if (j > 0) { // 字符不匹配,则根据next数值跳过子串前几个字符的匹配
				j = next[j-1];
			}else { // 子串的第一个字符就不匹配,则直接后移一位
				++i;
			}
			if (j == patt.length) return i-j; // 如果j已经达到子串末尾,则匹配成功,返回匹配的起始位置
		}
	}

 Note that the main string pointer i never decreases, which is also the essence of the KMP algorithm.

Generation of next array

As mentioned before, the value in the next array represents the number of matching characters that can be skipped in the substring when the match fails. But why is this possible?

From the previous example, we can find that the last two ABs matched are the same as the first two ABs skipped.

In other words, for the first four characters of the substring, they have a common prefix and suffix AB, and the length is 2.

The essence of the next array is to find the "length of the same suffix" in the substring, and it must be the longest suffix. 

Give a chestnut:

 In this substring, although the prefix and suffix A are the same, they are not the longest, and ABA is the longest of the same prefix and suffix, so the next value is 3.

But please note that the suffix we are looking for cannot be the substring itself. If the number of characters of the length of the substring itself is skipped, it will be meaningless.

Calculation of next array

Take the substring ABABC as an example to calculate the next array.

  • For the first character, the same suffix and suffix obviously do not exist, and next is 0
  • For the first two characters, there is no same suffix and suffix, next is 0
  • For the first three characters, they have the same prefix and suffix A, and next is 1
  • For the first four characters, they have the same prefix and suffix AB, and next is 2
  • For the first five characters, which do not have the same suffix and suffix, next is 0

But how to implement the algorithm?

We can use a for loop to solve the problem violently, but the efficiency is too low. In fact, the recursive method can be used to quickly solve the next array. Its cleverness is that it will continuously use known information to avoid repeated operations.

Give a chestnut:

Assuming that the current common suffix and suffix are known, and the length is 2, there are two situations for continuing downward matching.

1. If the next character is still the same, it forms a longer common suffix, and the length is the previous length + 1.

2. If the next character is not the same, you need to find whether there is a shorter identical suffix.

So how to find shorter identical suffixes?

In the previous calculation, we already know that there is a same suffix and suffix of length 3 in front of the unmatched character B. We can directly look for the common suffix on the left, and the length of the longest suffix on the left can be obtained by looking up the table. If it is 1, then we go back to the original step and check whether the next characters are the same. If they are the same, we can build a longer suffix with a length of +1.

Compare the animation to understand:

 

KMP algorithm next array generation code: 

private static int[] buider_nexts(char[] patt) {
	int[] next = new int[patt.length]; // next数组,且第一个元素为0
    next[0] = 0;
	int prefix_len = 0; // 当前公共前后缀的长度
	int i = 1;
	while (i < patt.length){
		if (patt[prefix_len] == patt[i]){ // 字符匹配,则将prefix_len+1,存入对应next数组中
			prefix_len++;
			next[i] = prefix_len;
		}else{
			if (prefix_len == 0){	// 如果不存在相同前后缀则直接把next设为0
				next[i] = 0;
				i++;
			}else{
				prefix_len = next[prefix_len-1];	// 字符不匹配则直接查表看看存不存在更短的共同前后缀
			}
		}
	}
	return next;
}

 Complete code:

public int kmpSearch(char[] txt, char[] patt){
		int[] next =buider_nexts(patt); // 假设已经计算出了next数组
		int i = 0; // 主串中的指针
		int j = 0; // 子串中的指针
		while (true){
			if (i == txt.length) return -1;
			if (txt[i] == patt[j]) { // 字符匹配,指针后移一位继续匹配
				i++;
				j++;
			} else if (j > 0) { // 字符不匹配,则根据next数值跳过子串前几个字符的匹配
				j = next[j-1];
			}else { // 子串的第一个字符就不匹配,则直接后移一位
				++i;
			}
			if (j == patt.length) return i-j; // 如果j已经达到子串末尾,则匹配成功,返回匹配的起始位置
		}
	}

	private int[] buider_nexts(char[] patt) {
		int[] next = new int[patt.length]; // next数组,且第一个元素为0
		next[0] = 0;
		int prefix_len = 0; // 当前公共前后缀的长度
		int i = 1;
		while (i < patt.length){
			if (patt[prefix_len] == patt[i]){ // 字符匹配,则将prefix_len+1,存入对应next数组中
				prefix_len++;
				next[i] = prefix_len;
                i++;
			}else{
				if (prefix_len == 0){	// 如果不存在相同前后缀则直接把next设为0
					next[i] = 0;
					i++;
				}else{
					prefix_len = next[prefix_len-1];	// 字符不匹配则直接查表看看存不存在更短的共同前后缀
				}
			}
		}
		return next;
	}

Guess you like

Origin blog.csdn.net/weixin_53922163/article/details/132755827