KMP algorithm understanding (update)

The KMP algorithm is a very classic string matching algorithm. It says that we are given two strings str1 and str2, the lengths of which are N and M respectively, and implement an algorithm. If the string str1 contains str2, then return The starting position of str2 in str1, if not included, return -1. The meaning of the title is easy to understand, as shown in the figure below: When str1 and str2 are abcdabce and abce, the two match and return the subscript 3 of the starting position of abce in str1. When str1 and str2 are abcdabce and abcf, the two are not If there is a containment relationship, return -1. When most people see this topic, the first method that comes to mind must be violent comparison. Compare str2 from the first character position of str1. If it can match, return it. If it can’t match, start from the second, the first Three...until matching or matching failure, it is undeniable that this method can indeed meet the requirements, but the time complexity consumed is O(M*N), so is there any way to reduce the time complexity? Our KMP algorithm is on the stage. Before we start the introduction, let's understand a concept, the nextArr array of the string str. What are the characteristics of this array: 1. The length of this array is the same as the length of the str string 2. nextArr The meaning of [i] is the suffix substring that must end with str[i-1] in the string str[0...i-1] before str[i] (cannot contain str[0]) and must end with str[ 0] The maximum matching length of the prefix substring (cannot contain str[i-1]). Here is a simple example, everyone will understand, suppose we have a string str as abcdabcd, then how to get its nextArr array, first list the results, the result is: [-1,0,0 ,0,0,1,2,3]. The following is a detailed introduction to how this result came out: 1. First, when i=0, there is no string before str[0], and the default value at this time is -1. 2. Next, when i=2, There is only one string a before str[1], and it is 0 by default at this time 3. Next i=2, the string before str[2] is ab, at this time a and b are not equal, nextArr[2]=0 4 . Next i=3, the string before str[3] is abc, nextArr[3]=0 5. Similarly, nextArr[4]=0 6. Until i=5, before str[5] is abcda, at this time there is a character a, which is the matching character between the prefix substring and the suffix substring, nextArr[5]=1 7. When i=6, Before str[5] is abcdab, now there is a string ab, which is the longest match between the prefix substring and the suffix substring, nextArr[6]=2 8. Similarly nextArr[7]=3 Let’s see when we get After this nextArr array, how does it optimize the time complexity. Going back to our original question, the problem we have to solve is to judge whether the string str1 contains str2 or not The longest matching string, that is, a and b are equal, assuming that str1 and str2 are completely matched on the left part of the cross, but the matching fails at the position of str2[j], that is, str2[j] is not equal to str1[i] , At this time, note that our approach is no longer to slide str2 to the right by only one unit, but to slide j-nextarr[j] (matched length - the maximum common length of the prefix and suffix) units to the right, and then continue The above process starts to match, thus completing the optimization of the time complexity, which is also the core step of the KMP algorithm. Now suppose we have a method to get the nextArr array of a string, assuming this method is called getNextArr, the code of the KMP algorithm is attached below: public int getIndex (String str1, String str2){ Similarly, nextArr[7]=3 Let's see how it optimizes the time complexity after we get the nextArr array. Going back to our original question, the problem we have to solve is to judge whether the string str1 contains str2 or not The longest matching string, that is, a and b are equal, assuming that str1 and str2 are completely matched on the left part of the cross, but the matching fails at the position of str2[j], that is, str2[j] is not equal to str1[i] , At this time, note that our approach is no longer to slide str2 to the right by only one unit, but to slide j-nextarr[j] (matched length - the maximum common length of the prefix and suffix) units to the right, and then continue The above process starts to match, thus completing the optimization of the time complexity, which is also the core step of the KMP algorithm. Now suppose we have a method to get the nextArr array of a string, assuming this method is called getNextArr, the code of the KMP algorithm is attached below: public int getIndex (String str1, String str2){ Similarly, nextArr[7]=3 Let's see how it optimizes the time complexity after we get the nextArr array. Going back to our original question, the problem we have to solve is to judge whether the string str1 contains str2 or not The longest matching string, that is, a and b are equal, assuming that str1 and str2 are completely matched on the left part of the cross, but the matching fails at the position of str2[j], that is, str2[j] is not equal to str1[i] , At this time, note that our approach is no longer to slide str2 to the right by only one unit, but to slide j-nextarr[j] (matched length - the maximum common length of the prefix and suffix) units to the right, and then continue The above process starts to match, thus completing the optimization of the time complexity, which is also the core step of the KMP algorithm. Now suppose we have a method to get the nextArr array of a string, assuming this method is called getNextArr, the code of the KMP algorithm is attached below: public int getIndex (String str1, String str2){
//If both strings are empty, or the length of the matching string is greater than the length of the matched string, return -1 directly
if (str1 == null || str2 == null || str2.length > str1.length) { return -1; } char[] ch1 = str1.toCharArray(); char[] ch2 = str2.toCharArray(); //i, j represent pointers in str1 and str2 respectively, when j reaches the last character of str2 , the match is successful. int i = 0; int j = 0; int[] nextArr = getNextArray(ch2); while (i < ch1.length && j < ch2.length) { //If it matches, compare the next position if (ch1 [i] == ch2[j]) { i++; j++; } //If nextArr[j]==-1, it means that the matching string index is 0, only the default value here is -1; else if (nextArr[ j] == -1) { i++; } //Otherwise, the matching string slides to the right, where j = nextArr[j] can be understood as sliding to the right else { j = nextArr[j]; }

}
return j == ch2.length ? i - j : -1;

	}接下来就介绍一下如何获取nextArr数组前面就介绍过按照规定字符串第一个字符对应的数组值为-1，第二个为0，即nextArr[0] = -1;nextArr[1] = 0;对于后面的求解过程，下面详细介绍：因为是从左到右依次求解，所以当求解nextArr[i]的时候，nextArr[i-1]已经求解出来，通过它的值可以知道B字符前字符串的最长前缀与最长后缀的匹配区域，a区域与b区域，字符C与字符B分别是紧贴着这两个区域后面的字符，由此可知，如果C字符与B字符相同，那么nextArr[i]=nextArr[i-1]+1。如果字符C与字符B不等，那么就看字符C之前的前缀与后缀的匹配情况了，假设字符C是第cn个字符，那么nextArr[cn]就是其最长前缀与后缀匹配的长度，如下图所示，那么，n与m两个就是最长前缀与后缀区域，m'是b区域的最右区域且长度与m区域长度一致，那么m与m'一定是相等的，字符D是n区域后面一个元素，如果D字符与B字符相等，那么nextArr[i]=nextArr[cn]+1。如果不等那么继续往前跳到字符D，之后的过程与跳到C一致，每跳一次都会出现一个字符与B比较，如果相等，nextArr[i]就可以确定。如果跳到最左的位置，此时nextArr[0]=-1，此时说明字符A之前的字符串不存在前缀后缀匹配，令nextArr[i]=0；具体代码如下：public int[] getNextArray (String s){
		char[] ch = s.toCharArray();
		if (ch.length == 1) {
			return new int[]{-1};
		}
		int[] nextArr = new int[ch.length];
		nextArr[0] = -1;
		nextArr[1] = 0;
		int pos = 2;
		int cn = 0;
		while (pos < next.length) {
		//如果字符B等于字符C，加一
			if (ch[pos - 1] == ch[cn]) {
				next[pos++] = ++cn;
			} else if (cn > 0) {
				cn = next[cn];
			} else {
				next[pos++] = 0;
			}
		}

		return next;

	}

KMP algorithm understanding (update)

Guess you like