The principle of KMP algorithm, talk about the understanding of "j = next[j]"

why write this article

  Recently, I am learning data structure. I just learned the KMP algorithm in the past two days . I have a good understanding of the logic of the KMP algorithm, but I am stuck in the code part. In fact, I am mainly stuck on    the statement j=next[j]. After collecting a lot of information, it took a long time to understand this statement. To write this article, one is to record my own learning results, and the other is to help those in need, because I have found a lot of articles and videos, but basically I didn’t explain it clearly here, just to add here Let it be vacant (personally thinks it is vacant).

What is the KMP algorithm

  The KMP algorithm (Knuth-Morris-Pratt algorithm) is a well-known string matching algorithm with high efficiency. Because the algorithm was jointly proposed by DEKnuth, JHMorris and VRPratt, it is called the KMP algorithm.

 The core idea of ​​KMP algorithm

  The KMP algorithm is actually the problem of finding the longest common prefix substring . What does it mean? Please look at the picture below:
insert image description here
  Here, the pointer to the main string and the pointer to the pattern string point to [E] and [F] respectively (The red part in the picture) So how to calculate the longest common prefix substring ? This is mainly seen in the green part of the picture above.
insert image description here
  Look at the picture above, the data in the two red boxes in the picture above are the same (both are ABC), and no data string longer than them can be found in the picture to meet the requirements, this is the longest common prefix substring . Then:
insert image description here
  the above picture is the core idea of ​​the KMP algorithm, because this article mainly talks about the next array , so the idea of ​​KMP is roughly here, if you need a more detailed explanation, you can find other bloggers article.

Heady next array

  The picture below is Yan Weimin's (C language version) code about generating the next array.
insert image description here
  In fact, the front part of this code is relatively easy to understand, mainly because there are doubts about the fifth line of code.
insert image description here
  Then let's talk about this j=next[ j] question

Talk about j=next[j]

  The "longest prefix common substring" has nothing to do with the main string, it is completely an attribute of the pattern string.I believe everyone should have understood this point, but using code to find the next array has become a problem.
insert image description here
  Continue to look at this picture, let’s talk about the code labeled 2 first, you can understand this line of code as the following picture, initialize the two pointers i and j to point to the position in the picture respectively, and set next[1] direct assignment 0
insert image description here

  I don’t know if you are a little confused, because in the above introduction, we are comparing the green part (K characters) in front of the pointer, which does not include the character pointed by the pointer, as shown in the figure below, but the code labeled 4 is directly comparing
insert image description here
  pointers The value of the element pointed to by i and j ? In fact, there is nothing wrong here. Everyone sees that the value is compared first, then the pointer is incremented by one, and finally the value of the next is given. Simply put, it is to compare first, and then move the pointer back and assign a value to the next array after it is finished. I don't know if everyone understands this, but in fact it is. The logic used here is this :
insert image description here

  To find the next array is to find the largest common prefix substring, as shown in the figure above, if the next [i] corresponding to an element (D) in the pattern string is just j, which is the position of the j pointer in the figure above [in fact, there is only j here -1 character is a match (like the two green parts in the picture above, it can be zero), don't confuse it! ], then the next step is to solve the problem of i+1. At this time, there are two cases:
Case 1: The new character D is the same as the jth character s[j] , as shown in the figure below:
insert image description here
  In this case, next[i+1] = next[i] + 1, this It is relatively easy to understand, and I believe everyone can think of it at a glance.

Case 2: The new character D is different from the jth character s[j] , as shown in the figure below:
insert image description here

  Many people are stuck at this step when they read the KMP algorithm. I believe that many people who read this article today are also here. The notorious rollback operation - j = next[j] appears here.
  At this time, the original green part can no longer be used. In order to meet the requirement of the longest common prefix substring, we need to shrink the green part, like this:
insert image description here

  The "?" in the blue grid means that we don't know what the characters in it are for the time being. The two parts enclosed in orange curly brackets (the part marked new in the figure above) are the parts we need to re-contrast.
  Why do you look for it like this? Assumed by the previous article that the two characters s[j] and s[i] are different, the matching fails. At this time, we need to find the longest common prefix substring again. Do we have to find it from the beginning? Impossible, absolutely impossible! And there is no need to be like this. If there is any difference between searching from scratch and the BF algorithm, it will increase the time complexity, which is very uneconomical. Comrades, don't forget, we still have a lot of hole cards in our hands, so why not be afraid, just do it!
  Now let's take a look at the cards in our hand, and the opponent asks us to provide next[i+1], because the two characters s[j] and s[i] do not match, so we can't follow the conventional routine of the first case. , but a starved camel is bigger than a horse, and I still hold a series of cards such as netx[i] in my hand. What is next[i], that is, the value of j in the figure below, that is, next[i]=j ?
insert image description here
  For the sentence next[i]=j , the vernacular means that for the pattern string, starting from the i-th character (no Contains the i-th character) The string obtained by counting j-1 characters forward is exactly the same as the substring obtained by counting j-1 characters backward from the beginning of the pattern string. The two strings are equal, that is, the green in the above figure part. The difference between the characters in s[j] and s[i] causes us to re-match. The idea is to select the longest possible substring in the characters close to s[i] to meet the conditions of the KMP algorithm.
  Here, we must first understand why the trimmed green part will only be shorter than the original one, and it is impossible to exceed the original green part. Here is a brief explanation. If you feel that you have fully understood it, you can skip this point and continue to look down . The picture below is the longest common prefix substring (green part) that we have verified
insert image description here
  When we encounter the second situation, that is, the characters in s[j] and s[i] do not match, then assume that the length of the extended longest common prefix substring can meet the requirements (assuming that the extension of m characters just meets the requirements ), as shown in the picture below:
insert image description here
  Look at the picture below, the longest common prefix substring (green part) that has been verified just now has been extended, which is contrary to what is known, so we know that the longest common prefix substring is impossible was extended.
insert image description here
  Well, it proves that the longest common prefix substring cannot be extended without remaining unchanged, it can only be shortened. To what extent does the shortening need to be shortened? How much should it be shortened at a time? This is related to the next array. Continue to get back this picture:
insert image description here

  The most important thing you must have seen is the green part in new . And the green part is the card in our hand, brothers, it is the longest common prefix substring of s[i] .
  What do you mean, look at the picture below (that is, the upper part of the picture above), in fact, our task is to find the longest common prefix of s[i+1] in Area1 and Area2 substring.
insert image description here
  It just so happens that the strings in these two areas are exactly the same, so we only need to study one of them. Then we will study Area1, we will take out Area1,
insert image description here
  you see, the pointer j at this time is still pointing to the purple part in the figure. And next[j] is to find the longest common prefix substring of s[j]! That is:
insert image description here
  But the problem arises, because what we want is the longest common prefix substring of s[i+1], so we need to ensure that the elements before the character s[i+1] exactly match the elements at the beginning, Faced with the following problem, see the picture:
insert image description here

  Yes, that is, we can only guarantee that the green part in the picture is the same, but we cannot guarantee that the blue square with a question mark matches s[i]. If you don’t understand this sentence, use the following picture Maybe it can help you more:
insert image description here
  the above is the function of j=next[j]. After moving the position pointed by the pointer j, we need to continue to compare whether the characters in s[i] and s[j] are the same, but this It's another cycle of the program. The same is case 1, and the difference is case 2. Try until success or the length of the common prefix is ​​0.

python code

  Writing an article is really tiring. Next, I will attach a piece of python code for everyone. The code is just to find the next array, so it is inevitably not standardized. If you need it, you can take it yourself.

# KMP 算法求next[i]的值
nex = [0]
ch = ("ABCABDABCABCM")  # 要求的模式串
i = 1
nex.insert(1, 0)
j = 0
q = [1]
u = [1]
while i < len(ch):
    if j == 0 or ch[i-1] == ch[j-1]:
        i += 1
        j += 1
        q = ch[i - 1]
        nex.insert(i, j)
    else:
        j = nex[j]
    u = ch[j - 1]
print(nex[1:])

Guess you like

Origin blog.csdn.net/weixin_45911959/article/details/123468409