kmp algorithm principle analysis next array principle explanation code detailed comments

Article Directory

What problem to solve

At the beginning of understanding an algorithm, we first need to know the purpose of the algorithm very clearly, and then follow us to understand a function or method, and we need to know its input, output and function.
The kmp algorithm is for strings, it is used to find the target string from a string.

To solve this problem, first of all, let's think about the violent solution, because most of the algorithms are to improve the violent solution, and take a trick. Find the target string from a string.
such as

S； abcabccabcaaabcabcd
T： abcabcd

The above example is to find the position of T from S and return the index position of the found S.
Our violent solution is actually very easy to think about.
It is to traverse each character of S, each time it is compared with the letter of T one by one. This can definitely be found, and of course the time complexity is also high. It is easy to analyze, the time complexity is O(n*m), n is the length of S, and m is the length of T.

Then the kmp algorithm is based on the brute force solution to make a "slight" improvement, eliminating some repetitive comparisons, so that the overall time complexity is reduced. Reduce to O(n+m).

Why is kmp faster

After understanding what problem the algorithm solves, and knowing the general violent solution of the problem, we need to further think about why the algorithm can be faster, oh, because the algorithm removes a lot of repetitions compared to the violent solution (or adopts A simple method is different from violent solution, but it is also understandable that achievement does not consider many situations).
So what duplication does the kmp algorithm remove?
Let's take a look below.

S； abcabccabcaaabcabcd
T： abcabcd

Let’s use the brute force solution to look at the first place of S. Below we find that the first place is the same as the first place of T. Then, compare and compare the seventh place of T and the seventh place of S. Different, at this time, what will the violent solution do?
What would a violent solution do?
It is clear:

S； abcabccabcaaabcabcd
T：  abcabcd

Move T one place to the right, and then continue to compare prices one by one.

But obviously, this comparison is definitely inconsistent. You can only move one place to the right. It's wrong. You can only compare one place later. Oh, I finally found that when it moves to the bottom, the first place is finally matched.

S； abcabccabcaaabcabcd
T：    abcabcd

Then I went to compare the second place dumbfounded, um, the same, yes, the third is the same, good. The fourth one is different, and when it's over, we have to start again.

So how many times has the brute force solution moved since the first match failed? I counted it, moved it 3 times, and compared it 3 times, and found it was wrong. A total of 6 operations were performed.

However, when we look at it, the discerning person can tell at a glance that when I fail to match for the first time, I directly move T three squares to the right. Are the first three digits of T and the three corresponding to S directly the same? You need to violently solve 6 times, can't I just move it once?

This is indeed the case, what can be done in one step, the violent method uses 6 steps. But the question is, how do we do this in one step?

Take a closer look at the situation where our first match fails:

S； abc"abc"cabcaaabcabcd
T： "abc"abcd

Look at the parts I marked with quotation marks. Are they equal? Yes, so I moved the T back to align the green parts of the quotation marks. Because they are equal, we start directly from the letter after the green part of the quotation marks. Match (compare), thereby saving overhead.

S； abc"abc"cabcaaabcabcd
T：    "abc"abcd

The above is the reason why kmp is faster, but let's think about it, why can we think that the green part of the quotation marks can be equal?

When we matched for the first time, we matched like this: a is the same as a, b is the same as b, c is the same as c, a is the same as a, b is the same as b, and c is the same as c, eh? abc, I seem to have seen it somewhere, right. There are the same parts at the beginning and end of the paragraph we have already matched. Then because T can also be matched, there are the same parts at the beginning and the end of T. Then, I can directly pull the head of T to the end of the matched part before S.

If the above is not easy to understand, let's use a clearer diagram to understand.
The match fails at the purple arrow.
Insert picture description here
The blue and yellow areas are equal , that is, the part that has been matched successfully, the head and tail have an equal (largest) area, so naturally, the yellow area of S can directly match the yellow area of T. We move T so that yellow and yellow are aligned, and then simply restart from purple.
Insert picture description here

The above is the reason why the kmp algorithm is faster. The algorithm found that because we sometimes have certain characteristics of the target string, it does not need to search from the beginning every time, but directly aligns the equal parts at the end to save time. Overhead.

But here comes the problem. The above is the logic of our manual search. How to realize this discovery of the equal part of the matched part in the specific algorithm?
Use next array. Let's talk about the next array below.

next array

Since the part that we have matched each time is a substring of the string T starting from position 0, we don’t know when the match will fail, but whenever it fails, we need to know the head and tail of the currently matched part. Equal parts.
Then the idea is simple. Anyway, it starts from position 0. Then I find a substring for each position, and then find the length of the equal parts of the first and the last of these substrings.
Just use the above example.

	T	：	a b c a b c d           （首尾相等的部分最长要小于子串长度）
	子串：                           已匹配部分的首尾相等部分长度
	0-0：	a								0
	0-1：	a b								0
	0-2：	a b c                           0
	0-3：	a b c a                         1
	0-4：	a b c a b                       2
	0-5：	a b c a b c                     3
	0-6：	a b c a b c d                   0

From this, we know that for the substrings a, ab, and abc, it is obvious that we cannot find their equal parts. In the kmp algorithm, we can only move to the right step by step.
And for abcabc, it’s comfortable. When the match fails, we find that its first and last equal parts are 3 in length, so we can move 3 squares directly to the right, and we don’t need to match the first three letters of T, just use the previous one. The letter of S that fails to match is matched with the fourth letter of T.
Here, the substring of 0-6 is meaningless, because it is impossible to appear. If it appears, it means that the match has been successful.

So the next array actually stores the length of the first and last common parts in the current matching situation. When the match fails, we can directly move next[i] units to the right by querying the next[i] value of the next array.
For example, if I failed to match abcabc, I would move 3 units directly to the right. If I fail to match abcab, I move 2 units directly to the right.

Of course, the next array above is not the final next array (in fact, it is the prefix table). In fact, we have to deal with it a little bit. Because 0-6 is actually meaningless, so we delete the 0-6, move the whole array back one grid, and then set the head and tail to -1, thus forming the final next array (actually just for the convenience of programming ) The
final next array is [-1,0,0,0,1,2,3].

After talking about what the next array is, there is another very important problem to be solved, that is, how do we generate the next array. It is of course easy for us to analyze the substring manually, but the computer is not good. How can it count like us? If it is violent If the next array is generated by the solution method, it will be very slow, but it will cause the final kmp algorithm to be slower, not as good as the violent solution, so we need a quick way to generate the next array. Let's talk about how to generate the next array below.

How to generate next array

Here our thinking is this:
the length of our current end-to-end equal part depends on the length of the previous end-to-end equal part.

For example, the last time the same length is 2, abcab. (The character starts with position 0)
Then this time I am abcabc (the last character c is added), I only need to compare the second character in the substring with my latest character c. Then, at first glance, it is indeed equal, then at this time, our first and last equal length = last time equal length +1.
And when my last equal length was 0? , Then this time I will directly compare whether the 0th character is equal to the latest character, if they are equal, the current length = 0+1, if not, it is still 0.

In this way, it is very simple, but the fact is not so simple.
Consider, if the length was i last time, and this time the character at position i is not equal to the latest character. What to do at this time? Is the direct length 0? No, let's look at an example.

                                                 首尾相等部分长度长度
上一次：  (a b c d a b c)(a b c d a b c)                  7  
这一次：  (a b c d a b c)(a b c d a b c) d                ？

Looking at this example, what is the length of the first and last equal parts this time? We know it must be less than 7, but how much? Is it 0, definitely not. We can manually find:

                                                 首尾相等部分长度长度
上一次：  (a b c d a b c)(a b c d a b c)                  7  
这一次：  (a b c d)a b c  a b c d (a b c d)                ？

So it is definitely not 0, but this is what our human eyes find, how to build an algorithm to find it?

Here we have several facts:

Our new equal part length must be smaller than the previous length.
The new equality must be inside the previous one.

For these two facts, we make this explanation. For 1, if the length of the new equal part is greater than the previous one, it is obviously wrong. And for 2, if the new equal part is not inside the previous equal part, let's look at it from the head, the new equal part is not inside the previous equal part, it means that the new equal part length> the previous equal part length, then This contradicts Fact 1, which is obviously wrong.

Based on these two facts, we searched.
To find the new equal part, we only need to look inside the previous equal part.
Insert picture description here
Therefore, we only look at the second half, and we look for the equal parts in the second half, as shown in blue and yellow in the figure above, which are equal.
Then we know that the length here is 3, then we compare the second half of the 3rd position d character with our new character d, so our first and last equal part length is 4.

Why not consider the front here? This is because the first half and the second half are exactly the same. We can use such a diagram to understand more intuitively.
Insert picture description here
Because they are equal, the yellow part corresponds to the yellow part end to end, and because the children in the equal part are equal, the yellow part and the blue part are also equal, so the end yellow part is equal.

Of course, if the sub-equal part is found, it is also possible that the length of the equal part here is not 4, for example, if I changed the last d to e, then obviously it is not 4.
But it doesn't matter, we just keep iterating like this.

At this point, the logic of how the next array is constructed is clear. Let's take a look at the code below, and there is basically no problem after walking the code.

Java code implementation and analysis

I have made a very detailed comment on the code here

public class kmp {
    
    
    /**
     * 构建prefix table，也就是求目标字符串子串的首尾相等部分
     * @param pattern
     * @return
     */
    public int[] setPrefix(char[] pattern){
    
    
        int len = pattern.length;
        int[] prefix = new int[len];

        for(int i=1; i<len; i++){
    
    
            int k = prefix[i-1];//获取前一个子串的最长首尾相等部分长度
                                //同时k刚好是相等子串首部的后一个，需要判断的当前一个

            while(pattern[i]!=pattern[k]&& k!=0){
    
    //如果不等于的话就一直找，找的逻辑是相等部分的首部相等部分，如果不是，继续寻找，这个要想一下是为什么                
                k = prefix[k-1];
                //来想想为什么是k = prefix[k-1]
                //其实蛮好理解的，如果不等于，那么出去i点，前面相等的部分一定在当前的相等部分内部，也就是说在相等部分的内部还存在子相等
                //这个子相等才是当前点i需要的子相等
                //那么就去寻找首部相等部分里的子相等，因为首部相等部分里的 子首部相等 与 子尾部 相等，那么同理，尾部相等部分 中的子首部也对称与它的子尾部 相等，所以首部相等部分里的子首部 与 尾部相等部分 里的子尾部相等
                //从而就找到了一个更小的相等部分
                //那么再来想一个问题，有没有比这个子首部更大的子首部呢？肯定没有，如果有的话，最大想等的又要修改了，所以，这已经是最大的了。
            }
            if(pattern[i]==pattern[k]){
    
    //如果找到了，则直接在基础上加1即可
                prefix[i] = k+1;
            }
            else{
    
    //如果找不到，则直接命名为0
                prefix[i] = 0;
            }
        }
        return prefix;
    }

    /**
     * 对prefix table 进行一个后移，然后初值赋值为-1，从而就获得了真正的next数组
     * @param prefix
     * @return
     */
    public int[] movePrefix(int[] prefix){
    
    
        for(int i = prefix.length-1; i>0; i--){
    
    
            prefix[i] = prefix[i-1];
        }
        prefix[0] = -1;
        return prefix;
    }

    /**
     * kmp算法
     * @param pattern
     * @param text
     */
    public void kmpSearch(char[] pattern,char[] text){
    
    
        //获取netx数组
        int[] prefix = setPrefix(pattern);
        prefix = movePrefix(prefix);
        //进行kmp查询
        //text[i]     len(text)     = M
        //pattern[j]  len(pattern)  = N
        int i = 0, j = 0, M = text.length, N = pattern.length;

        while(i<M){
    
    
            if(j>=N){
    
    //为了排除j>=N导致数组越界的问题
                j = 0;
            }
            if(j == N-1 && text[i] == pattern[j]){
    
    
                System.out.println("found pattern at :"+String.valueOf(i-j));
                //当找到第一个后，还得继续进行匹配
                j = prefix[j];
                if(j==-1){
    
    //排除AA中找A的问题
                    j++;
                }
            }
            if(text[i] == pattern[j]){
    
    
                i++;
                j++;
            }
            else {
    
    
                j = prefix[j];
                if(j == -1){
    
    //当移动到-1时
                    i++;
                    j++;
                }
            }
        }
    }

    public static void main(String[] args) {
    
    
        kmp demo = new kmp();
        char[] pattern = {
    
    'A','B','A','B','C','A','B','A','A'};
//        char[] pattern = {'A'};
        char[] text = {
    
    'A','B','A','B','A','B','C','A','B','A','A','B','A','C','A','B','A','B','C','A','B','A','A'};
//        char[] text = {'A','A'};

        demo.kmpSearch(pattern,text);

    }
}

Reference

https://blog.csdn.net/yearn520/article/details/6729426
https://www.bilibili.com/video/BV1Px411z7Yo
thanks