Manacher (horse-drawn cart) algorithm clear and detailed code analysis-find all palindrome substrings in the string

Algorithmic problem introduction

The Manacher algorithm is what I saw when doing LeetCode647. The purpose of this algorithm is to find all palindrome substrings in a string faster.
I read a lot of articles and videos on the Internet, and found it more obscure, so here I decided to write a blog to explain the algorithm, not only as an explanation, but also to deepen my understanding, for a better review later And review.
Below, we will explain this algorithm step by step.

First of all , we have to clarify what is a palindrome string.
Baidu Encyclopedia says: "Palindrome string" is a string with the same positive and negative reading, such as "level" or "noon", etc., which are palindrome strings. For these two words, we can see that the palindrome is characterized by symmetry . In level, l corresponds to l, e corresponds to e, v corresponds to itself (it is the center point), and noon is n corresponds to n, and o corresponds to o.
These two examples cover almost all the two cases of palindrome strings, namely the odd and even length cases.

Then it's about palindrome substrings . For a palindrome string level, it is obvious that its substring eve is also a palindrome string, and its substring v is also a palindrome string. So for level, there are a total of 3 palindrome substrings (considering itself). We can use the concept of radius to describe a palindrome string. For example, level, its center point to any left and right boundary is 3 units (considering itself) , So the radius is 3, and noon, the radius is 2. Therefore, for a palindrome string, the number of palindrome substrings is equal to its radius.

Our problem is that given a string, we have to find the number of all palindrome substrings in it .
Then a good idea ( central diffusion ) came out:
we traverse each character of the string, and then use each character as the center point, and then spread to both sides, each time comparing the numbers on both sides are equal or not equal, If they are equal, the result is +1 (meaning a new palindrome substring is found), and the traversal continues until they are not equal.
In this way, all palindrome substrings can be found, and there are two problems here.
Problem 1 is the parity. The center point of the odd-length palindrome substring is a single one. We only need to surround the center point and compare whether the characters on both sides of the center point are the same. Therefore, the odd number is our favorite because of its simplicity. The length of an even number is more troublesome, such as noon, which one is the center point, it’s hard to say, of course, we can make two pointers if they are even numbers, one point to the previous o, one to the next o, and if it is an odd number, All point to v in level. In this way, the parity can be discussed every time, but it is more troublesome and the complexity is also higher.
Problem 2 is that the overall complexity is relatively high, in the worst case O(n^2), which is not a favorite.

So at this time the manacher algorithm comes to the rescue, it is to solve these two problems.
Simply put, the so-called manacher algorithm is an improvement of our central diffusion method of price comparison violence above , adding some judgment conditions, so that many of the steps do not need to be recalculated, thereby reducing the time complexity to a certain extent. (Like dp, record the calculated value to prevent double calculation).

The core of Manacher algorithm

In order to solve the above two problems.
The manacher algorithm takes the following operations respectively.

The first is about the parity of the palindrome . The algorithm took such an operation. It modifies the string, eg: inserts'#' into the string (# is just an auxiliary symbol, it will not cause information interference to the original string).
Insert picture description here
In this way, we don't have to consider the odd and even problems. You can see the string we need to process, whether it was odd or even at the beginning, now the unity has become an odd number. This is what we want to see.
Then let's take a look at how many #s we inserted. For a string of length n, first we insert n-1 in the middle, and then insert 2 on both sides, so we insert a total of n+1, so the length of the final processed string is 2*n+1 .

Then there is the second question , how do we repeat it to reduce the time complexity.
In fact, let’s look at this duplication first . After reading this duplication, you can probably understand the core of the entire manacher algorithm.

As mentioned before, the manacher algorithm is an improvement of the central diffusion idea. We now traverse the processed string and continue to use center diffusion to find a palindrome centered at each point.
Assume that we have now found a palindrome string, and its right margin is the largest.
Insert picture description here
As shown in the figure above, the center of the palindrome we found is named iMax, the right border is named rMax, and the left border is marked by me, and the following is named lMax.
Then the center point of my current traversal is i, and I need to do center diffusion at i to retrieve the text string, but for such a question, do we really need to do the diffusion comparison one by one ?
Insert picture description here

As shown in the figure above, we must firmly grasp the symmetry of the palindrome string. In the above figure, s1 and s2 are two symmetrical about iMax and are included in the interval [lMax,rMax] (lMax is temporarily obtained by me, and I understand it all). Therefore, the two of them must be symmetrical and equal. If s1 is a palindrome, then s2 must also be a palindrome.

So obviously, my current i must be greater than iMax (because iMax is found by the previous traversal plus the center diffusion), and it is also obvious that I have already determined that s1 is a palindrome (hypothesis), then this At that time, I don’t have to judge again whether s2 is a palindrome, because it is symmetrical, so it must be!

In this way, you have almost understood why the manacher algorithm can reduce the time complexity. The reason is that I traversed from left to right. The left side has already been judged, so why bother to judge again on the right? Just because of symmetry, the same thing is over. (Is it a lot like finding the monotonicity of a symmetric function?)

Of course, this thing is not that simple. To implement it, we need auxiliary data structures and multiple situation judgments.

Logical analysis of Manacher algorithm

First of all, how do I know what is going on in the s1 segment? In other words, how long is the palindrome string s1? Who is the central point? What is the radius?
Here an array f is used in the algorithm, and the length of f is 2*n+1 (the same as the processed string, corresponding).
f[i] stores the radius of the longest palindrome that can be formed with the current point as the center point.

Then suppose that when we traverse to the point m, we only need to look at the f[n] of the n point symmetric about iMax, we can know that f[m] is greater than or equal to f[n] (with m as the center point) The radius of the palindrome string is at least f[n], and then the center spreads, and the comparison starts directly from m+f[n]+1, thus eliminating the need for f[n]-m comparisons).

Of course, the situation is not that simple. There are many situations to be discussed in different categories.

Case 1 : Our i is smaller than rMax.
Then because of symmetry, we can obtain f[i] through the 2*iMax-i point that is symmetric to i with respect to iMax.
But there are two situations.
Case 1 small case A :
Insert picture description here

When the largest palindrome found by the symmetry point of our i is also in our [lMax,rMax], then as it should be, f[i] = f[2*rMax-i].
no problem.
However, the small case B of Case 1 :
Insert picture description here
Obviously, the range of s1 here exceeds [lMax,rMax], and the excess part does not satisfy the symmetry characteristic, so the assignment of f[i] here can only be determined by us The maximum value is rMax-i+1.

Situation 2 :
This is easy to understand, i>rMax.
Insert picture description here
Obviously at this time we cannot find a point symmetrical to i about iMax in the interval [lMax,rMax], and we cannot use this to assign a value to f[i], but we know that a letter is also a palindrome, so its The radius is 1 (considering itself), so at this time we set f[i]=1.

The above is that we need to consider all the repetitive situations and how to deal with them.

After f[i] is determined, it does not mean that our longest palindrome with i as the center has been found. We also need to continue center diffusion based on f[i] to find the longest.
At the same time, when our new boundary exceeds rMax, we have to update iMax and rMax.

At this point, the logic of the manacher algorithm is finished, let's look at the code below, the following code is java.

Manacher code implementation

class Solution {
    
    
    public int countSubstrings(String s) {
    
    
        int n =s.length();
        StringBuffer t = new StringBuffer("$#");//这里左右多给了字符$!是为了保证数组从1开始,与rMax初始值避开
        //构建新的字符串,用#来填充
        for(int i =0; i<n; i++){
    
    
            t.append(s.charAt(i));
            t.append('#');
        }
        n = t.length();
        t.append('!');

        //f[i]
        int[] f = new int[n];
        int iMax= 0, rMax = 0, ans = 0; //iMax为边界最大的那个的中心,rMax为最大的右边界
        for(int i=1;i<n;i++){
    
    
            //初始化f[i] 
            f[i] = i<rMax? Math.min(rMax-i+1,f[2*iMax-i]) : 1;//情况1(min是小情况A与B的判断) 与情况2的判断
            //中心扩展  暴力扩展
            while(t.charAt(i+f[i])==t.charAt(i-f[i])){
    
    
                f[i]++;
            }
            //对iMax和rMax进行更新
            if(i+f[i]-1>rMax){
    
    
                iMax = i;
                rMax=i+f[i]-1;
            }
            //统计答案,当前贡献为f[i]-1/2向上取整  有多少个子串,同时因为多加入了#,要排除这个干扰
            ans += f[i]/2;
        }
        return ans;

    }
}

Guess you like

Origin blog.csdn.net/qq_34687559/article/details/109560059