Algorithms - Topic on String Matching

Table of contents

I. Overview

2. Violent matching

3. RabinKarp algorithm

3.1 Algorithm overview

3.2 Algorithm Supplement

3.3 Design hash function

3.4 Code implementation

Four. kmp algorithm

4.1 Overview and process

4.2 next array solution


I. Overview

String matching problem that is, given two strings str1 and str2, for example: 

str1: ababa

str2:aba

Then judge whether str2 is a substring of str1, and if so, return the position where str2 first appears in str1

This seemingly simple problem actually has a more complicated algorithm behind it. In the following, I will introduce three algorithms for string matching from easy to difficult.

2. Violent matching

As shown in the title, two pointers are defined, one pointer i traverses str1, the other pointer j points to the beginning of str2 , if the two pointers point to the same content, then go back together, if not, then j returns to the beginning Position position, i returns the position of i+1, the code is as follows:

public int subString(String str1,String str2){
//特殊情况判断
 if(str2.length()>str1.length()|| str1==null || str2==null){
            return -1;
        }
        //判断str2是否是str1的子串,并返回第一次出现的位置
        for(int i=0;i<str1.length();i++){
            int j=0;
            int k=i;
            while(str1.charAt(k++)==str2.charAt(j++)){
                if(j==str2.length()-1){
                    return i;
                }
            }
        }
        return -1;
    }

The time complexity of this algorithm is: O(N*M), where N is the length of str1 and M is the length of str2

This solution is the simplest, but the time complexity will be higher in special cases


3. RabinKarp algorithm

3.1 Algorithm overview

Using the RabinKarp algorithm, which mainly uses the hash value for comparison, what is the process like?

We first calculate the hash value of str2, for example, the string length of str2 is 3, then when we traverse str1, calculate the hash value three by three, and then compare.

What are the benefits of this approach? That is to say, when calculating the hash value in str1, we still assume that the length of str2 is 3. For example, in str1, the hash value of 0-3 is calculated differently from str2. When we calculate the hash value of 1-4 , You only need to deduct the hash value at position 0 and add the hash value at position 4 according to the rules of the hash function to get a hash value of 1-4, and the process of deducting and adding is O( 1) Complexity, this process is called rolling hash, and the complexity obtained by using rolling hash is: O(N+M)


3.2 Algorithm Supplement

One of the parts is to design the corresponding hash function. The disadvantage of this method is that hash collision will occur, that is, when str2 is longer and more complex, it may be in str1 at this time, which is obviously a different string. But the same hash value , hash collision is inevitable, this is the defect of this method, according to the relevant data, generally speaking, the conflict generated by using 100,000 different strings is about 0-3, and When using a million different strings, the number of collisions generated is around 110.


3.3 Design hash function

In the RabinKarp algorithm, the commonly used hash function is: hash=C0+C1*31+C2*31^2....

Because each character has its ASCII value, where C0 is the ASCII value of the character at position 0 of the string, and C1 is the ASCII value of the character at position 1 of the string. The process is somewhat similar to base calculation.

The hash function is designed as follows:

    public static long hash(String str){
        long res=0;
        for(int i=0;i<str.length();i++){
            res=31*res+str.charAt(i);
        }
        return res;
    }

3.4 Code implementation

The difficulty in the Rabin Karp algorithm is how to deduct one and add one when calculating the hash value

We assume: str1="BABABA"

                str2="ABA"

The ASCII code value of A is: 65, and the ASCII code value of B is: 66

Then the hash value of str2 is calculated as: 65*31^0+66*31^1+65*31^2

When we calculate the hash value of the three of 0-3 in str1, its value is: 66*31^0+65*31^1+66*31^2

When we calculate the three hash values ​​of 1-4 in str1, its value is: 65*31^0+66*31^1+65*31^2

When comparing the hash value calculations of 0-3 and 1-4, we can find that the hash value of 0-3 is multiplied by 31, then the hash value of position 4 is added, and the value of position 0 is subtracted Multiply by 31^3 (3 is the length of str2)

Or it is also possible to first subtract the value of the 0 position from the hash value of 0-3, then divide it by 31, and then add the value of the 4 position and multiply it by 31^2 (2 is the length of str2 minus one)

We can use a hash[] to record the specific scrolling process. The code of the scrolling process is as follows:

    public static long[] hash(String str,int M){ //M为str2的长度
        long[] res=new long[str.length()-M];
        //先计算0-M的hash值
        res[0]=hash(str.substring(0,M));
        //开始滚动
        for(int i=M;i<str.length();i++){
            char newChar=str.charAt(i);
            char oldChar=str.charAt(i-M);
            long v=(res[i-M]*31+newChar-(long)Math.pow(31,M)*oldChar)%Long.MAX_VALUE; //%是为了防止溢出
            res[i-M+1]=v;
        }
        return res;
    }

The overall code implementation is as follows:

package com.atguigu.algorithm;

public class RabinKarp2 {
    final static long seed=31;

    public static void main(String[] args) {
        String s="ABABABA";
        String p="ABA";
        long[] hash=hash(s,p.length());
        long hash1=hash(p);
        for(int i=0;i<hash.length;i++){
            if(hash[i]==hash1){
                System.out.println("match:"+i);
            }
        }
    }
    static long hash(String str){
        long hash=0;
        for(int i=0;i<str.length();i++){
            hash=seed*hash+str.charAt(i);
        }
        return hash;
    }
    static long[] hash(String s,int M){
        //利用滚动方法,求取一个hash数组
        long[] res=new long[s.length()-M+1];
        //前m个字符的hash
        res[0]=hash(s.substring(0,M)); //M其实就是短数组的长度
        for(int i=M;i<s.length();i++){
            char newChar=s.charAt(i);
            char oldChar=s.charAt(i-M);
            long v=(res[i-M]*seed+newChar-(long)Math.pow(seed,M)*oldChar)%Long.MAX_VALUE;
            res[i-M+1]=v;
        }
        return res;
    }
}

Four. kmp algorithm

4.1 Overview and process

The kmp algorithm is a capable and powerful string matching algorithm. Its specific principles can be understood by listening to Zuo Chengyun Algorithm p13 at Station B. I will only demonstrate the algorithm process and code here.

The KMP algorithm needs a next array for assistance. This array records the maximum matching length of the prefix and suffix before the corresponding position on the pattern string .

For the conventional violent solution, i is the pointer to traverse str1, and j is the pointer to traverse str2. For the traditional violent solution, when i and j do not match, that is, str1.charAt(i)!=str2.charAt(j) , i will go back to the position of i+1, and j will go back to the position of 0 to re-match.

The meaning of introducing the next array is that j does not need to go back to the position of 0 every time, but back to the position of next[j].

We assume that the next array has been solved, the specific code of the algorithm is as follows:

 public int kmp(String str1,String str2){
        if(str1==null || str2==null|| str1.length()<str2.length()){
            return -1;
        }
        char[] chars1=str1.toCharArray();
        char[] chars2=str2.toCharArray();
        int i1=0;
        int i2=0;
        int[] next=getNextArray(str2);
        while(i1<str1.length() && i2<str2.length()){
            if(chars1[i1]==chars2[i2]){
                i1++;
                i2++;
            }else if(next[i2]==-1){  //next数的第0项默认为-1,此时跳到这步说明失配了,同时i2无法再往前跳了,需要i1++换一个开头
                i1++;
            }else{
                i2=next[i2]; //失配了进行跳转
            }
        }
        //i1越界或者 i2越界
        return i2==str2.length()?i1-i2:-1;
    }

4.2 next array solution

Before solving the next array, let's first understand what is the maximum matching length of string prefixes and suffixes.

Let's take the string bababb as an example:

For bit 0, there is no character in front of it, so we specify the maximum matching length as -1

For the first bit, there is only one character in front, and the maximum matching length is recorded as 0

For 2 bits, the character in front of it is ba, in fact, there is only one b in the prefix, and only one a in the suffix, which are not equal, so the maximum matching length is 0

For the third bit, the character in front of it is bab. When the length of the prefix and suffix is ​​2, the prefix is ​​ba, and the corresponding suffix is ​​ab, which are not equal; when the length of the prefix and suffix is ​​1, the prefix is ​​b. The suffix is ​​also b, so the maximum matching length is 1

For the 4th digit, the character in front of it is baba, we know that the prefix is ​​ba, and the suffix is ​​also ba, which has a maximum matching length of 2

For the 5th digit, the character in front of it is babab, we know that when its prefix and suffix are bab, there is a maximum matching length of 3

To solve the next array, we should solve it iteratively:

First, the next array will be initialized accordingly, ie

next[0]=-1      next[1]=0

If pj==pk, next[j+1]=k+1 or k<0, next[j+1]=k+1;

Otherwise, k continues to backtrack until pj==pk or k<0 is satisfied

The specific code is as follows:

 public static int[] next(String str){
        int[] ret=new int[str.length()];
        ret[0]=-1;
        if(str.length()==1){
            return ret;
        }
        ret[1]=0;
        int j=1;
        int k=ret[j];
        while(j<str.length()-1){
            if(k<0 || str.charAt(j)==str.charAt(k)){
                ret[++j]=++k;
            }else{
                k=ret[k];
            }
        }
        return ret;
    }

 The overall code of kmp is as follows:

package com.atguigu.substring;

public class KMP {
    public int[] getNextArray(String str){
            int[] ret=new int[str.length()];
            ret[0]=-1;
            if(str.length()==1){
                return ret;
            }
            ret[1]=0;
            int j=1;
            int k=ret[j];
            while(j<str.length()-1){
                if(k<0 || str.charAt(j)==str.charAt(k)){
                    ret[++j]=++k;
                }else{
                    k=ret[k];
                }
            }
            return ret;
        }
    public int kmp(String str1,String str2){
        if(str1==null || str2==null|| str1.length()<str2.length()){
            return -1;
        }
        char[] chars1=str1.toCharArray();
        char[] chars2=str2.toCharArray();
        int i1=0;
        int i2=0;
        int[] next=getNextArray(str2);
        while(i1<str1.length() && i2<str2.length()){
            if(chars1[i1]==chars2[i2]){
                i1++;
                i2++;
            }else if(next[i2]==-1){  //next数的第0项默认为-1,此时跳到这步说明失配了,同时i2无法再往前跳了,需要i1++换一个开头
                i1++;
            }else{
                i2=next[i2]; //失配了进行跳转
            }
        }
        //i1越界或者 i2越界
        return i2==str2.length()?i1-i2:-1;
    }
}

Guess you like

Origin blog.csdn.net/yss233333/article/details/128364975