C++: KMP string matching algorithm

In traditional string matching, we find the position where the string p appears in the string s. We call the string s the main string, and the string p the pattern string.

The principle of the KMP algorithm is simply to not backtrack the pointer i of the main string when matching, but only backtrack the pointer j of the pattern string, that is, during the matching process, the main string is not moved, and the pattern string is moved to achieve "the maximum sliding to the right as possible" the distance".

on one 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Main string s                                              
Pattern string p             a b a a b c                      

                                                                     figure 1.

If c and s[11] in the pattern string p are mismatched, then in the traditional matching algorithm, s[7] and p[0] (a) are used to match, while in KMP matching, it is s[11 ] And p[2] (a) to match. Because in this mismatch, the value of 6~10 in the main string s is abaab, and then p[0] and s[7] must be mismatched. This comparison is redundant. So how do we know how many units to shift the pattern string to the right when there is a mismatch?

 

Assuming that the position where the main string is to be matched is i =11, and there is a mismatch with p[j]=='c', that is, j==5, then the j pointer must be traced back to a certain k, if =2 When the value is to be matched in the next step, then the value of k must satisfy two relations

Figure 1 shows that p[3]p[4]=s[9]s[10],

Figure 2 shows that p[0]p[1]=s[9]s[10]

on one 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Main string s                                              
Pattern string p                   a b a                      

                                                                            figure 2.

Replaced by a general expression is

which is

 

In a logical sense, formula 3 is explained: when the pattern string matches the j-th one to be traced back to k, then j and k satisfy the prefix k values ​​and the suffix k values ​​to be the same . The maximum value of k that meets this condition is the position to be traced back. If the next array is used to indicate the position of the pattern string to be traced back, next[j]=k.

So in fact, the value of next[j] in the pattern string has nothing to do with the main string s.

Assuming next[j]=k, then next[j+1]

If P[j] = P[k] at this time, then next[j+1]=k+1 ie next[j+1]= next[j] +1

If P[j] != P[k] at this time, then move next[k] characters for comparison to achieve the purpose of s[i] = p[k], the prefix is ​​aligned with the suffix. If equal, next[j+1]=next[k]+1, if not equal, iterate k=next[k] to move and compare again. For example, combining Figure 3 and Figure 4, next[5]=2, find next[6] ?

Because next[5]=2, the comparison is P[5] !=P[2], the prefix cannot be aligned with the suffix, and the suffix continues to move to the right (backtracking) ,

Because next[2]=0, we compare P[5] !=p[0],

Because next[0]=-1 then compare P[5] and P[-1] because next[0] = -1;

When there is no traceable k, next[j+1] = -1;

There is a traceable position 0<k'<k<j then 
 

while(k'>0 && k'< k ){
 if(k' = -1){
  next[j+1] = 0;
}
 if(p[k'] != p[k]){
   k= next[k]
 }else
 {
    next[j+1] = next[k] +1
}
}
on one 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Main string s                                              
Pattern string p             a b a a b c a                    

                                                                     image 3

on one 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Main string s                                              
Pattern string p             a b a a b c a c                  

                                                                     Figure 4.

.

on one 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Main string s                                              
Pattern string p             a b a a b c a c                  

                                                                     Figure 5.

. So the next value of the string abaabcac is next [] = {-1,0,0,1,1,2,0,1};

Improved next,

For example, next []={-1,0,1,2,3} of the string aaaab;

 

on one 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Main string s                                              
Pattern string p             a a a a b                        

      In the KMP algorithm, when the comparison mismatch between s[9] and p[3] (b) occurs backtracking, then s[9] will be compared with p[2]. At this time, this comparison will be redundant and must be mismatched again. Eliminating redundant comparisons is an improved next.

If the backtracking value is equal to the original value, that is, when P[j] = P[k], directly change P[j] = P[next[k]];

hpp:

//
// Created by hu on 2020/8/5.
//

#ifndef SMARTDONGLIB_SSTRING_H
#define SMARTDONGLIB_SSTRING_H
#include  <string>
#include <vector>

namespace SmartDongLib {

    class SString {
    public:
        SString(){ str_="";}
        explicit SString(std::string str): str_(std::move(str)){}
        SString(const  SString& str){ str_=str.str_;}
        int length(){return str_.length();}
        SString copy(){return *this;};
        bool isEmpty(){return str_.empty();}
        void clear(){str_=""; }
        SString concat(const SString& str2){ SString  ret(str_ + str2.str_); return ret;}
        SString subString(int pos, int len){SString ret (str_.substr(pos, len)); return ret;}
        SString subString(int pos=0){SString ret (str_.substr(pos)); return ret;}
        int index( const SString& , int pos =0);
        int index_KMP(SString& str2, int pos=0);
        SString replace(SString src, const SString& target);
        void strinsert(int pos,const SString& T){str_.insert(pos, T.str_);}
        void strdelete(int pos,int len){str_.erase(pos,len);}
        SString operator+(std::string str1){ return SString(str_ + std::move(str1));}
        SString operator+=(std::string str1){ return SString(str_ + std::move(str1));}
        bool operator==(const std::string& str1){return str_==str1;}
        std::string get(){return str_;}
        void getnext(int next[]);
    private:
        std::string  str_;
    };
}

#endif //SMARTDONGLIB_SSTRING_H

cpp :

//
// Created by hu on 2020/8/5.
//

#include "sdstructure/linearlist/SString.h"
namespace SmartDongLib {

   inline int SString::index(const SString& str2,  int pos ) {
        int iptr = pos , jptr = 0;
        while (iptr < str_.length() && jptr<str2.str_.length()){
            if (str_[iptr] == str2.str_[jptr]){
                //相等则比较下一位
                iptr++ ; jptr++;
            } else{
                //不相等则回溯,模式串指针从0 开始 i 回溯到原先的起始值+1 , 现值i'与原先的起始值i 满足 i'-i=j'-j其中j=0
                iptr = iptr - jptr+1;
                jptr = 0;
            }
        }
        if (jptr >=str2.str_.length()){
            return iptr - jptr;
        }
        return -1;
    }
    /**
     * <p> 1 . 求next数组,有了next数组后一个一个匹配,如果失配让 j = next[j];
     * @param substr
     * @param pos
     * @return
     */
    inline int SString::index_KMP(SString& substr, int pos) {
        int i=pos, j=0;
        int next[substr.str_.length()];
        substr.getnext(next);
        int thisLen=length(),sublen=substr.length();
        while ( i < thisLen && j < sublen){
            if (j==-1 || str_[i] == substr.str_[j]){
                i++;
                j++;
            } else{
                j=next[j];
            }
        }
        if (j >= sublen){
            int ret =i-sublen;
            return ret;
        }
        return -1;
    }

    inline SString SString::replace(SString src, const SString& target) {
        if (src.str_.empty()){
            return *this;
        }
        int index=0;
        while ( index != -1) {
            index = index_KMP(src);
            if(index != -1) {
                str_.erase(index,  src.str_.length());
                str_.insert(index, target.str_);
            }
        }
        return *this;
    }

    /**
     * <p>原理: 当求next的第j个元素时,看  j-1 个元素开始和第0个元素比对,k不断增加取最大值满足  0<k<j
     * 从后往前数k个即第 j-k+1...j-1元素与 0...k-1
     * 如  abaabcac    当(a)j=0 next[0]=0 ; (b)j =1 ,next[1]=1,;(a) j=2时,k=1,第1个元素和第0个元素比对即a和b比不对就是1
     *  当(a)j=3,k=1,第2个元素和第0个元素 比a和a匹配上了 那就是next[3]=2;
     * @param substr
     * @return
     */
    inline void SString::getnext(int next[]) {

        const int len =str_.length();
//        next.resize(len,-1);
        int i = 0,j=-1;next[0]=-1;
        while (i<len){
            if (j==-1 ||str_[i] == str_[j]){
                ++i;++j;
                if (str_[i] != str_[j]){
                    next[i]=j;
                } else{
                    next[i]=next[j];
                }
            } else{
                j=next[j];
            }
        }
        //return next;
    }
}

example:

//
// Created by hu on 2020/8/6.
//

#include <iostream>
#include "sdstructure/linearlist/SString.cpp"
using namespace std;
using namespace SmartDongLib;
int main(){
    SString mainstr("acabaabaabcacaabc");
    SString substr("abaabcac");//-1   0   -1   1   0   2   -1   1
    SString target("ijn");
    int next[substr.length()];
    substr.getnext(next);
    for (int i :next){
        cout<<i<<"   ";
    }


    int index=mainstr.index_KMP(substr);
    int index2=mainstr.index(substr);
    cout<<endl<<index;
    cout<<endl<<index2;
    mainstr.replace(substr,target);
    cout<<endl<<mainstr.get();
}

 

Guess you like

Origin blog.csdn.net/superSmart_Dong/article/details/107924487