Find the substring ---- KMP algorithm about the things you do not know

EDITORIAL:

(Before reading this article you need to understand the basic idea of ​​KMP algorithm. In addition, the spirit of thinking Road to Jane, and all the examples in this article will explain the do from start to finish)

 

Read a lot of the existing online KMP algorithm blog, we found widespread even an incomplete KMP algorithm. I.e., by next array as a finite state automaton, in order to achieve the non-matching backoff. This is definitely a good way.

But we are going to experience is a better and more complete method ---- has a complete DFA of KMP algorithm

 

Listed first in comparison with the methods described herein be described in the general procedure several advantages :

  1. In the worst case, the number of operations is only two-thirds of the string in general practice.
  2. In all cases, the operands of the strings are less general practice.
  3. A more complete idea of ​​respect for the general practice of detailed, learning it will allow you to have a new understanding of kmp.

 (Readers can go back and look at these words after read the text right in the end)

 

I, on the finite state automaton (What is DFA)

kmp algorithm simulates the operation of the finite state automaton, dfa general algorithm next array and the array is used herein as the operating guide finite state machine.

Different finite state automaton, the program ran naturally there will be different.

In the KMP algorithm presented in this paper, we use a two-dimensional array of DFA as a finite state automaton guidance:

  1. Is defined: DFA = new int [R] [M], R is a character type of the text may appear (EXTENDED_ASCII R is 256, is enough under normal circumstances), M is the length of the pattern string.
  2. Space: the space DFA times bigger than the next array of R, but the expense of space is bound to usher in a performance improvement !
  3. Storage contents: Like arrays and next is, DFA also stores the location of the restart mode string fails to match each position, but it is more detailed, DFA for a different character that can occur when the match fails to restart corresponds to its specific location, this benefit will be reduced later in the performance analysis.

 

         1 and FIG pattern string corresponding to ABABAC determination finite state machine automaton 

 

Figure shows a pattern string pat: ABABAC finite state machine corresponding to the determined automaton

dfa [A] [j] represents: pattern string successfully matched to the position of the j-th position of the case corresponding to text characters 'A' in the case where a pattern string to be matched.

Take for 1, dfa [A] [3] represents a pattern string matching to the third ABABAC (B), corresponding to the text is A, this time back to the pattern string dfa [A] [3] = 1 , that is the pattern string back to the first one (B) ABABAC, and then continue to the next (second place also on ABABAC in, here is a) a comparison continues with the next text.

It seems pretty complicated, but understanding its method of construction, you have the flexibility to use it.

 

1, dfa constructor:

We need the help of X j and constructed dfa, j points to the current matching position, X is a restart position when the match fails. Start j and X are set to 0.

For each j, we have to do is:

  1. The daf [] [X] Copy to daf [] [j] (for the case of a match failure)
  2. The daf [pat.charAt (j)] [j] is set to j + (for the case of a successful match) 1
  3. X update

Expressed by the following codes:

(Probably recommended that the reader look at the code, combined with complete examples given below, then do the code to run the debugger)

DFA [pat.charAt (0)] [0] =. 1 ;
 for ( int X-= 0, J =. 1; J <M; J ++) { // Calculation DFA [] [J] 
    for ( int C = 0; C <R & lt; C ++) { // mismatch 
        DFA [C] [J] = DFA [C] [X-];
    }
    dfa[pat.charAt(j)][j]=j+1;
    X=dfa[pat.charAt(j)][X];
}    

 

On the basis of the above code up a complete presentation of the construction process:

①  J and X are 0, dfa [pat.charAt (0) ] [0] = 1

 

② into the loop for X = 0, j = 1: Copy the X column j to the column, and then set dfa [pat.charAt (j)] [j] = j + 1, X update

You can see the updated X X 0 or third step, a second step since when X = dfa [pat.charAt (j)] [X] = dfa [B] [0] = 0 (about the X contact changes Discussion down it will be mentioned)

 

The second cycle ③ X = 0, j = 2: Copy the column j to the column of X, then set dfa [pat.charAt (j)] [j] = j + 1, X update

X=dfa[pat.charAt(j)][X]=dfa[A][0]=1

 

④ third cycle X = 1, j = 3: Copy the column j to the column of X, then set dfa [pat.charAt (j)] [j] = j + 1, X update

 

X=dfa[pat.charAt(j)][X]=dfa[B][1]=2

 

⑤ fourth cycle X = 2, j = 4: Copy the X column j to the column, and then set dfa [pat.charAt (j)] [j] = j + 1, X update

 

X=dfa[pat.charAt(j)][X]=dfa[A][2]=3

 

⑥ fourth cycle X = 3, j = 5: Copy the column X j to the column, and then set dfa [pat.charAt (j)] [j] = j + 1, has ended the last one, do not X update

 

 

That concludes the results of the pattern string ABABAC dfa structure finally obtained:

 

I believe we have to understand the structure of ideas dfa

For the consolidation exercise, the following reader construct their own daf pattern string ABRACAD then and what is not under the same control chart

 

2, some questions and answers about X:

It is worth mentioning that, X is a key building dfa, the following few questions and answers to help us understand the whole dfa structure.

Why come every time the value of X?

A: Because X is always less than j, X j is taking the old road to go.

Why should X Copy column to column j?

答:dfa里记录了到每种状态时可能的所有选择,如果状态A发生不匹配时可以回到状态B继续匹配,那我们就可以先把状态B复制到状态A,这样在状态A不匹配时就可以直接使用状态B的方案。

X的位置何时会发生变化?

X的下一个位置与j当前指向的字符、j之前指向过的字符、X当前位置都有关,事实上不管j当前指向的字符在之前是否出现过,X都可能移动。

X的位置会怎么变化?

当每次j指向的字符与X指向的字符能够连续对应上的时候,X就会每次向后移一位(字符与前缀对应时X往后移)。

当j指向的字符在之前没有出现过,X就会指向0。

 3、实例对问题的证明:

 

上图是模式ABCDE的dfa数组,可以观察到ABCDE中是没有出现重复字符的,所以到最后X依然指向0

对应极端情况,前面的字符出现重复达到了四次,X也是要移动四次,但只停留在3是因为模式串已经匹配完成,不需要再移动X。

 

关于X的移动,是需要读者自己在模拟dfa构造中细想的,想明白了就能全懂KMP,不明白就再看看上面的问题,尝试自己作答就会有新的心得。

 

二、改变搜索方法

有了强大的有限状态自动机,怎么用它呢?实际使用中是否比原来更强大呢?咱直接将两者的代码贴出来一顿对比,顺便说明精妙之处。

大体的思路是一样的,就是将txt字符串从头到尾循环一遍,过程中不断判断模式串的位置

 

1、先来看看一般方法中的搜索方法代码:

for(i=0;i<n;i++){
    while (j>-1&&txt.charAt(i)!=pat.charAt(j)){
        j=next[j];
    }
    if(j==-1||txt.charAt(i)==pat.charAt(j)){
        j++;
    }
    if(j==m){return i-j;
    }
}            

一边从头到尾循环,一边判断j是不是等于m,应该注意到的是,for循环中还包含了一个while,用来做回退和继续匹配的。

可以发现,这个过程中的操作次数必定是要大于i的(每次for循环都可能要加入while)

 

2、下面是使用dfa后的搜索方法:

for(j=0,i=0;i<N&&j<M;i++){
    j=dfa[txt.charAt(i)][j];
}
if(j==M){
    System.out.println("匹配成功");
    return i-M;
}else {
    System.out.println("匹配失败");
    return N;
}

可以看到,在for循环之后,直接进行匹配成功或失败的判断,整个过程的操作次数等于i,是小于一般方法的。

 

 

三、性能分析对比

①当字符串不匹配时(这是两种方法差异最大的地方):

使用DFA二维数组作为有限状态自动机,每次不匹配时都能到达精准位置(对每个不匹配的情况dfa都有记录在案)。

而使用next一维数组时,在每次匹配失败后到达的位置是不能确认的,它只是先到达可能的位置。

从可能的最长前缀位置,进行字符的匹配,如果不匹配再移到下一位可能的位置(下标在模式字符串上往前移)。

②当字符串匹配时

在两种方式中是一样的,i和j都加一,然后进入下一个for循环。

②最坏情况什么时候出现

对于一般方法:如果文本为AAAA,模式串为AAAB,这时匹配到最后一位时失败,j会一步步往前走,这时在搜索方法中操作次数达到了2n,加上构造next数组的n次操作,共3n次操作。

对于完整KMP算法:上面的情况并不会使它达到3n,因为在j一步步往前走的时候i也会往后走,当i达到n时for循环结束,这样最多也就操作n次,加上dfa数组的构造需要n次,共2n次操作。

结果:

可以看到,在通常情况下完整KMP算法的操作次数要比一般算法的操作次数少

即便是在最坏情况下完整KMP算法的操作次数也为一般方法的三分之二。

足以证明完整KMP的性能是更优的。

 

 

四、完整实现及测试代码(java)

 1 public class KMP {
 2     private String pat;
 3     private int dfa[][];
 4 
 5     public KMP(String pat){//由模式字符串构建dfa
 6         this.pat=pat;
 7         int M=pat.length();
 8         int R=256;
 9         dfa=new int[R][M];
10         dfa[pat.charAt(0)][0]=1;
11         for(int X=0,j=1;j<M;j++){//计算dfa[][j]
12             for(int c=0;c<R;c++){//不匹配情况
13                 dfa[c][j]=dfa[c][X];
14             }
15             dfa[pat.charAt(j)][j]=j+1;
16             X=dfa[pat.charAt(j)][X];
17         }
18     }
19 
20     public int search(String txt){
21         int N= txt.length();
22         int M=pat.length();
23         int j,i;
24         for(j=0,i=0;i<N&&j<M;i++){
25             j=dfa[txt.charAt(i)][j];
26         }
27         if(j==M){
28             System.out.println("匹配成功");
29             return i-M;
30         }else {
31             System.out.println("匹配失败");
32             return N;
33         }
34     }
35 }

测试例子:

1     @Test
2     public void KMPTest(){
3         KMP kmp=new KMP("abc");
4         System.out.println(kmp.search("abfeabcabc"));
5

Guess you like

Origin www.cnblogs.com/Unicron/p/11746306.html