Suffix Array & Tree & automaton

Suffix Array & Tree & automaton

Series suffix string structure summary

Some bloggers habits

String called \ (S \) , the length of called \ (n-\) , the string \ (a, b \) joined to \ (A + B \) , from a start index, \ (of S_ {I, J } \) represents \ (I \) to \ (J \) of the string piece

Between the string \ (<,> \) represents the lexicographical comparison

All examples of problem solutions can be found at the end of the text

\[ \ \]

\[ \ \]

1: SA (suffix array)

Need array \ (rk [i] \) represents \ (I \) suffix rank (the absence of the same), \ (SA [I] \) represents ranked \ (I \) suffix number, and an auxiliary array \ (cnt, tmp \)

1-1 suffix sorting

Because bloggers level is not enough, so I did not go to learn DC3 ( \ (O (the n-) \) ) algorithm is presented below DA ( \ (O (the n-\ the n-log) \) ) algorithm

Suffix sorting, by definition is to be \ (S \) each suffix \ (of S_ {I, n-} \) (hereinafter referred to as \ (Suf_i \) ), according to the lexicographic sorting the lexicographically smallest null character

Consider multiplier implemented radix sorting +

It has been determined for the current length \ (K \) , i.e. \ (S_ {i, i + k-1} \) sort (excess null character) has been completed, the next to be considered \ (S_ {i, i + 2k-1} \) ordering

\ (Of S_ {I, I + 2K-. 1} = of S_ {I, I + K-. 1} + of S_ {I + K, I + 2K-. 1} \) , it is possible according to the sorted finished part values for The combined use of the two parts to give radix sort \ (2K \)

If you are not very familiar with this sort of process under the base case, we can achieve a row with a fast

Process is as follows

根据首字母初始化sa,rk数组
for(k=1;k<=n;k<<=1) {
    rep(i,1,n) tmp[i]=i;
    sort(...);
    求出sa,rk,注意当前相同的串的rk也要相同
}

Code

int n;
char s[N];
int cnt[N],tmp[N],rk[N],sa[N],lcp[N];


void PreMake(){
    memset(cnt,0,800);
    rep(i,1,n) cnt[(int)s[i]]++;
    rep(i,1,200) cnt[i]+=cnt[i-1];
    rep(i,1,n) rk[i]=cnt[(int)s[i]],sa[i]=i; //瞎初始化一波,不用我这样写
    rep(i,n+1,n*2) rk[i]=0;// 注意一下边界的清空
    for(reg int k=1;k<=n;k<<=1) {
        rep(i,0,n) cnt[i]=0;
        rep(i,1,n) cnt[rk[i+k]]++;
        rep(i,1,n) cnt[i]+=cnt[i-1];
        drep(i,n,1) tmp[cnt[rk[i+k]]--]=i; // 按照第二关键字排序
        
        rep(i,0,n) cnt[i]=0;
        rep(i,1,n) cnt[rk[i]]++;
        rep(i,1,n) cnt[i]+=cnt[i-1];
        drep(i,n,1) sa[cnt[rk[tmp[i]]]--]=tmp[i];//按照第一关键字排序,求出 sa
        
        rep(i,1,n) tmp[sa[i]]=tmp[sa[i-1]]+(rk[sa[i]]!=rk[sa[i-1]]||rk[sa[i]+k]!=rk[sa[i-1]+k]);
        rep(i,1,n) rk[i]=tmp[i];//求出rk,注意相同的 
    }
}

Measured thief slow, this board should not. . .

\[ \ \]

\[ \ \]

1-2 LCP

Another important part of SA is \ (LCP \) array (in some places called \ (height \) arrays, it does not matter)

\ (LCP \) : the Common Longest the Prefix, the longest common prefix

\(LCP[i]\)\(LCP(Suf_{sa[i]},Suf_{sa[i+1]})\)

性质1 :\(LCP(Suf_i,Suf_j)=min(LCP(Suf_i,Suf_k),LCP(Suf_k,Suf_j)) (rk[i]\leq rk[k] \leq rk[j])\)

First, for any string \ (S_1, S_2, S_3 \) , there are \ (the LCP (S_1, S_2) \ GE min (the LCP (S_1, S_3), the LCP (S_2, S_3)) \) , presumably goes without saying that

Then proof $ LCP (Suf_i, Suf_j)> min (LCP (Suf_i, Suf_k), LCP (Suf_k, Suf_j)) $ absent

\(LCP(Suf_i,Suf_k)=a,LCP(Suf_k,Suf_j)=b\)

If \ (a> b \) , there are \ (of S_ {J + B} \ NE of S_ {K + B} \) , but because \ (a> b \) can be \ (S_ {k + b} = S_ I + B} {\) , i.e. \ (S_ {i + b} \ ne S_ {j + b} \) so there is no

If \ (B> A \) , empathy

If \ (A = B \) , then \ (of S_ {I + A} \ NE of S_ {K + A}, of S_ {K + A} \ NE of S_ {J + A} \) , and because in lexicographic order ordering , so there \ (I of S_ {a} + <K + of S_ {a} <J of S_ {a} + \) , the same can be excluded in the case of the above

Is proved (this amount out of their own barely see it)

\[ \ \]

性质2:\(LCP[rk[i+1]]\ge LCP[rk[i]]-1\)

事实上\(LCP[rk[i]]=LCP(Suf_i,Suf_{sa[rk[i]-1]})\)

\(Suf_i=a,Suf_{sa[rk[i]-1]}=b,Suf_{i+1}=c,Suf_{sa[rk[i+1]-1]}=d,LCP[rk[i]]=x,LCP[rk[i+1]]=y\)

The following discussion only \ (x \ ge 2 \) situation, (the X-\ Leq 1 \) \ case amount. . .

\ (C \) string is \ (A \) to remove the first letter, so there \ (a_ {1, x} = b_ {1, x}, a_ {2, x} = c_ {1, x-1} \)

Now \ (D \) is lexicographically less than \ (C \) , if the \ (d_ {1, x- 1 } \ NE C_ {. 1, X-. 1} \) , then \ (d_ {1, x- 1 } <c_ {1, x- 1} \)

And we already know that there suffix \ (e = b_ {2, len (b)} \) satisfies \ (e_ {1, x- 1} = c_ {1, x-1} \) Since \ (x \ ge 2 \) , there are \ (e <c \)

The \ (d <e <c \ ) conflicts

Therefore \ (d_ {1, x- 1} = c_ {1, x-1} \)

Property 2 can be obtained in accordance with a \ (O (n) \) processing (the LCP \) \ array-

for(reg int i=1,h=0;i<=n;++i) {
    if(h) h--; // LCP[rk[i+1]]>=LCP[rk[i]]
    int j=sa[rk[i]-1];
    while(i+h<=n && j+h<=n && s[i+h]==s[j+h]) h++;
    lcp[rk[i]-1]=h;
}
//由于0<=h<=n,h最多减少n次,所以h最多增加n*2次

\[ \ \]

1-3 Application:

1: Query \ (LCP (S_ {i, n}, S_ {j, n}) \)

Depending on the nature 1, it can be converted for the sake \ (min (LCP [rk [ i] .. rk [j] -1]) (rk [i] \ leq rk [j]) \)

Note about border issues, to maintain the table with ST / segment tree

\[ \ \]

2: the number of sub-strings of different requirements

For the sorted substring \ (SA [I] \) , the answer is \ (\ sum n-sa [ i] + 1-LCP [i-1] \)

For \ (sa [i] \) suffix provided \ (n-sa [i] +1 \) prefix, prefix lexicographically less than the number of strings that has appeared is the maximum \ (LCP (Suf_i, Suf_j) \) , i.e. \ (LCP [i-1] \)

SPOJ-DISUBSTR - Distinct Substrings

\[ \ \]

3. request LCS

The two series together, the middle with some strange characters

Then is seeking index falls two strings are all \ (i, j \) in the \ (LCP (Suf_i, Suf_j) \) maximum

According to \ (SA \) sequence can be found only consider recent \ (i, J \) , so for each (i \) \ find the nearest \ (j \) can be, is to take a ruler

POJ-2774

\[ \ \]

. 4: \ (K \) large substrings

Since the suffix array is already sorted, it is possible for the \ (sa [i] \) to consider the contribution to the number of \ (n-sa [i] + 1-lcp [i-1] \)

And considering the number of accumulated prefix \ (the Sum [I] \) , two points on \ (Sum [p] \ ge k \) can know in which looking for a suffix string (P \) \ prefix episode, but this suffix may occur more than once

Determined length \ (len = k-Sum [ p-1] + lcp [p-1] \) is coupled with the ranking has occurred

All contain the suffix string is a contiguous segment \ (l, r \) satisfies \ (\ forall_ {i \ in [l, r]} LCP (sa [i], p) \ ge len \)

You can solve some problems

\[ \ \]

5: the length of the string appears most times is at least x, at least x times the longest string occurs

The first case we can \ (the LCP \) array segments, each segment in the \ (the LCP [I] \ GE X \) , then the length of these suffixes appears \ (X \) prefixes are the same, then the statistics each segment can contain up to several suffix

The second case can be directly set a half. . .

Examples [USACO06DEC] mode milk Milk Patterns

If there are duplicate request can not be part of the case will determine what appears from the position of POJ - 1743

Like this group can also solve many problems, such as: POJ-3294

\[ \ \]

6. The use of LCP request cycle

\ (LCP (Suf_i, Suf_j) \ ge j-i + 1 (i \ leq j) \) when appeared cycle, may be determined according to the number of cycles matching the length of the

Of course, there are some other methods, the principles are similar

POJ-3693

Example solution to a problem Portal collections:

SPOJ-DISUBSTR - Distinct Substrings

Guess you like

Origin www.cnblogs.com/chasedeath/p/12213329.html