Suffix Array & Tree & automaton
Series suffix string structure summary
Some bloggers habits
String called \ (S \) , the length of called \ (n-\) , the string \ (a, b \) joined to \ (A + B \) , from a start index, \ (of S_ {I, J } \) represents \ (I \) to \ (J \) of the string piece
Between the string \ (<,> \) represents the lexicographical comparison
All examples of problem solutions can be found at the end of the text
\[ \ \]
\[ \ \]
1: SA (suffix array)
Need array \ (rk [i] \) represents \ (I \) suffix rank (the absence of the same), \ (SA [I] \) represents ranked \ (I \) suffix number, and an auxiliary array \ (cnt, tmp \)
1-1 suffix sorting
Because bloggers level is not enough, so I did not go to learn DC3 ( \ (O (the n-) \) ) algorithm is presented below DA ( \ (O (the n-\ the n-log) \) ) algorithm
Suffix sorting, by definition is to be \ (S \) each suffix \ (of S_ {I, n-} \) (hereinafter referred to as \ (Suf_i \) ), according to the lexicographic sorting the lexicographically smallest null character
Consider multiplier implemented radix sorting +
It has been determined for the current length \ (K \) , i.e. \ (S_ {i, i + k-1} \) sort (excess null character) has been completed, the next to be considered \ (S_ {i, i + 2k-1} \) ordering
\ (Of S_ {I, I + 2K-. 1} = of S_ {I, I + K-. 1} + of S_ {I + K, I + 2K-. 1} \) , it is possible according to the sorted finished part values for The combined use of the two parts to give radix sort \ (2K \)
If you are not very familiar with this sort of process under the base case, we can achieve a row with a fast
Process is as follows
根据首字母初始化sa,rk数组
for(k=1;k<=n;k<<=1) {
rep(i,1,n) tmp[i]=i;
sort(...);
求出sa,rk,注意当前相同的串的rk也要相同
}
Code
int n;
char s[N];
int cnt[N],tmp[N],rk[N],sa[N],lcp[N];
void PreMake(){
memset(cnt,0,800);
rep(i,1,n) cnt[(int)s[i]]++;
rep(i,1,200) cnt[i]+=cnt[i-1];
rep(i,1,n) rk[i]=cnt[(int)s[i]],sa[i]=i; //瞎初始化一波,不用我这样写
rep(i,n+1,n*2) rk[i]=0;// 注意一下边界的清空
for(reg int k=1;k<=n;k<<=1) {
rep(i,0,n) cnt[i]=0;
rep(i,1,n) cnt[rk[i+k]]++;
rep(i,1,n) cnt[i]+=cnt[i-1];
drep(i,n,1) tmp[cnt[rk[i+k]]--]=i; // 按照第二关键字排序
rep(i,0,n) cnt[i]=0;
rep(i,1,n) cnt[rk[i]]++;
rep(i,1,n) cnt[i]+=cnt[i-1];
drep(i,n,1) sa[cnt[rk[tmp[i]]]--]=tmp[i];//按照第一关键字排序,求出 sa
rep(i,1,n) tmp[sa[i]]=tmp[sa[i-1]]+(rk[sa[i]]!=rk[sa[i-1]]||rk[sa[i]+k]!=rk[sa[i-1]+k]);
rep(i,1,n) rk[i]=tmp[i];//求出rk,注意相同的
}
}
Measured thief slow, this board should not. . .
\[ \ \]
\[ \ \]
1-2 LCP
Another important part of SA is \ (LCP \) array (in some places called \ (height \) arrays, it does not matter)
\ (LCP \) : the Common Longest the Prefix, the longest common prefix
\(LCP[i]\):\(LCP(Suf_{sa[i]},Suf_{sa[i+1]})\)
性质1 :\(LCP(Suf_i,Suf_j)=min(LCP(Suf_i,Suf_k),LCP(Suf_k,Suf_j)) (rk[i]\leq rk[k] \leq rk[j])\)
First, for any string \ (S_1, S_2, S_3 \) , there are \ (the LCP (S_1, S_2) \ GE min (the LCP (S_1, S_3), the LCP (S_2, S_3)) \) , presumably goes without saying that
Then proof $ LCP (Suf_i, Suf_j)> min (LCP (Suf_i, Suf_k), LCP (Suf_k, Suf_j)) $ absent
设\(LCP(Suf_i,Suf_k)=a,LCP(Suf_k,Suf_j)=b\)
If \ (a> b \) , there are \ (of S_ {J + B} \ NE of S_ {K + B} \) , but because \ (a> b \) can be \ (S_ {k + b} = S_ I + B} {\) , i.e. \ (S_ {i + b} \ ne S_ {j + b} \) so there is no
If \ (B> A \) , empathy
If \ (A = B \) , then \ (of S_ {I + A} \ NE of S_ {K + A}, of S_ {K + A} \ NE of S_ {J + A} \) , and because in lexicographic order ordering , so there \ (I of S_ {a} + <K + of S_ {a} <J of S_ {a} + \) , the same can be excluded in the case of the above
Is proved (this amount out of their own barely see it)
\[ \ \]
性质2:\(LCP[rk[i+1]]\ge LCP[rk[i]]-1\)
事实上\(LCP[rk[i]]=LCP(Suf_i,Suf_{sa[rk[i]-1]})\)
另\(Suf_i=a,Suf_{sa[rk[i]-1]}=b,Suf_{i+1}=c,Suf_{sa[rk[i+1]-1]}=d,LCP[rk[i]]=x,LCP[rk[i+1]]=y\)
The following discussion only \ (x \ ge 2 \) situation, (the X-\ Leq 1 \) \ case amount. . .
\ (C \) string is \ (A \) to remove the first letter, so there \ (a_ {1, x} = b_ {1, x}, a_ {2, x} = c_ {1, x-1} \)
Now \ (D \) is lexicographically less than \ (C \) , if the \ (d_ {1, x- 1 } \ NE C_ {. 1, X-. 1} \) , then \ (d_ {1, x- 1 } <c_ {1, x- 1} \)
And we already know that there suffix \ (e = b_ {2, len (b)} \) satisfies \ (e_ {1, x- 1} = c_ {1, x-1} \) Since \ (x \ ge 2 \) , there are \ (e <c \)
The \ (d <e <c \ ) conflicts
Therefore \ (d_ {1, x- 1} = c_ {1, x-1} \)
Property 2 can be obtained in accordance with a \ (O (n) \) processing (the LCP \) \ array-
for(reg int i=1,h=0;i<=n;++i) {
if(h) h--; // LCP[rk[i+1]]>=LCP[rk[i]]
int j=sa[rk[i]-1];
while(i+h<=n && j+h<=n && s[i+h]==s[j+h]) h++;
lcp[rk[i]-1]=h;
}
//由于0<=h<=n,h最多减少n次,所以h最多增加n*2次
\[ \ \]
1-3 Application:
1: Query \ (LCP (S_ {i, n}, S_ {j, n}) \)
Depending on the nature 1, it can be converted for the sake \ (min (LCP [rk [ i] .. rk [j] -1]) (rk [i] \ leq rk [j]) \)
Note about border issues, to maintain the table with ST / segment tree
\[ \ \]
2: the number of sub-strings of different requirements
For the sorted substring \ (SA [I] \) , the answer is \ (\ sum n-sa [ i] + 1-LCP [i-1] \)
For \ (sa [i] \) suffix provided \ (n-sa [i] +1 \) prefix, prefix lexicographically less than the number of strings that has appeared is the maximum \ (LCP (Suf_i, Suf_j) \) , i.e. \ (LCP [i-1] \)
SPOJ-DISUBSTR - Distinct Substrings
\[ \ \]
3. request LCS
The two series together, the middle with some strange characters
Then is seeking index falls two strings are all \ (i, j \) in the \ (LCP (Suf_i, Suf_j) \) maximum
According to \ (SA \) sequence can be found only consider recent \ (i, J \) , so for each (i \) \ find the nearest \ (j \) can be, is to take a ruler
\[ \ \]
. 4: \ (K \) large substrings
Since the suffix array is already sorted, it is possible for the \ (sa [i] \) to consider the contribution to the number of \ (n-sa [i] + 1-lcp [i-1] \)
And considering the number of accumulated prefix \ (the Sum [I] \) , two points on \ (Sum [p] \ ge k \) can know in which looking for a suffix string (P \) \ prefix episode, but this suffix may occur more than once
Determined length \ (len = k-Sum [ p-1] + lcp [p-1] \) is coupled with the ranking has occurred
All contain the suffix string is a contiguous segment \ (l, r \) satisfies \ (\ forall_ {i \ in [l, r]} LCP (sa [i], p) \ ge len \)
You can solve some problems
\[ \ \]
5: the length of the string appears most times is at least x, at least x times the longest string occurs
The first case we can \ (the LCP \) array segments, each segment in the \ (the LCP [I] \ GE X \) , then the length of these suffixes appears \ (X \) prefixes are the same, then the statistics each segment can contain up to several suffix
The second case can be directly set a half. . .
Examples [USACO06DEC] mode milk Milk Patterns
If there are duplicate request can not be part of the case will determine what appears from the position of POJ - 1743
Like this group can also solve many problems, such as: POJ-3294
\[ \ \]
6. The use of LCP request cycle
\ (LCP (Suf_i, Suf_j) \ ge j-i + 1 (i \ leq j) \) when appeared cycle, may be determined according to the number of cycles matching the length of the
Of course, there are some other methods, the principles are similar
Example solution to a problem Portal collections: