[Study notes] string - generalized suffix automata

[Study notes] string - generalized suffix automata

I. Introduction】

Research in the last week scared (Ren) Yan (lei)no(Zhi)ratio(Hui),nice(At the)wonderful(Xing) Must (Yu) Lun (Yue) Automaton herein by reference \ (\ text {bztMinamoto} \ ) word giant man to express feelings at this:

I feel like I'm a whole person automaton ...... - \ (bztMinamoto \) ( palindrome automatic machine learning notes )

I found online in this process broadly speaking \ (\ text {SAM} \ ) articles are few and many are not correct, so I decided to tidy up.

II: [Lemma]

As we all know, \ (\ text {SAM} \) is a classic application is seeking a number of different string substring in nature, so if instead seeking a \ (Trie \) trees?

Liu Yi study in \ (2015 \) in the national papers said this sentence:

Most problems can be treated with a string suffix automaton can be extended to \ (Trie \) tree.

We will build on this \ (Trie \) tree \ (\ text {SAM} \ ) become generalized \ (\ text {SAM} \ ) . Before learning, we must first ensure that a single string \ (\ text {SAM} \ ) familiar enough,

In fact, may be understood as a simple multi-string \ (\ text {SAM} \ ) it QAQ

Three: [algorithm]

In GENERALIZED with \ (\ text {SAM} \ ) while processing multiple string problems, the Internet has spread mainstream wording \ (3 \) Species:

\ ((1). \) With a special symbol to all modes concatenated into a string into a large \ (\ text {SAM} \ ) , the determination to use some metaphysical Laid process.

\ ((2). \) Before each string is inserted into a pattern, regarded \ (Last \) is set to \ (1 \) , according to the conventional \ (\ text {SAM} \ ) as the insertion, i.e., each string all from the beginning \ (1 \) to restart construction.

\ ((3). \) In all the pattern string to build a \ (Trie \) tree, its \ (BFS \) traversing construct \ (\ text the SAM} {\) , \ (INSERT \) enabled \ (Last \) is in its \ (Trie \) tree father, and the rest of ordinary \ (\ text {SAM} \ ) the same.

The first practicality is not high and dangerous complexity, the second listening room chiefs say piracy, but because very few problems and the code is simple, so many people use this. However, according to the generalized \ (text {SAM} \ \ ) is defined, only the third standard is written.

Here's a question: Why is \ (bfs \) instead of \ (dfs \) it? Under certain circumstances, \ (the DFS \) will be stuck into \ (O (n ^ 2) \) and \ (bfs \) does not specifically refer here (a search of the entire network to find only a fine broadly speaking \ (\ text {SAM} \ ) the exact wording of the blog) .

\ (BFS \) code is as follows:

//Trie.tr[x]: Trie树的状态转移数组
//Trie.fa[x]: Trie树上节点x的父节点
//Trie.c[x]: Trie树上节点x的字符
//pos[x]:Trie上x节点的前缀字符串(路径 根->x 所表示的字符串)在SAM上的对应节点编号
inline void build(){//bfs遍历Trie树构造广义SAM
    for(Re i=0;i<C;++i)if(Trie.tr[1][i])Q.push(Trie.tr[1][i]);//插入第一层字符
    pos[1]=1;//Tire树上的根1在SAM上的位置为根1
    while(!Q.empty()){
        Re x=Q.front();Q.pop();
        pos[x]=insert(Trie.c[x],pos[Trie.fa[x]]);//注意是pos[Trie->fa[x]]
        for(Re i=0;i<C;++i)if(Trie.tr[x][i])Q.push(Trie.tr[x][i]);
    }
}

In fact established \ (Trie \) tree and then construct \ (\ text {SAM} \ ) is an offline writing, we can not build \ (Trie \) tree structure directly online:

Before each string inserting a pattern, regarded \ (Last \) is set to \ (. 1 \) , \ (INSERT \) function in the normal \ (\ text {SAM} \ ) based on adding a specific sentence (Note said earlier piracy wording used is not an ordinary sentence Stuttgart \ (INSERT \) ).

The changed \ (insert \) code is as follows:

//link[i]: 后缀链接
//trans[i]: 状态转移数组
inline int insert(Re ch,Re last){//将ch[now]接到last后面
    if(trans[last][ch]&&maxlen[last]+1==maxlen[trans[last][ch]])return trans[last][ch];
    //已经存在需要的节点(特判1)
    Re x,y,z=++O,p=last,flag=0;maxlen[z]=maxlen[last]+1;
    while(p&&!trans[p][ch])trans[p][ch]=z,p=link[p];
    if(!p)link[z]=1;
    else{
        x=trans[p][ch];
        if(maxlen[p]+1==maxlen[x])link[z]=x;
        else{//需要拆分x,将len<=maxlen[p]+1的部分复制一个y出来
            if(maxlen[p]+1==maxlen[z]/*或者写:p==last*/)flag=1;(特判2)
            y=++O;maxlen[y]=maxlen[p]+1;
            for(Re i=0;i<C;++i)trans[y][i]=trans[x][i];
            while(p&&trans[p][ch]==x)trans[p][ch]=y,p=link[p];
            link[y]=link[x],link[z]=link[x]=y;
        }
    }
    return flag?y:z;
    //返回值为:ch[now]插入到SAM中的节点编号,
    //如果now不是某个字符串的最后一个字符,
    //那么这次返回值将作为下一次插入时的last
}

The return value is added to facilitate record \ (Last \) .

Next we focus on what the specific meaning of these two special sentence:

(特判1)
if(trans[last][ch]&&maxlen[last]+1==maxlen[trans[last][ch]])
    return trans[last][ch];
(特判2)
if(maxlen[p]+1==maxlen[z]/*或者写:p==last*/)flag=1;

Japanese sentence \ (1 \) is better understood, we want (Last \) \ Insert a node behind \ (Z \) such that \ (the maxlen [Z] = the maxlen [Last] + 1'd \) , and this node has present in \ (\ text {SAM} \ ) in, and then it can be returned directly.

Note: This returned node saves the state of the plurality of pattern, a plurality of different modes is about the same sub-strings within string data compression in this node, if you want to record \ (endpos \) size, the need for each a pattern strings individually open \ (SIZ \) array sequentially updated, but not all bunched up cook. 【example】

Special sentenced \ (2 \) is the essence of treatment \ (trans [last] [ch ] \! \! = NULL \) and \ (maxlen [last] +1 \ ! \! = Maxlen [trans [last] [ch ]] \) situation.

Let's look at a single string \ (\ text {SAM} \ ) a \ (INSERT \) illustrates (from \ (\ text hihocoder} {\) ):

From \ (Last \) jump-started forward \ (Link \) , the single-string \ (\ text {SAM} \ ) in that there must be \ (trans [p] [ch ] = NULL \) period, but after extended to multiple strings may not have this period, and that there is \ (Trans [Last] [CH] = X \) , found \ (the maxlen [P] + 1'd \! \! = the maxlen [X] \) (if same words, in Japanese sentence \ (1 \) when the return duck), as shown below:

显然,此时没有任何节点指向最初新建的 \(z\) 节点,同时它没有记录任何信息,新加入的信息全部储存在了 \(link[z]=y\) 节点上面(即 \(x\) 拆分出来的复制点),但通常情况下它作为一个空节点不会对答案造成任何影响(为什么是空的呢?其后缀链接会指向 \(trans[last][ch]\)复制节点 \(y\),而 \(maxlen[y]=maxlen[last]+1\),所以 \(minlen[z]=maxlen[link[z]=y]+1=maxlen[last]+2\),又有 \(maxlen[z]=maxlen[last]+1\),所以 \(z\) 为空 )。

从另一个角度看,节点 \(y\) 满足 \(trans[last][ch]=y\)\(maxlen[y]=maxlen[last]+1\),这不正是我们想要的吗(同特判 \(1\)),所以可以返回 \(y\)

其实通常情况下,不加特判 \(2\) 也不会出啥事,无非就是多跳了一次 \(link\),但在统计某些特定的信息时可能会挂 【例】,所以还是建议推荐加上这一句。

疑问:在线和离线有什么不同呢?

在特判 \(1\) 的作用下,在线写法会构造出一颗类 \(Trie\) 形态的 \(\text{SAM}\),其本质还是在一颗没有具象化的 \(Trie\) 树上建立了 \(\text{SAM}\)

四:【广义SAM的复杂度】

\(|T|\)\(Trie\) 树大小,\(|A|\) 为字符集大小(可视为常数),\(G(T)\)\(Trie\) 树所有叶节点深度之和。

  • 状态数(节点数)依旧为线性 \(O(2|T|)\)

  • 转移函数(边数)上界为 \(O(|T||A|)\)

  • 离线时间复杂度为 \(O(|T||A|+|T|)\)

  • 在线时间复杂度为 \(O(|T||A|+G(T))\)

上述性质在 \(2015\) 年刘研绎的国家队论文都中有严谨证明,这里不赘述。

五:【例题】

传送门:诸神眷顾的幻想乡 \(\text{[ZJOI2015] [P3346]}\) \(\text{[Bzoj3926]}\)

给出一颗叶子结点不超过 \(20\) 个节点的无根树,每个节点上都有一个不超过 \(10\) 的数字,求树上本质不同的路径树(两条路径相同定义为 其路径上所有节点上的数字依次相连组成的字符串相同)。

首先有一个很麻烦的地方是路径可以拐弯(即两端点分别在其 \(lca\) 两个不同儿子节点的子树中),而 \(Trie\) 和各种自动机在“接受”字符串时都是以根为起点从上往下径直走到底(什么?跳 \(Parent\) 树?你跳任你跳,跳完还是直的)

所以要想办法把路径捋直,瞎 \(yy\) 可能不太容易想出来,这里直接抛结论:

  • 一颗无根树上任意一条路径必定可以在以某个叶节点为根时,变成一条从上到下的路径(利于广义 \(\text{SAM}\) 的使用)。

注意到题目中说叶节点不超过 \(20\) 个,这意味着什么?

暴力枚举每一个叶节点作为根节点遍历整棵树啊!

将一共 \(cnt_{leaf}\) 颗树中的所有前缀串都抽出来建立广义 \(\text{SAM}\),然后就可以求本质不同的子串了。 其中前缀串即是从根节点(无根树的某个叶子结点)到任意一个节点的路径所构成的字符串(实际上就是将 \(cnt_{leaf}\)\(Trie\) 树合在了一起跑广义 \(\text{SAM}\))。

注意数组大小和空间限制。

【Code】

【离线】

#include<algorithm>
#include<cstring>
#include<cstdio>
#include<queue>
#define Re register int
#define LL long long
using namespace std;
const int N=4e6+5,N20=2e6+3,Nn=1e5+3;
int n,m,o,x,y,t,C,du[Nn],co[Nn],head[Nn];LL ans;
struct QAQ{int to,next;}a[Nn<<1];
inline void add(Re x,Re y){a[++o].to=y,a[o].next=head[x],head[x]=o;}
inline void in(Re &x){
    int fu=0;x=0;char c=getchar();
    while(c<'0'||c>'9')fu|=c=='-',c=getchar();
    while(c>='0'&&c<='9')x=(x<<1)+(x<<3)+(c^48),c=getchar();
    x=fu?-x:x;
}
struct Trie{
    int O,c[N20],fa[N20],tr[N20][10];
    //fa[x]: Trie树上x的父节点
    //c[x]: Trie树上x的颜色
    Trie(){O=1;}//根初始化为1
    inline int insert(Re p,Re ch){//在p后面插入一个ch
        if(!tr[p][ch])tr[p][ch]=++O,c[O]=ch,fa[O]=p;
        return tr[p][ch];
    }
}T1;
struct Suffix_Automaton{    
    int O,pos[N],link[N],trans[N][10],maxlen[N];queue<int>Q;
    //pos[x]:Trie上的x节点(路径1->x所表示的字符串)在SAM上的对应节点编号
    //link[i]: 后缀链接
    //trans[i]: 状态转移数组
    Suffix_Automaton(){O=1;}//根初始化为1
    inline int insert(Re ch,Re last){//和普通SAM一模一样
        Re x,y,z=++O,p=last;maxlen[z]=maxlen[last]+1;
        while(p&&!trans[p][ch])trans[p][ch]=z,p=link[p];
        if(!p)link[z]=1;
        else{
            x=trans[p][ch];
            if(maxlen[p]+1==maxlen[x])link[z]=x;
            else{
                y=++O;maxlen[y]=maxlen[p]+1;
                for(Re i=0;i<C;++i)trans[y][i]=trans[x][i];
                while(p&&trans[p][ch]==x)trans[p][ch]=y,p=link[p];
                link[y]=link[x],link[z]=link[x]=y;
            }
        }
        return z;
    }
    inline void build(){//bfs遍历Trie树构造广义SAM 
        for(Re i=0;i<C;++i)if(T1.tr[1][i])Q.push(T1.tr[1][i]);//插入第一层字符
        pos[1]=1;//Tire树上的根1在SAM上的位置为根1
        while(!Q.empty()){
            Re x=Q.front();Q.pop();
            pos[x]=insert(T1.c[x],pos[T1.fa[x]]);//注意是pos[Trie->fa[x]]
            for(Re i=0;i<C;++i)if(T1.tr[x][i])Q.push(T1.tr[x][i]);
        }
    }
    inline void sakura(){
        for(Re i=2;i<=O;++i)ans+=maxlen[i]-maxlen[link[i]];
        printf("%lld\n",ans); 
    }
}SAM;
inline void dfs(Re x,Re fa,Re fap){//便利构造Trie树 
    Re xp=T1.insert(fap,co[x]);//记录在Trie树上的位置,方便下次直接使用
    for(Re i=head[x],to;i;i=a[i].next)
        if((to=a[i].to)!=fa)dfs(to,x,xp);
}
int main(){
//  freopen("123.txt","r",stdin);
    in(n),in(C),m=n-1;
    for(Re i=1;i<=n;++i)in(co[i]);
    while(m--)in(x),in(y),add(x,y),add(y,x),++du[x],++du[y];
    for(Re i=1;i<=n;++i)if(du[i]==1)dfs(i,0,1);//以此把每个叶子节点作为根插入Trie树
    SAM.build(),SAM.sakura();
}

【在线】

#include<algorithm>
#include<cstring>
#include<cstdio>
#include<queue>
#define Re register int
#define LL long long
using namespace std;
const int N=4e6+5,N20=2e6+3,Nn=1e5+3;
int n,m,o,x,y,t,C,du[Nn],co[Nn],head[Nn];LL ans;
struct QAQ{int to,next;}a[Nn<<1];
inline void add(Re x,Re y){a[++o].to=y,a[o].next=head[x],head[x]=o;}
inline void in(Re &x){
    int fu=0;x=0;char c=getchar();
    while(c<'0'||c>'9')fu|=c=='-',c=getchar();
    while(c>='0'&&c<='9')x=(x<<1)+(x<<3)+(c^48),c=getchar();
    x=fu?-x:x;
}
struct Suffix_Automaton{
    int O,link[N],trans[N][10],maxlen[N];
    //link[i]: 后缀链接
    //trans[i]: 状态转移数组
    Suffix_Automaton(){O=1;}//根初始化为1
    inline int insert(Re ch,Re last){
        if(trans[last][ch]&&maxlen[last]+1==maxlen[trans[last][ch]])return trans[last][ch];//已经存在需要的节点
        Re x,y,z=++O,p=last,flag=0;maxlen[z]=maxlen[last]+1;
        while(p&&!trans[p][ch])trans[p][ch]=z,p=link[p];
        if(!p)link[z]=1;
        else{
            x=trans[p][ch];
            if(maxlen[p]+1==maxlen[x])link[z]=x;
            else{
                if(maxlen[p]+1==maxlen[z]/*或者写:p==last*/)flag=1;
                y=++O;maxlen[y]=maxlen[p]+1;
                for(Re i=0;i<C;++i)trans[y][i]=trans[x][i];
                while(p&&trans[p][ch]==x)trans[p][ch]=y,p=link[p];
                link[y]=link[x],link[z]=link[x]=y;
            }
        }
        return flag?y:z;
    }
    inline void sakura(){
        for(Re i=2;i<=O;++i)ans+=maxlen[i]-maxlen[link[i]];
        printf("%lld\n",ans); 
    }
}SAM;
inline void dfs(Re x,Re fa,Re fap){//遍历在线构造SAM
    Re xp=SAM.insert(co[x],fap);//记录x在SAM上的位置,方便下次直接使用
    for(Re i=head[x],to;i;i=a[i].next)
        if((to=a[i].to)!=fa)dfs(to,x,xp);
}
int main(){
//  freopen("123.txt","r",stdin);
    in(n),in(C),m=n-1;
    for(Re i=1;i<=n;++i)in(co[i]);
    while(m--)in(x),in(y),add(x,y),add(y,x),++du[x],++du[y];
    for(Re i=1;i<=n;++i)if(du[i]==1)dfs(i,0,1);//以此把每个叶子节点作为根插入Trie树
    SAM.sakura();
}

六:【参考文献】

Guess you like

Origin www.cnblogs.com/Xing-Ling/p/12038349.html