"Suffix sorting SA" study notes

  • Foreword

    求 赞 moment

    The main reference article: this and this

    If there are still infringement, please private letter to me, I will immediately be marked up qaq. (After all, this thing is too cancer I have been turning data)

    So .. in the end is what the gods think that such a large brain-dongmagicalCancer stuff?


  • For nothing

    Like on Los Valley P3809 [template] suffix sorting (SA)

    A read length \ (n-\) string of uppercase and lowercase letters or numbers, see the string of all non-empty suffixes by lexicographic order from small to large , and then sequentially outputs the first suffix the position of the character in the original string . Position number is \ (1 \) to the \ (n-\) .

    Wherein \ (n-\ Le. 6 10 ^ \) .

    Direct sequencing of the complexity of the \ (O (n ^ 2 log_2 n) \)

    So SA certainly faster than this thing is .. \ (O (nlog_2n) \)

    Then add that this problem without seeking LCP (longest common prefix) , but it seems that something useful .. (


  • Sort idea suffix

    With a radix sort idea na.

    Radix sort (radix sort) are "assigned ordering" (distribution sort), also known as "bucket Act" (bucket sort) or bin sort, as the name suggests, it is a key part of the information through the elements to be sorted allocation to some of the "barrel" in order to achieve the sort of role
    ---- Baidu Encyclopedia

    In fact, the above definitions is not important, Ordering suffix is ​​the use of the idea of ​​radix sort.

    According to lexicographic order, if compared one by one, definitely start with the first start ratio. So we can follow the first character of the suffix of each row of a sequence, the operation is equivalent to each character of the string to row sequence .

    Unless each character just is not the same, or certainly some of the characters are ranked the same , according to the order, then we have to compare the second place ranking as these characters!

    Then the complexity will go up.

    In fact, the second place we have compared the!

    Because to sort of be a suffix , so the first \ (i \) suffix of the second character is the first \ (i + 1 \) character , which is already a good sort, but the first is the first keyword , The second is the second key .

    Then you can get second place ranking.

    And so on, you can take advantage of doubling the idea of re-use is known to come fourth in the ranking.

    Not difficult to draw, when all suffixes are pairwise rankings at the same time , to complete this question.

    Direct understanding is really a very hard core, do so in conjunction with the following diagram illustrates.

    Each suffix number , such as the title, to a position where the first letter .

    \ (S [] \) represents the string.

    \ (sa [i] \) expressed ranked number suffix what i was .

    So then came more cancer


  • Suffix sorting code to understand

    打算一块一块讲,算上输出我总共写了12个for

    首先,我们要先根据第一位排序,确定最初的 \(sa[]\)

    我们用 \(x[i]\) 表示第一关键字

     for(int i=1;i<=n;i++){x[i]=s[i];++c[x[i]];}
      for(int i=2;i<=m;i++)c[i]+=c[i-1];
      for(int i=n;i>=1;i--)sa[c[x[i]]--]=i;

    先直接把 \(s[i]\) 给上 \(x[i]\) ,而 \(c[]\) 是桶,这里用到的就是桶排序的思想,来统计每种字符有多少种

    为了方便标号,就先做一个前缀和。这样字典序越大,所对应的的 \(c[]\) 越大

    接下来就是排序确定最初 \(sa[]\) 了。再次强调: \(sa[i]\) 表示排名为i的后缀编号是什么

    至于为什么 \(c[x[i]]\) 要减一,是为了当出现 \(c[x[i]]>1\) ,即有重复时,保证排序不一样

    下一块,就是要一步一步确定第二位,第四位..

    也就是利用倍增的思想:

    for(int k=1;k<=n;k=k<<1)

    然后就在这循环里瞎搞就好了。

    首先,定义 \(y[i]\) 表示排名为第 \(i\) 的第二关键字 ,也就是确定 \(x[]\) 的排列的东西

    然后根据上次排序的 \(sa[]\) 来确定 \(y[]\)

    int num=0;
    for (int i=n-k+1;i<=n;++i) y[++num]=i;
    for(int i=1;i<=n;i++)
          if(sa[i]>k)y[++num]=sa[i]-k;

    \(num\) 只是一个指针而已。

    首先\(k\) 位,也就是第 \(n-k+1\) 位到第 \(n\),他们其实已经排序完了,因为他们后面的第 \(k\) 位不存在。那么先直接存在 \(y[]\) 中。

    若第 \(i\)可作其他位置的第二关键字,即 \(sa[i]>k\) 时,要把他放在对应的第一关键字( \(x[sa[i]-k]\) )中

    确定完第一关键字和第二关键字,就可以更新 \(sa[]\)。而且更新方法和开头很像。

      for(int i=1;i<=m;i++)c[i]=0;
      for(int i=1;i<=n;i++)c[x[i]]++;
      for(int i=2;i<=m;i++)c[i]+=c[i-1];
      for(int i=n;i>=1;i--){sa[c[x[y[i]]]--]=y[i];y[i]=0;}

    首先清空,然后用桶统计,然后前缀和。

    唯一改的就是把sa[c[x[i]]--]=i;改成sa[c[x[y[i]]]--]=y[i];

    原因也很简单,就是因为开头 \(x[]\) 排序就是 \(1\)\(n\),而这边排序变成了 \(y[1]\)\(y[n]\)

    接下来就是要更新 \(x[]\),然后很明显,要用到未更新的 \(x[]\)\(sa[]\)

    然后又懒得开新变量存,所以先用暂时没用的 \(y[]\) 来存此时的 \(x[]\)

    swap(x,y);

    然后更新。

      num=1;x[sa[1]]=1;
      for(int i=2;i<=n;i++)
      {   
          if(y[sa[i]]==y[sa[i-1]]&&y[sa[i]+k]==y[sa[i-1]+k])x[sa[i]]=num;
          else x[sa[i]]=++num;
      }
      if(num==n)break;
      m=num;//m是指不同字母的数量

    此时 \(i\) 代表的是排名,所以编号都要用 \(sa[i]\) 代替。

    然后当两个字符一样的时候,\(x[]\) 是一样的,也就是第一个if判断

    否则的话排名更新,因为是按照排名枚举的,所以直接 \(num+1\)

    \(x[i]\) 各不相同时,即 \(num=n\),这个排序也就做完了。

    然后这部分就做完了..


  • 后缀排序完整代码(即P3809 【模板】后缀排序 (SA)完整代码)

    #include<iostream>
    #include<cstdio>
    #include<cstring>
    #include<cmath>
    using namespace std;
    const int Maxn=1e6+5;
    char s[Maxn];
    int n,m,x[Maxn],y[Maxn],c[Maxn],sa[Maxn];
    int height[Maxn],rk[Maxn];
    void SA()
    { 
        for(int i=1;i<=n;i++){x[i]=s[i];++c[x[i]];}
        for(int i=2;i<=m;i++)c[i]+=c[i-1];
        for(int i=n;i>=1;i--)sa[c[x[i]]--]=i;
        for(int k=1;k<=n;k=k<<1)
        { 
            int num=0;
            for (int i=n-k+1;i<=n;++i) y[++num]=i;
            for(int i=1;i<=n;i++)
                if(sa[i]>k)y[++num]=sa[i]-k;
            for(int i=1;i<=m;i++)c[i]=0;
            for(int i=1;i<=n;i++)c[x[i]]++;
            for(int i=2;i<=m;i++)c[i]+=c[i-1];
            for(int i=n;i>=1;i--){sa[c[x[y[i]]]--]=y[i];y[i]=0;}
            swap(x,y);
            num=1;x[sa[1]]=1;
            for(int i=2;i<=n;i++)
            { 
                if(y[sa[i]]==y[sa[i-1]]&&y[sa[i]+k]==y[sa[i-1]+k])x[sa[i]]=num;
                else x[sa[i]]=++num;
            }
            if(num==n)break;
            m=num;
        }
        for(int i=1;i<=n;i++)
            printf("%d ",sa[i]);
        printf("\n");
    }
    int main()
    { 
        scanf("%s",s+1);
        n=strlen(s+1);
        m=122;
        SA();
        //LCP(); 
        return 0;
    }

  • LCP

    这个据说好用的东西就顺带讲了吧。

    LCP:最长公共前缀。然后就是求,要用到后缀排序。

    定义 \(LCA(i,j)\) 表示第 \(sa[i]\) 个和第 \(sa[j]\) 个的两个后缀的最长公共前缀。

    然后就是一堆定理:

    1. \(LCP(i,j)=LCP(j,i)\)
    2. \(LCP(i,i)=len(sa[i])=n-sa[i]+1\)
    3. LCP Lemma \(LCP(i,j)=min(LCP(i,k),LCP(k,j))(1 \le i \le k \le j \le n)\)
    4. LCP Theorem \(LCP(i,j)=min(LCP(k,k-1))(1<i \le k \le j \le n)\)

    那么如何证明呢?

    首先定理 \(1\) 和定理 \(2\) 是显然的。

    然后LCP LemmaLCP Theorem开始抄别人文章。

    1. LCP Lemma

      \(p=min(LCP(i,k),LCP(k,j))\) ,\(LCP(i,k) \ge p\)\(LCP(k,j) \ge p\)

      设第 \(sa[i]\) 后缀为 \(u\) ,第 \(sa[j]\) 后缀为 \(v\) ,第 \(sa[k]\) 后缀为 \(w\)

      所以 \(u\)\(w\) 的前 \(p\) 个字符相等,\(v\)\(w\) 的前 \(p\) 个字符相等。

      所以 \(u\)\(v\) 的前 \(p\) 个字符相等。

      \(LCP(i,j)=q\),且 \(q > p\)

      \(q \ge p+1\) ,且 \(u[p+1] = v[p+1]\)

      因为 \(p=min(LCP(i,k),LCP(k,j))\) ,所以 $u[p+1] \ne w[p+1] $ 或者 \(v[p+1] \ne w[p+1]\)

      所以 \(u[p+1] \ne v[p+1]\) 与前面矛盾。

      所以 \(LCP(i,j) \le p\)

      综上所述, \(LCP(i,j)=p=min(LCP(i,k),LCP(k,j))(1 \le i \le k \le j \le n)\)

    2. LCP Theorem

      \(i\) ~\(j\) 拆成 \(i\)~\(i+1\)\(i+1\)~\(j\)

      那么根据LCP Lemma,则 \(LCP(i,j)=min(LCP(i,i+1),LCP(i+1,j))\)

      然后我们依然可以把 \(i+1\)~\(j\) 继续拆,明显正确。

    好的那么接下来的问题就是怎么求了。

    我们\(rk[i]\) 表示编号为 \(i\) 的后缀的排名。

    请注意和前面的 \(sa[i]\) 区分,他们是这个关系:\(sa[rk[i]]=i\)

    \(height[i]=LCP(i,i-1)(1<i\le n)\)\(height[1]=0\)

    因为 \(LCP(i,j)=min(LCP(k,k-1))(1<i \le k \le j \le n)\)

    所以 \(LCP(i,j)=min(height(k))(i < k \le j)\)

    \(h[i]=height[rk[i]]\)

    因为 \(sa[rk[i]]=i\)

    所以 \(height[i]=h[sa[i]]\)

    然后这里其实是有一个定理的 \(h[i] \ge h[i-1]-1\)

    但是我并不会证明。

    然后就可以用这条定理来求 \(height[]\) ,然后再求LCP。

    void LCP()
    { 
        int k=0; //用k代表h[i]
        for(int i=1;i<=n;i++)rk[sa[i]]=i; //初始化rk[i]
        for(int i=1;i<=n;i++)//这里其实是枚举rk[i]
        { 
            if(rk[i]==1)continue; //height[1]=0
            if(k)k--; //h[i]>=h[i-1]-1,更新k然后一位位枚举
            int j=sa[rk[i]-1];//前一位字符串
            while(i+k<=n&&j+k<=n&&s[i+k]==s[j+k])k++;//一位位枚举
            height[rk[i]]=k;//h[i]=height[rk[i]]
        }
        for(int i=1;i<=n;i++)
            printf("%d ",height[i]);
        printf("\n");
    }

    然后就可以求LCP了。

    根据 \(LCP(i,j)=min(height(k))(i < k \le j)\)

      int ans=inf;//求LCP(x,y)
      for(int i=x+1;i<=y;i++)
          ans=min(ans,height(i));
     printf("%d\n",ans);

\[\text{既然都看到这里了为何不点个赞呢(小声}\]

\[\text{写死我了awsl}\]

\[\text{by Rainy7}\]

Guess you like

Origin www.cnblogs.com/Rainy7/p/12276010.html