Suffix array and its application

Here briefly multiplication algorithm \ ((O (nlogn)) \) seeking suffix array


definition

Defined \ (S \) of length \ (n-\) of the string, the suffix \ (suf [i] \) means \ (S [I \ n-SIM] \) ;

The \ (S \) of all suffixes sorted in lexical order:

  • \ (rk [i] (rank [i]) \) means \ (suf [i] \) ranking;
  • \ (sa [i] \) refers to the section \ (I \) suffix is \ (SUF [SA [I]] \) ;
  • \ ([i] height \) refers to the section \ (I \) and No. \ (i-1 \) the longest common prefix name suffix \ ((LCP) \) ;

Algorithm theory

\ (sa \) Array evaluation method

Obtain \ (sa \) array will obtain the \ (RK \) ;

First, we can easily think of a quick discharge algorithm, if based on character by character comparison algorithm time is \ (O (the n-^ 2logn) \) ;

We can also think of an optimized, with \ (Hash + \) half instead of by-character comparison, this time is \ (O (nlog ^ 2n) \) ;

Other ideas:

Each can be made as suffix \ (the ASCII \) series code constituted by radix sort to achieve, so that time is \ (O (n ^ 2) \) ; and

Think of a way to optimize:

Because the character string to be sorted are \ (S \) above suffixes, it is the same part;

So that we can introduce such an algorithm:

1. For a first character of each ordered suffix, to give them a rank as the first suffix of each key;

2. For the first two sorting a (1 + 1) th characters of each suffix, (the \ (suf [i] \) of the first two characters \ (S [i], S [i + 1] \) , seen as a whole sort, since already the first step to the length of a sub-string ranking here is equivalent to their ranking after the merge sort), and then rank them as to give each a suffix The first keyword;

3. For the first three sort order for each suffix (1 + 2) characters, then they are obtained as the first keyword rankings for each suffix;

4. sequentially ordered for each of the first five suffix (1 + 4) characters,

······

Until all the rankings are different from each other;

Look at the picture better understanding, is a non-stop process re-sort-merge sort, can be guaranteed to be the last step every time the problem has been solved in the merger, and for each suffix can not not leak to contain all the characters, so this the effect is equivalent to the sort of violence radix sorting;

Also, because the combined length of the multiplication is implemented, it most be \ (log_2n \) secondary sorting, and sorting are used each median \ (2 \) of the radix sort, the time complexity of the algorithm can make is maintained at \ (O (nlogn) \) level;

This portion of code

#include<bits/stdc++.h>
#define ll long long 
#define mp make_pair
using namespace std;

const int N=100005;
char c[N];
int n,m=127;//m是值域的初始大小  
int tub[N];//桶 
int fk[N],psk[N];//注意!!!fk(first key)是某个位置的一关键字 
//而psk(pos of second key)是 排名为i的二关键字所在的位置 
int sa[N],height[N],rk[N];
//ll ans;

inline int read()
{
    int x=0,fl=1;char st=getchar();
    while(st<'0'||st>'9'){ if(st=='-')fl=-1; st=getchar();}
    while(st>='0'&&st<='9') x=x*10+st-'0',st=getchar();
    return x*fl;
}

inline void rsort()//基数排序 
{
    for(int i=1;i<=m;i++) tub[i]=0;//清空桶 
    for(int i=1;i<=n;i++) ++tub[fk[i]];//将第一关键字加入桶中 
    for(int i=1;i<=m;i++) tub[i]+=tub[i-1];//求桶的前缀和,得到一关键字的排名 
    for(int i=n;i>=1;i--) sa[tub[fk[psk[i]]]--]=psk[i];//按照从后到前的顺序,依据二关键字的排名得到目前的sa 
}

inline void SA()
{
    for(int i=1;i<=n;i++) fk[i]=c[i],psk[i]=i; //初始 
    rsort();//第一次排序 
    for(int k=1;k<=n;k<<=1)//合并的长度 
    {
        int cnt=0;
        for(int i=n-k+1;i<=n;i++) psk[++cnt]=i;//这一部分位置是没有能与其合并的 ,其二关键字应是最小的 
        for(int i=1;i<=n;i++)
            if(sa[i]>k) psk[++cnt]=sa[i]-k; //这里的sa[i]会被某个位置合并,这个位置是sa[i]-k
        //这一部分的二关键字是按照顺序加入的 
        rsort();
        swap(fk,psk);//将fp放到psk转存,方便得到下一次的一关键字 
        fk[sa[1]]=1;
        cnt=1;
        for(int i=2;i<=n;i++) //如果两个位置的一关键字和二关键字都相同,则排名不同,需要+1  
            fk[sa[i]]=(psk[sa[i]]==psk[sa[i-1]]&&psk[sa[i]+k]==psk[sa[i-1]+k])?cnt:++cnt;
        if(cnt==n) break;//如果已经所有排名不同了就可以退出 
        m=cnt;//更新值域 
    }
}

int main()
{
    n=read();
    scanf("%s",c+1);
    SA();
//  get_height();
//  ans=(ll) n*(n+1)/2;
//  for(int i=1;i<=n;i++)
//      ans-=height[rk[i]];
//  printf("%lld",ans);
    return 0;
}

\ (height \) Seeking array

It relies mainly on a property:

\(height[rk[i]] ≥ height[rk[i-1]]-1\)

Inductive proof:

Suppose ranked \ (suf [i-1] \) in front of the string is \ (suf [k] \) thereof longest common prefix length \ (height [RK [-I. 1]] \) ;

\ (suf [i] \) than \ (suf [i-1] \) of a character Without it, the available \ (suf [i] \) and \ (suf [k + 1] \) of longest common prefix is \ (height [RK [-I. 1]] -. 1 \) ;

But \ (suf [i] \) and \ (suf [k + 1] \) in the sorted necessarily together (i.e. \ (height [rk [i] ] \) is not necessarily the longest common prefix thereof) ,

But even if not together, the sorted suffix \ (suf [k + 1] \) and (suf [i] \) \ suffix are necessarily satisfied between, and (suf [i] \) \ best common prefix length is at least \ (height [RK [-I. 1]] -. 1 \) ;

Then there must be \ (height [RK [I]] ≥ height [RK [-I. 1]] -. 1 \) ;

This achieves a \ (O (n) \) code

inline void get_height()
{
    int k=0;
    for(int i=1;i<=n;++i) rk[sa[i]]=i;
    for(int i=1;i<=n;++i)  //i是位置不是排名 
    {
        if(k)--k;//-1
        int j=sa[rk[i]-1];//找到排名前一个的位置 
        while(j+k<=n&&i+k<=n&&c[i+k]==c[j+k])++k;//继续匹配 
        height[rk[i]]=k;
    }
}

Some applications

Ask any two suffixes \ (lcp \)

Can \ (suf [i] \) and \ (SUF [J] \) ( \ (RK [I] <RK [J] \) ) of \ (LCP \) length is \ (\ min \ limits_ {k = RK [I] + 1'd} ^ {RK [J]} height [K] \) ;

It becomes a question of the minimum interval;

Essentially different from the substring

String \ (S \) number of all sub-string is \ (n-* (n-+. 1) / 2 \) ;

String \ (S \) of all sub-strings may be a prefix for all suffixes come;

Which has the same \ ([i] \ sum height \) th;

So have:

\[ans=\dfrac{n\times (n+1)}{2}-\sum height[i]\]

Longest common subsequence of two strings

\ (Dp \) time out of the kind;

A simple approach is:

The two strings are combined into one string, and does not appear in the middle with a spaced characters (e.g.,%), obtaining a new string suffix array;

Obtaining a maximum \ (height [I] \) , satisfies \ (sa [i] \) and \ (sa [i-1] \) different string in the inside;

This approach can be extended to the longest common subsequence multiple strings, but the time will increase;

Appear more than \ (K \) longest substring

Get the length \ (k-1 \) sequence \ (height [] \) of the maximum value of the minimum;

Guess you like

Origin www.cnblogs.com/yudes/p/SA.html