Talking about the study and application of SAM

Previous words: Except for those immature days, this should be the first summary article of this konjac. If there is anything wrong, please correct me.

The Big Three of the suffix family: suffix array (SA), suffix automaton (SAM), suffix tree (Suffix tree)

Today I'm studying the suffix automaton SAM. Amway will read Chen Lijie's paper, or this one is also good: blog

What does SAM do? In short, SAM can represent all strings in a string with the least space

SAM can receive all the suffixes of a string S, can contain the information of all S substrings, and it has a directed acyclic graph with the least number of states. If you go from the initial state root to a state (point) by any path, in order

Write all the characters that have passed through, and the final result must be a substring of the original string; if the terminal state is reached, then the suffix must be formed at this time.

Each point stores this information:

deep[]: Represents the maximum value of all paths from the root node to this point, and of course its position in the parent string

tr[].son[]: Indicates whether there is a character as a child

parent: points to the last location of this node, each path from root to parent is the suffix of the path to this point, and some other properties will be discussed later

specific properties

1. From the initial state (represented by root), the strings formed on all paths to any node p are all substrings of the parent string. (From this property, property 2 can also be deduced)

2. If the node p can become a new suffix, then all strings formed by each path from root to any node p (note the existence of the same node as p) are suffixes.

3. If the current character node p can receive a new character as a suffix, then the node pointed to by the parent of p can also receive a suffix, but not vice versa.

Construct

If we want to add a new node X, it is to make the suffix automaton of A already built into AX, right?

We add a new node as np (new point), store a last, indicating the last created node, and keep looking for the parent of last

If the current parent does not have the child x, then we connect parent and x to an edge, and continue to find the parent of parent (according to the nature of parent)

Until the current jump to the root or the current node has the child x, we set the current node to be p, and the x child of p is q

There will be two situations:

1. deep[q]=deep[p]+1, that is, q is next to p in the parent string, then we jump from last to p, p can accept a suffix, then when we insert np, the next visit q can also accept new suffixes

Then we let np's parent point to q, and we can use it directly below. What is an acceptable suffix? ? Let's talk about the second case

2. Then another situation is that p and q are still mixed with some messy things. If you deal with the first situation directly, you cannot guarantee that the specific property 2 can be established, because you have to guarantee the property 2. The current longest A string is a suffix, that is, root to last is a suffix, and because last has been looking for its parent to find p, each path from root to p is the suffix of this last, so in order to continue to ensure the nature, we must not be blind I have to change the method. It is very simple. Create a new point nq and use it to replace most of the functions of q.

Then we are done building, the specific code is as follows:

void add(int k)
{
    int x=st[k]-'a'+1;
    int p=last,np=++cnt;
    tr[np].dep=k;
    while(p!=0&&tr[p].son[x]==0)tr[p].son[x]=np,p=tr[p].parent;
    if(p==0)tr[np].parent=root;
    else
    {
        int q=tr[p].son[x];
        if(tr[p].dep==tr[q].dep-1)tr[np].parent=q;
        else
        {
            int nq++ cnt;
            tr[nq] =tr[q];tr[nq].dep=tr[p].dep+ 1 ;
            tr[q].parent=tr[np].parent=nq;
            while(p!=0&&tr[p].son[x]==q)tr[p].son[x]=nq,p=tr[p].parent;
        }
    }
    last=np;
}

Everyone should understand it, the construction is finished

some black technology

If there is only this thing, the practical value of SAM will be 200 grades lower

Then countless people with lofty ideals have come up with this thing one after another: the right collection

What is the right set? As the name implies, it is right, which means the position of the right endpoint (end) in the original string after a string composed of a certain point at the end.

After we build the automaton, we can process it. The processing method is as follows (I give the code directly):

 

for ( int i= 1 ;i<=cnt;i++)Rsort[tr[i].dep]++ ;
 for ( int i= 1 ;i<=len;i++)Rsort[i]+=Rsort[i- 1 ];
 for ( int i=cnt;i>= 1 ;i--)sa[Rsort[tr[i].dep]--]= i;
 // Have you seen this passage in SA? 
/* 
    This paragraph is radix sort. The purpose of our sorting is because when we process the right set, we must run from the back to the front. The further forward 
a point is, the shorter the string it forms, so the position where it appears There are also many,
but because there are many nqs built when we build SAM, we cannot run directly from the back to the front, so we need to sort
*/ for ( int i= 1 ,p=root;i<=len;i++ ) { int x=st[i]-'a'+1; p=tr[p].son[x]; right[p] ++; // get it directly } for ( int i=cnt;i>= 1 ;i-- ) right[tr[sa[i]].parent] +=right[sa[i]]; // Run from back to front

 

OK, so the important right set has also come out, so what is the use of SAM? ?
This konjac can only give a few examples that I know, and other more advanced things have to be practiced by everyone themselves QAQ

Practical example

1. The longest common substring of two strings

Build the suffix automaton of the A string, and then run the B string on the suffix automaton.

As for what is running? In fact, according to the nature of SAM, you can run directly to see if there is this child?

Konjac is probably just messing around like that. . .

2. Find the number of different strings

There are actually two ways to do this, I will only do one for the time being

SAM has a property: the contribution of different strings of node u = MAX(u)-Min(i)+1

You can think about it, max must be the deep of this node, what about min?

-min(u)+1 is actually -(min(u)-1) can also be converted into max(parent)

Why? Because the path from root to parent is the suffix to u, you can yy yourself

3. Find the K-th largest substring

 

 

Preprocess the right set and sum set, and then run DFS on SAM, what? More details?

1. Find the K-th largest number of different strings: preprocess to find out how many strings can be constructed in each state, which can be done with dfs or topology of the automaton (in fact, it is equivalent to sorting len from small to large, because len little

The topological order will also be small), and then find it backwards (equivalent to the DP on the DAG), and then dfs will find the K-th largest one.

2. Find the Kth largest of the same string: In addition to preprocessing the above things, it is also necessary to preprocess the size of the right set of all states (how many times each string appears in the original string), which will affect the required value above

Then it's good to deal with it at the same time when doing dfs. 

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324951360&siteId=291194637