Suffix automaton review notes

First, find the original understanding of a mess ah ... this time for an understanding of the way it ...

The original set of strings of length n-$ $ $ T $ string

We consider that each substring S $ $ string set with one set $ Representative endpos_ {S} $ $ $ S all right end points occurring in the $ T $

Example: $ T = "aaabc" $, $ S = "aaab" $, the $ endpos_ {S} = $ {$ 4 $} (subscript 1 is started from a good idea qwq)

So consider another problem: all this substring of the string $ endpos $ pairwise disjoint it?

Obviously not, because obviously $ S = "aab", S = "ab", S = "b" are $ endpos_ {S} = $ {$ 4 $} $ when

So we call these things constitute an equivalence class of $ endpos $, with $ len $ to record the length of this equivalence class length of the longest string in this case, by far the longest string $ " aaab "$, so $ len = 4 $

Next consider several properties:

If we add a character before a $ c $ substring get a new substring, then what does it matter between the two sub-strings $ endpos $?

Obviously, $ endpos _ { 'c' + S} \ subseteq endpos_ {S} $!

This is more obvious, because if $ 'c' + S $ appeared in position $ p $, then $ S $ appeared in a certain position $ p $

So based on these relationships contain, in fact we can put these different $ endpos $ equivalence classes leads to parent-child relationships!

for example:

$T="acbc"$

Column tables:

  a b c ac bc acb cbc cb acbc
endpos 1 3 2,4 2 4 3 4 3 4

So we put these equivalence classes leads to such relationships to obtain immediately:

It's a tree!

Parent must be a child node of the suffix!

We already know that the suffix automaton, so you might be able to see that this is the parent tree!

Of course, we have the order reversed ... first out of the parent tree

Dim this stuff?

It may be futile, but ... this point we can better understand the suffix automaton!

Next, consider the suffix automaton:

Understood on this basis suffix automaton, it is comparatively easy: a suffix automaton is actually directed acyclic graph, he can identify all substrings of the string, the string rather than a string is the string must not be recognition!

Meanwhile, the need to maintain a suffix automaton some pointers (note not directed acyclic graph edges), such that these pointers can be formed in accordance with a tree structure as above, the tree structure and this is the sense suffix automaton Core!

Here is to say, just that tree node that all nodes on the suffix automaton!

So when we build the suffix automaton, consider the nature of this tree has been maintained, it may be easier to understand:

Starting from a simple example:

例:$T="acc"$

First, we insert the first node $ a $:

This is easy to understand, is that we maintain the blue pointer, black is the edge of the DAG

Insert the second node $ c $, to give:

Here began a few minor problems:

First: how even a side?

Intuition tells us: should follow up on a node directly connected

But first of all we know, the suffix automaton should be able to accept all substring of this string, so simply can not accept the substring $ "c" $ thing!

So it seems that, in fact, we should catch some of these side out

How to make it?

We know, after adding a new node, the new can not be recognized by the string generated by the new suffix must be a string!

So if you need to add an edge, we must complement identify these suffixes

And what a new suffix string removed after the new node is obtained?

The original string suffix!

How to find the original string suffix it?

Well down the direct pointer jump up!

For the jump node, we make up this new node like a side thing!

Back to this example, we suffix automaton after long completions should look like this:

Next, we consider maintaining new node pointer

Obviously there is no relationship of equivalence classes, refers directly back to the root node

That's it:

Next, add the next node:

It seems very friendly

But still above two questions:

How to maintain the edges?

Down pointer jump up?

But ... the jump to this point has been out of the side!

How to do that?

 A little trouble ...

More troubling still below!

We mark at several points:

Then analyze the relationship of $ len $:

Found $ l_ {p}> l_ {f} + 1 $!

What does this mean?

This shows that $ p $ equivalence class number is not available simply by adding the string after the $ f $ equivalence classes obtained a letter!

那么就产生了一个问题:当加入一个新节点$n$以后,节点$p$产生了信息的丢失!

为什么?

原来,$endpos_{p}=${$2$},他维护了两个子串:$"ac"$和$"c"$,而$l_{p}=2$(来源于子串$"ac"$),也就是说我们实际上藏起来了$"c"$这个串

但是,当我们再次引入$"c"$这个节点时,子串$"ac"$的$endpos$没有变,然而子串$"c"$的$endpos$却发生了改变,这样的话把$"ac"$和$"c"$放在一起是不合法的!

因此我们需要重新构造了

我们把这个压缩节点展开,新建一个节点,就是:

(由于分裂了这个节点,我们应该把$p$的所有信息继承给$np$,只修改$len$即可)

同时发现$p$的指针需要修改了,因为他的$endpos$是$np$的$endpos$的一个子集,因此改一下指针,就得到:

好像还差个新节点的指针没处理啊...

直接指向新建的节点就可以了,因为此时满足$len_{np}=len_{f}+1$

等等,为什么这样就可以直接建立呢?

考虑$len_{np}-len_{f}=1$,也就是说明$np$这个节点上只维护了一个字符串,这个字符串没有压缩信息,而且加入了新的第$i$个字符以后这个节点的$endpos$集合同时增加了一个$i$,那么仍然保证$endpos$集合的正确性

所以在一开始处理的时候,如果本来就能满足这个条件,也就不需要分裂节点了

所以完整的后缀自动机就长成了这个样子:

如果我们把蓝色的线拎出来,就能看到一棵树形结构了

同时我们发现,父子节点之间的$len$的差值就是子节点自己这个$endpos$等价类中串的个数(很显然,一个等价类中的串长连续)

这样后缀自动机和parent树就都搞出来了

Guess you like

Origin www.cnblogs.com/zhangleo/p/11123277.html