Suffix automaton learning experience

Magic suffix automaton !!!

It can solve the following problems:
Problem 1: Analyzing substring

Run directly on the side of the suffix automaton, yet he went NULL string finish was the original sub-string string.

Suffix arrays: Ran sasasa, and then start from the smallest suffix, one back enumeration record to match the current location, if the match is not on it next suffix, otherwise moved back one position. If the enumeration is over suffix has not completely match the original string is not a string.


Problem 2: The number of sub-strings of different
DAG on DP. For a node i, f [i] represents the number of sub-string (excluding the empty string) starting from iii. Then, f [i] is equal to ( i , j ) E d g e ( f [ j ] + 1 ) \sum_{(i,j)\in Edge}(f[j]+1) . f [1] that is the answer.

Suffix Array: length of each suffix height minus the sum.
Question 3: In the original string all substring (not the same a) large dictionary which is the i-th order ( Los Valley P3975 )

First, the process of each node endpos size, i.e. the number of strings in each class appear in the original string. Consider dp, the size of the f [i], i represents a. For one node does not contain any prefix, f [i] = ( i , j ) E d g e f [ j ] \sum _{(i,j)\in Edge}f[j] and the DP g [i], represents the number of sub-string starting from III (excluding the empty set and the calculation is repeated), then g [i] = ( i , j ) E d g e ( g [ j ] + f [ j ] ) \sum_{(i,j)\in Edge}(g[j]+f[j])

Finally, in the DFS suffix automaton, a lexicographical ordering traversal edges, minus any enough ranking further side of substrings, this is not enough to go out of the edges connected to the node, and repeating the above steps. When the ranking becomes not positive output string passing path is formed. Please also refer to the specific approach this question problem solution.


Question 4: Analyzing two strings longest common substring

Put together the two strings, the intermediate and the special characters, run suffix automaton. Then processing method similar to the above with the number of occurrences, a number of times that the substring Ran put together string appears first half and the second half of the number appears. Then traverse node, find the largest number of occurrences before and after len node is not zero. Above ideas can also handle a plurality of strings longest common substring.

Suffix Array: the same is put together, and then treated sa height, for each suffix, to find the first suffix after the other half belongs to which (may be O (n) to achieve the specific approach reader thinking), find their lcp, finally takes a maximum value.


as well as:
Here Insert Picture Description

Here Insert Picture Description

Here Insert Picture Description
More than a lot of features yet to be realized ... I went (too dishes)

On the following codes:

#include <iostream>
#include <queue>
#include <cstring>
#include <cmath>
#include <algorithm>
#include <stack>
#include <map>
#define MAX 10010
#define ll long long
#define mod 1000000007
using namespace std;

struct node {
	int tree[MAX][26];
	int fa[MAX];
	int len[MAX];
	int num[MAX];
	int size[MAX];
	int tot;
	int last;

	node()
	{
		memset(tree, 0, sizeof(tree));
		tot = 1;
		last = 1;
		fa[1] = 0;
		len[1] = 0;
	}

	void add(int c)
	{
		int p = last;
		int now = last = ++tot;
		size[now] = 1;
		len[now] = len[p] + 1;
		for (; p && tree[p][c] == 0; p = fa[p])
			tree[p][c] = now;
		if (!p)
			fa[now] = 1;
		else
		{
			int l = tree[p][c];
			if (len[l] == len[p] + 1)
				fa[now] = l;
			else
			{
				int neww = ++tot;
				len[neww] = len[p] + 1;
				fa[neww] = fa[l];
				for (int i = 0; i < 26; i++)
					tree[neww][i] = tree[l][i];
				fa[now] = fa[l] = neww;
				for (; p && tree[p][c] == l; p = fa[p])
					tree[p][c] = neww;
			}
		}
	}

	int getnum(int now = 1)
	{
		if (num[now] != 0)
			return num[now];
		num[now] = 1;
		for (int i = 0; i < 26; i++)
		{
			if (tree[now][i])
			{
				int ans = getnum(tree[now][i]);
				num[now] += ans;
			}
		}
		return num[now];
	}

	void getsize()
	{
		int A[MAX] = { 0 };
		int t[MAX] = { 0 };
		for (int i = 1; i <= tot; i++) t[len[i]]++;
		for (int i = 1; i <= tot; i++) t[i] += t[i - 1];
		for (int i = 1; i <= tot; i++) A[t[len[i]]--] = i;//神奇的桶排
		for (int i = tot; i >= 1; i--) size[fa[A[i]]] += size[A[i]];
	}

	void getkth(int k, string& ans,int now=1)
	{
		if (now == 1)
			ans = "";
		if (k <= 1)
			return;
		k--;
		for (int i = 0; i < 26; i++)
		{
			int to = tree[now][i];
			if (to != 0)
			{
				if (k > num[to])
					k -= num[to];
				else
				{
					ans += (i + 'a');
					getkth(k,ans,to);
					return;
				}
			}	
		}
		ans = "-1";
	}

}SAM;

int main()
{
	string a;
	cin >> a;
	for (int k = 0; k < a.size(); k++)
		SAM.add(a[k] - 'a');
	SAM.getnum();
	SAM.getsize();
	string ans="yes";
	int len = 1;
	while (ans != "-1")
	{
		SAM.getkth(len++, ans);
		cout << len - 1 << ":" << ans << endl;
	}
	return 0;
}

目前只实现了两个比较基础的功能 getnum()getsize()
其中 getnum() 的功能是求出文本串中一共有几个不同的子串(含空串!)
getsize() 的功能是求出每个节点所代表的子串在文本串中一共出现了几次,也就是|endpos|.
这个实现很重要,值得学习

通过推(guan)理(cha)就可以不(xuan)难(xue)发现 s i z e [ i ] = ( i , j ) E d g e s i z e [ j ] size[i]=\sum _{(i,j)\in Edge}size[j]
对于除了第三种情况中neww出来的替代品来说每一个节点的size初始值为1

但是我们只有从下向上的fa关系所以我们需要一个拓扑排序,但是正常的拓扑排许肯定是不合适的,因为SAM的复杂度为O(n*logn) 傻愣愣的拓扑会让复杂度退化…这样不好…

Again it can not (xuan) difficult (xue) found that by pushing (guan) Li (cha): level of SAM can be divided by the size of the value of len.
Then sort the keyword len you can get a topology of SAM sequence

Then there is a row of buckets Here Insert Picture Descriptionseeking prefix and the finished array can think t
t [k] represents the number of node length is less than or equal k t [k] th
then the length of rank k-node is not on the t [k] ? yet (here around me for a long time ...)
as for the third line of t [len] - is to make the same rank value is not the same (this only just)

By getnum () and getszie () support, we can solve the problem kid string of text strings in the k it!
As for other functions, I will learn A pair
go republication of the k little has been achieved it ... (do not need an auxiliary array of size) do not go tomorrow heavy ...
Note ... must find out a point in the parent tree running on the SAM or run ...
where fa is stored the array from the parent tree connected up, tree stored in an array ... top-down the SAM
len is the length of the longest memory array string of each state is longest (s)

Not too heavy finished (and not a little weight is not the same place is not posted ...)

Then find the longest common string is a sequence of two

(update at any time)

Published 30 original articles · won praise 9 · views 1293

Guess you like

Origin blog.csdn.net/Zhang_sir00/article/details/104367280