AC automaton explain + [HDU2222] Keywords Search (AC automata)

First of all, there is such a question:

Give you a word W and a paper T , Q W in T appears a few times (see the original title in POJ3461 ).

OK,so easy~

HASH or KMP easily resolved.

Then there is an example:

Given n length of not more than 50 words by the lowercase letters prepared queries, and a length of m articles, Q: How many text appears to be the word queries (see original title POJ3630 ).

The OK , still so easy ~

Trie ( Trie ) easily resolved.

Well, if you say, what is KMP and Trie , then congratulations ah ......

To do suggest that you look at this blog before:


 

If you can see here, that you have mastered Trie and KMP .

So now the lovely touch of throw you a question:

Given n length of not more than 50 words by the lowercase letters prepared queries, and a length of m articles, Q: How many appear in the text word to be queried. Multiple sets of data.

 

Well, this time need to introduce a new algorithm: AC automaton.

Presumably resulting in chiefs of you certainly have heard of the term.

First, a brief overview of AC automaton (Aho-Corasick Automation) (not Accepted ).
Like former teammate when I gzh teacher as (like pressure or hit the duck?).

The algorithm in 1975 with an annual output Born in Bell Labs, is a famous multi-mode matching algorithm one.

To get to know the AC automatic machine, come there pattern tree (trie) Trie and KMP pattern matching algorithm basics.

KMP algorithm and AC distinction automaton algorithm is that the former is a single-string pattern matching , which is a multi-pattern matching .

You can be understood as emotional, Single Pattern Matching is the only one substring, and multiple string matching means there is more than one string.


 

The main steps:

Three steps:

① all pattern strings constituting a Trie tree.

② The Trie all configured prefix node pointer.

③ using the prefix pointer Fail main strings match.

In fact this prefix pointer Fail and KMP algorithm nxt arrays are very similar , and therefore AC automaton can be seen Trie and KMP binding algorithm ( Trie on KMP algorithm).

The first step: Build trie surely we will (if not, then you will not read here), do not do too much in this repeat.

Step two:

Find Fail pointer:

As we all know, the KMP algorithm which, when there is a mismatch occurs string nxt array is used to find the location of the next match, then the AC automaton similar nxt thing array is Fail pointer , when found in mismatch

Character mismatch when the jump to Fail position of the pointer, and then matching operation again, AC automatic machine has been able to achieve multi-pattern matching, it is due to Fail establish pointer.

So how should we seek Fail pointer it?

Using breadth-first search ( BFS ) to be obtained.

For point directly connected to the root node, if the node mismatch, their Fail pointers directly to root can.

Other nodes which Fail pointer Seeking follows:
Set the current node is Father , which is the child node Child .

Seeking child 's Fail pointer, we need to find his father 's Fail node pointer is set to F , F child look in there and child same letters node represented , if any, this section

点就是childFail指针,如果没有,则需要再找到FFail指针所指向的节点,如果一直找都找不到,则childFail指针就要指向root

例(帮助理解):

如图所示,首先root最初会进队,然后root,出队,我们把root的孩子的失败指针都指向root。因此图中h,s的失败指针都指向root,如红色线条所示,同时h,s进队。
接下来该h出队,我们就找h的孩子的Fail指针,首先我们发现h这个节点其Fail指针指向root,而root又没有字符为e的孩子,则e的Fail指针是空的,如果为空,则也要指向root,如图中蓝色线所示。并且e进队,此时s要出队,我们再找s的孩子a,h的Fail指针,我们发现s的Fail指针指向root,而root没有字符为a的孩子,故a的Fail指针指向root,a入队,然后找h的Fail指针,同样的先看s的Fail指针是root,发现root又字符为h的孩子,所以h的Fail指针就指向了第二层的h节点。e,a , h 的Fail指针的指向如图蓝色线所示。
此时队列中有e,a,h,e先出队,找e的孩子r的失败指针,我们先看e的失败指针,发现找到了root,root没有字符为r的孩子,则r的失败指针指向了root,并且r进队,然后a出队,我们也是先看a的失败指针,发现是root,则y的fail指针就会指向root.并且y进队。然后h出队,考虑h的孩子e,则我们看h的失败指针,指向第二层的h节点,看这个节点发现有字符值为e的节点,最后一行的节点e的失败指针就指向第三层的e。最后找r的指针,同样看第二层的h节点,其孩子节点不含有字符r,则会继续往前找h的失败指针找到了根,根下面的孩子节点也不存在有字符r,则最后r就指向根节点,最后一行节点的Fail指针如绿色虚线所示。
第三步:

文本串的匹配:

匹配过程分两种情况:

(1)当前字符匹配,从当前节点沿着树边有一条路径可以到达目标字符,如果当前匹配的字符是一个单词的结尾,就沿着当前字符的Fail指针,一直遍历到根,如果这些节点末尾有标记(当前节点单词末尾的标记),这些节点全都是可以匹配上的节点。统计完毕后,并将那些节点标记。此时只需沿该路径走向下一个节点继续匹配即可,目标字符串指针移向下个字符继续匹配;

(2)当前字符不匹配,则去当前节点失败指针所指向的字符继续匹配,当指针指向root时结束。

重复这2个过程中的任意一个,直到模式串走到结尾为止。

例:

还是刚才那张图:

 

假设其模式串为yasherhs。对于i=0,1。Trie中没有对应的路径,故不做任何操作;i=2,3,4时,指针p走到左下节点e。因为节点e的count信息为1,所以cnt+1,并且讲节点e的count值设置为-1,表示改单词已经出现过了,防止重复计数,最后temp指向e节点的失败指针所指向的节点继续查找,以此类推,最后temp指向root,退出while循环,这个过程中count增加了2。表示找到了2个单词she和he。当i=5时,程序进入第5行,p指向其失败指针的节点,也就是右边那个e节点,随后在第6行指向r节点,r节点的count值为1,从而count+1,循环直到temp指向root为止。最后i=6,7时,找不到任何匹配,匹配过程结束。

总结:

三步:构造一棵Trie树,构造失败指针和模式匹配过程。

Fail指针≈nxt数组。

例题:

「一本通 2.4 例 1」Keywords Search
「一本通 2.4 练习 1」玄武密码
「一本通 2.4 练习 3」单词
「一本通 2.4 练习 5」病毒
「一本通 2.4 练习 4」最短母串
「一本通 2.4 练习 6」文本生成器


 例题1讲解:

那么下面我来讲一下AC自动机的第一道例题Keywords Search

题目传送门

这是一道板子题,其实上面的内容就是根据这道题来讲的,我在此就不做过多的赘述,主要讲一下代码。

 

#include<bits/stdc++.h>
using namespace std;
int n;
char s[2000001];
int trie[1000001][30];
int que[1000001],end[1000001],nxt[1000001];
int ans,cnt;
void insert(char *str)//Trie树构建过程 
{
	int p=1;
	int len=strlen(str);
	for(int i=0;i<len;i++)
	{
		int ch=str[i]-'a';
		if(!trie[p][ch])
		{
			trie[p][ch]=++cnt;
			memset(trie[cnt],0,sizeof(trie[cnt]));//每次只需要清空我们会用得到的行 
		}
		p=trie[p][ch];
	}
	end[p]++;//因为有可能会有重复的单词,故在此end统计在此有多少个单词结束,而不是有没有单词结束 
}
void build()//BFS构建Fail指针 
{
	for(int i=0;i<26;i++)//为了方便将0的所有转一遍都设为根节点1 
		trie[0][i]=1;
	nxt[1]=0;//若在根节点失配, 则无法匹配字符 
	que[1]=1;
	int head=1,tail=1;
	while(head<=tail)
	{
		for(int i=0;i<26;i++)
			if(!trie[que[head]][i])trie[que[head]][i]=trie[nxt[que[head]]][i];//注意这里,下面会有详细解释 
			else
			{
				que[++tail]=trie[que[head]][i];
				int flag=nxt[que[head]];
				while(flag&&!trie[flag][i])flag=nxt[flag];//循环往前找 
				nxt[trie[que[head]][i]]=trie[nxt[que[head]]][i];
			}
		head++;//注意队头++ 
	}
}
void find(char *str)//匹配 
{
	int p=1;
	int len=strlen(str);
	for(int i=0;i<len;i++)
	{
		int flag=p=trie[p][str[i]-'a'];
		while(end[flag]!=-1&&flag)
		{
			ans+=end[flag];
			end[flag]=-1;//标记这个点已经访问过,以后不再访问 
			flag=nxt[flag];
		}
	}
}
int main()
{
	int T;
	scanf("%d",&T);
	while(T--)
	{
		memset(end,0,sizeof(end));//多测不清空,爆零两行泪(宝宝别哭) 
		cnt=1;
		ans=0;
		for(int i=0;i<26;i++)
			trie[0][i]=1,trie[1][i]=0;//亦是清空 
		scanf("%d",&n);
		for(int i=1;i<=n;i++)
		{
			scanf("%s",s);
			insert(s);//读入子串并插入Trie树 
		}
		build();
		scanf("%s",s);
		find(s);//匹配 
		printf("%d\n",ans);
	}
	return 0;
}

 

下面我来解释上面那个问题:

当发现不存在que[head](下面用u代替)的转移边i时,令trie[u][i]等于trie[nxt[u]][i],这并不符合Trie树的构造,但是在代码中却是正确的,那么这是为什么呢?

In fact, this is to optimize the time, if there trie [u] [i] transfer edge points to Trie [NXT [U]] [I] . In a particular problem because, if there trie [u] [i] of the transfer side, often require along que [head] of Fail pointer to reach the first character satisfies the presence of i point of the transfer side V , to give trie [v ] [i] , then directly to trie [u] [i] assigned to Trie [v] [i] , namely Trie [NXT [U]] [i] , this is the time to solve a class of problems of optimization. It is also for this reason, in the construction Fail pointer that does not address the case where transfer edge i v does not exist, but directly NXT [Trie [U] [i]] = Trie [v] [i] (wherein Trie [ v] [i] is also handled well before then).


rp ++

 

Reproduced in: https: //www.cnblogs.com/wzc521/p/11021155.html

Guess you like

Origin blog.csdn.net/weixin_33693070/article/details/93149834