AC automaton
Accepted
Aho-Corasick
nature
Automaton AC / AC algorithms (Aho-Corasick automaton), is a well-known multi-pattern matching algorithm.
Pre-knowledge
- Trie ( important )
- KMP algorithm (Next understand the role of the array)
Typical example of algorithm complexity analysis
A typical example is: Given a main string S, a plurality of a given pattern string T, Q number of a given pattern exists in the main string string S
In KMP algorithm, the complexity of a length of main string length n of a pattern string of m is O (n + m)
If KMP algorithm to directly copy such questions, the pattern string matching process need once again
If there is a pattern string t, the complexity O ((n + m) t )
If a large main string length, string or pattern given lot, even if the complexity of the algorithm will be high KMP
So born AC automatic machine which is capable of O (n + mt) to obtain the answer to the complexity of
Wherein O (mt) spent in establishing the trie, O (n) over the main string spent traversing the
So its time complexity can be controlled within a small range
(Inherited disadvantage of the complexity of the space trie ......)
About fail Pointer
AC biggest feature is the addition of an automatic machine called a pointer for each node fail
The role of this pointer and the role of KMP algorithm is very similar to the array Next
KMP algorithm is simply to refer to the Next [j] as the next position according to the position matching pattern matching j
AC is the automaton for every node, fail has a pointer to the next location to be matched
This is why the automaton can be O (n) to complete the main string match
Pointer to fail configuration:
If the current matching in the trie node to a node, corresponding to the i-th main string position, the main character string found to i + 1-position of the child node does not exist in the node, indicating the occurrence of a mismatch the node nd
At this point we need to find and from the root node to the root node in this case composed of a string suffix same longest pattern string prefix
The fail the node pointer points to the node should be the last character of the longest prefix corresponding to
Can be found, except in special circumstances, a node represents a character and the character represented by its fail node pointer is the same
Why looking for is the same as its suffix longest prefix -
This is the essence of KMP algorithm, then the present processing is a schematic suffix string suffix, the prefix string is a prefix of another pattern. Only moved to the position of the longest prefix in order to guarantee that all the answers are to find out. From the perspective of KMP algorithm point of view, that is, as much as possible small amplitude shifted to the right position of the pattern string in order to ensure the correct answer some will not fall.
In abcabcabcabc main string, the pattern string in abcabcab as an example, a position corresponding to the matching (i.e. the answer) with 2:
a | b | c | a | b | c | a | b | c | a | b | c |
---|---|---|---|---|---|---|---|---|---|---|---|
a | b | c | a | b | c | a | b |
a | b | c | a | b | c | a | b | c | a | b | c |
---|---|---|---|---|---|---|---|---|---|---|---|
a | b | c | a | b | c | a | b |
For the pattern string, the same suffix prefix where there are two kinds: ab
andabcab
Should be selected at this time is long, the amplitude of movement is 8-5 = 3, which is moved to the top of Table 1 Table 2
If shorter prefix and suffix selected for the 8-2 shift range = 6, it will become
a | b | c | a | b | c | a | b | c | a | b | c | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
a | b | c | a | b | c | a | b |
Because the pattern string beyond the boundary, the end of the match, then only records the results in Table 1 case
This is Next KMP array represents the longest length of prefixes and suffixes same reason
AC automaton is looking for it is the same reason for its longest prefix of suffix
For more details see below and the process of establishing "Process achievements"
About below
In the code below HDU 2222 Keywords Search Case
Title means the original question:
Given a number of T, T represents a sample group, to give each a number N (N <= 10000), expressed as a pattern string of N words
Next N lines of a word, the word length of not more than 50
The last line of the main strings, the string length is not more than 1 million primary
For each sample, the presence of output of the main string number string patterns
(Given word there may be a repeat, so repeat the words as different modes of strings , this word if the master string appears again, of course, if the same kind of words that appear in the main string plus the number of times the answer will repetitions. more than one time, subsequent answer not counted)
(As it is a string patterns have carried out the answer will be alone again KMP)
fail pointer setup procedure (Scheme)
If now there are the following four words
abs
abi
wasabi
binary
Establishing trie shown below
Completing construction of the tree you need to start building fail pointer
First consider the special circumstances:
Predetermined - fail root pointer points to NULL (later identified as the end of the iteration)
Pointer to the root node fail all the child nodes of the root node points to the presence of all (just a letter, not the same as any of prefixes and suffixes)
Then consider the case later iterations:
Assume now that node to the processing node
If at this time the same suffix prefix length is zero, indicating no presence of the same any suffix corresponding prefix, the node pointer should point to the root fail
If at this time is equal to the same length Suffix 1, indicating the presence of only the node representative of the same character prefix, then the node pointer to fail should be adjacent to the sub-root node, fail node pointer now points to the root node of the parent
If the same prefix length is greater than 1 suffix case, as long as the parent node of the node pointers fail to point to where it should have been directed to make the representative node character is c, then the node may point directly to fail fail node of the parent node child node pointed to node C (if present). If the child node c is NULL (explain this node does not exist), then the node fail iteration should continue it, to find whether there is a path fail on a node that has children c . If the last iteration pointer root to fail, that is NULL, the direct end of the iteration, but this time to manually change the node is the root of fail to point rather than remaining NULL
To ensure that the iteration
That is, a greater depth of nodes can fail basis pointer small depth of the node to build your own fail
And depth can be found in all nodes fail pointer to a depth less than the original node (root are closer than to refer to a direction)
So we can use to build a breadth-first search fail, began to spread out from the root
Consider first of all the root node and the first layer, the following construct fail
When BFS child of each node, or will lexicographic order from a to z full search again whether there is a character corresponding to the child node
After the first layer of processed manually all nodes, all nodes in the first layer is pushed into the queue
Look root-> a-> b this route
When we cycle child node 'a' of the found node is a child node corresponding to the character 'b', it is subjected to treatment
'B' is the parent node 'a', and 'a' pointer to the root fail
So we have to root for as a parent node b fail to observe root to see if there is a corresponding character is 'b' of the child nodes
Obviously "binary" is a first path 'b', it is there, then root-> a-> 'b' can be the fail b point of "binary" first 'b' of the
After the finished building fail, put the 'b' push queue, continue processing the next, i.e. root-> w-> a 'a' of
Until no element in the queue, the entire description of FIG pointer fail on all over Construction
Achieve matching function
After the completion of construction of the tree and began to fail to match the master string pattern string
We let each word end position corresponding mark on the flag node plus 1 (flag initially 0)
Then for each node, flag value or is
Matches the pattern string as a whole tree, starting from the root, a main string from the first character
In main string as "wasabinaryabi" Example
Points to the first main string character 'w', corresponding to the found root node to a node representing the character 'w', then the pointer to the tree 'w' from the root movement
Up to the sixth character 'i' too, a line down, found that 'i' node flag is 1, a main string to match this word, added after the answer flag is set to prevent the repetition count 0
Because the 'i' node has no child node, the node is processed next 'i' of the node pointed to fail, as shown, i.e., root-> a-> b-> last node i, 'i'
Found flag value 1, and this time match to a word, the answer flag is set to 0 after adding
Processing then continues the 'i' of nodes fail, followed by an intermediate main string "inary" process to eventually lower right corner of 'y' too, after the addition of the answer set to 0 to continue
At this time, 'y' is the fail directly to the root, the root continue directly back to the main matching string
Finally, "abi" After the finish, the original 'i' position flag has been set to 0, so do not contribute to the answer
After the primary string matching, output answer 3, refers to the matching process wasabi
, abi
, binary
the three words
Note that, each time a node match, need to start walking a path from this node fail, the flag on the path to join all the answers, there would be the following situation
The figure contains two words: abcde
and bcd
this is a special case, where a word is completely contained within another word, not inclusive
Assuming that the main string is "abcde", and left again fail if the path points to a node in every time, then the last match finished only encounter 'e' this node, only one answer
But obviously, bcd is also included in the "abcde", we must deal with every path again fail
As long as every walk again, when dealing with the 'd' nodes, 'd' of fail points "bcd" the 'd', this time to the "bcd" this flag to join the answer
DETAILED codes match, See
Code
First, the structure of variable nodes is defined as follows
struct node
{
int flag;
node *next[26],*fail;
};
node *root;
char str[1000050];
Wherein the recording flag in the current node is the number of words ending
flag and the next array with the ordinary dictionary meaning of the same tree
AddNode achievements
void addNode()
{
node *nd=root;
int i,id;
for(i=0;str[i]!='\0';i++)
{
id=str[i]-'a';
if(nd->next[id]==NULL)
{
nd->next[id]=new node;
nd->next[id]->flag=0;
for(int j=0;j<26;j++)
nd->next[id]->next[j]=NULL;
}
nd=nd->next[id];
}
nd->flag++;
}
The contribution of the trie same way, at this time may not be initialized pointers fail
Construction fail pointer buildFailPointer
void buildFailPointer()
{
queue<node*> q; //bfs容器
root->fail=NULL; //根节点fail置空
for(int i=0;i<26;i++)
{
if(root->next[i]!=NULL) //第一层所有出现过的节点的fail全部指向root,并加入队列准备搜索
{
root->next[i]->fail=root;
q.push(root->next[i]);
}
}
while(!q.empty())
{
node *nd=q.front();
q.pop();
for(int i=0;i<26;i++)
{
if(nd->next[i]!=NULL) //如果这个子节点存在
{
node *tmp=nd->fail; //tmp储存当前处理的nd->next[i]的父节点的fail指针
while(tmp!=NULL) //重复迭代
{
if(tmp->next[i]!=NULL) //直到出现某次迭代的节点存在一个子节点,代表的字符与当前处理的nd->next[i]代表的字符相同时,停止迭代
{
nd->next[i]->fail=tmp->next[i]; //那么当前处理的节点的fail就可以指向迭代到的这个节点的对应子节点
break;
}
tmp=tmp->fail; //如果上述子节点不存在,继续迭代fail指针
}
if(tmp==NULL) //如果最后tmp指向NULL,说明最后一次迭代到了root节点且没有找到答案,说明不存在任何前缀与当前的后缀相同,此时让fail指向root节点即可
nd->next[i]->fail=root;
q.push(nd->next[i]); //推入队列
}
}
}
}
Because the node search processing is nd-> next [i], nd it is nd-> next [i] is a parent node
The main string matching tree, the number of kinds of words query asks
int query()
{
node *nd=root,*tmp;
int ans=0,i,id;
for(i=0;str[i]!='\0';i++)
{
id=str[i]-'a';
while(nd->next[id]==NULL&&nd!=root) //如果nd没有字符为id的子节点的话,说明在这里失配,需要迭代指向fail,如果遇到根节点的话则无法继续迭代直接退出
nd=nd->fail;
if(nd->next[id]!=NULL) //针对于nd为根节点的情况,只有存在字符为id的子节点才改变nd的指向,否则nd继续保持指向根节点
nd=nd->next[id];
tmp=nd; //从nd开始走一遍fail路径,把所有完全包含于当前字符串的单词情况都考虑进来
while(tmp!=root)
{
if(tmp->flag!=0)
{
ans+=tmp->flag;
tmp->flag=0; //一定要置0
}
else
break;
tmp=tmp->fail;
}
}
return ans;
}
nd of the trie node currently pointed to, i.e., KMP algorithm cursor pattern string j
When walking path fail, if you encounter a node flag is 0, it has been passed (or is not the end of the word) before this path, then you do not need to continue along this road, and save time
The complete code HDU-2222
#include<bits/stdc++.h>
using namespace std;
struct node
{
int flag;
node *next[26],*fail;
};
node *root;
char str[1000050];
void addNode()
{
node *nd=root;
int i,id;
for(i=0;str[i]!='\0';i++)
{
id=str[i]-'a';
if(nd->next[id]==NULL)
{
nd->next[id]=new node;
nd->next[id]->flag=0;
for(int j=0;j<26;j++)
nd->next[id]->next[j]=NULL;
}
nd=nd->next[id];
}
nd->flag++;
}
void buildFailPointer()
{
queue<node*> q;
root->fail=NULL;
for(int i=0;i<26;i++)
{
if(root->next[i]!=NULL)
{
root->next[i]->fail=root;
q.push(root->next[i]);
}
}
while(!q.empty())
{
node *nd=q.front();
q.pop();
for(int i=0;i<26;i++)
{
if(nd->next[i]!=NULL)
{
node *tmp=nd->fail;
while(tmp!=NULL)
{
if(tmp->next[i]!=NULL)
{
nd->next[i]->fail=tmp->next[i];
break;
}
tmp=tmp->fail;
}
if(tmp==NULL)
nd->next[i]->fail=root;
q.push(nd->next[i]);
}
}
}
}
int query()
{
node *nd=root,*tmp;
int ans=0,i,id;
for(i=0;str[i]!='\0';i++)
{
id=str[i]-'a';
while(nd->next[id]==NULL&&nd!=root)
nd=nd->fail;
if(nd->next[id]!=NULL)
nd=nd->next[id];
tmp=nd;
while(tmp!=root)
{
if(tmp->flag!=0)
{
ans+=tmp->flag;
tmp->flag=0;
}
else
break;
tmp=tmp->fail;
}
}
return ans;
}
void solve()
{
root=new node;
root->flag=0;
root->fail=NULL;
for(int i=0;i<26;i++)
root->next[i]=NULL;
int n;
scanf("%d",&n);
while(n--)
{
scanf("%s",str);
addNode();
}
buildFailPointer();
scanf("%s",str);
printf("%d\n",query());
}
int main()
{
int T;
scanf("%d",&T);
while(T--)
solve();
return 0;
}