[Algorithm] AC automatic machine / AC algorithm - fast multiple string matching


AC automaton

Accepted

Aho-Corasick


nature

Automaton AC / AC algorithms (Aho-Corasick automaton), is a well-known multi-pattern matching algorithm.


Pre-knowledge

  1. Trie ( important )
  2. KMP algorithm (Next understand the role of the array)

Typical example of algorithm complexity analysis

A typical example is: Given a main string S, a plurality of a given pattern string T, Q number of a given pattern exists in the main string string S

In KMP algorithm, the complexity of a length of main string length n of a pattern string of m is O (n + m)

If KMP algorithm to directly copy such questions, the pattern string matching process need once again

If there is a pattern string t, the complexity O ((n + m) t )

If a large main string length, string or pattern given lot, even if the complexity of the algorithm will be high KMP

So born AC automatic machine which is capable of O (n + mt) to obtain the answer to the complexity of

Wherein O (mt) spent in establishing the trie, O (n) over the main string spent traversing the

So its time complexity can be controlled within a small range

(Inherited disadvantage of the complexity of the space trie ......)


About fail Pointer

AC biggest feature is the addition of an automatic machine called a pointer for each node fail

The role of this pointer and the role of KMP algorithm is very similar to the array Next

KMP algorithm is simply to refer to the Next [j] as the next position according to the position matching pattern matching j

AC is the automaton for every node, fail has a pointer to the next location to be matched

This is why the automaton can be O (n) to complete the main string match


Pointer to fail configuration:

  If the current matching in the trie node to a node, corresponding to the i-th main string position, the main character string found to i + 1-position of the child node does not exist in the node, indicating the occurrence of a mismatch the node nd

  At this point we need to find and from the root node to the root node in this case composed of a string suffix same longest pattern string prefix

  The fail the node pointer points to the node should be the last character of the longest prefix corresponding to

Can be found, except in special circumstances, a node represents a character and the character represented by its fail node pointer is the same


Why looking for is the same as its suffix longest prefix -

  This is the essence of KMP algorithm, then the present processing is a schematic suffix string suffix, the prefix string is a prefix of another pattern. Only moved to the position of the longest prefix in order to guarantee that all the answers are to find out. From the perspective of KMP algorithm point of view, that is, as much as possible small amplitude shifted to the right position of the pattern string in order to ensure the correct answer some will not fall.

  In abcabcabcabc main string, the pattern string in abcabcab as an example, a position corresponding to the matching (i.e. the answer) with 2:

a b c a b c a b c a b c
a b c a b c a b
a b c a b c a b c a b c
a b c a b c a b

For the pattern string, the same suffix prefix where there are two kinds: abandabcab

Should be selected at this time is long, the amplitude of movement is 8-5 = 3, which is moved to the top of Table 1 Table 2

If shorter prefix and suffix selected for the 8-2 shift range = 6, it will become

a b c a b c a b c a b c
a b c a b c a b

Because the pattern string beyond the boundary, the end of the match, then only records the results in Table 1 case

This is Next KMP array represents the longest length of prefixes and suffixes same reason

AC automaton is looking for it is the same reason for its longest prefix of suffix


For more details see below and the process of establishing "Process achievements"




About below

In the code below HDU 2222 Keywords Search Case

Title means the original question:

Given a number of T, T represents a sample group, to give each a number N (N <= 10000), expressed as a pattern string of N words

Next N lines of a word, the word length of not more than 50

The last line of the main strings, the string length is not more than 1 million primary

For each sample, the presence of output of the main string number string patterns

(Given word there may be a repeat, so repeat the words as different modes of strings , this word if the master string appears again, of course, if the same kind of words that appear in the main string plus the number of times the answer will repetitions. more than one time, subsequent answer not counted)

(As it is a string patterns have carried out the answer will be alone again KMP)




fail pointer setup procedure (Scheme)

If now there are the following four words

abs

abi

wasabi

binary

Establishing trie shown below

a

Completing construction of the tree you need to start building fail pointer

First consider the special circumstances:

  Predetermined - fail root pointer points to NULL (later identified as the end of the iteration)

  Pointer to the root node fail all the child nodes of the root node points to the presence of all (just a letter, not the same as any of prefixes and suffixes)

Then consider the case later iterations:

  Assume now that node to the processing node

  If at this time the same suffix prefix length is zero, indicating no presence of the same any suffix corresponding prefix, the node pointer should point to the root fail

  If at this time is equal to the same length Suffix 1, indicating the presence of only the node representative of the same character prefix, then the node pointer to fail should be adjacent to the sub-root node, fail node pointer now points to the root node of the parent

  If the same prefix length is greater than 1 suffix case, as long as the parent node of the node pointers fail to point to where it should have been directed to make the representative node character is c, then the node may point directly to fail fail node of the parent node child node pointed to node C (if present). If the child node c is NULL (explain this node does not exist), then the node fail iteration should continue it, to find whether there is a path fail on a node that has children c . If the last iteration pointer root to fail, that is NULL, the direct end of the iteration, but this time to manually change the node is the root of fail to point rather than remaining NULL


To ensure that the iteration

That is, a greater depth of nodes can fail basis pointer small depth of the node to build your own fail

And depth can be found in all nodes fail pointer to a depth less than the original node (root are closer than to refer to a direction)

So we can use to build a breadth-first search fail, began to spread out from the root


Consider first of all the root node and the first layer, the following construct fail

b

When BFS child of each node, or will lexicographic order from a to z full search again whether there is a character corresponding to the child node

After the first layer of processed manually all nodes, all nodes in the first layer is pushed into the queue

Look root-> a-> b this route

When we cycle child node 'a' of the found node is a child node corresponding to the character 'b', it is subjected to treatment

'B' is the parent node 'a', and 'a' pointer to the root fail

So we have to root for as a parent node b fail to observe root to see if there is a corresponding character is 'b' of the child nodes

Obviously "binary" is a first path 'b', it is there, then root-> a-> 'b' can be the fail b point of "binary" first 'b' of the

c

After the finished building fail, put the 'b' push queue, continue processing the next, i.e. root-> w-> a 'a' of

Until no element in the queue, the entire description of FIG pointer fail on all over Construction

d




Achieve matching function

After the completion of construction of the tree and began to fail to match the master string pattern string

We let each word end position corresponding mark on the flag node plus 1 (flag initially 0)

Then for each node, flag value or is

e

Matches the pattern string as a whole tree, starting from the root, a main string from the first character

In main string as "wasabinaryabi" Example

Points to the first main string character 'w', corresponding to the found root node to a node representing the character 'w', then the pointer to the tree 'w' from the root movement

Up to the sixth character 'i' too, a line down, found that 'i' node flag is 1, a main string to match this word, added after the answer flag is set to prevent the repetition count 0

Because the 'i' node has no child node, the node is processed next 'i' of the node pointed to fail, as shown, i.e., root-> a-> b-> last node i, 'i'

Found flag value 1, and this time match to a word, the answer flag is set to 0 after adding

Processing then continues the 'i' of nodes fail, followed by an intermediate main string "inary" process to eventually lower right corner of 'y' too, after the addition of the answer set to 0 to continue

At this time, 'y' is the fail directly to the root, the root continue directly back to the main matching string

Finally, "abi" After the finish, the original 'i' position flag has been set to 0, so do not contribute to the answer

After the primary string matching, output answer 3, refers to the matching process wasabi, abi, binarythe three words

Note that, each time a node match, need to start walking a path from this node fail, the flag on the path to join all the answers, there would be the following situation

The figure contains two words: abcdeand bcdthis is a special case, where a word is completely contained within another word, not inclusive

f

Assuming that the main string is "abcde", and left again fail if the path points to a node in every time, then the last match finished only encounter 'e' this node, only one answer

But obviously, bcd is also included in the "abcde", we must deal with every path again fail

As long as every walk again, when dealing with the 'd' nodes, 'd' of fail points "bcd" the 'd', this time to the "bcd" this flag to join the answer

DETAILED codes match, See




Code

First, the structure of variable nodes is defined as follows

struct node
{
    int flag;
    node *next[26],*fail;
};

node *root;

char str[1000050];

Wherein the recording flag in the current node is the number of words ending

flag and the next array with the ordinary dictionary meaning of the same tree


AddNode achievements

void addNode()
{
    node *nd=root;
    int i,id;
    for(i=0;str[i]!='\0';i++)
    {
        id=str[i]-'a';
        if(nd->next[id]==NULL)
        {
            nd->next[id]=new node;
            nd->next[id]->flag=0;
            for(int j=0;j<26;j++)
                nd->next[id]->next[j]=NULL;
        }
        nd=nd->next[id];
    }
    nd->flag++;
}

  The contribution of the trie same way, at this time may not be initialized pointers fail


Construction fail pointer buildFailPointer

void buildFailPointer()
{
    queue<node*> q; //bfs容器
    root->fail=NULL; //根节点fail置空
    for(int i=0;i<26;i++)
    {
        if(root->next[i]!=NULL) //第一层所有出现过的节点的fail全部指向root,并加入队列准备搜索
        {
            root->next[i]->fail=root;
            q.push(root->next[i]);
        }
    }
    while(!q.empty())
    {
        node *nd=q.front();
        q.pop();
        for(int i=0;i<26;i++)
        {
            if(nd->next[i]!=NULL) //如果这个子节点存在
            {
                node *tmp=nd->fail; //tmp储存当前处理的nd->next[i]的父节点的fail指针
                while(tmp!=NULL) //重复迭代
                {
                    if(tmp->next[i]!=NULL) //直到出现某次迭代的节点存在一个子节点,代表的字符与当前处理的nd->next[i]代表的字符相同时,停止迭代
                    {
                        nd->next[i]->fail=tmp->next[i]; //那么当前处理的节点的fail就可以指向迭代到的这个节点的对应子节点
                        break;
                    }
                    tmp=tmp->fail; //如果上述子节点不存在,继续迭代fail指针
                }
                if(tmp==NULL) //如果最后tmp指向NULL,说明最后一次迭代到了root节点且没有找到答案,说明不存在任何前缀与当前的后缀相同,此时让fail指向root节点即可
                    nd->next[i]->fail=root;
                q.push(nd->next[i]); //推入队列
            }
        }
    }
}

  Because the node search processing is nd-> next [i], nd it is nd-> next [i] is a parent node


The main string matching tree, the number of kinds of words query asks

int query()
{
    node *nd=root,*tmp;
    int ans=0,i,id;
    for(i=0;str[i]!='\0';i++)
    {
        id=str[i]-'a';
        while(nd->next[id]==NULL&&nd!=root) //如果nd没有字符为id的子节点的话,说明在这里失配,需要迭代指向fail,如果遇到根节点的话则无法继续迭代直接退出
            nd=nd->fail;
        if(nd->next[id]!=NULL) //针对于nd为根节点的情况,只有存在字符为id的子节点才改变nd的指向,否则nd继续保持指向根节点
            nd=nd->next[id];
        tmp=nd; //从nd开始走一遍fail路径,把所有完全包含于当前字符串的单词情况都考虑进来
        while(tmp!=root)
        {
            if(tmp->flag!=0)
            {
                ans+=tmp->flag;
                tmp->flag=0; //一定要置0
            }
            else
                break;
            tmp=tmp->fail;
        }
    }
    return ans;
}

  nd of the trie node currently pointed to, i.e., KMP algorithm cursor pattern string j

  When walking path fail, if you encounter a node flag is 0, it has been passed (or is not the end of the word) before this path, then you do not need to continue along this road, and save time




The complete code HDU-2222

#include<bits/stdc++.h>
using namespace std;

struct node
{
    int flag;
    node *next[26],*fail;
};

node *root;

char str[1000050];

void addNode()
{
    node *nd=root;
    int i,id;
    for(i=0;str[i]!='\0';i++)
    {
        id=str[i]-'a';
        if(nd->next[id]==NULL)
        {
            nd->next[id]=new node;
            nd->next[id]->flag=0;
            for(int j=0;j<26;j++)
                nd->next[id]->next[j]=NULL;
        }
        nd=nd->next[id];
    }
    nd->flag++;
}

void buildFailPointer()
{
    queue<node*> q;
    root->fail=NULL;
    for(int i=0;i<26;i++)
    {
        if(root->next[i]!=NULL)
        {
            root->next[i]->fail=root;
            q.push(root->next[i]);
        }
    }
    while(!q.empty())
    {
        node *nd=q.front();
        q.pop();
        for(int i=0;i<26;i++)
        {
            if(nd->next[i]!=NULL)
            {
                node *tmp=nd->fail;
                while(tmp!=NULL)
                {
                    if(tmp->next[i]!=NULL)
                    {
                        nd->next[i]->fail=tmp->next[i];
                        break;
                    }
                    tmp=tmp->fail;
                }
                if(tmp==NULL)
                    nd->next[i]->fail=root;
                q.push(nd->next[i]);
            }
        }
    }
}

int query()
{
    node *nd=root,*tmp;
    int ans=0,i,id;
    for(i=0;str[i]!='\0';i++)
    {
        id=str[i]-'a';
        while(nd->next[id]==NULL&&nd!=root)
            nd=nd->fail;
        if(nd->next[id]!=NULL)
            nd=nd->next[id];
        tmp=nd;
        while(tmp!=root)
        {
            if(tmp->flag!=0)
            {
                ans+=tmp->flag;
                tmp->flag=0;
            }
            else
                break;
            tmp=tmp->fail;
        }
    }
    return ans;
}

void solve()
{
    root=new node;
    root->flag=0;
    root->fail=NULL;
    for(int i=0;i<26;i++)
        root->next[i]=NULL;
    int n;
    scanf("%d",&n);
    while(n--)
    {
        scanf("%s",str);
        addNode();
    }
    buildFailPointer();
    scanf("%s",str);
    printf("%d\n",query());
}

int main()
{
    int T;
    scanf("%d",&T);
    while(T--)
        solve();

    return 0;
}

Guess you like

Origin www.cnblogs.com/stelayuri/p/12578889.html