HDU 2222 Keywords Search (AC automata)

Topic

Insert picture description here

Input sample

1
5
she
he
say
shr
her
yasherhs

Sample output

3

answer

AC automata = Trie + KMP ----> can be optimized as a Trie diagram

KMP: O(n) Find the position and number of occurrences of a word in the article
AC automata: O(n) Find the position and number of occurrences of each word in the article

KMP core:ne[i] = j In the P string, the suffix ending with P[i] can match the maximum length of a non-trivial prefix starting with 1 (it can also be understood as j is a prefix subscript) , Every time the P string (word) and S string (article) fail to match, instead of restarting the match, the next letter at the position of ne[i] in the P string will continue to match the position where the S string failed to match

AC automata core:Create an array of ne[] on Trie, which means that all suffixes ending with a certain endpoint can overlap with a certain prefix for the longest time, and what is stored is the ending endpoint of this prefix, After each match fails, directly start from the node pointed to by the ne[i] node to rematch other words
Example simulation

As shown in the figure, the non-trivial suffixes ending with the e point are e, he, and the longest prefix that can be matched in the dictionary tree is he, so the e (she) point points to the end point e (he) of the longest prefix

//求next数组,数组下标从1开始
ne[0]=ne[1]=0; //非平凡前后缀,第一个不算,for从2开始
for (int i = 2, j = 0; i <= n; i++) {
    
    
    	//int j=ne[i-1];其实就是上边传下来的
        while (j && p[i] != p[j + 1]) {
    
    
            j = ne[j];
        }
        if (p[i] == p[j + 1]) {
    
    
            j++;
        }
        ne[i] = j;
    }

The above is the process of KMP seeking the ne array. We can find that every time after seeking ne[i]=j, i ++, and then loop, then in fact j = ne [i-1]. For the process of finding the ne array, it is a process of gradual update from the front to the back, then for the AC automata, it is updated layer by layer, each time the ne information of the next layer node is updated with the current node ne, ( for Trie The ne of the middle root node and the first layer node must be 0, because we are looking for a non-trivial prefix, the string itself is not counted ), then we can use BFS to update layer by layer

//构建ac自动机
void build() {
    
    

    int hh = 0, tt = -1;
    for (int i = 0; i < 26; i++) {
    
    
        if (tr[0][i]) q[++tt] = tr[0][i];
    }

    while (hh <= tt) {
    
    
        int t = q[hh++]; 
        for (int i = 0; i < 26; i++) {
    
    
            int c = tr[t][i];
            if (!c) continue;

            int j = ne[t];
            while (j && !tr[j][i]) j = ne[j];
            if (tr[j][i]) j = tr[j][i];
            ne[c] = j;
            q[++tt] = c;
        }
    }

}

( Construction of AC automata ) Analogy to KMP (chain structure corresponds to tree structure), let’s explain the code: for all the child nodes of the root node, because they have only one letter, their ne arrays are all 0, and the initialization is first They all join the team, and then use them as the current node to update their son nodes. So int t = q[hh++]; is the current node (equivalent to i-1 in int j=ne[i-1]); int c = tr[t][i]; (c is the child node of the current node , Which is the next one, equivalent to i); then int j = ne[t] (equivalent to the original int j=ne[i-1] ), for p[i] != p[j + 1 in KMP ] Is to judge whether the next letter of j is equal to the letter corresponding to i, then for Trie, it is to judge whether there is this letter in the position of tr[j][i] in the tree, that is, whether the son of i of j exists (where j is A node, tr[j][i] is the i-th son of j), and the rest of the code can be directly analogized

Article matching word process : It also starts from the root node of Trie and continues to match. If it is unsuccessful, it will go to ne[j] until the match succeeds j. At this time, the found j is what can be matched by the first i letters in the current article To the deepest node j of Trie, but can match j, then the words ending in ne[j] can also be matched, just like in the figure, if traversed to she, then e (she) points to e (he) It will definitely match,So when adding the answer, not only add j, but also add the position that ne[j] can go to

Code

#include<cstdio>
#include<cstring>
#include<iostream>
#include<algorithm>

using namespace std;
const int N = 1e4 + 10, S = 55, M = 1e6 + 10;

int n;
int tr[N * S][26];// Trie字典树
int cnt[N * S];  //记录以这个节点结尾的单词数量
int idx;  //Trie中给节点分配空间
char str[M];  //文章
int q[N * S]; //BFS所用队列(数组模拟队列)
int ne[N * S];  //AC自动机所用ne数组

//Trie的建立
void insert() {
    
    
    int p = 0;
    for (int i = 0; str[i]; i++) {
    
    
        int u = str[i] - 'a';
        if (!tr[p][u]) tr[p][u] = ++idx;
        p = tr[p][u];
    }
    cnt[p]++;
}

//构建ac自动机
void build() {
    
    

    int hh = 0, tt = -1;
    for (int i = 0; i < 26; i++) {
    
    
        if (tr[0][i]) q[++tt] = tr[0][i];
    }

    while (hh <= tt) {
    
    
        int t = q[hh++];
        for (int i = 0; i < 26; i++) {
    
    
            int c = tr[t][i];
            if (!c) continue;
            int j = ne[t];
            while (j && !tr[j][i]) j = ne[j];
            if (tr[j][i]) j = tr[j][i];
            ne[c] = j;
            q[++tt] = c;
        }
    }

}

int main() {
    
    

    int T;
    scanf("%d", &T);
    while (T--) {
    
    
        memset(tr, 0, sizeof tr);
        memset(cnt, 0, sizeof cnt);
        memset(ne, 0, sizeof ne);
        idx = 0;

        scanf("%d", &n);
        for (int i = 0; i < n; i++) {
    
    
            scanf("%s", str);
            insert();
        }

        build();

        scanf("%s", str);

        int res = 0;
        //文章匹配单词过程
        for (int i = 0, j = 0; str[i]; i++) {
    
    
            int t = str[i] - 'a';
            while (j && !tr[j][t]) j = ne[j];
            if (tr[j][t]) j = tr[j][t];

            int p = j;
            while (p) {
    
    
                res += cnt[p];
                cnt[p] = 0;  //每个答案只能加一次
                p = ne[p];
            }
        }
        printf("%d\n", res);
    }
    return 0;
}

Trie graph optimization

Through the code, we can find that in the process of establishing the ne array and matching, we jump forward through ne every time, and only the previous position is jumped. If it fails, we continue to jump to the previous position. When the data is strong, the complexity is Will become higher, we must optimize the while loop

We can make the while loop jump multiple times to optimize to jump directly to the position where the ne pointer finally jumps to, that is, one step. Then what is the final position of the ne pointer jump, which is the position pointed to by the ne pointer of its parent node

#include<cstdio>
#include<cstring>
#include<iostream>
#include<algorithm>

using namespace std;
const int N = 1e4 + 10, S = 55, M = 1e6 + 10;

int n;
int tr[N * S][26];// Trie字典树
int cnt[N * S];  //记录以这个节点结尾的单词数量
int idx;  //Trie中给节点分配空间
char str[M];  //文章
int q[N * S]; //BFS所用队列(数组模拟队列)
int ne[N * S];  //AC自动机所用ne数组

//Trie的建立
void insert() {
    
    
    int p = 0;
    for (int i = 0; str[i]; i++) {
    
    
        int u = str[i] - 'a';
        if (!tr[p][u]) tr[p][u] = ++idx;
        p = tr[p][u];
    }
    cnt[p]++;
}

//构建ac自动机
void build() {
    
    

    int hh = 0, tt = -1;
    for (int i = 0; i < 26; i++) {
    
    
        if (tr[0][i]) q[++tt] = tr[0][i];
    }

    while (hh <= tt) {
    
    
        int t = q[hh++];
        for (int i = 0; i < 26; i++) {
    
    
            //Trie图优化
            int p = tr[t][i];
            if (!p) tr[t][i] = tr[ne[t]][i];
            else {
    
    
                ne[p] = tr[ne[t]][i];
                q[++tt] = p;
            }

        }
    }

}

int main() {
    
    

    int T;
    scanf("%d", &T);
    while (T--) {
    
    
        memset(tr, 0, sizeof tr);
        memset(cnt, 0, sizeof cnt);
        memset(ne, 0, sizeof ne);
        idx = 0;

        scanf("%d", &n);
        for (int i = 0; i < n; i++) {
    
    
            scanf("%s", str);
            insert();
        }

        build();

        scanf("%s", str);

        int res = 0;
        //文章匹配单词过程
        for (int i = 0, j = 0; str[i]; i++) {
    
    
            int t = str[i] - 'a';
            //Trie图优化
            j=tr[j][t];

            int p = j;
            while (p) {
    
      //这里跳的时候其实最坏情况下是O(n^2)
                res += cnt[p];
                cnt[p] = 0;  //每个答案只能加一次
                p = ne[p];
            }
        }
        printf("%d\n", res);
    }
    return 0;
}

Guess you like

Origin blog.csdn.net/qq_44791484/article/details/113788918