Multi-pattern string matching algorithm: AC automaton principle, complexity analysis and code implementation

Multiple pattern string matching

Multi-pattern string matching scenarios are common when some platforms block sensitive terms in certain users' speeches.

Use a string matching algorithm to find sensitive terms in the text and replace them with "***". Although a single pattern string matching algorithm can be used to find sensitive terms one by one and then replace them, in actual scenarios, if the database of sensitive terms is large and there is a lot of text content to be matched, the matching time will be too long, which may lead to It takes a long time to send a message. Obviously this will lead to a degraded user experience.

Therefore, an efficient matching algorithm under multiple pattern strings is needed to deal with this scenario.

Filter sensitive words based on Trie tree

The Trie tree itself is an algorithm based on multi-pattern string matching, which constructs multiple pattern strings into a Trie tree. When the pattern string changes, only the Trie tree needs to be changed.

When matching the main string, we match the Trie tree one by one starting from the first character of the main string. When a bad character is matched, we move the starting character of the main string back one character and continue matching from the root of the Trie tree, so First, we only need to scan the main string once to complete the matching of multiple pattern strings . The efficiency is much higher than using single pattern string matching.

AC automaton principle

The above-mentioned multi-pattern string matching algorithm based on Trie tree is similar to the brute-force matching algorithm in single-pattern string matching. We know that the brute-force matching algorithm can improve efficiency by introducing the next array, that is, the KMP algorithm. In this multi-pattern string matching, it can Should we add the idea of ​​next array to it?

The answer is obviously yes, you just need to slightly transform the Trie tree. Of course, it is not to add the next array, but to add a next array to each node of the Trie tree.next pointer,Right nowfailure pointer

cAs shown in the figure, the failure pointer of the character is bcfin the string c. When we match it abc, we find that dit does not match the character. At this time, we can cjump to the failure pointer bcdand continue to match f.
failure pointer
In this way, it is no longer necessary to start matching again when there is no match. The idea is the same as the next array. If you do not understand the principle of the KMP algorithm, it is recommended to understand the next array first, and then look at the failure pointer to understand it easily ( KMP algorithm recommended reading: Famous string matching algorithm: KMP algorithm principle analysis and code implementation )

The next question is, how to find the next node pointed by the failure pointer of each node?
The Trie tree actually includes all pattern strings. Suppose we now require cthe failure pointer of the node in the above figure. The known condition is that when matching ceverywhere , abcit is the prefix that has successfully matched the main string, and then the pattern string that needs to be matched. It should be abcother pattern strings with the suffix substring as the prefix substring, and it should be the longest matching prefix substring .

abcThe suffix substrings of cand , only the prefix and suffix substrings bcof other pattern strings can match, so the failure pointer should point to .bcfbcabccbcfc

Build an automaton

The conditions for building an automaton are as follows:

  1. Build a Trie tree
  2. Initialize node failure pointer

First, let’s take a look at the data structure of each node:

public class AcNode {
    
    
    public char data; //数据域
    public AcNode[] children = new AcNode[26]; //字符集只包含a~z这26个字符
    public boolean isEndingChar = false; //记录模式串结尾字符
    public int length = -1; //记录模式串长度
    public AcNode fail; // 失败指针
    public AcNode(char data) {
    
    
      this.data = data;
    }
}

It can be found that compared with the Trie tree, there is only one morefailure pointer

Therefore, the first step in building an automaton is to build a Trie tree, which will not be discussed in detail here (see Trie tree construction principles, application scenarios and complexity analysis ).

The question we have to consider now is, after building the Trie tree, how can we get the failure pointers of all nodes?

Through the above analysis, we already know that to get the node pointed by the failure pointer of a node, we actually need to find the longest prefix substring that matches the suffix substring of the previous part of the pattern string where the node is located .

In a Trie tree, the failure pointer of a node points to the node only in its upper level. Therefore, the failure pointer can be obtained using the same method as the next array, that is, the failure pointer of the current node can be deduced from the node where the failure pointer has been obtained.

The failure pointer of the root node root is null, that is, it points to itself. Then, after obtaining the pfailure pointer of a certain node q, how to obtain the failure pointer of its child node?

Situation one: Compare pthe child nodes of and qthe child nodes with each other. If they are the same, the corresponding failure pointer is found.
Insert image description here
Situation 2: If pthe child nodes of qare not equal to the child nodes of , then we pass qthe failure pointer to obtain the corresponding node, and continue to search for child nodes until null is found, which is the root node.
Insert image description here

Here is the code to build the failed pointer:

public void buildFailurePointer(AcNode root) {
    
    
    Queue<AcNode> queue = new LinkedList<>();
    root.fail = null;
    queue.add(root);
    while (!queue.isEmpty()) {
    
    
        AcNode p = queue.remove();//拿到节点p
        for (AcNode pc : p.children) {
    
    //遍历节点p的子节点
            if (pc == null) continue;
            if (p == root) {
    
    //root的子节点失败指针为root
                pc.fail = root;
            } else {
    
    
                AcNode q = p.fail;//找到p的失败指针节点q
                while (q != null) {
    
    
                	//查找p的子节点是否存在q的子节点
                    AcNode qc = q.children[pc.data - 'a'];
                    if (qc != null) {
    
    //存在则找到失败指针
                        pc.fail = qc;
                        break;
                    }
                    q = q.fail;//否则继续找下一个失败指针
                }
                if (q == null) {
    
    //直到找到null,则失败指针为root
                    pc.fail = root;
                }
            }
            queue.add(pc);
        }
    }
}

After constructing the failure pointer, as shown in the figure:
Insert image description here

Use AC automaton matching

Assuming the main string str, matching starts from the first character of the main string, and the automaton p=rootstarts matching from the pointer

  1. Assume pthat the child node xis equal to str[0], then pupdate to x, and then check whether the failure pointer p(currently pointed to x) is the end of a pattern string. If so, find a matching pattern string. After processing, continue to match str[2].
  2. If after reaching a certain step, pno matching characters are found in the child nodes, then the failure pointer comes in handy, that is, searching in the child nodes of the node pointed to by the failure pointer.
public void match(char[] str, AcNode root) {
    
     // str是主串,root是自动机
    AcNode p = root;
    for (int i = 0; i < str.length; i++) {
    
    
        int idx = str[i] - 'a';
        //p的子节点中没有,就往p的失败节点的子节点中找,直到失败指针指向null为止
        while (p.children[idx] == null && p != root) {
    
    
            p = p.fail; // 失败指针发挥作用的地方
        }
        p = p.children[idx];//找到匹配的字符后,p更新指向到这个节点
        if (p == null)// 如果没有匹配的,从 root 开始重新匹配
            p = root; 
        AcNode tmp = p;
        while (tmp != root) {
    
     // 找到已经匹配到的模式串
            if (tmp.isEndingChar == true) {
    
    
                int pos = i - tmp.length + 1;
                System.out.println(" 匹配起始下标 " + pos + "; 长度 " + tmp.length);
            }
            tmp = tmp.fail;
        }
    }
}

AC automaton matching efficiency

  1. The complexity of Trie tree construction is O(m*len), where mis the number of pattern strings and lenis the average length of pattern strings.
  2. When constructing the failure pointer, the most time-consuming thing is to search for the failure pointer layer by layer in the while loop. Each loop goes up at least one layer, and the height of the tree does not exceed. Therefore, the time lencomplexity is O(K*len), K is the node in the Trie tree. number.
  3. The above two steps only need to be executed once to complete the construction, which does not affect the efficiency of matching with the main string. During matching, the most time-consuming thing is also the code for the next failure pointer in the while loop, so the time complexity is, if the main O(len)string The length is n, then the total matching time complexity isO(n*len)

In fact, when matching sensitive words, the average length of sensitive words will not be very long. Therefore, the matching efficiency of AC automaton is very close. Only in O(n)extreme cases, the efficiency will degrade to the same as the Trie tree matching efficiency.

Extreme cases are as follows:
Extreme AC automaton

Guess you like

Origin blog.csdn.net/m0_37264516/article/details/86177992