Multiple pattern string matching
Multi-pattern string matching scenarios are common when some platforms block sensitive terms in certain users' speeches.
Use a string matching algorithm to find sensitive terms in the text and replace them with "***". Although a single pattern string matching algorithm can be used to find sensitive terms one by one and then replace them, in actual scenarios, if the database of sensitive terms is large and there is a lot of text content to be matched, the matching time will be too long, which may lead to It takes a long time to send a message. Obviously this will lead to a degraded user experience.
Therefore, an efficient matching algorithm under multiple pattern strings is needed to deal with this scenario.
Filter sensitive words based on Trie tree
The Trie tree itself is an algorithm based on multi-pattern string matching, which constructs multiple pattern strings into a Trie tree. When the pattern string changes, only the Trie tree needs to be changed.
When matching the main string, we match the Trie tree one by one starting from the first character of the main string. When a bad character is matched, we move the starting character of the main string back one character and continue matching from the root of the Trie tree, so First, we only need to scan the main string once to complete the matching of multiple pattern strings . The efficiency is much higher than using single pattern string matching.
AC automaton principle
The above-mentioned multi-pattern string matching algorithm based on Trie tree is similar to the brute-force matching algorithm in single-pattern string matching. We know that the brute-force matching algorithm can improve efficiency by introducing the next array, that is, the KMP algorithm. In this multi-pattern string matching, it can Should we add the idea of next array to it?
The answer is obviously yes, you just need to slightly transform the Trie tree. Of course, it is not to add the next array, but to add a next array to each node of the Trie tree.next pointer,Right nowfailure pointer。
c
As shown in the figure, the failure pointer of the character is bcf
in the string c
. When we match it abc
, we find that d
it does not match the character. At this time, we can c
jump to the failure pointer bcd
and continue to match f
.
In this way, it is no longer necessary to start matching again when there is no match. The idea is the same as the next array. If you do not understand the principle of the KMP algorithm, it is recommended to understand the next array first, and then look at the failure pointer to understand it easily ( KMP algorithm recommended reading: Famous string matching algorithm: KMP algorithm principle analysis and code implementation )
The next question is, how to find the next node pointed by the failure pointer of each node?
The Trie tree actually includes all pattern strings. Suppose we now require c
the failure pointer of the node in the above figure. The known condition is that when matching c
everywhere , abc
it is the prefix that has successfully matched the main string, and then the pattern string that needs to be matched. It should be abc
other pattern strings with the suffix substring as the prefix substring, and it should be the longest matching prefix substring .
abc
The suffix substrings of c
and , only the prefix and suffix substrings bc
of other pattern strings can match, so the failure pointer should point to .bcf
bc
abc
c
bcf
c
Build an automaton
The conditions for building an automaton are as follows:
- Build a Trie tree
- Initialize node failure pointer
First, let’s take a look at the data structure of each node:
public class AcNode {
public char data; //数据域
public AcNode[] children = new AcNode[26]; //字符集只包含a~z这26个字符
public boolean isEndingChar = false; //记录模式串结尾字符
public int length = -1; //记录模式串长度
public AcNode fail; // 失败指针
public AcNode(char data) {
this.data = data;
}
}
It can be found that compared with the Trie tree, there is only one morefailure pointer
Therefore, the first step in building an automaton is to build a Trie tree, which will not be discussed in detail here (see Trie tree construction principles, application scenarios and complexity analysis ).
The question we have to consider now is, after building the Trie tree, how can we get the failure pointers of all nodes?
Through the above analysis, we already know that to get the node pointed by the failure pointer of a node, we actually need to find the longest prefix substring that matches the suffix substring of the previous part of the pattern string where the node is located .
In a Trie tree, the failure pointer of a node points to the node only in its upper level. Therefore, the failure pointer can be obtained using the same method as the next array, that is, the failure pointer of the current node can be deduced from the node where the failure pointer has been obtained.
The failure pointer of the root node root is null, that is, it points to itself. Then, after obtaining the p
failure pointer of a certain node q
, how to obtain the failure pointer of its child node?
Situation one: Compare p
the child nodes of and q
the child nodes with each other. If they are the same, the corresponding failure pointer is found.
Situation 2: If p
the child nodes of q
are not equal to the child nodes of , then we pass q
the failure pointer to obtain the corresponding node, and continue to search for child nodes until null is found, which is the root node.
Here is the code to build the failed pointer:
public void buildFailurePointer(AcNode root) {
Queue<AcNode> queue = new LinkedList<>();
root.fail = null;
queue.add(root);
while (!queue.isEmpty()) {
AcNode p = queue.remove();//拿到节点p
for (AcNode pc : p.children) {
//遍历节点p的子节点
if (pc == null) continue;
if (p == root) {
//root的子节点失败指针为root
pc.fail = root;
} else {
AcNode q = p.fail;//找到p的失败指针节点q
while (q != null) {
//查找p的子节点是否存在q的子节点
AcNode qc = q.children[pc.data - 'a'];
if (qc != null) {
//存在则找到失败指针
pc.fail = qc;
break;
}
q = q.fail;//否则继续找下一个失败指针
}
if (q == null) {
//直到找到null,则失败指针为root
pc.fail = root;
}
}
queue.add(pc);
}
}
}
After constructing the failure pointer, as shown in the figure:
Use AC automaton matching
Assuming the main string str
, matching starts from the first character of the main string, and the automaton p=root
starts matching from the pointer
- Assume
p
that the child nodex
is equal tostr[0]
, thenp
update tox
, and then check whether the failure pointerp
(currently pointed tox
) is the end of a pattern string. If so, find a matching pattern string. After processing, continue to match str[2]. - If after reaching a certain step,
p
no matching characters are found in the child nodes, then the failure pointer comes in handy, that is, searching in the child nodes of the node pointed to by the failure pointer.
public void match(char[] str, AcNode root) {
// str是主串,root是自动机
AcNode p = root;
for (int i = 0; i < str.length; i++) {
int idx = str[i] - 'a';
//p的子节点中没有,就往p的失败节点的子节点中找,直到失败指针指向null为止
while (p.children[idx] == null && p != root) {
p = p.fail; // 失败指针发挥作用的地方
}
p = p.children[idx];//找到匹配的字符后,p更新指向到这个节点
if (p == null)// 如果没有匹配的,从 root 开始重新匹配
p = root;
AcNode tmp = p;
while (tmp != root) {
// 找到已经匹配到的模式串
if (tmp.isEndingChar == true) {
int pos = i - tmp.length + 1;
System.out.println(" 匹配起始下标 " + pos + "; 长度 " + tmp.length);
}
tmp = tmp.fail;
}
}
}
AC automaton matching efficiency
- The complexity of Trie tree construction is
O(m*len)
, wherem
is the number of pattern strings andlen
is the average length of pattern strings. - When constructing the failure pointer, the most time-consuming thing is to search for the failure pointer layer by layer in the while loop. Each loop goes up at least one layer, and the height of the tree does not exceed. Therefore, the time
len
complexity isO(K*len)
, K is the node in the Trie tree. number. - The above two steps only need to be executed once to complete the construction, which does not affect the efficiency of matching with the main string. During matching, the most time-consuming thing is also the code for the next failure pointer in the while loop, so the time complexity is, if the main
O(len)
string The length isn
, then the total matching time complexity isO(n*len)
In fact, when matching sensitive words, the average length of sensitive words will not be very long. Therefore, the matching efficiency of AC automaton is very close. Only in O(n)
extreme cases, the efficiency will degrade to the same as the Trie tree matching efficiency.
Extreme cases are as follows: