AC automaton: How to achieve multi-pattern matching sensitive word filtering?

AC automaton: How to achieve multi-pattern matching sensitive word filtering?

String matching algorithm, by maintaining a sensitive word in the dictionary, after you enter a text by the string matching algorithm, to find this text entered by the user, contains sensitive words, if there is, to use *** substitute

How to achieve a high performance sensitive word filtering system? Multiple string matching algorithm

Based on single-mode filtering sensitive word strings and the Trie achieved

BF, RK, BM, KMP, Trie tree, the first four single pattern string matching algorithm, only the Trie is a multi-pattern matching algorithm

Called single pattern string is a string and between a master pattern string matching, multiple string matching is done between a plurality of master pattern string and a string

How sensitive words with the Trie filtering it?

Pretreatment sensitive words, construct Trie tree structure, if the word sensitive character dynamically updated, and only need to dynamically update the look Trie tree, when the user enters a text content, the user input as a main string, from the first start characters, matching in the Trie, when matched to the Trie leaf node does not match the character when encountered, or when the main string matches the start position of a next move, the next character is a starting character restart the match in the Trie

More efficient classic multiple string matching algorithms: AC automaton

The relationship between Trie and AC automatic machine as a single string matching simple string matching algorithms, KMP algorithm with the same, but the Trie for a string of multi-mode only, so AC automatic machine is on top of the Trie, plus the next similar array of KMP, but now the next array is built in the trees Bale

public class AcNode{
	public char data;
	public AcNode[] children = new AcNode[26];//字节集只包含a~z这26个字符
	public boolean isEndingChar = false; //结尾字符为true
	public int length = -1;   //当isEndingChar = true时,记录模式串长度
	public AcNode fail;//失败指针
	public AcNode(char data){
		this.data = data;
	}
}

AC automaton construct comprises two operations:

  • The plurality of pattern strings constructed to the Trie
  • Trie tree constructed failure pointer (next function corresponding to the failure of the array KMP)

How After building the Trie pointer fails to build on top of it?

There are four such pattern string c, bc, bcd, abcd, is the main string abcd

​ root

​ a b c

​ b c

c d

​ d

Trie tree each node has a pointer to a failure, if p go along the Trie nodes, node C is red, and that the pointer p is the failure of the string abc red come from the root node is formed, the pattern string with all prefixes bc longest pattern matching substring matches a suffix, arrow points to a string that

About longest substring matches a suffix string abc suffix substring two bc, c, to take them to other pattern matching, if a suffix prefix substring matches a pattern string, the sub-suffix called the suffix string matching substring can find a longest match from a suffix substring is the longest substring matches a suffix

Fails pointer p pointing to node that matches the longest substring suffix pattern string corresponding to the last node prefix is ​​the first from a second point c c

Construction process fails pointer is a process in layers through the tree, the root if it fails the pointer is null, that is point to themselves

When we have to pray for a failed node p pointers, pointers on how to find the failure of its child nodes?

Failed pointer set node p points to the node q, to ​​see whether the child node of p pc corresponding character can be found in the child node q, if a child node qc node q is found, the corresponding character with node pc corresponding the same character, will fail pointer node pc point node QC, if the node q is not character sub-node is equal to the characters of the node pc included, so that q = q -> fail (fail indicating failure pointer), continue searching until q = root So far, if the child node of the same characters do not find, let node failure pc pointer pointing to root

Fails to build pointer:
public void buildFailurePointer(){
	Quene<AcNode> quene = new LinkedList<>();
	root.fail = null;
	quene.add(root);
	while(!quene.isEmpty()){
		AcNode p = quene.remove();
		for(int i = 0 ; i < 26 ; ++i){
			AcNode pc = p.children[i];
			if(pc == null) continue;
			if(p == root){
			pc.fail = root;
			}else{
				AcNode q = p.fail;
				while(q != null){
					AcNode qc = q.children[pc.data - 'a'];
					if(qc != null){
						pc.fail = qc;
						break;
					}
					q = q.fail;
				}
				if(q == null){
					pc.fail = root;
				}
			}
			quene.add(pc);
		}
	}
}
How to match the string on the main AC automaton?

Process, main string starts from i = 0, the AC automaton starts from p = root pointer, the pattern string is assumed that B, is a main string

  • If there is a node pointed to by p is equal to b [i] child node x, p update point x, which fail when needed by the pointer, the pointer to detect a series of failed path ending pattern string, post-treatment, i + 1, carry on
  • If not the node pointed to by p = b [i] child node, so that p = p-> fail

Code that matches the output code is :( position in the main strings each occurrence of the string pattern matching can)

public void match(char[] text){     //text是主串
	int n = text.length;
	AcNode p = root;
	for(int i = 0 ; i < n ; ++i){
		int idx = text[i] - 'a';
		while(p.children[idx] == null && p != root){
			p = p.fail;        //失败指针发挥作用的地方
		}
		p = p.children[idx];
		if(p == null)  p = root;   //如果没有匹配的,从root开始重新匹配
		AcNode tmp = p ;
		while (tmp != root){  //打印出可以匹配的模式串
			if(tmp.isEndingChar == true){
				int pos = i - tmp.length+1;
				System.out.println("匹配起始下标" + pos + ";长度" + tmp.length);
			}
			tmp = tmp.fail;
		}
	}
}

Sensitive word filtering system AC Automata whether more efficient than single pattern string matching method?

The AC sensitive words constructed automatic machine comprising constructing the Trie and failed pointer

From the complexity point of view, the efficiency of AC automaton match with the Trie, like, in fact, most of the failure pointer pointing to root,

https://www.cnblogs.com/sclbgw7/p/9260756.html

https://www.cnblogs.com/hyfhaha/p/10802604.html

Published 75 original articles · won praise 9 · views 9185

Guess you like

Origin blog.csdn.net/ywangjiyl/article/details/104525867