AC multi-pattern matching algorithm

The article is roughly divided into the following 3 parts:

1. Application background;

2. Introduction to AC algorithm and its principle;

3. Java implementation of AC algorithm;

 

1. Application background

In Internet applications, a keyword detection function is usually used to prevent users from publishing content including specified keywords. For example, the chat system of the game, character name detection, forum posting, live barrage, etc., all need to detect the content posted by the user to detect whether it contains sensitive keywords.

 

There are usually many keywords that need to be detected, such as insulting keywords, politically sensitive keywords, and system-related specific keywords. It is no exaggeration to say that there are usually thousands or even thousands of keywords to be detected. At this time, the efficiency has become particularly prominent. If the efficiency of detecting keywords is low, it may be fatal to large-scale Internet applications.

 

Taking 8000 keywords as an example, if a regular expression is used, the content posted by the user needs to be traversed 8000 times. If in the same second, there are 100, 1000, 10,000 users publishing content, it is conceivable that the CPU overhead on the server is only in terms of keyword detection.

 

AC multi-pattern matching algorithm can effectively solve the efficiency problem of keyword detection. The time complexity is O(n), where n is the length n of the content posted by the user. Basically it has nothing to do with the number of keywords.

 

2. Introduction to AC algorithm and its principle

The AC algorithm is a classic multi-pattern matching algorithm proposed by Alfred V.Aho (author of "Compilation Principles") and Margaret J.Corasick in 1974 (the same year as the KMP algorithm), which can guarantee that for a given length of n A text, and a pattern set P{p1,p2,...pm}, find all target patterns in the text in O(n) time complexity, regardless of the size m of the pattern set.

 

The implementation principle of the AC algorithm can be roughly divided into three steps:
1), build a tree structure of sensitive words, and mark the end node (that is, whether it is the end of a sensitive word);
2), for the nodes on the tree, build a match Jump on failure - failed node (that is, when the match fails, which node to jump to to continue matching);
3) traverse the user content once, and desensitize the word tree structure for each character (byte) , start matching from the current node position.
If the match is successful, the current node is moved down to the corresponding node. If the current node is the "end node", it means that the match is successful;
if the match fails, the current node, jumps to the failed node of the node, and continues to match until the match is successful or the current node is the root node;

 

1) Build a tree structure of sensitive words, and mark the end node (that is, whether it is the end of a sensitive word).
First, there must be a root node;
traverse all sensitive words, and put each byte of each sensitive word into the On the tree, it becomes a node. The node corresponding to the previous byte of each byte is the parent node of the node corresponding to this byte. If the node corresponding to the byte already exists, no new node will be added. If the byte is the last byte of the sensitive word, set the corresponding node as the end node.

Such as sensitive words: he, hers, his, erase

 

2) For the nodes on the tree, build the jump-failed node when the match fails (that is, when the match fails, which node to jump to to continue the match)
The failed node of the first-level child node is directly designated as the root node ;
The failure node of other child nodes is to find the corresponding path in the failure point of its parent node. If not found, continue to search for the corresponding path in the failed node of the current node. Until the corresponding node is found, or the failed node is the root node.

 

3) Perform a traversal of the user content, remove the sensitive word tree structure for each character (byte), and start matching from the current node position.
If the match is successful, the current node is moved down to the corresponding node. If the current node is the "end node", it means that the match is successful;
if the match fails, the current node, jumps to the failed node of the node, and continues to match until the match is successful or the current node is the root node;

Take herase as a class to demonstrate

3. Java implementation of AC algorithm

Data structure, node on the keyword tree Node.java

package cn.buddie.ac.model;

import java.util.Collections;
import java.util.HashMap;
import java.util.Map;

/**
 * Node class
 *
 * @author buddie
 *
 */
public class Node {
	// The level of the current node
	private int level;
	// The child node after the current node, the Key is a lowercase letter
	private Map<Character, Node> subNodes;
	// Jump node when the current result match fails
	private Node failNode;
	// Whether the current node is a terminal node
	private boolean terminal;

	/**
	 * Whether the current node already contains the child node of the specified Key value
	 *
	 * @param c
	 * @return
	 */
	public boolean containSubNode(Character c) {
		if (this.subNodes == null || this.subNodes.isEmpty()) {
			return false;
		}
		return subNodes.containsKey(c);
	}

	/**
	 * Get the child node of the specified Key value
	 *
	 * @param c
	 * @return
	 */
	public Node getSubNode(Character c) {
		if (this.subNodes == null || this.subNodes.isEmpty()) {
			return null;
		}
		return subNodes.get(c);
	}

	/**
	 * 添加子结点
	 * 
	 * @param c
	 * @param node
	 */
	public void addSubNode(Character c, Node node) {
		if (this.subNodes == null) {
			this.subNodes = new HashMap<Character, Node>();
		}
		this.subNodes.put(c, node);
	}

	// getter & setter
	public int getLevel() {
		return level;
	}

	public void setLevel(int level) {
		this.level = level;
	}

	public Map<Character, Node> getSubNodes() {
		return subNodes == null ? Collections.emptyMap() : subNodes;
	}

	public void setSubNodes(Map<Character, Node> subNodes) {
		this.subNodes = subNodes;
	}

	public Node getFailNode() {
		return failNode;
	}

	public void setFailNode(Node failNode) {
		this.failNode = failNode;
	}

	public boolean isTerminal() {
		return terminal;
	}

	public void setTerminal(boolean terminal) {
		this.terminal = terminal;
	}

}

 

构建关键词树,关构建失败结点

package cn.buddie.ac.tree;

import java.util.Collection;
import java.util.HashMap;
import java.util.LinkedList;
import java.util.Queue;
import java.util.Set;

import cn.buddie.ac.model.Node;

public class ACTree {
	private Node rootNode;

	public ACTree(String[] keyWords) {
		// 初始树
		initTree(keyWords);
		// 构建失败跳转
		buildFailLink();
	}

	/**
	 * 初始树
	 * 
	 * @param keyWords
	 */
	private void initTree(String[] keyWords) {
		rootNode = new Node();
		rootNode.setSubNodes(new HashMap<Character, Node>());
		char[] charArray;
		for (String keyWord : keyWords) {
			if (keyWord.isEmpty()) {
				continue;
			}
			charArray = keyWord.toLowerCase().toCharArray();
			buildKeyMap(charArray);
		}
	}

	/**
	 * 构建指定字符数组的结点
	 * 
	 * @param charArray
	 */
	private void buildKeyMap(char[] charArray) {
		Character c;
		Node curNode = rootNode;
		Node node;
		for (int i = 0; i < charArray.length; i++) {
			c = charArray[i];
			if (curNode.containSubNode(c)) {
				node = curNode.getSubNode(c);
			} else {
				node = new Node();
				node.setLevel(i + 1);
				curNode.addSubNode(c, node);
			}
			if (i == charArray.length - 1) {
				node.setTerminal(true);
			}
			curNode = node;
		}
	}

	/**
	 * 构建失败跳转
	 */
	private void buildFailLink() {
		buildFirstLevelFailLink();
		buildOtherLevelFailLink();
	}

	/**
	 * 根结点的所有第一级子结点,失败跳转均为根结点
	 */
	private void buildFirstLevelFailLink() {
		Collection<Node> nodes = rootNode.getSubNodes().values();
		for (Node node : nodes) {
			node.setFailNode(rootNode);
		}
	}

	/**
	 * 根结点、第一级结点以外的所有结点,失败跳转均为其父结点的失败结点的对应子结点
	 */
	private void buildOtherLevelFailLink() {
		Queue<Node> queue = new LinkedList<Node>(rootNode.getSubNodes().values());
		Node node;
		while (!queue.isEmpty()) {
			node = queue.remove();
			buildNodeFailLink(node, queue);
		}
	}

	/**
	 * 构建指定结点的下一层结点的失败跳转
	 * 
	 * @param node
	 */
	private void buildNodeFailLink(Node node, Queue<Node> queue) {
		if (node.getSubNodes().isEmpty()) {
			return;
		}
		queue.addAll(node.getSubNodes().values());
		Node failNode = node.getFailNode();
		Set<Character> subNodeKeys = node.getSubNodes().keySet();
		Node subFailNode;
		for (Character key : subNodeKeys) {
			subFailNode = failNode;
			while (subFailNode != rootNode && !subFailNode.containSubNode(key)) {
				subFailNode = subFailNode.getFailNode();
			}
			subFailNode = subFailNode.getSubNode(key);
			if (subFailNode == null) {
				subFailNode = rootNode;
			}
			node.getSubNode(key).setFailNode(subFailNode);
		}
	}

	// getter
	public Node getRootNode() {
		return rootNode;
	}
}

 

过滤、替换工具类

package cn.buddie.ac.filter;

import org.apache.commons.lang.StringUtils;

import cn.buddie.ac.model.Node;
import cn.buddie.ac.tree.ACTree;

public class ACFilter {
	public static final Character REPLACE_CHAR = '*';

	private ACTree tree;

	public ACFilter(ACTree tree) {
		this.tree = tree;
	}

	/**
	 * 过滤
	 * 
	 * @param word
	 * @return
	 */
	public String filter(String word) {
		if (StringUtils.isEmpty(word)) {
			return "";
		}
		char[] words = word.toLowerCase().toCharArray();
		char[] result = null;
		Node curNode = tree.getRootNode();
		Node subNode;
		Character c;
		int fromPos = 0;
		for (int i = 0; i < words.length; i++) {
			c = words[i];
			subNode = curNode.getSubNode(c);
			while (subNode == null && curNode != tree.getRootNode()) {
				curNode = curNode.getFailNode();
				subNode = curNode.getSubNode(c);
			}
			if (subNode != null) {
				curNode = subNode;
			}
			if (curNode.isTerminal()) {
				int pos = i - curNode.getLevel() + 1;
				if (pos < fromPos) {
					pos = fromPos;
				}
				if (result == null) {
					result = word.toLowerCase().toCharArray();
				}
				for (; pos <= i; pos++) {
					result[pos] = REPLACE_CHAR;
				}
				fromPos = i + 1;
			}
		}
		if (result == null) {
			return word;
		}
		return String.valueOf(result);
	}

}

 

 

Demo

package cn.buddie.ac;

import cn.buddie.ac.filter.ACFilter;
import cn.buddie.ac.tree.ACTree;

public class WordFilterTest {

	public static void main(String[] args) {
		String[] keyWords = new String[] { "he", "hers", "his", "erase" };
		ACTree tree = new ACTree(keyWords);
		ACFilter filter = new ACFilter(tree);
		String str = "herase";
		str = filter.filter(str);
		System.out.println(str);
	}
}

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326483053&siteId=291194637