Multi-mode matching - History
Explanation
A multi-mode matching scenario: a plurality of pattern string matching (key character string) from a string of
multi-mode matching often scenarios: (1) keyword filter (2) Intrusion Detection (3) virus detection (4) word etc.
there are many multi-mode matching specific algorithm, commonly used are (1) Trie tree (2) AC algorithm (3) WM algorithm
AC (Aho-Corasick) algorithm
- AC automatic machine, the core idea is finite automata cleverly converted to character comparisons state transition. ;
- Important functions: goto function, failure function, output function. ;
- goto function: to guide state transition ;;
- failure function: to guide the state's failure to match the conversion;
- output function: describes the state under which the matching has occurred;
- Reference Address:
The state machine entry: from the definition used to http://www.sohu.com/a/230886244_114819
multimode string matching algorithm Java-Corasick--Aho https://www.cnblogs.com/hwyang/p/ 6836438.html
the AC + trie tree automaton efficient multi-mode matching dictionary (java source) https://blog.csdn.net/wangyangzhizhou/article/details/80964748
WM (Wu-Manber) algorithm
- Way hash using rules and bad characters in the BM algorithm to achieve rapid retrieval, effect stepwise jump;
- Important table structure: SHIFT, HASH, PREFIX table;
- SHIFT table: bad character tables;
- HASH table: B corresponding character blocks;
- PREFIX table: Table Prefix for control prefix lookup, and increase efficiency.
- Recommended to read reference address :( BM algorithm)
Boyer-Moore string matching algorithm (most detailed) http://www.ruanyifeng.com/blog/2013/05/boyer-moore_string_search_algorithm.html
multimode matching Wu-Manber algorithm https://blog.csdn.net/sunnianzhong/article/details/8875566
Code Remarks
-
Enter:
the key array: [ "research", "lipase", "horizontal", "asthma," "positive correlation", "expression", "study"]
to be matched character: "The results show that, IL-17A and IL-9 expression and lipase, CCL11 was positively correlated with decreased levels between adult asthma. " -
Output:
successfully matched keyword phrases: [ 'Research', 'results', 'Lipase', 'horizontal', 'asthma', 'positive correlation'] -
Test Description: using more than 20,000 keywords, parts of the string 2000 to extract as-rate within 2 seconds
AC and WM comparison:
- Memory footprint: improved algorithm for improved memory WM algorithm is much smaller than the AC;
- Pretreatment: WM improve the lot of small time preprocessing algorithm improved algorithm than AC;
- Matching speed: speed matching algorithm with WM pattern string loaded content has a great relationship;
- AC algorithm with load pattern strings unrelated to the content;
- Prefix: If the prefix large number of similar content, Shift table WM improved algorithm and HASH tables more conflict, matching slow.
expand
- Pattern matching: calculating a basic string data structure, given a substring, required to identify all substrings with the same substring of a string, which is pattern matching. Pattern matching divided into two categories: single-mode and multi-mode matching matching;
- Single-mode matching algorithm common: the simple algorithm (ie, constantly shifting after a match), (performed hash algorithm on a simple basis for comparison) Rabin-Karp, KMP algorithm (after each match the number of displaced dynamic calculations, increased efficiency), BM algorithm (dynamic binding bad characters displaced from the end of the matching character) and the like;
- AC algorithm optimization: dual-array to optimize memory usage.
Code
AC algorithm (python achieve)
# -*- encoding=utf-8 -*- class Node(object): def __init__(self): self.next = {} self.fail = None self.isWord = False class Ahocorasick(object): def __init__(self): self.__root = Node() def addWord(self, word): """ @param word: 添加关键词到Tire树中 """ tmp = self.__root for i in range(0, len(word)): if not tmp.next.__contains__(word[i]): tmp.next[word[i]] = Node() tmp = tmp.next[word[i]] tmp.isWord = True def make(self): """ build the fail function 构建自动机,失效函数 """ tmpQueue = [] tmpQueue.append(self.__root) while (len(tmpQueue) > 0): temp = tmpQueue.pop() p = None for k, v in temp.next.items(): if temp == self.__root: temp.next[k].fail = self.__root else: p = temp.fail while p is not None: if p.next.__contains__(k): temp.next[k].fail = p.next[k] break p.fail = P IF P IS None: temp.next [K] = Self .__ .fail the root tmpQueue.append (temp.next [K]) DEF Search (Self, Content): "" " @return: Returns a string matching collection (also for the needs may change, returns a string position subscript) "" " P = the root .__ Self Result = [] startWordIndex = 0 endWordIndex = -1 currentPosition = 0 the while currentPosition <len (Content): Word = Content [currentPosition] # retrieve the state machine until matching the while the contains p.next .__ __ (Word) and P = == False Self .__ the root:! P = p.fail p.next the contains __ .__ IF (Word): IF P == Self .__ the root: # If the current node is the root and the transition state is present, then the word match the beginning, recording start position word startWordIndex = currentPosition # transition state machine the state of P = p.next [Word] the else: P = the root Self .__ IF p.isWord: # If the end status word, words put into the result set # result.append ((startWordIndex, currentPosition)) result. the append (Content [startWordIndex: currentPosition +. 1]) currentPosition + =. 1 return Result DEF replace (Self, Content): "" " replacement character portionThe results show that, IL-17A and IL-9 expression and lipase, CCL11 was positively correlated with decreased levels between adult asthma. ') print(mm)
WM algorithm (java achieve)
the java.io. * Import; Import of java.util.ArrayList; Import java.util.List; Import java.util.Scanner; Import java.util.Vector; / *** * Image Wu-Manber, rapid detection of multi-mode ( filter) algorithm * Note: For detailed examples are provided in the main method, modified in the keyword loading exceeds a certain magnitude keyword method (memory consumption problem) recommendation * * / public class WuManber { Private int B = 2; // block the length of the character X (suffix number pattern string of characters) Private InitFlag = Boolean to false; // initialize whether Private UnionPatternSet unionPatternSet new new UnionPatternSet = (); Private maxIndex int = (int) java.lang.Math.pow (2, 16 ); // maxIndex = 65536 Private shiftTable int [] = new new int [maxIndex]; Private the Vector <AtomicPattern> hashTable [] = new new the Vector [maxIndex]; UnionPatternSet tmpUnionPatternSet new new UnionPatternSet = Private (); public WuManber () { } public static void main (String [] args) { .. 1 // build objects WuManber objWM new new WuManber = (); // Get be matched keyword (Mode string) the Vector <String> = new new vKey the Vector <> (); ( "studies") vKey.add; vKey.add ( "asthma"); vKey.add ( "results"); vKey.add ( "lipase "); vKey.add (" horizontal "); vKey.add (" positive correlation "); vKey.add (" expression "); // 2 WM algorithm to build the dictionary table, make sure to build successful if (objWM. addFilterKeyWord (vKey, 0)) { Long the startTime = System.currentTimeMillis ();// Get start time String text = "The results show a positive correlation between IL-17A and IL-9 expression and lipase, decreased levels of CCL11 and adult asthma."; // 3 matching work, and the results were treated List < String> sResult = objWM.macth (text, the Vector new new (0)); System.out.println (sResult); Long endTime = System.currentTimeMillis (); System.out.println ( "running time:" + (endTime -startTime) + "ms"); // running time } // release the object storage space 4 objWM.clear (); } / ** * matching work * @param Content * @param Levelset * @return * / public List <String> macth (String content, Vector <Integer> levelSet) { List<String> sResult = new ArrayList<>(); if (initFlag == false) init(); Vector <AtomicPattern> aps = new Vector <AtomicPattern>(); String preContent = content;//preConvert(content); for (int i = 0; i < preContent.length();) { char checkChar = preContent.charAt(i); if (shiftTable[checkChar] == 0) { Vector <AtomicPattern> tmpAps = new Vector <AtomicPattern>(); tmpAps = findMathAps(preContent.substring(0, i + 1),hashTable[checkChar]); aps.addAll(tmpAps); if(tmpAps.size()>0){ sResult.add(tmpAps.get(0).getPattern () str.); } I ++; } the else I = I + shiftTable [checkChar]; } parseAtomicPatternSet (APS, Levelset); return sResult; } / ** * add keywords, when more is not recommended keyword vector way access parameters, as severely consume memory * @param keyWord * @param Level * @return * / public Boolean addFilterKeyWord (the Vector <String> keyWord, Level int) { IF (== InitFlag to true) return to false; unionPattern unionPattern new new unionPattern = (); Object [] = strArray keyWord.toArray (); for (int I = 0; I <strArray.length; I ++) { String sPattern=(String)strArray[i]; Pattern pattern = new Pattern(sPattern); AtomicPattern atomicPattern = new AtomicPattern(pattern); unionPattern.addNewAtomicPattrn(atomicPattern); unionPattern.setLevel(level); atomicPattern.setBelongUnionPattern(unionPattern); } tmpUnionPatternSet.addNewUnionPattrn(unionPattern); return true; } /** * 验证字符 * @param ch * @return */ private boolean isValidChar(char ch) { if ((ch >= '0' && ch <= '9') || (ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z')) return true; if ((ch >= 0x4e00 && ch <= 0x7fff) || (ch >= 0x8000 && ch <= 0x952f)) return true;// 简体中文汉字编码 return false; } /** * 封装原子模式集 * @param aps * @param levelSet */ private void parseAtomicPatternSet(Vector <AtomicPattern> aps,Vector <Integer> levelSet) { while (aps.size() > 0) { AtomicPattern ap = aps.get(0); UnionPattern up = ap.belongUnionPattern; if (up.isIncludeAllAp(aps) == true) { levelSet.add(new Integer(up.getLevel())); } aps.remove(0); } } / ** * Find atoms mode * @param the src * @param destAps * @return * / Private the Vector <AtomicPattern> findMathAps (String the src, the Vector <AtomicPattern> destAps) { the Vector <AtomicPattern> new new APS = the Vector <AtomicPattern> (); for (int I = 0; I <destAps.size (); I ++) { AtomicPattern AP = destAps.get (I); IF (ap.findMatchInString (the src) == to true) aps.add (AP); } return APS; } / ** * pre-conversion content (removal of special characters) * @param content * @return * / private String preConvert(String content) { String retStr = new String(); for (int i = 0; i < content.length(); i++) { char ch = content.charAt(i); if (this.isValidChar(ch) == true) { retStr = retStr + ch; } } return retStr; } /** * shift table and hash table of initialize */ private void init() { initFlag = true; for (int i = 0; i < maxIndex; i++) hashTable[i] = new Vector <AtomicPattern>(); shiftTableInit(); hashTableInit(); } /** * 清除 */ public void clear() { tmpUnionPatternSet.clear(); initFlag = false; } /** * 初始化跳跃表 */ private void shiftTableInit() { for (int i = 0; i < maxIndex; i++) shiftTable[i] = B; Vector <UnionPattern> upSet = tmpUnionPatternSet.getSet(); for (int i = 0; i < upSet.size(); i++) { Vector <AtomicPattern> apSet = upSet.get(i).getSet(); for (int j = 0; j < apSet.size(); j++) { AtomicPattern ap = apSet.get(j); Pattern pattern = ap.getPattern(); //System.out.print(pattern.charAtEnd(1)+"\t");//如pattern.charAtEnd(1)==B,则shiftTable[pattern.charAtEnd(1)]==shiftTable[53] if (shiftTable[pattern.charAtEnd(1)] != 0) shiftTable[pattern.charAtEnd(1)] = 1; if (shiftTable[pattern.charAtEnd(0)] != 0) shiftTable[pattern.charAtEnd(0)] = 0; } } } /** * 初始化HASH表 */ private void hashTableInit() { Vector <UnionPattern> upSet = tmpUnionPatternSet.getSet(); for (int i = 0; i < upSet.size(); i++) { Vector <AtomicPattern> apSet = upSet.get(i).getSet(); for (int j = 0; j < apSet.size(); j++) { AtomicPattern ap = apSet.get(j); Pattern pattern = ap.getPattern(); //System.out.println(pattern.charAtEnd(0));//存储shiftTable[pattern.charAtEnd(0)]==0的字符块 if (pattern.charAtEnd(0) != 0) { hashTable[pattern.charAtEnd(0)].add(ap); //System.out.println(hashTable[pattern.charAtEnd(0)]); } } } } } /** * 模式类 */ class Pattern { public String str; Pattern(String str) { this.str = str; } public char charAtEnd(int index) { if (str.length() > index) { return str.charAt(str.length() - index - 1); } else return 0; } public String getStr() { return str; } } /** * 原子模式类 */ class AtomicPattern { private Pattern pattern; public UnionPattern belongUnionPattern; AtomicPattern(Pattern pattern) { this.pattern = pattern; } public UnionPattern getBelongUnionPattern() { return belongUnionPattern; } public void setBelongUnionPattern(UnionPattern belongUnionPattern) { this.belongUnionPattern = belongUnionPattern; } public Pattern getPattern() { return pattern; } public void setPattern(Pattern pattern) { this.pattern = pattern; } public boolean findMatchInString(String str) { if (this.pattern.str.length() > str.length()) return false; int beginIndex = str.length() - this.pattern.str.length(); String eqaulLengthStr = str.substring(beginIndex); if (this.pattern.str.equalsIgnoreCase(eqaulLengthStr)) return true; return false; } } /** * 合并的模式类 */ class UnionPattern { public Vector <AtomicPattern> apSet; private int level; // union string UnionPattern() { this.apSet = new Vector <AtomicPattern>(); } public void addNewAtomicPattrn(AtomicPattern ap) { this.apSet.add(ap); } public Vector <AtomicPattern> getSet() { return apSet; } public boolean isIncludeAllAp(Vector <AtomicPattern> inAps) { if (apSet.size() > inAps.size()) return false; for (int i = 0; i < apSet.size(); i++) { AtomicPattern ap = apSet.get(i); if (isInAps(ap, inAps) == false) return false; } return true; } private boolean isInAps(AtomicPattern ap, Vector <AtomicPattern> inAps) { for (int i = 0; i < inAps.size(); i++) { AtomicPattern destAp = inAps.get(i); if (ap.getPattern().str.equalsIgnoreCase(destAp.getPattern().str) == true) return true; } return false; } public void setLevel(int level) { this.level = level; } public int getLevel() { return this.level; } } /** * 合并的模式集子类 */ class UnionPatternSet { // union string set public Vector <UnionPattern> unionPatternSet; UnionPatternSet() { this.unionPatternSet = new Vector <UnionPattern>(); } public void addNewUnionPattrn(UnionPattern up) { this.unionPatternSet.add(up); } public Vector <UnionPattern> getSet() { return unionPatternSet; } public void clear() { unionPatternSet.clear(); } }