Multi-mode matching - History

 

Explanation

A multi-mode matching scenario: a plurality of pattern string matching (key character string) from a string of
multi-mode matching often scenarios: (1) keyword filter (2) Intrusion Detection (3) virus detection (4) word etc.
there are many multi-mode matching specific algorithm, commonly used are (1) Trie tree (2) AC algorithm (3) WM algorithm

AC (Aho-Corasick) algorithm

  1. AC automatic machine, the core idea is finite automata cleverly converted to character comparisons state transition. ;
  2. Important functions: goto function, failure function, output function. ;
  3. goto function: to guide state transition ;;
  4. failure function: to guide the state's failure to match the conversion;
  5. output function: describes the state under which the matching has occurred;
  6. Reference Address:
    The state machine entry: from the definition used to http://www.sohu.com/a/230886244_114819
    multimode string matching algorithm Java-Corasick--Aho https://www.cnblogs.com/hwyang/p/ 6836438.html
    the AC + trie tree automaton efficient multi-mode matching dictionary (java source) https://blog.csdn.net/wangyangzhizhou/article/details/80964748

WM (Wu-Manber) algorithm

  1. Way hash using rules and bad characters in the BM algorithm to achieve rapid retrieval, effect stepwise jump;
  2. Important table structure: SHIFT, HASH, PREFIX table;
  3. SHIFT table: bad character tables;
  4. HASH table: B corresponding character blocks;
  5. PREFIX table: Table Prefix for control prefix lookup, and increase efficiency.
  6. Recommended to read reference address :( BM algorithm)
    Boyer-Moore string matching algorithm (most detailed) http://www.ruanyifeng.com/blog/2013/05/boyer-moore_string_search_algorithm.html
    multimode matching Wu-Manber algorithm https://blog.csdn.net/sunnianzhong/article/details/8875566

Code Remarks

  1. Enter:
    the key array: [ "research", "lipase", "horizontal", "asthma," "positive correlation", "expression", "study"]
    to be matched character: "The results show that, IL-17A and IL-9 expression and lipase, CCL11 was positively correlated with decreased levels between adult asthma. "

  2. Output:
    successfully matched keyword phrases: [ 'Research', 'results', 'Lipase', 'horizontal', 'asthma', 'positive correlation']

  3. Test Description: using more than 20,000 keywords, parts of the string 2000 to extract as-rate within 2 seconds

AC and WM comparison:

  1. Memory footprint: improved algorithm for improved memory WM algorithm is much smaller than the AC;
  2. Pretreatment: WM improve the lot of small time preprocessing algorithm improved algorithm than AC;
  3. Matching speed: speed matching algorithm with WM pattern string loaded content has a great relationship;
  4. AC algorithm with load pattern strings unrelated to the content;
  5. Prefix: If the prefix large number of similar content, Shift table WM improved algorithm and HASH tables more conflict, matching slow.

expand

    1. Pattern matching: calculating a basic string data structure, given a substring, required to identify all substrings with the same substring of a string, which is pattern matching. Pattern matching divided into two categories: single-mode and multi-mode matching matching;
    2. Single-mode matching algorithm common: the simple algorithm (ie, constantly shifting after a match), (performed hash algorithm on a simple basis for comparison) Rabin-Karp, KMP algorithm (after each match the number of displaced dynamic calculations, increased efficiency), BM algorithm (dynamic binding bad characters displaced from the end of the matching character) and the like;
    3. AC algorithm optimization: dual-array to optimize memory usage.

Code

AC algorithm (python achieve)

 

# -*- encoding=utf-8 -*-


class Node(object):
    def __init__(self):
        self.next = {}
        self.fail = None
        self.isWord = False


class Ahocorasick(object):
    def __init__(self):
        self.__root = Node()

    def addWord(self, word):
        """
            @param word: 添加关键词到Tire树中
        """
        tmp = self.__root
        for i in range(0, len(word)):
            if not tmp.next.__contains__(word[i]):
                tmp.next[word[i]] = Node()
            tmp = tmp.next[word[i]]
        tmp.isWord = True

    def make(self):
        """
            build the fail function
            构建自动机,失效函数
        """
        tmpQueue = []
        tmpQueue.append(self.__root)
        while (len(tmpQueue) > 0):
            temp = tmpQueue.pop()
            p = None
            for k, v in temp.next.items():
                if temp == self.__root:
                    temp.next[k].fail = self.__root
                else:
                    p = temp.fail
                    while p is not None:
                        if p.next.__contains__(k):
                            temp.next[k].fail = p.next[k]
                            break
                        p.fail = P 
                    IF P IS None: 
                        temp.next [K] = Self .__ .fail the root 
                tmpQueue.append (temp.next [K]) 

    DEF Search (Self, Content): 
        "" " 
            @return: Returns a string matching collection (also for the needs may change, returns a string position subscript) 
        "" " 
        P = the root .__ Self 
        Result = [] 
        startWordIndex = 0 
        endWordIndex = -1 
        currentPosition = 0 

        the while currentPosition <len (Content): 
            Word = Content [currentPosition] 
            # retrieve the state machine until matching 
            the while the contains p.next .__ __ (Word) and P = == False Self .__ the root:! 
                P = p.fail

            p.next the contains __ .__ IF (Word): 
                IF P == Self .__ the root: 
                    # If the current node is the root and the transition state is present, then the word match the beginning, recording start position word 
                    startWordIndex = currentPosition 
                # transition state machine the state of 
                P = p.next [Word] 
            the else: 
                P = the root Self .__ 

            IF p.isWord: 
                # If the end status word, words put into the result set 
                # result.append ((startWordIndex, currentPosition)) 
                result. the append (Content [startWordIndex: currentPosition +. 1]) 

            currentPosition + =. 1 
        return Result 

    DEF replace (Self, Content): 
        "" " 
            replacement character portionThe results show that, IL-17A and IL-9 expression and lipase, CCL11 was positively correlated with decreased levels between adult asthma. ')





    print(mm)

  

WM algorithm (java achieve)

the java.io. * Import; 
Import of java.util.ArrayList; 
Import java.util.List; 
Import java.util.Scanner; 
Import java.util.Vector; 

/ *** 
 * Image Wu-Manber, rapid detection of multi-mode ( filter) algorithm 
 * Note: For detailed examples are provided in the main method, modified in the keyword loading exceeds a certain magnitude keyword method (memory consumption problem) recommendation 
 * 
 * / 
public class WuManber { 
    Private int B = 2; // block the length of the character X (suffix number pattern string of characters)    

    Private InitFlag = Boolean to false; // initialize whether        
    Private UnionPatternSet unionPatternSet new new UnionPatternSet = (); 
    Private maxIndex int = (int) java.lang.Math.pow (2, 16 ); // maxIndex = 65536    
    Private shiftTable int [] = new new int [maxIndex]; 
    Private the Vector <AtomicPattern> hashTable [] = new new the Vector [maxIndex];
    UnionPatternSet tmpUnionPatternSet new new UnionPatternSet = Private (); 

    public WuManber () { 

    } 

    public static void main (String [] args) { 
        .. 1 // build objects 
        WuManber objWM new new WuManber = (); 

        // Get be matched keyword (Mode string) 
        the Vector <String> = new new vKey the Vector <> (); 
        ( "studies") vKey.add; 
        vKey.add ( "asthma"); 
        vKey.add ( "results"); 
        vKey.add ( "lipase "); 
        vKey.add (" horizontal "); 
        vKey.add (" positive correlation "); 
        vKey.add (" expression "); 


        // 2 WM algorithm to build the dictionary table, make sure to build successful 
        if (objWM. addFilterKeyWord (vKey, 0)) { 

           Long the startTime = System.currentTimeMillis ();// Get start time


            String text = "The results show a positive correlation between IL-17A and IL-9 expression and lipase, decreased levels of CCL11 and adult asthma."; 

            // 3 matching work, and the results were treated 
            List < String> sResult = objWM.macth (text, the Vector new new (0)); 

            System.out.println (sResult); 


            Long endTime = System.currentTimeMillis (); 
            System.out.println ( "running time:" + (endTime -startTime) + "ms"); // running time 
        } 

        // release the object storage space 4 
        objWM.clear (); 
    } 



    / ** 
     * matching work 
     * @param Content 
     * @param Levelset 
     * @return 
     * / 
    public List <String> macth (String content, Vector <Integer> levelSet) {
        List<String> sResult = new ArrayList<>();
        if (initFlag == false)
            init();
        Vector <AtomicPattern> aps = new Vector <AtomicPattern>();
        String preContent = content;//preConvert(content);
        for (int i = 0; i  < preContent.length();) {
            char checkChar = preContent.charAt(i);
            if (shiftTable[checkChar] == 0) {
                Vector <AtomicPattern> tmpAps = new Vector <AtomicPattern>();
                tmpAps = findMathAps(preContent.substring(0, i + 1),hashTable[checkChar]);
                aps.addAll(tmpAps);
                if(tmpAps.size()>0){
                    sResult.add(tmpAps.get(0).getPattern () str.);
                } 
                I ++; 
            } the else 
                I = I + shiftTable [checkChar]; 
        } 
        parseAtomicPatternSet (APS, Levelset); 
        return sResult; 
    } 


    / ** 
     * add keywords, when more is not recommended keyword vector way access parameters, as severely consume memory 
     * @param keyWord 
     * @param Level 
     * @return 
     * / 
    public Boolean addFilterKeyWord (the Vector <String> keyWord, Level int) { 
        IF (== InitFlag to true) 
            return to false; 
        unionPattern unionPattern new new unionPattern = (); 
        Object [] = strArray keyWord.toArray (); 
        for (int I = 0; I <strArray.length; I ++) {
            String sPattern=(String)strArray[i];
            Pattern pattern = new Pattern(sPattern);
            AtomicPattern atomicPattern = new AtomicPattern(pattern);
            unionPattern.addNewAtomicPattrn(atomicPattern);
            unionPattern.setLevel(level);
            atomicPattern.setBelongUnionPattern(unionPattern);
        }
        tmpUnionPatternSet.addNewUnionPattrn(unionPattern);
        return true;
    }

    /**
     * 验证字符
     * @param ch
     * @return
     */
    private boolean isValidChar(char ch) {
        if ((ch >= '0' && ch  <= '9') || (ch >= 'A' && ch  <= 'Z') || (ch >= 'a' && ch  <= 'z'))
            return true;
        if ((ch >= 0x4e00 && ch  <= 0x7fff) || (ch >= 0x8000 && ch  <= 0x952f))
            return true;// 简体中文汉字编码            
        return false;
    }

    /**
     * 封装原子模式集
     * @param aps
     * @param levelSet
     */
    private void parseAtomicPatternSet(Vector <AtomicPattern> aps,Vector <Integer> levelSet) {
        while (aps.size() > 0) {
            AtomicPattern ap = aps.get(0);
            UnionPattern up = ap.belongUnionPattern;
            if (up.isIncludeAllAp(aps) == true) {
                levelSet.add(new Integer(up.getLevel()));
            }
            aps.remove(0);
        } 
    } 

    / ** 
     * Find atoms mode 
      * @param the src 
     * @param destAps 
     * @return 
     * / 
    Private the Vector <AtomicPattern> findMathAps (String the src, the Vector <AtomicPattern> destAps) { 
        the Vector <AtomicPattern> new new APS = the Vector <AtomicPattern> (); 
        for (int I = 0; I <destAps.size (); I ++) { 
            AtomicPattern AP = destAps.get (I); 
            IF (ap.findMatchInString (the src) == to true) 
                aps.add (AP); 
        } 
        return APS; 
    } 

    / ** 
     * pre-conversion content (removal of special characters) 
     * @param content 
     * @return 
     * /
    private String preConvert(String content) {
        String retStr = new String();
        for (int i = 0; i  < content.length(); i++) {
            char ch = content.charAt(i);
            if (this.isValidChar(ch) == true) {
                retStr = retStr + ch;
            }
        }
        return retStr;
    }

    /**
     * shift table and hash table of initialize
     */
    private void init() {
        initFlag = true;
        for (int i = 0; i  < maxIndex; i++)
            hashTable[i] = new Vector <AtomicPattern>();
        shiftTableInit();
        hashTableInit();
    }

    /**
     * 清除
     */
    public void clear() {
        tmpUnionPatternSet.clear();
        initFlag = false;
    }

    /**
     * 初始化跳跃表
     */
    private void shiftTableInit() {
        for (int i = 0; i  < maxIndex; i++)
            shiftTable[i] = B;
        Vector <UnionPattern> upSet = tmpUnionPatternSet.getSet();
        for (int i = 0; i  < upSet.size(); i++) {
            Vector <AtomicPattern> apSet = upSet.get(i).getSet();
            for (int j = 0; j  < apSet.size(); j++) {
                AtomicPattern ap = apSet.get(j);
                Pattern pattern = ap.getPattern();
                //System.out.print(pattern.charAtEnd(1)+"\t");//如pattern.charAtEnd(1)==B,则shiftTable[pattern.charAtEnd(1)]==shiftTable[53]   
                if (shiftTable[pattern.charAtEnd(1)] != 0)
                    shiftTable[pattern.charAtEnd(1)] = 1;
                if (shiftTable[pattern.charAtEnd(0)] != 0)
                    shiftTable[pattern.charAtEnd(0)] = 0;
            }
        }
    }

    /**
     * 初始化HASH表
     */
    private void hashTableInit() {
        Vector <UnionPattern> upSet = tmpUnionPatternSet.getSet();
        for (int i = 0; i  < upSet.size(); i++) {
            Vector <AtomicPattern> apSet = upSet.get(i).getSet();
            for (int j = 0; j  < apSet.size(); j++) {
                AtomicPattern ap = apSet.get(j);
                Pattern pattern = ap.getPattern();
                //System.out.println(pattern.charAtEnd(0));//存储shiftTable[pattern.charAtEnd(0)]==0的字符块
                if (pattern.charAtEnd(0) != 0) {
                    hashTable[pattern.charAtEnd(0)].add(ap);
                    //System.out.println(hashTable[pattern.charAtEnd(0)]);   
                }
            }
        }
    }
}

/**
 * 模式类
 */
class Pattern {
    public String str;

    Pattern(String str) {
        this.str = str;
    }

    public char charAtEnd(int index) {
        if (str.length() > index) {
            return str.charAt(str.length() - index - 1);
        } else
            return 0;
    }

    public String getStr() {
        return str;
    }
}

/**
 * 原子模式类
 */
class AtomicPattern {
    private Pattern pattern;
    public UnionPattern belongUnionPattern;
    AtomicPattern(Pattern pattern) {
        this.pattern = pattern;
    }

    public UnionPattern getBelongUnionPattern() {
        return belongUnionPattern;
    }

    public void setBelongUnionPattern(UnionPattern belongUnionPattern) {
        this.belongUnionPattern = belongUnionPattern;
    }

    public Pattern getPattern() {
        return pattern;
    }

    public void setPattern(Pattern pattern) {
        this.pattern = pattern;
    }

    public boolean findMatchInString(String str) {
        if (this.pattern.str.length() > str.length())
            return false;
        int beginIndex = str.length() - this.pattern.str.length();
        String eqaulLengthStr = str.substring(beginIndex);
        if (this.pattern.str.equalsIgnoreCase(eqaulLengthStr))
            return true;
        return false;
    }
}

/**
 * 合并的模式类
 */
class UnionPattern {
    public Vector <AtomicPattern> apSet;
    private int level;
    // union string
    UnionPattern() {
        this.apSet = new Vector <AtomicPattern>();
    }
    public void addNewAtomicPattrn(AtomicPattern ap) {
        this.apSet.add(ap);
    }
    public Vector <AtomicPattern> getSet() {
        return apSet;
    }
    public boolean isIncludeAllAp(Vector <AtomicPattern> inAps) {
        if (apSet.size() > inAps.size())
            return false;
        for (int i = 0; i  < apSet.size(); i++) {
            AtomicPattern ap = apSet.get(i);
            if (isInAps(ap, inAps) == false)
                return false;
        }
        return true;
    }
    private boolean isInAps(AtomicPattern ap, Vector <AtomicPattern> inAps) {
        for (int i = 0; i  < inAps.size(); i++) {
            AtomicPattern destAp = inAps.get(i);
            if (ap.getPattern().str.equalsIgnoreCase(destAp.getPattern().str) == true)
                return true;
        }
        return false;
    }
    public void setLevel(int level) {
        this.level = level;
    }
    public int getLevel() {
        return this.level;
    }
}


/**
 * 合并的模式集子类
 */
class UnionPatternSet {
    // union string set
    public Vector <UnionPattern> unionPatternSet;
    UnionPatternSet() {
        this.unionPatternSet = new Vector <UnionPattern>();
    }
    public void addNewUnionPattrn(UnionPattern up) {
        this.unionPatternSet.add(up);
    }
    public Vector <UnionPattern> getSet() {
        return unionPatternSet;
    }
    public void clear() {
        unionPatternSet.clear();
    }
}

  

 

Guess you like

Origin www.cnblogs.com/fanblogs/p/11139214.html