C++ multi-pattern matching problem

Multi Pattern Matching : Given a text string T=t1t2...tn, then given a set of pattern strings P=p1,p2,...,pr, where each pattern string pi is defined in a limited alphabet The string pi=p1ip2i...pni on . It is required to find all occurrences of all the pattern strings pi in the pattern string set P from the text string T.

Some strings in the set of pattern strings P may be substrings, prefixes, suffixes, or exact equivalents of other strings in the set. The easiest way to solve the problem of multi-pattern string matching is to use the "single-pattern string matching algorithm" to search it rover. This results in a worst-case time complexity of O(|P|) for the preprocessing phase and O(r∗n) for the search phase.

If the "single-pattern string matching algorithm" is used to solve the multi-pattern matching problem, then according to the different ways of searching for pattern strings in text, we can also divide the multi-pattern string matching algorithms into the following three types:

  • Prefix-based search method: the search is performed from front to back (along the forward direction of the text), and the text characters are read one by one, and the automaton built on P is used for recognition. For each text position, compute the longest string that is both a suffix of the read text and a prefix of some pattern string in P.
    • Well-known Aho-Corasick Automaton (AC 自动机)algorithms and Multiple Shift-Andalgorithms use this approach.
  • Based on the suffix search method: the search is performed from the back to the front (along the reverse of the text), searching for the suffix of the pattern string. Moves the current text position according to the next occurrence of the suffix. This method avoids reading all text characters.
    • Commentz-WalterAlgorithm ( Boyer-Mooreextended algorithm of algorithm), Set Horspoolalgorithm ( Commentz-Waltersimplified algorithm of algorithm), and Wu-Manberalgorithm all use this method.
  • Based on the substring search method: the search is performed from the back to the front (along the reverse direction of the text), and the substring is searched in the prefix whose pattern string length is min(len(pi)), so as to determine the movement of the current text position. This approach also avoids reading all text characters.
    • Multiple BNDMAlgorithms, Set Backward Dawg Matching (SBDM)algorithms, Set Backwrad Oracle Matching (SBOM)algorithms all use this approach.

It should be noted that most of the multi-pattern string matching algorithms introduced above use a basic data structure: "Trie Tree" . The famous "Aho-Corasick Automaton (AC automaton) algorithm" was KMPborn on the basis of the algorithm combined with the "dictionary tree" structure. The "AC Automata Algorithm" is also one of the most effective algorithms in the multi-pattern string matching algorithm.

Therefore, the key to learning multi-pattern matching algorithms is to master the "dictionary tree" and "AC automaton algorithm" .

Dictionary tree knowledge

Dictionary tree (Trie) : also known as prefix tree, word search tree, is a tree structure. As the name suggests, it is a tree like a dictionary. It is a storage method for dictionaries. Each word in the dictionary is represented as a path starting from the root node in the dictionary tree, and the letters on the sides connected by the path are connected to form a corresponding string.

You can see that a trie is a structure that stores strings.

Please add a picture description

First of all, for nodes, we can use a hash table to define, and then define an ending sign

typedef struct Node {
    
    
	unordered_map<char, Node*> children;
	bool is_end = false;
};

The next step is to build a dictionary tree

class Trie {
    
    

public:
	Trie();  // 初始化

public:
	void Insert(string &s1);        //插入
	bool Search(const string& s);   //查找
	bool StartWith(const string& prefix);  //查找前缀

private:
	Node *root = NULL;
};

The first is initialization, where the head node does not store characters.

Trie::Trie() {
    
    
	root = new Node();
}

Then there is an insertion operation, traversing from the root, if the node exists, continue to index down, if it does not exist, create an index node.

void Trie::Insert(string& s) {
    
    
	Node* cur = this->root;
	for (int i = 0; i < s.size(); i++) {
    
    
		if (cur->children.find(s[i]) == cur->children.end()) cur->children[s[i]] = new Node();
		cur = cur->children[s[i]];
	}
	cur->is_end = true;
}

Next is to look up words, and look up prefixes

bool Trie::Search(const string &s) {
    
    
	Node* cur = this->root;
	for (int i = 0; i < s.size(); i++) {
    
    
		if (cur->children.find(s[i]) == cur->children.end()) return false;
		cur = cur->children[s[i]];
	}
	return cur && cur->is_end;
}
bool Trie::StartWith(const string& prefix) {
    
    
	Node* cur = this->root;
	for (int i = 0; i < prefix.size(); i++) {
    
    
		if (cur->children.find(prefix[i]) == cur->children.end()) return false;
		cur = cur->children[prefix[i]];
	}
	return cur;
}

Finding a word is basically the same as finding a prefix, the difference is that searching for a prefix does not need to judge whether the ending word is an end tag.

Finally, the time complexity of the algorithm

  • Inserting a word : time complexity is O(n); space complexity is O(dn) if using an array, or O(n) if implemented using a hash table.
  • Find a word : time complexity is O(n); space complexity is O(1).
  • Find a prefix : time complexity is O(m); space complexity is O(1).

Knowledge of AC Automata

AC Automaton (Aho-Corasick Automaton) : This algorithm was born in Bell Labs in 1975 and is one of the most famous multi-pattern matching algorithms. To put it simply, AC automata is based on the structure of Trie and combined with the idea of ​​KMP algorithm .

The construction of an AC automaton has 3 steps:

  1. Construct a dictionary tree (Trie) as the search data structure of the AC automaton.
  2. Using the idea of ​​KMP algorithm, the mismatch pointer is constructed. When the current character is mismatched, the mismatch pointer can be used to jump to the character position with the longest common prefix and suffix to continue matching.
  3. Scan text strings for matches.

Here is an example of a specific image.

Please add a picture description

let's take a look

  • The lower left corner is the string to build the dictionary tree
  • On the right is the "next array" constructed by mismatching pointers, which is similar to the KMP algorithm.

**The upper left corner is the string to match. ** start matching

The first step is to find a: the status is 0, and the status is returned to 1;

The second step is to find s: the state is 0, and the state is returned to 1;

The third step is to find h: the status is 2, continue to search;

The fourth step is to find e: the search is completed, because he is the character to be matched, so he+1;

The fifth step is to find r: the search is complete, because her is the character to be matched, so her+1

The sixth step is to find i: the status is 7, continue to search;

The seventh step is to find s: the search is completed, because is is the character to be matched, so is+1

The eighth step is to find h: the state is 2, continue to search, if the next character does not exist, then return to state 1;

The ninth step is to find y: the status is 9, continue to search;

The following you believe that you already know how to search, so the last his, how do you think you should search?

He will go through 1->2->5->6->8->7.

Because his contains two characters his and is, the mismatch pointer links their returns together. After querying his, his+1, when returning to state 1, passes through state 8, and 8 is the end of the word is, so is+1.


This is the principle of AC automaton knowledge, I hope you can read it patiently, and the implementation method needs you to implement it according to your actual application situation.

The old rules are useful Erlian, thank you everyone.

Guess you like

Origin blog.csdn.net/suren_jun/article/details/127561672