AC automata - 1 Trie tree (dictionary tree) introduction

AC automata - 1 Trie tree (dictionary tree) introduction

Before, we introduced the Kmp algorithm, in fact, it is a single pattern matching. When you want to check whether there are certain sensitive words in an article, this is actually a problem of multi-pattern matching. Of course, you can also use the KMP algorithm to find it, then its time complexity is O(c*(m+n)), and c: is the number of pattern strings. m: is the length of the pattern string, n: is the length of the text, then the complexity is no longer linear. We learn algorithms in the hope of optimizing the problem to be solved to the extreme. No, AC automata will be used. Useful.

   In fact, the AC automaton is a utilization of the Trie tree. The utilization point is to instill the idea of ​​kmp, but in the AC automaton, a returned pointer is added to the Trie, which is equivalent to the next value in the kmp algorithm. This optimizes the time complexity to linear O(N) again.

 

Next, we first introduce the Trie tree

Trie tree that saves 6 strings tea, ten, to, in, inn, int

 

The basic properties of Trie trees can be summarized as:

(1) The root node does not contain characters, except for the root node, each node contains only one character

(2) From the root node to a certain node, the characters passing on the path are connected, which is the string corresponding to the node

(3) All child nodes of each node contain different strings

Of course, the trie tree also has a disadvantage. If there are a large number of strings in the system and these strings basically have no common prefix, the corresponding trie tree will consume a lot of memory.

 

 

Now, let's look at the basic implementation of the Trie tree

The insertion (Insert), deletion (Delete) and search (Find) of the letter tree are very simple, just use a single loop, that is, the i-th loop finds the subtree corresponding to the first i letter, and then performs the corresponding operation. To implement this alphabet tree, we can save it with the most common array (statically open up memory), and of course we can open dynamic pointer types (dynamically open up memory). As for the direction of the node to the son, there are generally three methods: 1. Each node in the column opens an array of the size of the letter set, the subscript of the column should be the letter represented by the son, and the content is that the son column should be in the large array. 2. Hang a linked list for each node, and record who each son is in a certain order (the space is relatively small, and it is time-consuming) 3. Use the left son and the right Sibling notation records this tree. (minimum space requirement, relatively time consuming and not easy to write)

 

 

[cpp]  view plain copy  
 
  1. //Define the number of child nodes of the node 26 represents 26 letters  
  2. #define MAX_NUM 26  
  3.   
  4. //Define the node type, completed means a string from the root node to this node  
  5. enum NODE_TYPE  
  6. {  
  7.     COMPLETED,  
  8.     UNCOMPLETED  
  9. };  
  10.   
  11. //node data type  
  12. struct Node  
  13. {  
  14.     enum NODE_TYPE type;  
  15.     char ch;  
  16.     struct Node* child[MAX_NUM];  
  17. }  
  18.   
  19. struct Node* ROOT;  
  20.   
  21. //create new node  
  22. struct Node* createNewNode(char ch)  
  23. {  
  24.     struct Node* new_node = (struct Node*) malloc(sizeof(struct Node));  
  25.     new_node->ch = ch;  
  26.     new_node->type = UNCOMPLETED;  
  27.       
  28.     int i;  
  29.     for(i=0; i<MAX_NUM; i++)  
  30.         new_node->child[i] = NULL;  
  31.   
  32.     return new_node;  
  33. }  
  34.   
  35. //Initialize the Trie tree  
  36. void initialization()  
  37. {  
  38.     ROOT = createNewNode('');  
  39. }  
  40.   
  41. //  
  42. int charToindex(char ch)  
  43. {  
  44.     return ch - 'a';  
  45. }  
  46.   
  47. // query string  
  48. int find(const char chars[], int len)  
  49. {  
  50.     struct Node* ptr = ROOT;  
  51.     int i = 0;  
  52.       
  53.     while(i<len)  
  54.     {  
  55.         if(ptr->child[charToindex(chars[i])] == NULL)   
  56.              break;  
  57.         ptr = ptr->child[charToindex(chars[i])];  
  58.         i ++;  
  59.     }  
  60.   
  61.     return (i == len) && (ptr->type == COMPLETED);  
  62. }  
  63.   
  64. //Insert inserts the string into the Trie tree  
  65. void insert(const char chars[], int len)  
  66. {  
  67.     struct Node* ptr = ROOT;  
  68.     int i;  
  69.     for(i = 0; i<len; i++)  
  70.     {  
  71.         if(ptr->child[charToindex(chars[i])] == NULL)  
  72.         {  
  73.             ptr->child[charToindex(chars[i])] == createNewNode(chars[i]);  
  74.         }  
  75.   
  76.         ptr = ptr->child[charToindex[chars[i]]];  
  77.     }  
  78.       
  79.     ptr->type = COMPLETED;  
  80. }  

Triel tree application

 

(l) String retrieval
Save the information about some known strings (dictionaries) in the trie tree in advance, and find out whether other unknown strings have appeared or the frequency of occurrence.
Example:
1 Given a familiar vocabulary list consisting of N words, and an article written in lowercase English, please write all the new words that are not in the familiar vocabulary list in the earliest order.
2 Give a dictionary in which the words are bad words. Words are in lowercase letters. The paragraph text is then given, and each line of the text is also formed from lowercase letters. Determine if the text contains any bad words. For example, if rob is a bad word, then the text problem contains the bad word.

(2) The longest common prefix
of strings The Trie tree uses the common prefix of multiple strings to save storage space. On the contrary, when we store a large number of strings in a trie tree, we can quickly get the common prefix of some strings. prefix.
For example:
1. Given N lowercase English letter strings and Q queries, that is, asking what is the length of the longest common prefix of two strings.
Solution: First, build corresponding letter trees for all strings. At this time, it is found that the length of the longest common prefix of the two strings is the number of common ancestors of the node where they are located, so the problem is transformed into the offline (Offline) nearest common ancestor (L-t Common An...tor, referred to as LCA) problem
and the nearest common ancestor problem is also a classic problem. The following methods can be used:
1) Using the Disioint Set, the classic Tarian algorithm can be used,
2) After finding the Euler Sequence of the letter tree, it can be converted to the classical minimum The value query (RanZeMinimum Querr, referred to as XIIO) problem,
(about the union search, T arjar thin method, XII. Question, there is a lot of information on the Internet.)

(3) The sorted
Trie tree is a multi-fork tree. As long as the entire tree is traversed in preorder, the corresponding string output is the result of lexicographical sorting.
Example:
Give you N different English names consisting of only one word, and let you sort them lexicographically from small to large.

(4) As an auxiliary structure for other data structures and algorithms
such as suffix trees, AC automata, etc.

 

At the same time, since the space complexity of the Trie tree is 26^n, it is very large and can be improved by double arrays

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324940811&siteId=291194637