Standard Trie, compressed Trie, Suffix Trie

ref : https://dsqiu.iteye.com/blog/1705697

 

1.Trie guide

Trie tree is a tree-based data structure, also known as trie prefix tree, is a variation on the hash tree. Statistics and sorting used in the string, the search engine systems are often used for text word frequency statistics. The main query for strings in order to support fast pattern matching, the main applications in information retrieval, is supported Trie pattern matching and prefix matching. Trie tree may be looking at a deterministic finite automaton, finite state automata another blog post string pattern matching algorithm --BM, Horspool, Sunday, KMP, KR, AC clean sweep algorithm  has introduced.

 

2. Standard Trie

Let S be a set of s from the alphabet Σ satisfy a string S does not exist in the other string is a prefix. S is a standard Trie (standard trie) is an ordered book T, satisfies the following properties:

 

  • Except the root, T each node marked with a character of Σ.
  • A sub-sequence of internal node T node has specifications on the order of the alphabet Σ is determined.
  • T s has external node (leaf node), a node associated with each of the outer string S, satisfies a string T from the root to the indicia on an external path connected node v S is generated in association.

The figure is a string {bear, bell, bid, bull, buy, sell, stock, stop} standard Trie

                                                    

Storing a total length n, from the standard alphabet of size d s in the set S strings Trie had the following properties:

 

  1. Each internal node T Up d subnodes.
  2. T s has external nodes.
  3. T is equal to the height of the longest string length.
  4. Book of node T is O (n).

Performance: For n letters of the alphabet string, the internal node pointer in the positioning takes O (d) time, d is the size of the alphabet of English to 26. Since the internal node pointer is positioned in the above algorithm uses a random array storage, thus reduced the time complexity O (1). But if it is in writing, the following will be mentioned in practical applications. So here we are still using O (d). Find success when just taking a path from the root to leaf node. Thus the time complexity is O (d * n). However, when looking set X all strings twenty-two not shared prefix trie in the worst case occurs. Except for the root, internal nodes are all consisting of a child node. In this case lookup time complexity is degenerate O (d * (n ^ 2))

 

 

 

Standard Trie tree Chinese words

      Much more than the English because the Chinese word 26 letters. Thus the internal nodes of the trie, an array is not possible to store a pointer 26. If each node pointer space have opened tens of thousands of Chinese characters. Memory is estimated to burst, and even disk consumes large.

 

      Generally, we take such kinds of measures:

     (1) to the first word of the same word as roots of a tree. In this case, a set of Chinese words can constitute a Trie forest. This forest is stored on disk. root forest in the position where the root word and disks are recorded in a sort of orderly word table in Unicode code values. Word table can be stored in memory.

    Pointer (2) internal nodes stored with variable length arrays.

 

     Features: As the Chinese words rarely operated four words, so Trie height of the tree is not long. Find the time spent in the main internal node pointer lookup. Thus the word pointer points to the values ​​are sorted according to Unicode code word, and then loaded into memory to find possible to improve the efficiency after two minutes.

 

Trie standard Application Examples


                             

 

Trie standard applications and the advantages and disadvantages

     (1) Match whole word: to be examined to determine if a string exactly match the word set.

     (2) prefix-match: Find all the strings in the set s and prefix.

 

     Note: Trie tree structure is not suitable for finding substring. This and the role of PAT Tree and Suffix Tree behind a special mention of the previously mentioned are very different.

 

      Advantages: search efficiency is much higher than the set each string as matching efficiency. Search in O (m) times a length of the string s m is in the dictionary.

      Cons: standard Trie space utilization is not high, there may be a large number of nodes in only one child node, this node is absolutely a waste. For this reason, it has rapidly led to the development of the following compression trie spoken.

 

 

Compression Trie

Trie Trie compression similar to the standard, but it can ensure that every internal node in the Trie be at least two nodes. This rule is performed by the node list into each side of the chain in compression. Let T be a standard Trie, if an internal node V T has a child node, and it is not the root, this is called an internal node is an internal node is redundant (redundant).

in case


      

String {bear, bell, bid, bull, buy, sell, stock, stop} Trie compression

 

The nature and advantages of compression Trie:

     Trie compared with the standard, the number of nodes and the string compression is proportional to the Trie, while not always proportional to the length of the string. Storing data from a combined alphabet size d s in a compressed trie string T had the following properties of:

     (1) T in each internal node has at least two child nodes, there are at most d subnodes.

     (2) T s has external nodes.

     (3) T is the number of nodes is O (s)

Trie memory space standard of O (n) is reduced to O (s) after the compression, where n is the total length of the string set T, s T is the number of strings from

 

Compression Trie Application examples

 

The set S is assumed string S [0], S [1], ..., S [s-1] of an array using the triple (i, j, k) represents the stored implicitly labeled X, satisfies X = S [i] [j, ..., k]; i.e., X is S [i] of the substring of characters from j to k included in the composition.


                                  

 

Suffix Trie

 

Trie set suffix string is designated by the suffix string sub-string configuration. For example, the full string "minimize" suffix sub-strings of the set S are as follows:

 

         s1=minimize

         s2 = inimize

         s3 = nimize

         s4=imize

         s5=mize

         s6=ize

         s7 = too

         and s8 =

 Then these common prefix substring as an internal node constituting a "minimize" the suffix tree, as shown below:


                                                             

save space

Since the total length of the suffix string of length n for X n (n + 1) / 2, the space required for explicitly storing all suffixes X is O (n²). The suffix Trie represented implicitly these strings required space is O (n).

Suffix Trie created (shown)

Substring when inserted, the leaf node found keywords with a common prefix substring, it is necessary to split the leaf node. As step 3-4 of FIG. Otherwise, re-create a leaf node to store suffix as in FIG. 2 to Step 1.

 

Suffix Trie substring query

 

     If you find substring in P suffix tree T, we need this process:

     (1) starting from the root root, root traversal of all child nodes: N1, N2, N3 ....

     (2) If all child nodes of the first character of keyword and a character P does not match, you do not have this substring search ends.

     (3) If the first character the same as N3 of the node P with keywords K3, K3 and the matching P.

          If K3.length> = P.length and K3.subString (0, P.length-1) = P, the match is successful, otherwise the match fails.

          If K3.length <= P.length and K3 = P.subString (0, K3.length-1), then the substring P1 = P.subString (K3.length, P.length); i.e., P taken to exclude K3 after substring. P1 is then repeated continuously N3 (1) to (3) steps of the root node. Until you match all the characters P1, the match is successful. Otherwise, the match fails.

 

Query efficiency: Obviously, in the above algorithm. A successful match just compare P.length times characters. Positioning the child node pointer, and the like Trie case, if the number of alphabet d. The query efficiency is O (d * m), actually, d is a fixed constant, if positioned directly using the Hash table, then d = 1.

        Thus, a substring P suffix tree query time complexity of O (m), where m is the length P. However, the configuration of the suffix Trie time O (dn).

Suffix Trie application

Trie tree only for the standard prefix matching and whole words, and is not suitable for suffix substring matching. The suffix Trie in this regard is very appropriate.

 

 

 

 

 

 

 

 

reference:

Michael T. Goodrich Roberto Tamassia Algorithm Design Foundations, Analysis, and Internet Examples

Heart.X.Raid: http://hxraid.iteye.com/blog/618962

Heart.X.Raid: http://hxraid.iteye.com/blog/620414

 

Guess you like

Origin www.cnblogs.com/schips/p/11098165.html