Weekend, I hanged on a tree

Usually double array trie find many scenes, but it has not understood the build process, so you chiefs by the article, summed up the process of building their own understandable dual array trie, combined with some practical examples, experience about the specific usage. The whole idea of ​​the article is to Trie-based, then simply comb.

graph LR A[Array Trie] --> B[List Trie] B --> C[Hash Trie] C --> D[Double Array Trie]

Before we look at the number of double array dictionary is to look at what the dictionary tree.

Trie (Trie)

The dictionary definition of the tree

Trie: also known as the Trie, prefix tree, the tree data structure which is a string.It is constructed in the shape of a string of a tree, as in FIG. For finite set {AC, ACE, ACFF, AD , CD, CF, ZQ}. R represents the root node. Here Insert Picture DescriptionFor processing strings, we usually have application in the set of character strings is determined whether the string is present, this is a bottleneck matching algorithm, the matching algorithm for ordinary, if traversed to find the time complexity is O (n ^ 2), search method time with half the complexity is O (logn), if to match with the TreeMap, time complexity is O (logn), n here it refers to the size of the dictionary, if the HashMap words, the time complexity is O (1 ), but the complexity of the space went up, so you want to find a fast speed, but at the same time and save memory data structures, and to complete the matching operation. Trie to meet these characteristics. First a brief look at the basic principles of trie

Principle trie

Each side of trie corresponds to a word string constituting a path from the root node down.Trie node is not directly stored in the string, but the root node to a word regarded as a path between nodes. And make a mark on the end node (the node corresponds to the end of words), the string is a path to query a certain word, you need to go down this path from the root down, if we can come to a special mark node (blue node), then that current string in the set, or the current string is not in the set. The following figure is the word { "abc", "abcd" , "adb", "b", "bcd", "efg", "hik"}, constituting the prefix tree. Picture by Here Insert Picture Description an orange flag of the node is the end of a word (not necessarily the end of a word to the leaf nodes), only a digital number, and the path corresponding to the words in the table below. | Words | path | | - | - | | abc | 0-1-2-3 | | abcd | 0-1-2-3-4 | | adb | 0-1-2-5 | | b | 0-6 | | bcd | 0-6-7-8 | | efg | 0-9-10-11 | | hik | 0-12-13-14 |

Note: The 橙色=色节点不一定是叶子节点,也就是词的结尾不一定是叶子节点。time complexity trie worst case is O (logn), but it is faster than binary search, after all, with the deepening of the path prefix matching is a progressive process, the algorithm is not necessary to compare string prefix .

Characteristics trie

  1. Space for time
  2. The root node does not include characters, each node except the root node contains only one character.
  3. From the root node to a node on the path through the connected characters, the string corresponding to that node.
  4. All the characters in each node contains child nodes are not the same.

Be easier to understand

For example, there are now 10,000 word list, we have to determine studentthe word has not had to traverse to find the time complexity is O (n ^ 2), to find a dichotomy time complexity is O (logn), with a trie is O ( logn), but I say why trie more excellent, then use the dictionary lookup rules is to first find the tree s, go to sthe sub-tree to find t, and so on, see if you can find a student that path.

Achieve trie

The specific needs of implementation has the following

  • void insert(String word):添加word;
  • void delete(String word):删除word;
  • boolean search(String word):查询word是否在字典树中;
/**
 * 前缀树
 */
public class TrieTree {
    //字典树节点
    class TrieNode {
        public int path;
        public int end;
        public HashMap<Character, TrieNode> map;

        public TrieNode() {
            path = 0;
            end = 0;
            map = new HashMap<>();
        }
    }

    private TrieNode root;

    public TrieTree() {
        root = new TrieNode();
    }

    /**
     * 插入一个新的单词
     * @param word
     */
    public void insert(String word) {
        if (word == null)
            return;
        TrieNode node = root;
        node.path++;
        char[] words = word.toCharArray();
        for (int i = 0; i < words.length; i++) {
            if (node.map.get(words[i]) == null) {
                node.map.put(words[i], new TrieNode());
            }
            node = node.map.get(words[i]);
            node.path++;
        }
        node.end++;
    }

    public boolean search(String word) {
        if (word == null)
            return false;
        TrieNode node = root;
        char[] words = word.toCharArray();
        for (int i = 0; i < words.length; i++) {
            if (node.map.get(words[i]) == null)
                return false;
            node = node.map.get(words[i]);
        }
        return node.end > 0;
    }

    public void delete(String word) {
        if (search(word)) {
            char[] words = word.toCharArray();
            TrieNode node = root;
            node.path--;
            for (int i = 0; i < words.length; i++) {
                if (--node.map.get(words[i]).path == 0) {
                    node.map.remove(words[i]);
                    return;
                }
                node = node.map.get(words[i]);
            }//for
            node.end--;
        }//if
    }

    public int prefixNumber(String pre) {
        if (pre == null)
            return 0;
        TrieNode node = root;
        char[] pres = pre.toCharArray();
        for (int i = 0; i < pres.length; i++) {
            if (node.map.get(pres[i]) == null)
                return 0;
            node = node.map.get(pres[i]);
        }
        return node.path;
    }

    public static void main(String[] args) {
        TrieTree trie = new TrieTree();
        System.out.println(trie.search("程龙颖"));//f
        trie.insert("自然人");
        trie.insert("自然");
        trie.insert("自然语言");
        trie.insert("自语");
        trie.insert("入门");
        System.out.println(trie.search("自然"));//t
        trie.delete("自然语言");
        System.out.println(trie.search("自然语言"));//f
        trie.insert("自然语言");
        System.out.println(trie.search("自然语言"));//t
        System.out.println(trie.prefixNumber("自然"));//3
    }
}

DFA简单理解

TrieTree本质上是一个确定有限自动机(DFA)。 DFA的特征:有一个有限状态集合和一些从一个状态通向另一个状态的边,每条边上标记有一个符号,其中一个状态是初态,某些状态是终态。但不同于不确定的有限自动机,DFA中不会有从同一状态出发的两条边标志有相同的符号。 对于DFA来说,每个节点代表一个“状态”,每条边代表一个“变量”。

双数组字典树

双数组字典树(DoubleArrayTrie, DAT)是由三个日本人提出的一种字典树的高效实现,兼顾了查询效率与空间存储。DAT极大地节省了内存占用。

优点

在Trie数实现过程中,我们发现了每个节点均需要 一个数组来存储next节点,非常占用存储空间,空间复杂度大,双数组Trie树正是解决这个问题的。双数组字典树(DoubleArrayTrie)是一种空间复杂度低的Trie树,应用于字典树压缩、分词、敏感词等领域。所以,DAT是前缀树的一个变形,同样也是一个DFA。

缺点

每个状态都依赖于其他状态,所以当在词典中插入或删除词语的时候,往往需要对双数组结构进行全局调整,从而灵活性能较差。

定义

将原来需要多个数组才能表示的Trie树,使用两个数组就可以存储下来,可以极大的减小空间复杂度。由于用base和check两个数组构成,又称为双数组字典树。 具体来说就是使用两个数组base[]和check[]来维护Trie树,base[]负责记录状态,check[]用于检验状态转移的正确性,当check[i]为负值时,表示此状态为字符串的结束。 具体来说,当状态b接受字符c然后转移到状态p的时候,满足的状态转移公式如下:

p = base[b] + c
check[p] = base[c]    

构建双数组的过程

对于词典 { AC,ACE,ACFF,AD,CD,CF,ZQ },构建双数组具体过程如下。 Here Insert Picture Description在构造之前,先梳理几个概念

  • STATE:状态,也就是数组的下标
  • CODE: 状态转移值,实际为字符的 ASCII码
  • BASE: 表示后继节点的基地址的数组,叶子节点没有后继,标识为字符序列的结尾标志

主要是基于 dart-java,此版本对双数组算法做了一个改进,即darts双数组中有以下的改进。

    base[0] = 1
    check[0] = 0

第二个改进就是令字符的code = ascii+1

结合两个数组的状态转移公式有以下条件

base[0] = 1
check[0] = 0
p = base[b] + c
check[p] = base[c]    

基于base和check两个数据构建双数组的流程整体如下

1、建立根节点root,令$base[root] =1$ 2、找出root的子节点集$(i = 1...n)$ , 使得 check[\(root.children_i ] = base[root] = 1\) 3、对 each element in root.children : 1)找到{\(elemenet.children_i\) }(i = 1...n) ,注意若一个字符位于字符序列的结尾,则其孩子节点包括一个空节点,其code值设置为0找到一个值begin使得每一个$check[ begin_i + element.children_i .code] = 0$   2)设置$base[element.children_i] = begin_i$   3)对$element.children_i$ 递归执行步骤3,若遍历到某个$element$,其没有$children$,即叶节点,则设置$base[element]$为负值(一般为在字典中的$index$取负)

1、根据上面的那个例子{ AC,ACE,ACFF,AD,CD,CF,ZQ }来说,最开始有

base[0] = 1
check[0] = 0

备注:ascii表格

    65     A
    66     B
    ...

此外,结合darts双数组的改进code= ascii+1, 以及i = base[0] + code可以得到下面每个字符的状态。base[0] = 1

root A C D E F Q Z
i 0 67 69 92
code 0 66 68 69 70 71 82 91

2、根据构造过程中的第二步,距离root节点深度为1的所有children其$check[root.children_i ] = base[root] = 1$,在模式串中root的三个子节点'A', 'C', 'E'的check值都是1, 假设root经过A C Z 的作用分别到达$p_1 , p_2, p_3$三个状态,可以得到下面矩阵。 | | root| A| C | Z| |--|--|--|--|--|--|--|--| |i| 0 | 67| 69 | 92 | |base| 1 | | | | |check| 0 | 1| 1| 1| 1| |state| p0 | p1| p2| p3|

3、根据构建的第三步,状态p1是由条件 'A'触发的,那么'A'的base值的计算方式需要满足以下的规则: 我们知道,对于每一个字符, 需要确定一个base值,使得对于所有以该字开头的词,在双数组中都能放下。 已知'A的子节点值为, 需要找一个begin值,使得check[begin +'C'.code] = check[begin +'D'.code] = 0满足, 即check[begin + 68] = check[begin + 69] = 0,换句话说,需要找到一个begin,从而找到之前没有使用过的空间。

a、当begin=0的时候,有check[0+ 68] 和check[0+ 69]都必须要为0, 但是begin为0的时候,check[0+ 69] 存在字符‘C’, 所以check[begin +’C'.code] = check[begin +’D’.code] = 0不成立。 b、当begin=1的时候,有check[1+ 68] 和check[1+ 69] 都必须为0, 但是check[1 + 68] 存在字符‘C’, 所以check[begin +’C'.code] = check[begin +’D’.code] = 0不成立。 c、当begin=2的时候 有check[2+ 68] 和check[2 + 69] 的值都必须为0 有check[begin + 68] = check[begin + 69] = 0 所以有base[p1] = begin = 2, 状态p1 = 67。

p4 = base[p1] + 'C'.code = 2 + 68 = 70 , p5 = base[p1] + 'D'.code = 2 + 69 = 71, check[p5] = check[p4] = base[p1] = 2, 那么有以下矩阵 | | root| A| C | Z|C|D| |--|--|--|--|--|--|--|--| |i| 0 | 67| 69 | 92 |70|71| |base| 1 | 2| | ||| |check| 0 | 1| 1| 1| 2|2| |state| p0 | p1| p2| p3|p4|p5|

4、根据上一步,继续推导。已知C的子节点是{D、F},需要找一个begin值,使得check[begin +'D'.code] = check[begin +'F'.code] = 0满足, 即check[begin + 69] = check[begin + 71] = 0 这个等式成立。 a、当begin为0的时候,check[0+ 69]和check[0+ 71]分别有字符C和字符D。 b、当begin为1的时候,check[1+69]有字符C c、当begin为2的时候,check[2+69]有字符C

字符‘C’的base[t2] = begin = 8, 下一状态,即子节点值{D, F}的t6,t7状态。   t6 = base[t2] + ‘D’.code = 8 + 69 = 77;   t7 = base[t2] + ‘F’.code = 8 + 70 = 79;   check[t6] = check[t7] = base[t2] = begin = 8;矩阵如下:    | | root| A| C | Z|C|D|D|F| |--|--|--|--|--|--|--|--|--|--| |i| 0 | 67| 69 | 92 |70|71|77|79| |base| 1 | 2| 8 | ||| |check| 0 | 1| 1| 1| 1|2|2|8|8| |state| p0 | p1| p2| p3|p4|p5| p6|p7|

The final matrix as | | root | A | C | Z | C | D | D | F | Q | E | F | F | | - | - | - | - | - | - | - - | - | - | - | - | - | - | | i | 0 | 67 | 69 | 92 | 70 | 71 | 77 | 79 | 86 | 142 | 143 | 74 | | base | 1 | 2 | 8 | 4 | 72 | 76 | 78 | 80 | 83 | 73 | 3 | 75 | | check | 0 | 1 | 1 | 1 | 1 | 2 | 2 | 8 | 8 | 4 | 72 | 72 | 3 | | state | p0 | p1 | p2 | p3 | p4 | p5 | p6 | p7 | p8 | p9 | p10 | p11 |

The pattern string following completion

 root  A   C   Z   C   D   D   F   Q   E   F   F  AC  AD  CD  CF  ZQ ACE ACFF
i      0   67  69  92  70  71  77  79  86  142 143 74  72  76  78  80  83  73  75
base   1    2   8   4  72  76  78  80  83  73   3  75  -1  -4  -5  -6  -7  -2  -3
check  0    1   1   1   2   2   8   8   4  72  72   3  72  76  78  80  83  73  75
state  t0  t1  t2  t3  t4  t5  t6  t7  t8  t9  t10 t11 t12 t13 t14 t15 t16 t17 t18

DFA to use forms depicted State node represents, as a character transition condition, trigger different State different character, the tree can be obtained as shown below, wherein the matrix is ​​just the red part of the fifth step; green part is obtained in accordance with the mode set ouput table. Here Insert Picture Description

reference

https://blog.csdn.net/u013300579/article/details/78869742 https://blog.csdn.net/zhoubl668/article/details/6957830 https://github.com/komiya-atsushi/darts-java https://linux.thai.net/~thep/datrie/datrie.html https://www.cnblogs.com/ooon/p/4883159.html https://blog.csdn.net/xlxxcc/article/details/67631988

Guess you like

Origin www.cnblogs.com/zhangxinying/p/12057737.html