Trie tree (rpm)

Original: https://www.cnblogs.com/huangxincheng/archive/2012/11/25/2788268.html

     Quite some time did not write this a series, the Trie, the name of the Trie There are a lot of us today, such as the dictionary tree, prefix tree and so on.

A: Concepts

     Here we have and, as, at, cn, com these key words, then how to build the trie it?

From the above figure, we can find some more or less fun features.

      First: the root node does not include characters, each child node except the root node contains a character.

      Second: from the root node to a node on the path through the connected characters, it is the string corresponding to the node.

      Third: common prefix character of each word as a node save.

 

Two: Use range

     Since learning the Trie, we certainly want to know this stuff is used to doing.

     First: word frequency statistics.

            Some people may say, and word frequency statistics simple ah, a hash or a stack can kick down the call it a day, but the question is, if the limited memory of it? Can do that

             Play? So here we can use the space below to compress the trie, because common prefix with a node are stored.

     Second: prefix match

            Take the above chart for instance, if I want to get all the strings "a" beginning, from the figure we can clearly see that: and, as, at, if not the trie,

            How do you do it? Apparently simple practice time complexity is O (N 2 ), then use the Trie is not the same, it can be done h, h search word length for you,

            We can say that this is the effect of the spike.

For example: the existing number a string of 1's "and", we have to be inserted into the trie, the idea of ​​using dynamic programming, the number "1" to the node included in each pathway,

              So after we are looking for "a", "an", "and" for the string prefix number will be easy.

三:实际操作

     到现在为止,我想大家已经对trie树有了大概的掌握,下面我们看看如何来实现。

1:定义trie树节点

     为了方便,我也采用纯英文字母,我们知道字母有26个,那么我们构建的trie树就是一个26叉树,每个节点包含26个子节点。

复制代码
 1 #region Trie树节点
 2         /// <summary>
 3         /// Trie树节点
 4         /// </summary>
 5         public class TrieNode
 6         {
 7             /// <summary>
 8             /// 26个字符,也就是26叉树
 9             /// </summary>
10             public TrieNode[] childNodes;
11 
12             /// <summary>
13             /// 词频统计
14             /// </summary>
15             public int freq;
16 
17             /// <summary>
18             /// 记录该节点的字符
19             /// </summary>
20             public char nodeChar;
21 
22             /// <summary>
23             /// 插入记录时的编码id
24             /// </summary>
25             public HashSet<int> hashSet = new HashSet<int>();
26 
27             /// <summary>
28             /// 初始化
29             /// </summary>
30             public TrieNode()
31             {
32                 childNodes = new TrieNode[26];
33                 freq = 0;
34             }
35         }
36         #endregion
复制代码

2: 添加操作

     既然是26叉树,那么当前节点的后续子节点是放在当前节点的哪一叉中,也就是放在childNodes中哪一个位置,这里我们采用

      int k = word[0] - 'a'来计算位置。

复制代码
 1         /// <summary>
 2         /// 插入操作
 3         /// </summary>
 4         /// <param name="root"></param>
 5         /// <param name="s"></param>
 6         public void AddTrieNode(ref TrieNode root, string word, int id)
 7         {
 8             if (word.Length == 0)
 9                 return;
10 
11             //求字符地址,方便将该字符放入到26叉树中的哪一叉中
12             int k = word[0] - 'a';
13 
14             //如果该叉树为空,则初始化
15             if (root.childNodes[k] == null)
16             {
17                 root.childNodes[k] = new TrieNode();
18 
19                 //记录下字符
20                 root.childNodes[k].nodeChar = word[0];
21             }
22 
23             //该id途径的节点
24             root.childNodes[k].hashSet.Add(id);
25 
26             var nextWord = word.Substring(1);
27 
28             //说明是最后一个字符,统计该词出现的次数
29             if (nextWord.Length == 0)
30                 root.childNodes[k].freq++;
31 
32             AddTrieNode(ref root.childNodes[k], nextWord, id);
33         }
34         #endregion
复制代码

3:删除操作

     删除操作中,我们不仅要删除该节点的字符串编号,还要对词频减一操作。

复制代码
  /// <summary>
        /// 删除操作
        /// </summary>
        /// <param name="root"></param>
        /// <param name="newWord"></param>
        /// <param name="oldWord"></param>
        /// <param name="id"></param>
        public void DeleteTrieNode(ref TrieNode root, string word, int id)
        {
            if (word.Length == 0)
                return;

            //求字符地址,方便将该字符放入到26叉树种的哪一颗树中
            int k = word[0] - 'a';

            //如果该叉树为空,则说明没有找到要删除的点
            if (root.childNodes[k] == null)
                return;

            var nextWord = word.Substring(1);

            //如果是最后一个单词,则减去词频
            if (word.Length == 0 && root.childNodes[k].freq > 0)
                root.childNodes[k].freq--;

            //删除途经节点
            root.childNodes[k].hashSet.Remove(id);

            DeleteTrieNode(ref root.childNodes[k], nextWord, id);
        }
复制代码

4:测试

   这里我从网上下载了一套的词汇表,共2279条词汇,现在我们要做的就是检索“go”开头的词汇,并统计go出现的频率。

复制代码
 1        public static void Main()
 2         {
 3             Trie trie = new Trie();
 4 
 5             var file = File.ReadAllLines(Environment.CurrentDirectory + "//1.txt");
 6 
 7             foreach (var item in file)
 8             {
 9                 var sp = item.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
10 
11                 trie.AddTrieNode(sp.LastOrDefault().ToLower(), Convert.ToInt32(sp[0]));
12             }
13 
14             Stopwatch watch = Stopwatch.StartNew();
15 
16             //检索go开头的字符串
17             var hashSet = trie.SearchTrie("go");
18 
19             foreach (var item in hashSet)
20             {
21                 Console.WriteLine("当前字符串的编号ID为:{0}", item);
22             }
23 
24             watch.Stop();
25 
26             Console.WriteLine("耗费时间:{0}", watch.ElapsedMilliseconds);
27 
28             Console.WriteLine("\n\ngo 出现的次数为:{0}\n\n", trie.WordCount("go"));
29         }
复制代码

下面我们拿着ID到txt中去找一找,嘿嘿,是不是很有意思。

测试文件:1.txt

完整代码:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.Threading;
using System.IO;

namespace ConsoleApplication2
{
    public class Program
    {
        public static void Main()
        {
            Trie trie = new Trie();

            var file = File.ReadAllLines(Environment.CurrentDirectory + "//1.txt");

            foreach (var item in file)
            {
                var sp = item.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);

                trie.AddTrieNode(sp.LastOrDefault().ToLower(), Convert.ToInt32(sp[0]));
            }

            Stopwatch watch = Stopwatch.StartNew();

            //检索go开头的字符串
            var hashSet = trie.SearchTrie("go");

            foreach (var item in hashSet)
            {
                Console.WriteLine("当前字符串的编号ID为:{0}", item);
            }

            watch.Stop();

            Console.WriteLine("耗费时间:{0}", watch.ElapsedMilliseconds);

            Console.WriteLine("\n\ngo 出现的次数为:{0}\n\n", trie.WordCount("go"));
        }
    }

    public class Trie
    {
        public TrieNode trieNode = new TrieNode();

        #region Trie树节点
        /// <summary>
        /// Trie树节点
        /// </summary>
        public class TrieNode
        {
            /// <summary>
            /// 26个字符,也就是26叉树
            /// </summary>
            public TrieNode[] childNodes;

            /// <summary>
            /// 词频统计
            /// </summary>
            public int freq;

            /// <summary>
            /// 记录该节点的字符
            /// </summary>
            public char nodeChar;

            /// <summary>
            /// 插入记录时的编号id
            /// </summary>
            public HashSet<int> hashSet = new HashSet<int>();

            /// <summary>
            /// 初始化
            /// </summary>
            public TrieNode()
            {
                childNodes = new TrieNode[26];
                freq = 0;
            }
        }
        #endregion

        #region 插入操作
        /// <summary>
        /// 插入操作
        /// </summary>
        /// <param name="word"></param>
        /// <param name="id"></param>
        public void AddTrieNode(string word, int id)
        {
            AddTrieNode(ref trieNode, word, id);
        }

        /// <summary>
        /// 插入操作
        /// </summary>
        /// <param name="root"></param>
        /// <param name="s"></param>
        public void AddTrieNode(ref TrieNode root, string word, int id)
        {
            if (word.Length == 0)
                return;

            //求字符地址,方便将该字符放入到26叉树中的哪一叉中
            int k = word[0] - 'a';

            //如果该叉树为空,则初始化
            if (root.childNodes[k] == null)
            {
                root.childNodes[k] = new TrieNode();

                //记录下字符
                root.childNodes[k].nodeChar = word[0];
            }

            //该id途径的节点
            root.childNodes[k].hashSet.Add(id);

            var nextWord = word.Substring(1);

            //说明是最后一个字符,统计该词出现的次数
            if (nextWord.Length == 0)
                root.childNodes[k].freq++;

            AddTrieNode(ref root.childNodes[k], nextWord, id);
        }
        #endregion

        #region 检索操作
        /// <summary>
        /// 检索单词的前缀,返回改前缀的Hash集合
        /// </summary>
        /// <param name="s"></param>
        /// <returns></returns>
        public HashSet<int> SearchTrie(string s)
        {
            HashSet<int> hashSet = new HashSet<int>();

            return SearchTrie(ref trieNode, s, ref hashSet);
        }

        /// <summary>
        /// 检索单词的前缀,返回改前缀的Hash集合
        /// </summary>
        /// <param name="root"></param>
        /// <param name="s"></param>
        /// <returns></returns>
        public HashSet<int> SearchTrie(ref TrieNode root, string word, ref HashSet<int> hashSet)
        {
            if (word.Length == 0)
                return hashSet;

            int k = word[0] - 'a';

            var nextWord = word.Substring(1);

            if (nextWord.Length == 0)
            {
                //采用动态规划的思想,word最后节点记录这途经的id
                hashSet = root.childNodes[k].hashSet;
            }

            SearchTrie(ref root.childNodes[k], nextWord, ref hashSet);

            return hashSet;
        }
        #endregion

        #region 统计指定单词出现的次数

        /// <summary>
        /// 统计指定单词出现的次数
        /// </summary>
        /// <param name="root"></param>
        /// <param name="word"></param>
        /// <returns></returns>
        public int WordCount(string word)
        {
            int count = 0;

            WordCount(ref trieNode, word, ref count);

            return count;
        }

        /// <summary>
        /// 统计指定单词出现的次数
        /// </summary>
        /// <param name="root"></param>
        /// <param name="word"></param>
        /// <param name="hashSet"></param>
        /// <returns></returns>
        public void WordCount(ref TrieNode root, string word, ref int count)
        {
            if (word.Length == 0)
                return;

            int k = word[0] - 'a';

            var nextWord = word.Substring(1);

            if (nextWord.Length == 0)
            {
                //采用动态规划的思想,word最后节点记录这途经的id
                count = root.childNodes[k].freq;
            }

            WordCount(ref root.childNodes[k], nextWord, ref count);
        }

        #endregion

        #region 修改操作
        /// <summary>
        /// 修改操作
        /// </summary>
        /// <param name="newWord"></param>
        /// <param name="oldWord"></param>
        /// <param name="id"></param>
        public void UpdateTrieNode(string newWord, string oldWord, int id)
        {
            UpdateTrieNode(ref trieNode, newWord, oldWord, id);
        }

        /// <summary>
        /// 修改操作
        /// </summary>
        /// <param name="root"></param>
        /// <param name="newWord"></param>
        /// <param name="oldWord"></param>
        /// <param name="id"></param>
        public void UpdateTrieNode(ref TrieNode root, string newWord, string oldWord, int id)
        {
            //先删除
            DeleteTrieNode(oldWord, id);

            //再添加
            AddTrieNode(newWord, id);
        }
        #endregion

        #region 删除操作
        /// <summary>
        ///  删除操作
        /// </summary>
        /// <param name="root"></param>
        /// <param name="newWord"></param>
        /// <param name="oldWord"></param>
        /// <param name="id"></param>
        public void DeleteTrieNode(string word, int id)
        {
            DeleteTrieNode(ref trieNode, word, id);
        }

        /// <summary>
        /// 删除操作
        /// </summary>
        /// <param name="root"></param>
        /// <param name="newWord"></param>
        /// <param name="oldWord"></param>
        /// <param name="id"></param>
        public void DeleteTrieNode(ref TrieNode root, string word, int id)
        {
            if (word.Length == 0)
                return;

            //求字符地址,方便将该字符放入到26叉树种的哪一颗树中
            int k = word[0] - 'a';

            //如果该叉树为空,则说明没有找到要删除的点
            if (root.childNodes[k] == null)
                return;

            var nextWord = word.Substring(1);

            //如果是最后一个单词,则减去词频
            if (word.Length == 0 && root.childNodes[k].freq > 0)
                root.childNodes[k].freq--;

            //删除途经节点
            root.childNodes[k].hashSet.Remove(id);

            DeleteTrieNode(ref root.childNodes[k], nextWord, id);
        }
        #endregion
    }
}

 

Guess you like

Origin www.cnblogs.com/ajianbeyourself/p/11260133.html
Recommended