Google原生输入法LatinIME词库构建流程分析(一)

进入到cpp目录下(pwd=.../cpp/),在command目录中有个pinyinime_dictbuilder.cpp文件,源码中可以看到main函数,这里就是词库构建的入口,接下来看下main函数源码:

 25 /**
 26  * Build binary dictionary model. Make sure that ___BUILD_MODEL___ is defined
 27  * in dictdef.h.
 28  */
 29 int main(int argc, char* argv[]) {
 30   DictTrie* dict_trie = new DictTrie();
 31   bool success;
 32   if (argc >= 3)
 33      success = dict_trie->build_dict(argv[1], argv[2]);
 34   else
 35      success = dict_trie->build_dict("../data/rawdict_utf16_65105_freq.txt",
 36                                      "../data/valid_utf16.txt");
 37 
 38   if (success) {
 39     printf("Build dictionary successfully.\n");
 40   } else {
 41       printf("Build dictionary unsuccessfully.\n");
 42     return -1;
 43   }
 44 
 45   success = dict_trie->save_dict("../../res/raw/dict_pinyin.dat");
 46 
 47   if (success) {
 48     printf("Save dictionary successfully.\n");
 49   } else {
 50     printf("Save dictionary unsuccessfully.\n");
 51     return -1;
 52   }
 53 
 54   return 0;
 55 }

根据注释来看,该函数是用来构建字库模型的,缺省状态下执行第35行,用data下面的两个utf16的txt文件来作为源进行build,如果构建成功,第45行会进行保存,保存的目录以及文件格式通过代码很容易找到了,这是个二进制文件,其内部实际上是一些列的数据结构组合而成,构建词库模型的核心逻辑就在第35行调用的build_dict方法中,接下来进一步跟进(./share/dicttrie.cpp):

103 #ifdef ___BUILD_MODEL___
104 bool DictTrie::build_dict(const char* fn_raw, const char* fn_validhzs) {
105   DictBuilder* dict_builder = new DictBuilder();
106 
107   free_resource(true);
108 
109   return dict_builder->build_dict(fn_raw, fn_validhzs, this);
110 }

第105行创建了一个dict_builder,这个builder就是实际用来构建的对象,创建完对象后调用了free_resource方法用来释放一下创建字典过程中用到的数据结构,这些数据结构其实就是最终保存字典的时候需要保存的对象,通过fwrite方法写入到文件dict_pinyin.dat中,释放完相关数据结构以后调用builder的build_dict方法,开始构建过程(./share/dictbuilder.cpp):

bool DictBuilder::build_dict(const char *fn_raw,
                             const char *fn_validhzs,
                             DictTrie *dict_trie) {
  ...

  lemma_num_ = read_raw_dict(fn_raw, fn_validhzs, 240000);
  ...

  spl_buf = spl_table_->arrange(&spl_item_size, &spl_num);
  ...
    // 把所有合法音节组织成一个Trie树 construct方法
  if (!spl_trie.construct(spl_buf, spl_item_size, spl_num,
                          spl_table_->get_score_amplifier(),
                          spl_table_->get_average_score())) {
    free_resource();
    return false;
  }

  printf("spelling tree construct successfully.\n");

  // 填充lemma_arr_数组每个元素的spl_idx_arr项，它表示每个汉字的音对应的spl_id
  // Convert the spelling string to idxs
  for (size_t i = 0; i < lemma_num_; i++) {
    for (size_t hz_pos = 0; hz_pos < (size_t)lemma_arr_[i].hz_str_len;
         hz_pos++) {
      ...
      int spl_idx_num =
        spl_parser_->splstr_to_idxs(lemma_arr_[i].pinyin_str[hz_pos],
                                    strlen(lemma_arr_[i].pinyin_str[hz_pos]),
                                    spl_idxs, spl_start_pos, 2, is_pre);
        ...
      if (spl_trie.is_half_id(spl_idxs[0])) {
        uint16 num = spl_trie.half_to_full(spl_idxs[0], spl_idxs);
        assert(0 != num);
      }
      lemma_arr_[i].spl_idx_arr[hz_pos] = spl_idxs[0];
    }
  }

  ...
// Sort the lemma items according to the hanzi, and give each unique item a
  // id
  // 按照汉字串排序，更新idx_by_hz字段，为每个词分配一个唯一id
  sort_lemmas_by_hz();
// 构建单字表到scis_，并根据该单字表更新lemma_arr_中的hanzi_scis_ids字段
  scis_num_ = build_scis();

  // Construct the dict list
  dict_trie->dict_list_ = new DictList();
  bool dl_success = dict_trie->dict_list_->init_list(scis_, scis_num_,
                                                     lemma_arr_, lemma_num_);
  assert(dl_success);

  // Construct the NGram information
  NGram& ngram = NGram::get_instance();
  ngram.build_unigram(lemma_arr_, lemma_num_,
                      lemma_arr_[lemma_num_ - 1].idx_by_hz + 1);

  ...

  lma_nds_used_num_le0_ = 1;  // The root node
  bool dt_success = construct_subset(static_cast<void*>(lma_nodes_le0_),
                                     lemma_arr_, 0, lemma_num_, 0);
  ...

  if (kPrintDebug0) {
    printf("Building dict succeds\n");
  }
  return dt_success;
}

构建字典的主要逻辑都在这个方法中，代码中只保留了主要逻辑方法可以直观的看出构建的具体过程，首先从两个文件中读取内容到对应的数据结构中，raw这个文件内容被保存到lemma_arr_数组中，但是通过打印lemma_num的值发现lemma_arr数组中实际元素个数为65101个，但是raw_dict_utf16_65101.txt文件为65105行，为什么会少四个呢？通过对for循环中的continue挂断点发现问题在read_raw_dict函数中：

// The whole line must have been parsed fully, otherwise discard this one.
    token = utf16_strtok(to_tokenize, &token_size, &to_tokenize);
    if (spelling_not_support || NULL != token) {
      i--;
      continue;
    }

spelling_not_support为true，通过打印当前索引i发现文件中确实存在非法的拼音，如6557行的：

哼 2072.17903804 0 hng

以及17035、17036、17037行：

噷 6.18262663209 1 hm
唔 1126.6237397 0 ng
嗯 31982.2903695 0 ng

所以正好是四个元素。

validhanzi这个内容保存到valid_hzs数组中打印valid_hzs数组如下所示：

(gdb) ptype valid_hzs
type = unsigned short *
(gdb) p valid_hzs
$1 = (ime_pinyin::char16 *) 0x627190
(gdb) p *valid_hzs@10
$2 = {12295, 19968, 19969, 19971, 19975, 19976, 19977, 19978, 19979, 19980}
(gdb)

指针类型的数组，这里只是打印了前十个元素，第一个12295就是汉字“〇”的Unicode编码，可以在这里验证，lemma_arr_已经在Google原生输入法LatinIME词库构建流程分析--相关数据结构分析这篇文章中打印过了，这两个数组是后面构建词库的基础，read_raw_ditct()方法读取完数据后调用了spl_table->arrange方法，此方法返回一个指针数组spl_buf，该数组为有效汉语音节总数，长度为413，并且是经过排序的，然后调用spl_trie->construct方法，构建所有合法音节（413个）的字典树，传入的数据依次为音节数组spl_buf、数组元素长度、数组长度、用于计算每个音节score（得分）的放大器以及一个平均score，具体值如下：

(gdb) p spl_num
$7 = 413
(gdb) p spl_item_size
$8 = 8
(gdb) p spl_table_->get_score_amplifier()
$9 = -14.1073904
(gdb) p spl_table_->get_average_score()
$10 = 100 'd'
(gdb)

这里的节点是用结构体SpellingNode来描述的：

(gdb) ptype first_son
type = struct ime_pinyin::SpellingNode {
    ime_pinyin::SpellingNode *first_son;
    ime_pinyin::uint16 spelling_idx : 11;
    ime_pinyin::uint16 num_of_son : 5;
    char char_this_node;
    unsigned char score;
} *
(gdb)

spl_trie构建的字典有两层，第0层是root_即level0层,第1层是root_的儿子节点即level1层，通过打印spl_trie的堆栈信息如下：

扫描二维码关注公众号，回复： 11366449 查看本文章

spl_trie = @0x6160a0: {static kMaxYmNum = 64, static kValidSplCharNum = 26, static kHalfIdShengmuMask = 1, static kHalfIdYunmuMask = 2, static kHalfIdSzmMask = 4, 
  static kHalfId2Sc_ = "0ABCcDEFGHIJKLMNOPQRSsTUVWXYZz", static char_flags_ = 0x615140 <ime_pinyin::SpellingTrie::char_flags_> "\006\005\005\005\006\005\005\005", static instance_ = 0x6160a0, 
  spelling_buf_ = 0x616510 "A", spelling_size_ = 8, spelling_num_ = 413, score_amplifier_ = -14.1073904, average_score_ = 100 'd', spl_ym_ids_ = 0x628350 "", ym_buf_ = 0x628eb0 "A", ym_size_ = 6, ym_num_ = 33, 
  splstr_queried_ = 0x617200 "ZhUO", splstr16_queried_ = 0x617220, root_ = 0x617240, dumb_node_ = 0x617260, splitter_node_ = 0x617280, level1_sons_ = {0x6172a0, 0x6172b0, 0x6172c0, 0x6172d0, 0x6172e0, 0x6172f0, 
    0x617300, 0x617310, 0x0, 0x617320, 0x617330, 0x617340, 0x617350, 0x617360, 0x617370, 0x617380, 0x617390, 0x6173a0, 0x6173b0, 0x6173c0, 0x0, 0x0, 0x6173d0, 0x6173e0, 0x6173f0, 0x617400}, h2f_start_ = {0, 30, 
    35, 51, 67, 86, 109, 114, 124, 143, 0, 162, 176, 195, 221, 241, 266, 268, 285, 299, 313, 329, 348, 0, 0, 368, 377, 391, 406, 423}, h2f_num_ = {0, 5, 16, 35, 19, 23, 5, 10, 19, 19, 0, 14, 19, 26, 20, 25, 2, 
    17, 14, 14, 35, 19, 20, 0, 0, 9, 14, 15, 37, 20}, f2h_ = 0x627fb0, node_num_ = 496}
__PRETTY_FUNCTION__ = "bool ime_pinyin::DictBuilder::build_dict(const char*, const char*, ime_pinyin::DictTrie*)"

其中root_的地址=0x617240,root_也是一个类型位SpellingNode的结构体，通过gdb进一步打印其first_son:

(gdb) p spl_trie->root_.first_son
$58 = (ime_pinyin::SpellingNode *) 0x6172a0
(gdb)

level1_sons_ = {0x6172a0, 0x6172b0, 0x6172c0, 0x6172d0, 0x6172e0, 0x6172f0, 
    0x617300, 0x617310, 0x0, 0x617320, 0x617330, 0x617340, 0x617350, 0x617360, 0x617370, 0x617380, 0x617390, 0x6173a0, 0x6173b0, 0x6173c0, 0x0, 0x0, 0x6173d0, 0x6173e0, 0x6173f0, 0x617400}

root_的first_son地址是0x6172a0，这个地址其实就是level1首元素的地址，根节点直接指向level1首元素，而level1的长度=26，其char_this_node正是从a~z的大写字母：

(gdb) p spl_trie->level1_sons_[0].char_this_node
$60 = 65 'A'
(gdb) p spl_trie->level1_sons_[1].char_this_node
$61 = 66 'B'
(gdb) p spl_trie->level1_sons_[3].char_this_node
$62 = 68 'D'
(gdb) p spl_trie->level1_sons_[2].char_this_node
$63 = 67 'C'
(gdb) p spl_trie->level1_sons_[4].char_this_node
$64 = 69 'E'
(gdb) p spl_trie->level1_sons_[25].char_this_node
$65 = 90 'Z'
(gdb) p spl_trie->level1_sons_[26].char_this_node
Cannot access memory at address 0x330023001e000a
(gdb)

但是并不是每个字母都可以作为声母的，如level1_sons_的第9个元素正好是‘I’，所以它的地址为0x0。

然后我们再看下level1_sons_首元素有几个儿子节点呢？

(gdb) p spl_trie->level1_sons_[0].num_of_son
$67 = 3
(gdb)

对，三个，哪三个呢？其实答案就在spl_buf_数组中，要验证此也不难，继续往下跟就是了：

(gdb) p spl_trie->level1_sons_[0].first_son[0].char_this_node
$70 = 73 'I'
(gdb) p spl_trie->level1_sons_[0].first_son[1].char_this_node
$71 = 78 'N'
(gdb) p spl_trie->level1_sons_[0].first_son[2].char_this_node
$72 = 79 'O'

分别是ai an ao，其实an往下还有呢，就是ang，到这里spl_trie中构建的树结构就明晰了，然后我们回过头来看一下root_节点的first_son指针，其实它是个指针数组，其内容为

{first_son = 0x617420, spelling_idx = 1, num_of_son = 3, char_this_node = 65 'A', score = 86 'V'},
  {first_son = 0x617480, spelling_idx = 2, num_of_son = 5, char_this_node = 66 'B', score = 57 '9'},
  {first_son = 0x617620, spelling_idx = 3, num_of_son = 6, char_this_node = 67 'C', score = 72 'H'},
  {first_son = 0x6179e0, spelling_idx = 5, num_of_son = 5, char_this_node = 68 'D', score = 46 '.'},
  {first_son = 0x617c50, spelling_idx = 6, num_of_son = 3, char_this_node = 69 'E', score = 79 'O'},
  {first_son = 0x617cb0, spelling_idx = 7, num_of_son = 5, char_this_node = 70 'F', score = 72 'H'},
  {first_son = 0x617e00, spelling_idx = 8, num_of_son = 4, char_this_node = 71 'G', score = 62 '>'},
  {first_son = 0x617ff0, spelling_idx = 9, num_of_son = 4, char_this_node = 72 'H', score = 64 '@'},
  {first_son = 0x6181e0, spelling_idx = 11, num_of_son = 2, char_this_node = 74 'J', score = 59 ';'},
  {first_son = 0x618380, spelling_idx = 12, num_of_son = 4, char_this_node = 75 'K', score = 70 'F'},
  {first_son = 0x618570, spelling_idx = 13, num_of_son = 6, char_this_node = 76 'L', score = 62 '>'},
  {first_son = 0x618810, spelling_idx = 14, num_of_son = 5, char_this_node = 77 'M', score = 68 'D'},
  {first_son = 0x6189e0, spelling_idx = 15, num_of_son = 6, char_this_node = 78 'N', score = 66 'B'},
  {first_son = 0x618c70, spelling_idx = 16, num_of_son = 1, char_this_node = 79 'O', score = 109 'm'},
  {first_son = 0x618c90, spelling_idx = 17, num_of_son = 5, char_this_node = 80 'P', score = 90 'Z'},
  {first_son = 0x618e50, spelling_idx = 18, num_of_son = 2, char_this_node = 81 'Q', score = 66 'B'},
  {first_son = 0x626ff0, spelling_idx = 19, num_of_son = 5, char_this_node = 82 'R', score = 65 'A'},
  {first_son = 0x6271a0, spelling_idx = 20, num_of_son = 6, char_this_node = 83 'S', score = 46 '.'},
  {first_son = 0x627540, spelling_idx = 22, num_of_son = 5, char_this_node = 84 'T', score = 70 'F'},
  {first_son = 0x6277a0, spelling_idx = 25, num_of_son = 4, char_this_node = 87 'W', score = 61 '='},
  {first_son = 0x627890, spelling_idx = 26, num_of_son = 2, char_this_node = 88 'X', score = 68 'D'},
  {first_son = 0x627a30, spelling_idx = 27, num_of_son = 5, char_this_node = 89 'Y', score = 51 '3'},
  {first_son = 0x627bd0, spelling_idx = 28, num_of_son = 6, char_this_node = 90 'Z', score = 61 '='}

在root_的first_son数组中存放的元素就是level1_sons_中存放的元素，虽然地址不同（不是同一个指针）但是存放的内容是相同的，至此spl_trie构建的树的结构就逐渐明晰了：

上图只是用char_this_node来简要说明一下spl_trie中构建的树结构，其实每个节点是结构体SpellingNode对象。然后再往下看for循环中：

// 填充lemma_arr_数组每个元素的spl_idx_arr项，它表示每个汉字的音对应的spl_id
  // Convert the spelling string to idxs
  for (size_t i = 0; i < lemma_num_; i++) {
    for (size_t hz_pos = 0; hz_pos < (size_t)lemma_arr_[i].hz_str_len;
         hz_pos++) {
      ...
      int spl_idx_num =
        spl_parser_->splstr_to_idxs(lemma_arr_[i].pinyin_str[hz_pos],
                                    strlen(lemma_arr_[i].pinyin_str[hz_pos]),
                                    spl_idxs, spl_start_pos, 2, is_pre);
        ...
      if (spl_trie.is_half_id(spl_idxs[0])) {
        uint16 num = spl_trie.half_to_full(spl_idxs[0], spl_idxs);
        assert(0 != num);
      }
      lemma_arr_[i].spl_idx_arr[hz_pos] = spl_idxs[0];
    }
  }

外层for循环从0到65101，其实就是遍历lemma_arr_数组，内层for循环遍历每个lemma的汉字对应的拼音，进一步调用spl_parser->splstr_to_idx()方法实现汉字拼音到id的映射，具体映射过程在下一篇博客再说。

for循环结束以后开始构建单汉字表，构建之前先对lemma_arr_按照汉字排序并给每个词分配一个唯一的id：

// Sort the lemma items according to the hanzi, and give each unique item a
  // id
  // 按照汉字串排序，更新idx_by_hz字段，为每个词分配一个唯一id
  sort_lemmas_by_hz();
// 构建单字表到scis_，并根据该单字表更新lemma_arr_中的hanzi_scis_ids字段
  scis_num_ = build_scis();

然后是构建单汉字表scis，但是在构建之前先对lemma_arr数组进行了排序操作，即按照汉字对应的unicode十进制数来排序，这里只打印lemma_arr数组前10个元素：

{
	{idx_by_py = 0, idx_by_hz = 1, hanzi_str = {12295, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {1, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {210, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"LING\000\000", 
      "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, 
    hz_str_len = 1 '\001', freq = 248.484543}, 
	{idx_by_py = 0, idx_by_hz = 2, hanzi_str = {19968, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {2, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {396, 0, 0, 0, 0, 0, 0, 0, 0}, 
    pinyin_str = {"YI\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", 
      "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 134392.703}, 
	{idx_by_py = 0, idx_by_hz = 3, hanzi_str = {19969, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {3, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {
      100, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"DING\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", 
      "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 4011.11377}, 
	{idx_by_py = 0, idx_by_hz = 4, hanzi_str = {19969, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {4, 0, 0, 0, 
      0, 0, 0, 0}, spl_idx_arr = {431, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"ZhENG\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", 
      "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 3.37402463}, 
	{idx_by_py = 0, idx_by_hz = 5, hanzi_str = {19971, 0, 0, 0, 0, 0, 0, 0, 0}, 
    hanzi_scis_ids = {5, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {285, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"QI\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", 
      "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 6313.39502}, 
	{idx_by_py = 0, idx_by_hz = 6, hanzi_str = {
      19975, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {6, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {238, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"MO\000\000\000\000", "\000\000\000\000\000\000", 
      "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', 
    freq = 4.85489225}, 
	{idx_by_py = 0, idx_by_hz = 7, hanzi_str = {19975, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {7, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {370, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {
      "WAN\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", 
      "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 25941.043}, 
	{idx_by_py = 0, idx_by_hz = 8, hanzi_str = {19976, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {8, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {
      426, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"ZhANG\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", 
      "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 305.971039}, 
	{idx_by_py = 0, idx_by_hz = 9, hanzi_str = {19977, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {9, 0, 0, 0, 
      0, 0, 0, 0}, spl_idx_arr = {315, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"SAN\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", 
      "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 26761.9336}, 
	{idx_by_py = 0, idx_by_hz = 10, hanzi_str = {19978, 0, 0, 0, 0, 0, 0, 0, 0}, 
    hanzi_scis_ids = {10, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {332, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"ShANG\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", 
      "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 284918.875}
}

构建scis完成后每个汉字对应一个id，使用此id来更新lemma_arr_中的hanzi_scis_ids字段，构建完成的scis表大小和内容如下（只打印了前十个）：

(gdb) p scis_num
$141 = 17038
(gdb) p *scis@10
$142 =   {{freq = 0, hz = 0, splid = {half_splid = 0, full_splid = 0}},
  {freq = 248.484543, hz = 12295, splid = {half_splid = 13, full_splid = 210}},
  {freq = 134392.703, hz = 19968, splid = {half_splid = 27, full_splid = 396}},
  {freq = 4011.11377, hz = 19969, splid = {half_splid = 5, full_splid = 100}},
  {freq = 3.37402463, hz = 19969, splid = {half_splid = 29, full_splid = 431}},
  {freq = 6313.39502, hz = 19971, splid = {half_splid = 18, full_splid = 285}},
  {freq = 4.85489225, hz = 19975, splid = {half_splid = 14, full_splid = 238}},
  {freq = 25941.043, hz = 19975, splid = {half_splid = 25, full_splid = 370}},
  {freq = 305.971039, hz = 19976, splid = {half_splid = 29, full_splid = 426}},
  {freq = 26761.9336, hz = 19977, splid = {half_splid = 20, full_splid = 315}}}
(gdb)

valid_uft16.txt文件中总汉字个数位16466个，而scis表中除了第0个还剩17037个，为什么会比valid_utf16.txt中多呢?因为有的字有多个读音，如‘丨’字，既都‘gun’，又读‘e’,还可以读成‘shu’，所以scis中的总数要比valid_utf16.txt中要多。然后是从lemma_arr_数组来进行初始化字典列表，并且lemma_arr_数组是经过按照汉字排序过的，id也是从1开始分派好的，调用如下：

 // Construct the dict list
  dict_trie->dict_list_ = new DictList();
  bool dl_success = dict_trie->dict_list_->init_list(scis_, scis_num_,
                                                     lemma_arr_, lemma_num_);
  assert(dl_success);

init_list方法中进一步调用了fill_scis和fill_list方法：

#ifdef ___BUILD_MODEL___
bool DictList::init_list(const SingleCharItem *scis, size_t scis_num,
                         const LemmaEntry *lemma_arr, size_t lemma_num) {
  if (NULL == scis || 0 == scis_num || NULL == lemma_arr || 0 == lemma_num)
    return false;

  initialized_ = false;

  if (NULL != buf_)
    free(buf_);

  // calculate the size 计算大小
  size_t buf_size = calculate_size(lemma_arr, lemma_num);
  if (0 == buf_size)
    return false;
  //分配资源
  if (!alloc_resource(buf_size, scis_num))
    return false;
  //填充scis_hz_和scis_splid_两个数组，数据来源scis数组。
  fill_scis(scis, scis_num);

  // Copy the related content from the array to inner buffer
  fill_list(lemma_arr, lemma_num);

  initialized_ = true;
  return true;
}

在fill_scis方法中就是一个for循环，把scis中的hz字段内容依次复制到scis_hz_数组中，同时复制scis的splid字段到scis_splid_数组中：

void DictList::fill_scis(const SingleCharItem *scis, size_t scis_num) {
  assert(scis_num_ == scis_num);

  for (size_t pos = 0; pos < scis_num_; pos++) {
    scis_hz_[pos] = scis[pos].hz;
    scis_splid_[pos] = scis[pos].splid;
  }
}

最终初始化的scis_hz_数组内容为（这里只是打印前十个元素为例）：

(gdb) p *scis_hz_@10
$152 =   {0,
  12295,
  19968,
  19969,
  19969,
  19971,
  19975,
  19975,
  19976,
  19977}

数组scis_splid内容为：

(gdb) p *scis_splid_@10
$154 =   {{half_splid = 0, full_splid = 0},
  {half_splid = 13, full_splid = 210},
  {half_splid = 27, full_splid = 396},
  {half_splid = 5, full_splid = 100},
  {half_splid = 29, full_splid = 431},
  {half_splid = 18, full_splid = 285},
  {half_splid = 14, full_splid = 238},
  {half_splid = 25, full_splid = 370},
  {half_splid = 29, full_splid = 426},
  {half_splid = 20, full_splid = 315}}
(gdb)

长度就是scis的长度了。在init_list函数中调用完fill_scis之后紧接着又调用了fill_list函数，在该函数中初始化了buf_这个数组：

void DictList::fill_list(const LemmaEntry* lemma_arr, size_t lemma_num) {
  size_t current_pos = 0;

  utf16_strncpy(buf_, lemma_arr[0].hanzi_str,
                lemma_arr[0].hz_str_len);

  current_pos = lemma_arr[0].hz_str_len;

  size_t id_num = 1;

  for (size_t i = 1; i < lemma_num; i++) {
    utf16_strncpy(buf_ + current_pos, lemma_arr[i].hanzi_str,
                  lemma_arr[i].hz_str_len);

    id_num++;
    current_pos += lemma_arr[i].hz_str_len;
  }

  assert(current_pos == start_pos_[kMaxLemmaSize]);
  assert(id_num == start_id_[kMaxLemmaSize]);
}

传入的参数为lemma_arr数组和该数组长度，在前面的逻辑中已经对lemma_arr数组按照汉字进行了排序，最终buf_数组长度为150837，其存储内容为：

(gdb) p *buf_@10
$230 = {12295, 19968, 19969, 19969, 19971, 19975, 19975, 19976, 19977, 19978}

这些内容对应rawdict_utf16_65101_freq.txt文件中的汉字，即排序后的lemma_arr数组中的hanzi_str字段依次放在buf_数组中，但是lemma_arr数组中的hanzi_str字段有单个汉字、两个汉字、三个汉字以及四个汉字的情况，那也没用，就是依次放在buf_中，估计search的时候会去判断的，这里先留个悬念！这一步初始化了三个数组，分别是scis_hz_、scis_splid和buf_，只有当这三个数组都初始化成功即dl_success为true的时候下面的断言语句才可以通过，继续往下构建n-gram，n-gram信息的构建主要是为后期的预测功能提供支持，构建过程在另一篇文章中单独研究，这里先来看下DictBuilder::construct_subset方法：此方法从build_dict中调用时传入的参数item_start=0,item_end=65101,也就是说从文件rawdict_utf16_65101_freq.txt的第一行到最后一行进行遍历，从根节点开始构建字典树，默认从level0开始构建，

// 1. Scan for how many sons
  size_t parent_son_num = 0;
  // LemmaNode *son_1st = NULL;
  // parent.num_of_son = 0;

  LemmaEntry *lma_last_start = lemma_arr_ + item_start;
  uint16 spl_idx_node = lma_last_start->spl_idx_arr[level];

  // Scan for how many sons to be allocaed
  for (size_t i = item_start + 1; i< item_end; i++) {
    LemmaEntry *lma_current = lemma_arr + i;
    uint16 spl_idx_current = lma_current->spl_idx_arr[level];
    if (spl_idx_current != spl_idx_node) {
      parent_son_num++;
      spl_idx_node = spl_idx_current;
    }
  }
  parent_son_num++;

这个for循环就是来遍历lemma_arr_计算总共有多少个节点，什么条件下才增加一个节点呢？答案就是spl_idx_current != spl_idx_node这个条件成立，看过第一篇LatinIME数据解构分析的话应该能记得LemmaEntry中有一个字段就是spl_idx_arr,这个字段就是汉字的拼音在对应数据结构（spl_buf_）中拼音的id值，他们两个不相等就说明需要增加一个节点了，跳过for循环打印parent_son_num发现正好时413，也就是Google原生LatinIME输入法spl_buf_数据内容的行数，即有效汉语音节总数。

字典构建过程分析到此还并未结束，限于篇幅，将在流程分析（二）中继续对字典构建流程进行分析，文中如有纰漏、谬误，敬请指教！

Google原生输入法LatinIME词库构建流程分析(一)

猜你喜欢