DAT achieved with heavy CppJieba Chinese word segmentation algorithm, memory consumption reduction of 99% [2019-11-25]

First, the problem background

Chinese word used widely in open-source algorithm, is jieba word stammer , stammer points to achieve high performance is the word C ++ version CppJieba:
https://github.com/yanyiwu/cppjieba

In actual use CppJieba in the process, we found that CppJieba memory footprint is relatively high.

76W for example, a word dictionary size of 11MB, loading 2 parts (such as the user dictionary to support smooth changes) requires 505MB memory consuming.

This background service some of the multi-process, wasting a lot of memory, it is difficult to accept, so here hope to reduce memory consumption.

After a preliminary investigation to determine improved methods and hands-on reform, eventually reduced to 505MB of 4.7MB, to achieve a 99% reduced memory .

Also issue discussed here https://github.com/yanyiwu/cppjieba/issues/3

Open source code that may come out later.

Second, the implementation process

II.1 check memory distribution

The first step is to see where to spend with jemalloc memory of memory profiler tool,

  1. The change CppJieba of test / demo.cpp, links jemalloc, compiled into a binary jieba_test
  2. Then set the environment variable
    export MALLOC_CONF="prof:true,prof_prefix:mem_prof/mem_profile_je.out,lg_prof_interval:20,lg_prof_sample:20"
  3. Then mkdir mem_prof, and run the test program
  4. Jeprofa --peedif K/jiab_test Mem_prof / Mem_profail_jekautkcshcshkshkheep> Mem_profailkpeedif

Open mem_profile.pdf, you can see the distribution of the memory

II.2 optimization

Obviously, the main memory is spent on:

  1. Trie.hpp in Trie tree construction
  2. KeywordExtractor.hpp load the idf dictionary file.

Therefore program:

1. Double Array Trie 代替 cppjieba::Trie

Introducing Double Array Trie (referred to as DAT, https://github.com/s-yata/darts-clone ), instead of a simple Trie Trie.hpp the memory and saves the darts DAT file generated when starting, If you already have a dictionary and a corresponding DAT, direct mmap () attach up, you can start.

Through the field found that 750,000 word dictionary, dart-clone generation of DAT file size is only 24MB, but can mmap mount, multi-process sharing.

2. KeywordExtractor

KeywordExtractor is not commonly used functions, support incoming air directly into the idfPath and stopWordPath, this time you can not load data.

II.3 Other questions

1. Support hot update, to ensure consistency and DAT dictionary

One problem here is that the heat may update the dictionary, then how do you know the contents of the current DAT file and the corresponding dictionary?

My approach is that for the default dictionary file + custom dictionary file, with the contents of the file count MD5, write DAT file header, so open DAT file MD5 found inconsistencies, you know DAT file out of date, you can rebuild DAT.

MD5 was soon found that the measured count of start time in about 1 second.

2. code cleanup

In addition, clean up the code, deleted Unicode.hpp of useless code.
Clean up FullSegment.hpp HMMSegment.hpp MixSegment.hpp MPSegment.hpp QuerySegment.hpp such as duplicate code.

3. incompatible changes

  • Because Double Array Trie not support dynamic insertion of the word, delete InsertUserWord () method
  • There computing FullSegment.hpp in maxId bug, do the fix.

After the overall transformation of the code than the original reduction of more than 100 lines.

Significant effect on the line.

When reduced to the level of 2-3MB of memory, which means that the words of this size 75W big dictionary, the phone can be used in the environment.

For example, you can do Chinese / English word cut on ios or Android,
which means possible to achieve very good search engine experience on the client side.

the word ios also can be used in Chinese CFStringTokenizer , but seemingly not open source.

Guess you like

Origin www.cnblogs.com/windydays/p/12536029.html