person github
jieba
Use a built-in dictionary for word segmentation and part-of-speech tagging. This dictionary is generally saved as a text file, and each line contains an entry and some information related to it, such as word frequency and part of speech.
structure
A typical jieba dictionary has the following format:
词语 词频 词性
For example:
清华大学 2333 nt
自然语言处理 1012 n
词语
is the word to be recognized in the dictionary.词频
is a numerical value indicating the frequency of occurrence of the word in the corpus. jieba uses this information during word segmentation to determine the "importance" of a word.词性
is an identifier that represents the part of speech (noun, verb, etc.) of the word. This field is used during part-of-speech tagging.
Custom dictionary
In addition to using the default built-in dictionary, jieba
users are also allowed to load custom dictionaries:
jieba.load_userdict("userdict.txt")
Custom dictionaries have the same format as built-in dictionaries. By loading a custom dictionary, you can override entries in the built-in dictionary or add new entries to more accurately reflect the vocabulary of a specific application or domain.
Modify and extend
jieba's dictionary can also be modified dynamically, for example:
jieba.add_word("特定词", freq=1000, tag="n")
jieba.del_word("不需要的词")
This way, you can add or remove vocabulary as needed while the program is running.
Overall, jieba
the dictionary is a flexible and extensible component that can support a variety of different Chinese text processing needs.