What is the dictionary of jieba

person github

jiebaUse a built-in dictionary for word segmentation and part-of-speech tagging. This dictionary is generally saved as a text file, and each line contains an entry and some information related to it, such as word frequency and part of speech.

structure

A typical jieba dictionary has the following format:

词语 词频 词性

For example:

清华大学 2333 nt
自然语言处理 1012 n
  • 词语is the word to be recognized in the dictionary.
  • 词频is a numerical value indicating the frequency of occurrence of the word in the corpus. jieba uses this information during word segmentation to determine the "importance" of a word.
  • 词性is an identifier that represents the part of speech (noun, verb, etc.) of the word. This field is used during part-of-speech tagging.

Custom dictionary

In addition to using the default built-in dictionary, jiebausers are also allowed to load custom dictionaries:

jieba.load_userdict("userdict.txt")

Custom dictionaries have the same format as built-in dictionaries. By loading a custom dictionary, you can override entries in the built-in dictionary or add new entries to more accurately reflect the vocabulary of a specific application or domain.

Modify and extend

jieba's dictionary can also be modified dynamically, for example:

jieba.add_word("特定词", freq=1000, tag="n")
jieba.del_word("不需要的词")

This way, you can add or remove vocabulary as needed while the program is running.

Overall, jiebathe dictionary is a flexible and extensible component that can support a variety of different Chinese text processing needs.

Guess you like

Origin blog.csdn.net/m0_57236802/article/details/133393061