1. Introduction
Official introduction: "Stuttering" Chinese word segmentation: making the best Python Chinese word segmentation component
I understand that the official goal is to make the best python Chinese word segmentation component, but after searching on the Internet, many articles directly say that it is the best Chinese word segmentation component. I think they have misunderstood the meaning of the original text. For now, jieba word segmentation It still cannot solve the Chinese ambiguity (will be explained through the actual code later), so the accuracy is not the best, and it cannot be said to be the best Chinese word segmentation component. I will make a record here so that I can find better Chinese later. Word segmentation components. Nowadays, there are more and more natural language processing modules of this type. As for whether they are good or not, we can only say that there is no best, only better, because every Chinese word segmentation component is being updated and improved.
2. Features (The following is quoted from the official readme)
- Supports four word segmentation modes:
- Exact mode, which attempts to cut sentences into the most precise form, is suitable for text analysis;
- Full mode scans out all the words in the sentence that can be turned into words, which is very fast, but cannot resolve ambiguities;
- The search engine mode, based on the precise mode, segments long words again to improve the recall rate and is suitable for search engine word segmentation.
- paddle mode uses the PaddlePaddle deep learning framework to train the sequence annotation (bidirectional GRU) network model to achieve word segmentation. Part-of-speech tagging is also supported. To use paddle mode, paddlepaddle-tiny needs to be installed,
pip install paddlepaddle-tiny==1.6.1
. Currently paddle mode supports jieba v0.40 and above. For versions below jieba v0.40, please upgrade jieba,pip install jieba --upgrade
. PaddlePaddle official website
- Support traditional Chinese word segmentation
- Support custom dictionary
- MIT License Agreement
Practical understanding of the fourth word segmentation mode: If jieba v0.40 and above, and paddlepaddle is above 2.0.0, directly using jieba.enable_paddle() will report an error, because paddlepaddle is enabled by default for static images above 2.0.0 model
Traceback (most recent call last):
File "D:\Party_committee_project\党员转出回执收集V1\Lib\identify_name.py", line 15, in <module>
jieba.enable_paddle()
File "D:\Party_committee_project\党员转出回执收集V1\Lib\jieba\_compat.py", line 46, in enable_paddle
import jieba.lac_small.predict as predict
File "D:\Party_committee_project\党员转出回执收集V1\Lib\jieba\lac_small\predict.py", line 43, in <module>
infer_ret = creator.create_model(dataset.vocab_size, dataset.num_labels, mode='infer')
File "D:\Party_committee_project\党员转出回执收集V1\Lib\jieba\lac_small\creator.py", line 32, in create_model
words = fluid.data(name='words', shape=[-1, 1], dtype='int64', lod_level=1)
File "<decorator-gen-31>", line 2, in data
File "D:\Party_committee_project\党员转出回执收集V1\Lib\paddle\fluid\wrapped_decorator.py", line 25, in __impl__
return wrapped_func(*args, **kwargs)
File "D:\Party_committee_project\党员转出回执收集V1\Lib\paddle\fluid\framework.py", line 442, in __impl__
), "In PaddlePaddle 2.x, we turn on dynamic graph mode by default, and '%s()' is only supported in static graph mode. So if you want to use this api, please call 'paddle.enable_static()' before this api to enter static graph mode." % func.__name__
AssertionError: In PaddlePaddle 2.x, we turn on dynamic graph mode by default, and 'data()' is only supported in static graph mode. So if you want to use this api, please call 'paddle.enable_static()' before this api to enter static graph mode.
Process finished with exit code 1
Three, installation instructions (The following is quoted from the official readme)
The code is compatible with Python 2/3
- Full automatic equipment:
easy_install jieba
someonepip install jieba
/pip3 install jieba
- Semi-automatic installation: first download jieba · PyPI , unzip and run
python setup.py install
- Manual installation: Place the jieba directory in the current directory or site-packages directory
- Transmission
import jieba
Citation - If you need to use the word segmentation and part-of-speech tagging functions in paddle mode, please install paddlepaddle-tiny first,
pip install paddlepaddle-tiny==1.6.1
.
4. Algorithm (the following is quoted from the official readme)
- Implement efficient word graph scanning based on the prefix dictionary and generate a directed acyclic graph (DAG) composed of all possible word formations of Chinese characters in the sentence
- Dynamic programming is used to find the maximum probability path and the maximum segmentation combination based on word frequency.
- For unregistered words, an HMM model based on the word-forming ability of Chinese characters is used, and the Viterbi algorithm is used
Five, main functions
1. participle
jieba.cut
The method accepts four input parameters: the string that needs to be segmented; the cut_all parameter is used to control whether to use the full mode; the HMM parameter is used to control whether to use the HMM model; the use_paddle parameter is used to control whether to use the word segmentation mode in paddle mode, and the paddle mode uses Lazy loading method, install paddlepaddle-tiny through enable_paddle interface, and import related code;jieba.cut_for_search
The method accepts two parameters: the string that needs to be segmented; whether to use the HMM model. This method is suitable for word segmentation for search engines to build inverted indexes, and the granularity is relatively fine.- The string to be segmented can be a unicode or UTF-8 string, or a GBK string. Note: It is not recommended to input GBK string directly, as it may be unpredictably and incorrectly decoded into UTF-8.
jieba.cut
The structure returned by andjieba.cut_for_search
is an iterable generator. You can use a for loop to obtain each word (unicode) obtained after word segmentation, or usejieba.lcut
andjieba.lcut_for_search
return directly to listjieba.Tokenizer(dictionary=DEFAULT_DICT)
Create a new custom word segmenter, which can be used to use different dictionaries at the same time.jieba.dt
is the default word segmenter, and all global word segmentation related functions are mappings of this word segmenter.
2. Part-of-speech tagging
Next is the focus of this article, the practical part
(1) Operating environment: python3.7.2 + windows10
(2) Actual project background: Find emails with specified subjects from the emails in the specified inbox of Outlook mailbox, and identify Chinese names from the subjects or bodies of these emails.
(3) Some code examples are shown:
import paddle
import jieba.posseg as pseg
paddle.enable_static() # 从2.0.0版本开始,Paddle默认开启静态图模式
def identify_person_name(text):
"""
识别姓名
:param text: 语句, str
:return: 姓名, list
"""
"""
jieba算法识别中文名
特殊参数解释: use_paddle: 使用飞浆模式,默认为False
"""
try:
words = pseg.cut(text, use_paddle=True) # jieba分词及词性标注列表
# words = pseg.cut(text) # jieba分词及词性标注列表
name_jieba = [] # jieba识别出来的人名
for pair_word in list(words): # 遍历pair对象
if list(pair_word)[1] == 'nr' or list(pair_word)[1] == 'PER':
name_jieba.append(list(pair_word)[0])
if not name_jieba:
return '未识别到人名'
return f'已识别完毕,结果为{name_jieba}'
except Exception as e:
return e
Why should we focus on part-of-speech tagging? Because in actual projects, when identifying Chinese names, the Chinese names are obtained through part-of-speech tagging. Therefore, jieba’s pseg module is used in the above code, which can return the recognized words and parts of speech at the same time. When the part of speech is 'nr' or 'per', it is a person’s name;
The paddle mode part-of-speech tagging correspondence table is as follows:
The set of paddle mode part-of-speech and proper name category tags is as follows, including 24 part-of-speech tags (lowercase letters) and 4 proper name category tags (uppercase letters).
Label | meaning | Label | meaning | Label | meaning | Label | meaning |
---|---|---|---|---|---|---|---|
n | common noun | f | location noun | s | place noun | t | time |
No. | name | ns | place name | nt | Organization name | nw | Work name |
nz | Other proper names | in | common verb | CEO | verb adverb | vn | noun verb |
a | adjective | ad | adverb | an | noun | d | adverb |
m | Quantifier | q | quantifier | r | pronoun | p | preposition |
c | conjunction | in | particle | xc | Other function words | In | Punctuation |
PER | name | PLACE | place name | ORG | Organization name | TIME | time |
Next, start the actual operation
print(identify_person_name('钟伟政党组织关系转出回执给张颖'))
The running result is:
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\L84171~1\AppData\Local\Temp\jieba.cache
已识别完毕,结果为['钟伟', '张颖']
Loading model cost 0.565 seconds.
Prefix dict has been built successfully.
It can be seen from the running results that the first personal name was not accurately recognized. The reason may be due to Chinese ambiguity, because the "party" in the sentence is considered a noun, and the second personal name was recognized and did not exist. Ambiguity, and the name is relatively common
To further prove it, debug it and see what all the recognized words and parts of speech are.
The jieba algorithm split the name "Zhong Weizheng" into "Zhong Wei", and then separated "zheng" and "party" 'formed into common nouns, and it turned out to be a problem of Chinese ambiguity. When the jieba algorithm was used to identify actual projects, it was so inaccurate that it could only be used in conjunction with other natural language Chinese recognition components later, or simply used Other components replace jieba. If you want to know what this more accurate Chinese name recognition component is, please continue to pay attention to this blogger's post. Welcome to communicate together. Natural language processing tools