jieba’s practical records and experience in identifying Chinese names

1. Introduction

Official introduction: "Stuttering" Chinese word segmentation: making the best Python Chinese word segmentation component

I understand that the official goal is to make the best python Chinese word segmentation component, but after searching on the Internet, many articles directly say that it is the best Chinese word segmentation component. I think they have misunderstood the meaning of the original text. For now, jieba word segmentation It still cannot solve the Chinese ambiguity (will be explained through the actual code later), so the accuracy is not the best, and it cannot be said to be the best Chinese word segmentation component. I will make a record here so that I can find better Chinese later. Word segmentation components. Nowadays, there are more and more natural language processing modules of this type. As for whether they are good or not, we can only say that there is no best, only better, because every Chinese word segmentation component is being updated and improved.

2. Features (The following is quoted from the official readme)

  • Supports four word segmentation modes:
    • Exact mode, which attempts to cut sentences into the most precise form, is suitable for text analysis;
    • Full mode scans out all the words in the sentence that can be turned into words, which is very fast, but cannot resolve ambiguities;
    • The search engine mode, based on the precise mode, segments long words again to improve the recall rate and is suitable for search engine word segmentation.
    • paddle mode uses the PaddlePaddle deep learning framework to train the sequence annotation (bidirectional GRU) network model to achieve word segmentation. Part-of-speech tagging is also supported. To use paddle mode, paddlepaddle-tiny needs to be installed, pip install paddlepaddle-tiny==1.6.1. Currently paddle mode supports jieba v0.40 and above. For versions below jieba v0.40, please upgrade jieba,pip install jieba --upgrade . PaddlePaddle official website
  • Support traditional Chinese word segmentation
  • Support custom dictionary
  • MIT License Agreement

 Practical understanding of the fourth word segmentation mode: If jieba v0.40 and above, and paddlepaddle is above 2.0.0, directly using jieba.enable_paddle() will report an error, because paddlepaddle is enabled by default for static images above 2.0.0 model

Traceback (most recent call last):
  File "D:\Party_committee_project\党员转出回执收集V1\Lib\identify_name.py", line 15, in <module>
    jieba.enable_paddle()
  File "D:\Party_committee_project\党员转出回执收集V1\Lib\jieba\_compat.py", line 46, in enable_paddle
    import jieba.lac_small.predict as predict
  File "D:\Party_committee_project\党员转出回执收集V1\Lib\jieba\lac_small\predict.py", line 43, in <module>
    infer_ret = creator.create_model(dataset.vocab_size, dataset.num_labels, mode='infer')
  File "D:\Party_committee_project\党员转出回执收集V1\Lib\jieba\lac_small\creator.py", line 32, in create_model
    words = fluid.data(name='words', shape=[-1, 1], dtype='int64', lod_level=1)
  File "<decorator-gen-31>", line 2, in data
  File "D:\Party_committee_project\党员转出回执收集V1\Lib\paddle\fluid\wrapped_decorator.py", line 25, in __impl__
    return wrapped_func(*args, **kwargs)
  File "D:\Party_committee_project\党员转出回执收集V1\Lib\paddle\fluid\framework.py", line 442, in __impl__
    ), "In PaddlePaddle 2.x, we turn on dynamic graph mode by default, and '%s()' is only supported in static graph mode. So if you want to use this api, please call 'paddle.enable_static()' before this api to enter static graph mode." % func.__name__
AssertionError: In PaddlePaddle 2.x, we turn on dynamic graph mode by default, and 'data()' is only supported in static graph mode. So if you want to use this api, please call 'paddle.enable_static()' before this api to enter static graph mode.

Process finished with exit code 1

Three, installation instructions (The following is quoted from the official readme)

The code is compatible with Python 2/3

  • Full automatic equipment:easy_install jieba someone pip install jieba / pip3 install jieba
  • Semi-automatic installation: first download jieba · PyPI , unzip and run python setup.py install
  • Manual installation: Place the jieba directory in the current directory or site-packages directory
  • Transmission import jieba Citation
  • If you need to use the word segmentation and part-of-speech tagging functions in paddle mode, please install paddlepaddle-tiny first, pip install paddlepaddle-tiny==1.6.1.

4. Algorithm (the following is quoted from the official readme)

  • Implement efficient word graph scanning based on the prefix dictionary and generate a directed acyclic graph (DAG) composed of all possible word formations of Chinese characters in the sentence
  • Dynamic programming is used to find the maximum probability path and the maximum segmentation combination based on word frequency.
  • For unregistered words, an HMM model based on the word-forming ability of Chinese characters is used, and the Viterbi algorithm is used

Five, main functions

1. participle

  • jieba.cut The method accepts four input parameters: the string that needs to be segmented; the cut_all parameter is used to control whether to use the full mode; the HMM parameter is used to control whether to use the HMM model; the use_paddle parameter is used to control whether to use the word segmentation mode in paddle mode, and the paddle mode uses Lazy loading method, install paddlepaddle-tiny through enable_paddle interface, and import related code;
  • jieba.cut_for_search The method accepts two parameters: the string that needs to be segmented; whether to use the HMM model. This method is suitable for word segmentation for search engines to build inverted indexes, and the granularity is relatively fine.
  • The string to be segmented can be a unicode or UTF-8 string, or a GBK string. Note: It is not recommended to input GBK string directly, as it may be unpredictably and incorrectly decoded into UTF-8.
  • jieba.cut The structure returned by and jieba.cut_for_search is an iterable generator. You can use a for loop to obtain each word (unicode) obtained after word segmentation, or use
  • jieba.lcut and jieba.lcut_for_search return directly to list
  • jieba.Tokenizer(dictionary=DEFAULT_DICT) Create a new custom word segmenter, which can be used to use different dictionaries at the same time. jieba.dt is the default word segmenter, and all global word segmentation related functions are mappings of this word segmenter. 

2. Part-of-speech tagging

Next is the focus of this article, the practical part

(1) Operating environment: python3.7.2 + windows10

(2) Actual project background: Find emails with specified subjects from the emails in the specified inbox of Outlook mailbox, and identify Chinese names from the subjects or bodies of these emails.

(3) Some code examples are shown:

import paddle
import jieba.posseg as pseg

paddle.enable_static()  # 从2.0.0版本开始,Paddle默认开启静态图模式

def identify_person_name(text):
    """
    识别姓名
    :param text: 语句, str
    :return: 姓名, list
    """
    """
    jieba算法识别中文名
    特殊参数解释: use_paddle: 使用飞浆模式,默认为False
    """
    try:
        words = pseg.cut(text, use_paddle=True)  # jieba分词及词性标注列表
        # words = pseg.cut(text)  # jieba分词及词性标注列表
        name_jieba = []  # jieba识别出来的人名
        for pair_word in list(words):  # 遍历pair对象
            if list(pair_word)[1] == 'nr' or list(pair_word)[1] == 'PER':
                name_jieba.append(list(pair_word)[0])

        if not name_jieba:
            return '未识别到人名'

        return f'已识别完毕,结果为{name_jieba}'

    except Exception as e:
        return e

Why should we focus on part-of-speech tagging? Because in actual projects, when identifying Chinese names, the Chinese names are obtained through part-of-speech tagging. Therefore, jieba’s pseg module is used in the above code, which can return the recognized words and parts of speech at the same time. When the part of speech is 'nr' or 'per', it is a person’s name;

The paddle mode part-of-speech tagging correspondence table is as follows:

The set of paddle mode part-of-speech and proper name category tags is as follows, including 24 part-of-speech tags (lowercase letters) and 4 proper name category tags (uppercase letters).

Label meaning Label meaning Label meaning Label meaning
n common noun f location noun s place noun t time
No. name ns place name nt Organization name nw Work name
nz Other proper names in common verb CEO verb adverb vn noun verb
a adjective ad adverb an noun d adverb
m Quantifier q quantifier r pronoun p preposition
c conjunction in particle xc Other function words In Punctuation
PER name PLACE place name ORG Organization name TIME time

Next, start the actual operation

print(identify_person_name('钟伟政党组织关系转出回执给张颖'))

The running result is:

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\L84171~1\AppData\Local\Temp\jieba.cache
已识别完毕,结果为['钟伟', '张颖']
Loading model cost 0.565 seconds.
Prefix dict has been built successfully.

It can be seen from the running results that the first personal name was not accurately recognized. The reason may be due to Chinese ambiguity, because the "party" in the sentence is considered a noun, and the second personal name was recognized and did not exist. Ambiguity, and the name is relatively common

To further prove it, debug it and see what all the recognized words and parts of speech are.

 The jieba algorithm split the name "Zhong Weizheng" into "Zhong Wei", and then separated "zheng" and "party" 'formed into common nouns, and it turned out to be a problem of Chinese ambiguity. When the jieba algorithm was used to identify actual projects, it was so inaccurate that it could only be used in conjunction with other natural language Chinese recognition components later, or simply used Other components replace jieba. If you want to know what this more accurate Chinese name recognition component is, please continue to pay attention to this blogger's post. Welcome to communicate together. Natural language processing tools

Guess you like

Origin blog.csdn.net/Smile_Lai/article/details/128613900