How to implement part-of-speech tagging using the spacy toolkit

Table of contents

2.2 Using spacy, do not split word annotations

1 Description of the problem:

1.1 Introduction to basic knowledge:

spaCy is an NLP natural language text processing library for Python and CPython that satisfies some common natural language processing tasks.

For the use of the spacy toolkit, please refer to another blog: Click Here

How to use the spacy toolkit to split words and mark them without splitting words (take tag or pos as an example)

2 problem solved:

The spacy toolkit can meet the part-of-speech tagging in Chinese, English and other languages. Here, only English is used as an example.

2.1 Use spacy to split the word markup

Use the spacy toolkit to realize the code implementation of English part-of-speech tagging:

import spacy

nlp = spacy.load("en_core_web_sm")

# 给定一个英文句子
sentence = "This is a test sentence for POS tagging X-T ."

# 对句子进行分析
doc = nlp(sentence)

# 遍历每个 token，并输出它的文本和词性标注
for token in doc:
    print(token.text, token.pos_, token.tag_)

The result of running the above code is:

Result analysis :

token.text, token.pos_, and token.tag_ respectively represent the words in the sentence, the part of speech corresponding to the word, and the tag corresponding to the word. Judging from the running results, the spacy toolkit comes with a word segmentation tool . Although the English text is separated by spaces, spacy will use its own word segmentation rules to split the word "XT" into "X", "-" and "T".

2.2 Using spacy, do not split word annotations

Use the spacy toolkit to realize English labeling with spaces as separators, without splitting additional words. The code is implemented as follows:

import spacy

# 加载英文模型
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# 定义一个不对单词进行拆分的Tokenizer
class WhitespaceTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab
    
    def __call__(self, text):
        words = text.split(' ')
        return spacy.tokens.Doc(self.vocab, words=words)
    
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)

# 输入英文句子
text = 'This is a test sentence for POS tagging X-T .'

# 创建一个Doc对象
doc = nlp(text)

# 获取每个单词的词性
for token in doc:
    print(token.text, token.pos_, token.tag_, )

The result of running the above code is:

It can be seen that only spaces are used as separators to achieve labeling, and words are not additionally split.