This article will introduce spaCy, a high-performance natural language processing (NLP) library, focusing on the main functions, applications and practices of spaCy. We will show how to use spaCy for tasks such as word segmentation, part-of-speech tagging, named entity recognition, and dependency resolution through practical code examples. This article also explores how to use spaCy for text classification and information extraction, and how to train custom models. We will also compare with other NLP libraries to highlight the strengths of spaCy.

Article Directory

1. Introduction to spaCy
2. Install spaCy
3. Basic functions of spaCy
4. Text Classification
5. Information Extraction
6. Train a custom model
7. Comparison of spaCy with other NLP libraries
8. Summary
9. References

1. Introduction to spaCy

spaCy is a Python library for advanced natural language processing. It was founded in 2015 by Matthew Honnibal and Ines Montani. spaCy is designed with high performance, ease of use, and scalability in mind. spaCy has a variety of pre-trained models built in, which can be used to process multiple languages, including English, French, German, Chinese, etc. It also provides many tools and interfaces so that users can easily develop custom NLP applications.

2. Install spaCy

pip install spacy

3. Basic functions of spaCy

Below are some basic functions of spaCy for processing natural language text.

3.1 Word segmentation

Tokenization is the process of breaking text down into basic units like words and punctuation marks. spaCy can quickly complete word segmentation tasks. First run the following command on the command line to install the model:

# 英文模型
python -m spacy download en_core_web_sm
# 中文模型
python -m spacy download zh_core_web_sm

import spacy

nlp = spacy.load("en_core_web_sm")
# 如果是中文文本，则需要加载中文模型
# nlp = spacy.load("zh_core_web_sm")

text = "This is a sentence."
doc = nlp(text)

for token in doc:
    print(token.text)

Regarding the word segmentation function of spaCy, I can also refer to this post: Word segmentation tools and methods: jieba, spaCy, etc.

3.2 Part-of-speech tagging

Part-of-speech tagging is the process of assigning a part of speech (e.g. noun, verb, etc.) to each word in a text. spaCy uses pre-trained models to automatically complete part-of-speech tagging.

for token in doc:
    print(token.text, token.pos_)

3.3 Named entity recognition

Named entity recognition (NER) is the process of identifying and classifying named entities (such as names of people, places, companies, etc.) in text. spaCy's pretrained models can automatically recognize many types of named entities.

for ent in doc.ents:
    print(ent.text, ent.label_)

3.4 Dependency Analysis

Dependency parsing is the process of determining the syntactic relationship (such as subject, object, etc.) between words in a text. spaCy can automatically analyze the dependencies between words, which can help us better understand the text structure.

for token in doc:
    print(token.text, token.dep_, token.head.text)

The above lists some basic functions of spaCy. In fact, spaCy also includes many other functions, such as text similarity calculation, word vector generation, sentence boundary detection, etc. You can learn more about spaCy by reading the official spaCy documentation .

4. Text Classification

spaCy's text classification mainly relies on its built-in TextCategorizercomponents. First, we need to create a training dataset and then use spaCy for training.

When dealing with Chinese text classification, spaCy tries to load a Chinese word segmentation library (pkuseg), so it is necessary to install the pkuseg library and its related pre-trained models.

First, make sure you have the pkuseg library installed. If not, run the following command to install it:

pip install pkuseg

Then, you need to download the pretrained pkuseg model. You can download it from the releases page of the pkuseg-python project on GitHub.
After the download is complete, extract the pretrained model files to a directory. Suppose you unpacked the pretrained model into pkuseg_modela directory named .
Next, we need to create and configure the tokenizer:

import spacy
from spacy.lang.zh import Chinese, try_pkuseg_import

class CustomChineseTokenizer(Chinese):
    def initialize(self, get_examples, **kwargs):
        self.pkuseg_seg = try_pkuseg_import(pkuseg_model="pkuseg_model", pkuseg_user_dict=None)
        return super().initialize(get_examples, **kwargs)

nlp = CustomChineseTokenizer()

Here is a complete example showing how to use spaCy for text classification:

import spacy
from spacy.util import minibatch, compounding
from spacy.pipeline.textcat import Config, single_label_cnn_config
import random

from spacy.lang.zh import Chinese, try_pkuseg_import
from spacy.training import Example


# 创建训练数据
train_data = [
    ("这是一个好消息。", {
    
    "cats": {
    
    "POSITIVE": 1.0, "NEGATIVE": 0.0}}),
    ("我很高兴。", {
    
    "cats": {
    
    "POSITIVE": 1.0, "NEGATIVE": 0.0}}),
    ("这是一个糟糕的经历。", {
    
    "cats": {
    
    "POSITIVE": 0.0, "NEGATIVE": 1.0}}),
    ("我很沮丧。", {
    
    "cats": {
    
    "POSITIVE": 0.0, "NEGATIVE": 1.0}})
]

# 加载中文模型
class CustomChineseTokenizer(Chinese):
    def initialize(self, get_examples, **kwargs):
        self.pkuseg_seg = try_pkuseg_import(pkuseg_model="pkuseg_model", pkuseg_user_dict=None)
        return super().initialize(get_examples, **kwargs)

nlp = CustomChineseTokenizer()

# 添加TextCategorizer组件
config = Config().from_str(single_label_cnn_config)
if "textcat" not in nlp.pipe_names:
    textcat = nlp.add_pipe("textcat", config=config , last=True)
else:
    textcat = nlp.get_pipe("textcat")

# 添加标签
textcat.add_label("POSITIVE")
textcat.add_label("NEGATIVE")

# 训练模型
n_iter = 20
random.seed(1)
spacy.util.fix_random_seed(1)

optimizer = nlp.begin_training()
batch_sizes = compounding(4.0, 32.0, 1.001)
# 更新训练循环
for i in range(n_iter):
    losses = {
    
    }
    batches = minibatch(train_data, size=batch_sizes)
    for batch in batches:
        examples = []
        for text, annotations in batch:
            doc = nlp.make_doc(text)
            example = Example.from_dict(doc, annotations)
            examples.append(example)
        nlp.update(examples, sgd=optimizer, drop=0.2, losses=losses)
    print(f"迭代次数：{
      
      i+1}，损失：{
      
      losses['textcat']}")

# 测试分类器
test_text = "这个消息让人高兴。"
doc = nlp(test_text)
print(f"文本：{
      
      test_text}")
for label, score in doc.cats.items():
    print(f"{
      
      label}: {
      
      score}")

This example starts by creating a simple training dataset containing sentences with positive and negative sentiment. Next, we loaded the Chinese model and added TextCategorizercomponents. We then added labels for positive and negative sentiment and trained for 20 iterations.

After training, we use a test text to see the classifier's predictions.

迭代次数：1，损失：0.25
迭代次数：2，损失：0.24783627688884735
迭代次数：3，损失：0.243463397026062
迭代次数：4，损失：0.23672938346862793
迭代次数：5，损失：0.23108384013175964
迭代次数：6，损失：0.22520506381988525
迭代次数：7，损失：0.20326930284500122
迭代次数：8，损失：0.16069942712783813
迭代次数：9，损失：0.15615935623645782
迭代次数：10，损失：0.12908345460891724
迭代次数：11，损失：0.1116722971200943
迭代次数：12，损失：0.09159457683563232
迭代次数：13，损失：0.07003745436668396
迭代次数：14，损失：0.05477789789438248
迭代次数：15，损失：0.03252990543842316
迭代次数：16，损失：0.02523144707083702
迭代次数：17，损失：0.00921378843486309
迭代次数：18，损失：0.004524371586740017
迭代次数：19，损失：0.003985351417213678
迭代次数：20，损失：0.0013836633879691362
文本：这个消息让人高兴。
POSITIVE: 0.9919237494468689
NEGATIVE: 0.008076261729001999

Note that this example uses very simple training data and number of iterations, so the performance of the classifier may not be very high. In practice, you need larger training datasets and more iterations for better performance.

5. Information Extraction

import spacy

nlp = spacy.load("en_core_web_sm")

text = "Apple Inc. is an American multinational technology company headquartered in Cupertino, California."
doc = nlp(text)

# 提取公司名和总部所在地
for ent in doc.ents:
    if ent.label_ == "ORG":
        org = ent.text
    elif ent.label_ == "GPE":
        location = ent.text

print(f"{
      
      org} is located in {
      
      location}.")

6. Train a custom model

Refer to the training custom model guide in the official spaCy documentation .

7. Comparison of spaCy with other NLP libraries

spaCy vs. NLTK: NLTK is a powerful NLP library, but spaCy is more performant and easier to integrate into a production environment.
spaCy and TextBlob: TextBlob provides some basic NLP functionality, but spaCy is more focused on high-performance and production-grade applications.
spaCy vs. Gensim: Gensim is mainly used for topic modeling and document similarity analysis, while spaCy provides a wider range of NLP functions.

8. Summary

spaCy is a high-performance, easy-to-use natural language processing library that can handle multiple languages, providing many pre-trained models and extensible functions. This article introduces the basic functions and applications of spaCy, and shows how to use spaCy for various NLP tasks through practical code examples.

[Deep Learning] SpaCy Introduction and Combat: High Performance Natural Language Processing