spaCy:Processing Pipelines

spaCy learning record



Preface

For example: I used the spaCy library in my studies today. I often encountered this library before, but there was no system to record the problem. I will use this article series as a note in the future. If I encounter the same problem in the future, I will check it again.

1. What is spaCy?

Example: spaCy is a so-called industrial-grade natural language processing toolkit. For details, please check the link here. Here you will write down the first part of the processed text, and then you will see it later.

二、Processing Text

1. Concept introduction

When calling nlp on the text, spaCy first tokenizes the text to generate a Doc object. This Doc processes the document through several different steps-this is also called a processing pipeline. The pipeline used by the default model consists of a tagger, parser, and entity recognizer (tagger, parser, entity recognizer). Each pipeline component returns the processed Doc, which is then passed to the next component.
Insert picture description here
The processing pipeline always depends on the statistical model and its functions. For example, if a pipeline model only contains data for prediction using entity tags, the pipeline can only contain entity recognizer components. This is why each model will specify the pipeline to use in its metadata

This is an example with a simple list of components:

"pipeline": ["tagger", "parser", "ner"]

Note: The above are independent of each other. If some components written by yourself have no dependencies, you do not need to import all the components

The tokenizer is a special component, it is not part of the regular pipeline, it will not appear when using nlp.pipe_name. It is just a tokenizer and all Docs will return it. You can also customize the generated tokenizer, nlp.tokenizer is writable

2.Processing text

When nlp is called on the text, spaCy will tokenize it, and then call each component on the document in turn. Then, return to the processed document that can be used.
The code is as follows (example):

doc = nlp("This is a text")

At this time doc means "This is a text" that has been processed

When processing large amounts of text, if statistical models are allowed to process batches of text, they are usually more efficient. The nlp.pipe method of spaCy takes iterable text and generates processed Doc objects. Batch processing is done internally.

The following are two processing methods, the second is more efficient

texts = ["This is a text", "These are lots of texts", "..."]
- docs = [nlp(text) for text in texts]
+ docs = list(nlp.pipe(texts))

Here is an example of named entity recognition, explaining

import spacy
​
texts = [
    "Net income was $9.4 million compared to the prior year of $2.7 million.",
    "Revenue exceeded twelve billion dollars, with a loss of $1b.",
]
​
nlp = spacy.load("en_core_web_sm")
for doc in nlp.pipe(texts, disable=["tagger", "parser"]):
    # Do something with the doc here
    print([(ent.text, ent.label_) for ent in doc.ents])
    
-----------------------------------------------------------------------------------
[('$9.4 million', 'MONEY'), ('the prior year', 'DATE'), ('$2.7 million', 'MONEY')]
[('twelve billion dollars', 'MONEY'), ('1b', 'MONEY')]

3.How pipelines work

Using spaCy can not only easily create reusable pipelines-this includes spaCy's default tagger, parser and entity recognizer, but also includes your own custom processing functions. Specify when the language class is initialized, the pipeline component can be added to an existing nlp object, or defined in the model package .

When you load a model, spaCy first queries the model's meta.json . Metadata usually includes model details, language class ID and optional pipeline component list. Then, spaCy does the following:

  1. Load the language class and data of the given ID through get_lang_class and initialize it. The language class contains shared vocabulary (vocabulary), tokenization rules (tokenization) and language-specific annotation scheme (the language-specific annotation scheme).
  2. Iterate through all the pipeline names and use create_pipe to create each component. Create_pipe will look for them in Language.factories .
  3. Add each pipeline component to the pipeline in order, use add_pipe
  4. By calling from_disk with the path of the model data directory , the model data **(model data)** is available to the language class.

Therefore when you use

nlp = spacy.load("en_core_web_sm")

…The meta.json of this model will tell spaCy to use the language "en" and the pipeline ["tagger", "parser", "ner"]. spaCy will initialize spacy.lang.en.English . And create each pipeline component and add it to the processing pipeline. Then, it will load the model's data from its data directory and return the modified Language class for you to use as an nlp object.

META.JSON (EXCERPT)
{
    
    
  "lang": "en",
  "name": "core_web_sm",
  "description": "Example model for spaCy",
  "pipeline": ["tagger", "parser", "ner"]
}

to be continue…

to sum up

Today I used the spaCy library once. I will summarize it for the first time. I will have time to translate some other guides. The address I read today is here https://spacy.io/usage/processing-pipelines#plugins

Guess you like

Origin blog.csdn.net/qq_42388742/article/details/112095021