Artificial intelligence (pytorch) building model 8 - use pytorch to build a BiLSTM+CRF model to achieve simple named entity recognition

Hello everyone, I am Weixue AI. Today I will introduce to you artificial intelligence (pytorch) building model 8-Use pytorch to build a BiLSTM+CRF model to achieve simple named entity recognition. The BiLSTM+CRF model is a commonly used sequence annotation Algorithms that can be used for tasks such as part-of-speech tagging, word segmentation, and named entity recognition. This article uses pytorch to build a BiLSTM+CRF model, and gives data samples, and demonstrates the training and prediction process of the model through a simple named entity recognition (NER) task. The article will be divided into the following sections:

1. Introduction to the BiLSTM+CRF model2
. Mathematical principles of the BiLSTM+CRF model3
. Data preparation4
. Model building5
. Training and evaluation6
. Prediction7
. Summary

1. Introduction to the BiLSTM+CRF model

The BiLSTM+CRF model combines two technologies of bidirectional long short-term memory network (BiLSTM) and conditional random field (CRF). BiLSTM is used to capture contextual information in sequences, while CRF is used to resolve dependencies between labels. In fact, BiLSTM is used to generate a feature vector for each input sequence, and then these feature vectors are fed into the CRF layer to assign a label to each element in the sequence. The combination of BiLSTM and CRF makes the model not only consider the correlation between the sequence before and after the sequence like CRF, but also have the feature extraction and fitting capabilities of LSTM.

2. Mathematical principle of BiLSTM+CRF model

Suppose we have a sequence x = ( x 1 , x 2 , . . . , xn ) \boldsymbol{x} = (x_1, x_2, ..., x_n)x=(x1,x2,...,xn) , of whichxi x_ixiright iiInput features for i positions. We want to label each position, that is, for each positioniii predicts a labelyi y_iyi. The label set is Y = y 1 , y 2 , . . . , yn \mathcal{Y}={y_1, y_2, ..., y_n}Y=y1,y2,...,yn, where yi ∈ L y_i \in \mathcal{L}yiLL \mathcal{L}L represents the category set of labels.

BiLSTM is used to extract features from the input sequence, which consists of LSTMs in two directions, processing the input sequence from front to back and from back to front, respectively. at time step ttt , the output of BiLSTM isht ∈ R 2 d h_t \in \mathbb{R}^{2d}htR2 d , whereddd is the hidden state dimension of LSTM. Specifically, the forward LSTM processes the input sequencex \boldsymbol{x}x , output hidden state sequenceh → = ( ​​h 1 → , h 2 → , . . . , hn → ) \overrightarrow{h}=(\overrightarrow{h_1},\overrightarrow{h_2},...,\overrightarrow {h_n})h =(h1 ,h2 ,...,hn ) , whereht → \overrightarrow{h_t}ht Indicates that at time step ttThe hidden state of the forward LSTM at t ; the backward LSTM processes the input sequence x \boldsymbol{x}x , output hidden state sequenceh ← = ( h 1 ← , h 2 ← , . . . , hn ← ) \overleftarrow{h}=(\overleftarrow{h_1},\overleftarrow{h_2},...,\overleftarrow {h_n})h =(h1 ,h2 ,...,hn ) , whereht ← \overleftarrow{h_t}ht Indicates that at time step ttThe hidden state of the backward LSTM at t . Then each positioniiThe feature of i is expressed ashi = [ hi → ; hi ← ] h_i=[\overrightarrow{h_i};\overleftarrow{h_i}]hi=[hi ;hi ] , where[ ⋅ ; ⋅ ] [\cdot;\cdot][;] represents the vector concatenation operation.

CRF is used to model the relationship between labels and perform global optimization. The CRF model defines a Y \mathcal{Y}The joint distribution P ( y ∣ x ) composed of Y P(\boldsymbol{y}|\boldsymbol{x})P ( yx ),其中y = ( y 1 , y 2 , . . . , yn ) \boldsymbol{y} = (y_1, y_2, ..., y_n)y=(y1,y2,...,yn) represents a label sequence. Specifically, the CRF model decomposes the probability of a label sequence into the product of conditional probabilities at multiple positions, namely

P ( y ∣ x ) = ∏ i = 1 n ψ i ( y i ∣ x ) ∏ i = 1 n − 1 ψ i , i + 1 ( y i , y i + 1 ∣ x ) P(\boldsymbol{y}|\boldsymbol{x})=\prod_{i=1}^{n}\psi_i(y_i|\boldsymbol{x}) \prod_{i=1}^{n-1}\psi_{i,i+1}(y_i,y_{i+1}|\boldsymbol{x}) P(yx)=i=1npi(yix)i=1n1pi,i+1(yi,yi+1x)

其中ψ i ( yi ∣ x ) \psi_i(y_i|\boldsymbol{x})pi(yix ) means at positioniiWhen i , the predicted label isyi y_iyi, ψ i , i + 1 ( yi , yi + 1 ∣ x ) \psi_{i,i+1}(y_i,y_{i+1}|\boldsymbol{x})pi,i+1(yi,yi+1x ) indicates that the predicted label isyi y_iyiSum yi + 1 y_{i+1}yi+1the joint probability of . These conditional and joint probabilities can be modeled with a neural network, where the input is position iiThe feature of i representshi h_ihi

The global optimization problem of the CRF model can be realized by maximizing the log-likelihood function, namely

max ⁡ y log ⁡ P ( y ∣ x ) = ∑ i = 1 n log ⁡ ψ i ( y i ∣ x ) ∑ i = 1 n − 1 log ⁡ ψ i , i + 1 ( y i , y i + 1 ∣ x ) \max_{\boldsymbol{y}}\log P(\boldsymbol{y}|\boldsymbol{x}) = \sum_{i=1}^{n}\log\psi_i(y_i|\boldsymbol{x}) \sum_{i=1}^{n-1}\log\psi_{i,i+1}(y_i,y_{i+1}|\boldsymbol{x}) ymaxlogP(yx)=i=1nlogpi(yix)i=1n1logpi,i+1(yi,yi+1x )
wherey \boldsymbol{y}y is all possible label sequences. A dynamic programming algorithm such as the Viterbi algorithm can be used to solve for the globally optimal label sequence.

In summary, the mathematical principle of the BiLSTM+CRF model can be expressed as:

P ( y ∣ x ) = ∏ i = 1 n ψ i ( y i ∣ x ) ∏ i = 1 n − 1 ψ i , i + 1 ( y i , y i + 1 ∣ x ) P(\boldsymbol{y}|\boldsymbol{x}) = \prod_{i=1}^{n}\psi_i(y_i|\boldsymbol{x}) \prod_{i=1}^{n-1}\psi_{i,i+1}(y_i,y_{i+1}|\boldsymbol{x}) P(yx)=i=1npi(yix)i=1n1pi,i+1(yi,yi+1x)

in

ψ i ( y i ∣ x ) = exp ⁡ ( W o T h i + b o T y i ) ∑ y i ′ ∈ L exp ⁡ ( W o T h i + b o T y i ′ ) \psi_i(y_i|\boldsymbol{x}) = \frac{\exp(\boldsymbol{W}_o^{T}\boldsymbol{h}_i + \boldsymbol{b}_o^{T}\boldsymbol{y}i)}{\sum{y_i'\in\mathcal{L}}\exp(\boldsymbol{W}_o^{T}\boldsymbol{h}_i + \boldsymbol{b}_o^{T}\boldsymbol{y}_i')} pi(yix)=yiLexp(WoThi+boTyi)exp(WoThi+boTy i).

ψ i , i + 1 ( y i , y i + 1 ∣ x ) = exp ⁡ ( W t T y i , i + 1 ) ∑ y i ′ ∈ L ∑ y i + 1 ′ ∈ L exp ⁡ ( W t T y i ′ , i + 1 ′ ) \psi_{i,i+1}(y_i,y_{i+1}|\boldsymbol{x}) = \frac{\exp(\boldsymbol{W}t^{T}\boldsymbol{y}{i,i+1})}{\sum_{y_i'\in\mathcal{L}}\sum_{y_{i+1}'\in\mathcal{L}}\exp(\boldsymbol{W}t^{T}\boldsymbol{y}{i',i+1}')} pi,i+1(yi,yi+1x)=yiLyi+1Lexp(WtT yi,i+1)exp(WtT yi,i+1)

Among them W o \boldsymbol{W}_oWobo \boldsymbol{b}_obois the parameter of the output layer, W t \boldsymbol{W}_tWtis the transition matrix, hi \boldsymbol{h}_ihiis position iiThe feature representation of i ,yi \boldsymbol{y}iy iis positioniiThe label of i means,yi , i + 1 \boldsymbol{y}{i,i+1}yes yes _i+1 is positioniii sumi + 1 i + 1i+1 's label joint representation.

insert image description here

3. Data preparation

Below I will use a simple Named Entity Recognition (NER) task to demonstrate the model training and prediction process. The dataset consists of sentences, and the words in each sentence are labeled as "B-PER" (beginning of person name), "I-PER" (middle of person name), "B-LOC" (beginning of place name), "I- LOC" (place name middle) or "O" (other).

Data sample:

John B-PER
lives O
in O
New B-LOC
York I-LOC
. O

4. Model building

First, we need to install the PyTorch library:

pip install torch

Next, we will use PyTorch to build the BiLSTM+CRF model. The complete model code is as follows:

import torch
import torch.nn as nn
import torch.optim as optim

from TorchCRF import CRF

class BiLSTM_CRF(nn.Module):
    def __init__(self, vocab_size, tag_to_ix, embedding_dim, hidden_dim):
        super(BiLSTM_CRF, self).__init__()
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.vocab_size = vocab_size
        self.tag_to_ix = tag_to_ix
        self.tagset_size = len(tag_to_ix)

        self.word_embeds = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim // 2,
                            num_layers=1, bidirectional=True)

        self.hidden2tag = nn.Linear(hidden_dim, self.tagset_size)
        self.crf = CRF(self.tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeds(sentence).view(len(sentence), 1, -1)
        lstm_out, _ = self.lstm(embeds)
        lstm_out = lstm_out.view(len(sentence), self.hidden_dim)
        lstm_feats = self.hidden2tag(lstm_out)
        return lstm_feats

    def loss(self, sentence, tags):
        feats = self.forward(sentence)
        return -self.crf(torch.unsqueeze(feats, 0), tags)

    def predict(self, sentence):
        feats = self.forward(sentence)
        return self.crf.decode(torch.unsqueeze(feats, 0))

5. Training and Evaluation

Next, we will train the model using the training data and print the loss and accuracy after each epoch.

def train(model, optimizer, data):
    for epoch in range(10):
        total_loss = 0
        total_correct = 0
        total_count = 0
        for sentence, tags in data:
            model.zero_grad()
            loss = model.loss(sentence, tags)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

            prediction = model.predict(sentence)
            total_correct += sum([1 for p, t in zip(prediction, tags) if p == t])
            total_count += len(tags)

        print(f"Epoch {
      
      epoch + 1}: Loss = {
      
      total_loss / len(data)}, Accuracy = {
      
      total_correct / total_count}")

6. Forecast

Finally, we will use the trained model to make predictions on new sentences.

def predict(model, sentence):
    prediction = model.predict(sentence)
    return [p for p in prediction]

7. Summary

Use the trained model to predict new sentences.

def predict(model, sentence):
    prediction = model.predict(sentence)
    return [p for p in prediction]

7. Summary

This article introduces how to use PyTorch to build a BiLSTM+CRF model, and demonstrates the training and prediction process of the model through a simple named entity recognition (NER) task. I hope this article can help you understand the principle of the BiLSTM+CRF model, and provide a reference for your actual project.

Please continue to pay attention to the updated and exciting model building and application!

Guess you like

Origin blog.csdn.net/weixin_42878111/article/details/131007449