Pytorch - XLNet pre-training model and named entity recognition

Introduction

Before we introduce and use the pre-BERT training model and GPT-2 pre-trained models, respectively, text classification and text generation times. We will introduce XLNet pre-training model and use it for named entity recognition times.

Knowledge Point

  • XLNet BERT and improvements in the GPT-2
  • XLNet model structure
  • Use XLNet named entity recognition times

Following the Google team BERT model in mid-2019 and presented XLNet model . XLNet on as many as 20 missions have achieved beyond BERT results, also as a question answering system, natural language reasoning, sentiment analysis, text sorting and other tasks than the current best results.

Here is XLNet in GLUE on the test results:

Model Railroads QNLI QQP RTE SST-2 MRPC CoLA STS-B
BERT-Large 86.6 92.3 91.3 70.4 93.2 88.0 60.6 90.0
XLNet-Base 86.8 91.7 91.4 74.0 94.7 88.2 60.2 89.5
XLNet-Large 89.8 93.9 91.8 83.8 95.6 89.2 63.6 91.8

XLNet BERT and improvements in the GPT-2

BERT shortcomings

BERT can be said to be an enhanced version of XLNet, but it BERT there are many differences. Below, we will explain in detail.

In the first times mentioned, BERT is self-coding model (Autoencoding), put it another way, it is, BERT to shield the language model (Masked Language Model) for the training objectives. Training self-regression model, some of the words of the input sentence will be randomly replaced with [MASK]the label, and then train the model to predict masked tag words.

We can see two drawbacks from this process:

  1. Mistakenly assumed to be covered are independent of each word and the word is covered.
  2. Enter the pre-training and fine-tuning is not uniform.

Disadvantages 1 means that when making a prediction, because some words are [MASK]covered, so when BERT model prediction is covered by a word, ignoring the influence of other coverage to his words. That is, assuming that all the words are not covered by relevant, obviously we know this assumption is wrong.

2 refers to the shortcomings in the pre-training we use the [MASK]label to cover part of the word, and in the fine-tuning pre-trained model, we will not be using this label, which led to the pre-match training process and fine-tune the process.

Disadvantage of GPT-2

At a time we introduced, GPT-2 is a self-regression model (Autoregressive), that is predicted by the text before or after the word to be predicted as contextual information.

Mathematically, a text sequence = X (x_1, ..., x_t) X = ( X . 1, ..., X ** T ), the autoregressive model calculates the product of the word to be forward prediction \ ( p (x) = \ prod ^ {T} * {t = 1} p (x_t | x * {<t}), wherein, where x _ {<t} denotes represents a word before x_t Similarly, the predicted word. behind the product be expressed as: Similarly previous word, word backward prediction can be expressed as the product of:. p (x) = \ prod ^ {T} * {t = 1} p (x_t | x * {> t}) \) .

But from a regression model is obvious drawback is that it can only consider the context of a single direction. Many times downstream tasks, such as natural language understanding, will also need contextual information back and forth in both directions.

The inherent shortcomings of self-coding model introduction to represent BERT and self-regression model of GPT-2 represented below will explain how to improve XLNet is carried out for the shortcomings of these two types of models.

The improvement XLNet

Researchers in the design XLNet model, taking into account to overcome the disadvantages of self-coding model and self-regression model, and to combine their strengths to design the arrangement of language model (Permutation Lanuage Model). As shown below, the idea is that the language model arrangement, since the only way to obtain language context autoregression model, it would change the way words are arranged by the location of the two-way to one-way arrangement statements.

img

source

In the top right portion of FIG example, when the word alignment becomes "2 -> 4 -> 3 -> 1", and the prediction of the target training three words, two words for the input model and word 4 Information. This model retains the characteristics of self-regression model, but also to make a two-way learning model contextual information.

Note that changing the arrangement does not actually change the terms of the location information that came in third position vector word or words would be the third bits of information.

In specific implementation, by the action of one of the masks is varied attention mechanism to change the purpose of the arrangement, red frame portion may refer to the diagram,

img

source

Other parts of the meaning of this figure will explain below, here to talk about the part of the red box. The second half of the behavior of the above embodiment, when expressed on the following word 2 word prediction, a mask is added to each word, and at this time the order is "3 -> 2 -> 4 -> 1." So, when added to the word mask value 2 and 3 should be 1, you can see the word model representation of the value of the remaining term should be zero. We can also see the figure in the second row and three second corresponding mask are marked red. Other line the upper half of empathy.

The different figures in the lower half is the lower half of the mask can be added so that the model can not see each word to be predicted, such as model predictive models have the word would not have seen the word in question 2 2:00.

XLNet model structure

SentencePiece segmentation method

XLNet model uses SentencePiece segmentation method, SentencePiece is Google's open-source natural language processing toolkit. Its principle is more often the statistics appear fragment, the fragment is considered to be a word.

SentencePiece tool is unique in that it does not rely on training before, but by focusing on learning from the training given, and it will not be because of the different languages ​​and have different performance because it treats all string characters considered a Unicode character.

It can be said to optimize the use SentencePiece BERT WordPiece method used for the Chinese word poor performance issues.

Shuangliu self-attention mechanism

On the whole structure, the structural differences of the model structure XLNet BERT and little, are based Transformer-based. However XLNet model uses special attention mechanism, i.e. from double attentional mechanisms (Two-Stream Self-Attention), XLNet attention from the double characterize the mechanism uses two units, respectively, and the contents of the token interrogation characterization unit .

Is a unit characterizing the content of the above information, will contain the current word. Characterization of interrogation information unit includes information indicating the current word other than the above, and the position information including the current word, and can not access the content information of the current word.

Characterization Characterization content interrogation unit and two units of information flow, both the continuous flow of information passed up, the output information at the end of the interrogation unit. We can see from the red line and the lower portion of FIG., The same as the predicted result of the final output corresponding to the input word order.

img

source

The following detailed look at each part of this figure, the first figure (a) indicates the portion of the content stream attention (Content stream attention), part of a diagram showing (b) is to ask the flow of attention (Query stream attention). We can see in FIG. (B) only one word corresponding characterization interrogation unit is input, while in FIG. (A), the contents of word 1 has the token is input. Section diagram showing (c) how the model is applied from Shuangliu attention mechanisms. And FIG. (C) is the right focus mask, we have described above. Add a mask effect in addition to achieve the purpose of changing the arrangement, also reached the attention of the model in the content stream can be seen in the current word, and in the attention can not see the purpose of asking the flow of the current word.

In addition to the above-mentioned method, XLNet also use some predictable manner, because the language autoregression model is predicted from the first word to the last word, but in predicting the initial stage, due to less known statement information model 1 / K after a word is difficult to converge, only the actual prediction selected statement portion of the front and 1-1 / K as context information.

XLNet named entity recognition

Above, we described in some special methods XLNet BERT and GPT-2 on the basis of the improvement, and XLNet model used. Next we will use the pre-training model XLNet were named entity recognition times. NER (Named Entity Recognition, referred NER), are entities that have special meaning in the recognized text, including place names, organization names, and other proper nouns. Named entity recognition is an important step in information extraction, and is widely used in natural language processing.

The second training and test data sources we use to CoNLL-2003 , CoNLL-2003 data set is based on news corpus, marked four entities, namely: companies, locations, and names do not belong to the above three categories entity. Named ORG, LOC, PER, MISC, the first word is marked entity B-ORG, B-LOC, B-PER, B-MISC, labeled second word I-ORG, I-LOC, I-PER, I-MISC, labeled Oword indicates it does not belong to any of a phrase.

Here we will use PyTorch-Transformers model library packaged XLNetTokenizer()and XLNetModelclass to actually conduct some XLNet pre-training model application. First, you need to install PyTorch-Transformers.

!pip install pytorch-transformers==1.0  # 安装 PyTorch-Transformers

Because the original data structure is more complex, so we had to refresh the data in advance, the data has been labeled in accordance with the following label.

# 标签数据所对应的字符串含义
label_dict = {'O':0, 'B-ORG':1, 'I-ORG':2, 'B-PER':3, 'I-PER':4, 'B-LOC':5, 'I-LOC':6, 'B-MISC':7, 'I-MISC':8, 'X':9, '[PAD]':10}

It may be noted in the above label more Xand [PAD]tags, their meanings are: because word will make some original complete word off, and extra section we set off the label X. [PAD]Corresponding to the label is to fill in the characters.

Next to download the data set has been downloaded well in advance, network disk link: https://pan.baidu.com/s/18jqTwLNM2Vmf7fOzkh7UgA extraction code: zko3

Once you have downloaded the data set, read data files.

train_samples = []
train_labels = []

with open('./train.txt', 'r') as f:
    while True:
        s1 = f.readline()
        if not s1:
            # 如果读取到内容为空,则读取结束
            break
        s2 = f.readline()
        _ = f.readline()
        train_samples.append(s1.replace('\n', ''))
        train_labels.append(s2.replace('\n', ''))

len(train_samples), len(train_labels)

Due to the above mentioned, make some word complete word off, so after the word, we want to further increase the label in the original data base. For example the word "She's", are classified into "She", " '", "s", three words, then the word will be the original "She's" corresponding to the label Omarked on the word, "She", and "'" , "s" for the tables, respectively X, and X. The following code is embodied to modify the tag.

from pytorch_transformers import XLNetTokenizer

# 使用 XLNet 的分词器
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')

input_ids = []
input_labels = []
for text, ori_labels in zip(train_samples, train_labels):
    l = text.split(' ')
    labels = []
    text_ids = []
    for word, label in zip(l, ori_labels):
        if word == '':
            continue
        tokens = tokenizer.tokenize(word)
        for i, j in enumerate(tokens):
            if i == 0:
                labels.append(int(label))
                text_ids.append(tokenizer.convert_tokens_to_ids(j))
            else:
                labels.append(9)
                text_ids.append(tokenizer.convert_tokens_to_ids(j))
    input_ids.append(text_ids)
    input_labels.append(labels)

len(input_ids), len(input_labels)

After preparing the data used here PyTorch offered DataLoader()to build training data set that the use of TensorDataset()building training data iterator.

import torch
from torch.utils.data import DataLoader, TensorDataset

del train_samples
del train_labels

for j in range(len(input_ids)):
    # 将样本数据填充至长度为 128
    i = input_ids[j]
    if len(i) != 128:
        input_ids[j].extend([0]*(128 - len(i)))

for j in range(len(input_labels)):
    # 将样本数据填充至长度为 128
    i = input_labels[j]
    if len(i) != 128:
        input_labels[j].extend([10]*(128 - len(i)))

# 构建数据集和数据迭代器,设定 batch_size 大小为 8
train_set = TensorDataset(torch.LongTensor(input_ids),
                          torch.LongTensor(input_labels))
train_loader = DataLoader(dataset=train_set,
                          batch_size=8,
                          shuffle=True)
train_loader

Check whether the machine has GPU, if you run GPU, otherwise the CPU is running.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

Due to the large volume XLNet pre-model, and hosted outside the network, so start with the Web site to download pre-training model. Links: https://pan.baidu.com/s/1CySwfsOyh9Id4T85koxAeg extraction code: lah0

The following classes constructed for named entity recognition, the named entity recognition is performed essentially classification model XLNet added at a Dropoutlayer for preventing overfitting, and a Linearfully connected layer.

import torch.nn as nn
from pytorch_transformers import XLNetModel

class NERModel(nn.Module):
    def __init__(self, num_class=11):
        super(NERModel, self).__init__()
        # 读取 XLNet 预训练模型
        self.model = XLNetModel.from_pretrained("./")
        self.dropout = nn.Dropout(0.1)
        self.l1 = nn.Linear(768, num_class)

    def forward(self, x, attention_mask=None):
        outputs = self.model(x, attention_mask=attention_mask)
        x = outputs[0]  # 形状为 batch * seqlen * 768
        x = self.dropout(x)
        x = self.l1(x)
        return x

Defined loss function. As used herein, the cross-entropy (Cross Entropy) as a loss function.

def loss_function(logits, target, masks, num_class=11):
    criterion = nn.CrossEntropyLoss(reduction='none')
    logits = logits.view(-1, num_class)
    target = target.view(-1)
    masks = masks.view(-1)
    cross_entropy = criterion(logits, target)
    loss = cross_entropy * masks
    loss = loss.sum() / (masks.sum() + 1e-12)  # 加上 1e-12 防止被除数为 0
    loss = loss.to(device)
    return loss

Entity class, defined loss function, establishing the optimizer.

from torch.optim import Adam

model = NERModel()
model.to(device)
model.train()

optimizer = Adam(model.parameters(), lr=1e-5)

Start training.

from torch.autograd import Variable
import time

pre = time.time()

epoch = 3

for i in range(epoch):
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = Variable(data).to(device), Variable(target).to(device)

        optimizer.zero_grad()

        # 生成掩膜
        mask = []
        for sample in data:
            mask.append([1 if i != 0 else 0 for i in sample])
        mask = torch.FloatTensor(mask).to(device)

        output = model(data, attention_mask=mask)

        # 得到模型预测结果
        pred = torch.argmax(output, dim=2)

        loss = loss_function(output, target, mask)
        loss.backward()

        optimizer.step()

        if ((batch_idx+1) % 10) == 1:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss:{:.6f}'.format(
                i+1, batch_idx, len(train_loader), 100. *
                batch_idx/len(train_loader), loss.item()
            ))

        if batch_idx == len(train_loader)-1:
            # 在每个 Epoch 的最后输出一下结果
            print('labels:', target)
            print('pred:', pred)

print('训练时间:', time.time()-pre)

After training, you can be used to verify the effect of the training set to observe the model.

Construction and validation data sets read data iterator same manner as the training set.

eval_samples = []
eval_labels = []

with open('./dev.txt', 'r') as f:
    while True:
        s1 = f.readline()
        if not s1:
            break
        s2 = f.readline()
        _ = f.readline()
        eval_samples.append(s1.replace('\n', ''))
        eval_labels.append(s2.replace('\n', ''))

len(eval_samples)

# 这里使用和训练集同样的方式修改标签,不再赘述
input_ids = []
input_labels = []
for text, ori_labels in zip(eval_samples, eval_labels):
    l = text.split(' ')
    labels = []
    text_ids = []
    for word, label in zip(l, ori_labels):
        if word == '':
            continue
        tokens = tokenizer.tokenize(word)
        for i, j in enumerate(tokens):
            if i == 0:
                labels.append(int(label))
                text_ids.append(tokenizer.convert_tokens_to_ids(j))
            else:
                labels.append(9)
                text_ids.append(tokenizer.convert_tokens_to_ids(j))
    input_ids.append(text_ids)
    input_labels.append(labels)

del eval_samples
del eval_labels

for j in range(len(input_ids)):
    # 将样本数据填充至长度为 128
    i = input_ids[j]
    if len(i) != 128:
        input_ids[j].extend([0]*(128 - len(i)))

for j in range(len(input_labels)):
    # 将样本数据填充至长度为 128
    i = input_labels[j]
    if len(i) != 128:
        input_labels[j].extend([10]*(128 - len(i)))

# 构建数据集和数据迭代器,设定 batch_size 大小为 1
eval_set = TensorDataset(torch.LongTensor(input_ids),
                         torch.LongTensor(input_labels))
eval_loader = DataLoader(dataset=eval_set,
                         batch_size=1,
                         shuffle=False)
eval_loader

The model is set to the authentication mode, input validation data set.

from tqdm import tqdm_notebook as tqdm

model.eval()

correct = 0
total = 0

for batch_idx, (data, target) in enumerate(tqdm(eval_loader)):
    data = data.to(device)
    target = target.float().to(device)

    # 生成掩膜
    mask = []
    for sample in data:
        mask.append([1 if i != 0 else 0 for i in sample])
    mask = torch.Tensor(mask).to(device)

    output = model(data, attention_mask=mask)

    # 得到模型预测结果
    pred = torch.argmax(output, dim=2)

    # 将掩膜添加到预测结果上,便于计算准确率
    pred = pred.float()
    pred = pred * mask
    target = target * mask

    pred = pred[:, 0:mask.sum().int().item()]
    target = target[:, 0:mask.sum().int().item()]

    correct += (pred == target).sum().item()
    total += mask.sum().item()

print('正确分类的标签数:{},标签总数:{},准确率:{:.2f}%'.format(
    correct, total, 100.*correct/total))

We can see the final accuracy rate above 90%. In the application entity extraction, and sometimes extracted by adding rules entirety entity word, or will be replaced when the last forecast Softmax layer labeled CRFs improve accuracy, since the content is not the focus of time, interested students can be self learn to find information.

to sum up

At this time we know the BERT and upgraded version XLNet GPT-2, which combines the advantages of both models, that does not introduce noise masking label produced without problems pre-match training and fine-tuning, and can at the same time integration of contextual information context. Then we were NER times XLNet, has been a good performance.

Related Links

Guess you like

Origin www.cnblogs.com/wwj99/p/12564136.html