NLP (54) Using the English Roberta model to implement text classification in Keras

The English Roberta model is a new pre-training model proposed by Facebook in the paper RoBERTa: A Robustly Optimized BERT Pretraining Approach in   2019. Its purpose is to improve some problems existing in the BERT model. At that time, it also refreshed the list of many NLP tasks and reached SOTA. The effect, its model and code are open sourced in fairseqa project on Github. As we all know, the English Roberta model is trained using the Torch framework, so its torch version model is the most common.
  Of course, torch models can also be converted into tensorflow models. This article will introduce how to convert the original torch version of the English Roberta model into the tensorflow version model, and use the tensorflow version model in Keras to implement English text classification.
  The project structure is shown below:
Project structure chart

Model transformation

  This project will first convert the original torch version of the English Roberta model into the tensorflow version model. This part of the code mainly refers to the Github project keras_roberta .
  First, you need to download fairseqthe roberta base model published by Facebook in the project. The access URL is: https://github.com/pytorch/fairseq/blob/main/examples/roberta/README.md .
Roberta model
Run convert_roberta_to_tf.pythe script to convert the torch model into a tensorflow model. The specific code is not given here, you can refer to the Github project address given later in the article.
  In terms of the tokenizer of the model, change RobertaTokenizer to GPT2Tokenizer because RobertaTokenizer inherits from GPT2Tokenizer and the two are very similar. Test the performance of the original torch model and tensorflow model, the code is as follows (tf_roberta_demo.py):

import os
import tensorflow as tf
from keras_roberta.roberta import build_bert_model
from keras_roberta.tokenizer import RobertaTokenizer
from fairseq.models.roberta import RobertaModel as FairseqRobertaModel
import numpy as np
import argparse


if __name__ == '__main__':
    roberta_path = 'roberta-base'
    tf_roberta_path = 'tf_roberta_base'
    tf_ckpt_name = 'tf_roberta_base.ckpt'
    vocab_path = 'keras_roberta'

    config_path = os.path.join(tf_roberta_path, 'bert_config.json')
    checkpoint_path = os.path.join(tf_roberta_path, tf_ckpt_name)
    if os.path.splitext(checkpoint_path)[-1] != '.ckpt':
        checkpoint_path += '.ckpt'

    gpt_bpe_vocab = os.path.join(vocab_path, 'encoder.json')
    gpt_bpe_merge = os.path.join(vocab_path, 'vocab.bpe')
    roberta_dict = os.path.join(roberta_path, 'dict.txt')

    tokenizer = RobertaTokenizer(gpt_bpe_vocab, gpt_bpe_merge, roberta_dict)
    model = build_bert_model(config_path, checkpoint_path, roberta=True)  # 建立模型,加载权重

    # 编码测试
    text1 = "hello, world!"
    text2 = "This is Roberta!"
    sep = [tokenizer.sep_token]
    cls = [tokenizer.cls_token]
    # 1. 先用'bpe_tokenize'将文本转换成bpe tokens
    tokens1 = cls + tokenizer.bpe_tokenize(text1) + sep
    tokens2 = sep + tokenizer.bpe_tokenize(text2) + sep
    # 2. 最后转换成id
    token_ids1 = tokenizer.convert_tokens_to_ids(tokens1)
    token_ids2 = tokenizer.convert_tokens_to_ids(tokens2)
    token_ids = token_ids1 + token_ids2
    segment_ids = [0] * len(token_ids1) + [1] * len(token_ids2)
    print(token_ids)
    print(segment_ids)

    print('\n ===== tf model predicting =====\n')
    our_output = model.predict([np.array([token_ids]), np.array([segment_ids])])
    print(our_output)

    print('\n ===== torch model predicting =====\n')
    roberta = FairseqRobertaModel.from_pretrained(roberta_path)
    roberta.eval()  # disable dropout

    input_ids = roberta.encode(text1, text2).unsqueeze(0)  # batch of size 1
    print(input_ids)
    their_output = roberta.model(input_ids, features_only=True)[0]
    print(their_output)

The output is as follows:

[0, 42891, 6, 232, 328, 2, 2, 713, 16, 1738, 102, 328, 2]
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

 ===== tf model predicting =====
[[[-0.01123665  0.05132651 -0.02170264 ... -0.03562857 -0.02836962
   -0.00519008]
  [ 0.04382067  0.07045364 -0.00431021 ... -0.04662359 -0.10770167
    0.1121687 ]
  [ 0.06198474  0.05240346  0.11088232 ... -0.08883709 -0.02932207
   -0.12898633]
  ...
  [-0.00229368  0.045834    0.00811818 ... -0.11751424 -0.06718166
    0.04085271]
  [-0.08509324 -0.27506304 -0.02425355 ... -0.24215901 -0.15481825
    0.17167582]
  [-0.05180666  0.06384835 -0.05997407 ... -0.09398533 -0.05159672
   -0.03988626]]]

 ===== torch model predicting =====
tensor([[    0, 42891,     6,   232,   328,     2,     2,   713,    16,  1738,
           102,   328,     2]])
tensor([[[-0.0525,  0.0818, -0.0170,  ..., -0.0546, -0.0569, -0.0099],
         [-0.0765, -0.0568, -0.1400,  ..., -0.2612, -0.0455,  0.2975],
         [-0.0142,  0.1184,  0.0530,  ..., -0.0844,  0.0199,  0.1340],
         ...,
         [-0.0019,  0.1263, -0.0787,  ..., -0.3986, -0.0626,  0.1870],
         [ 0.0127, -0.2116,  0.0696,  ..., -0.1622, -0.1265,  0.0986],
         [-0.0473,  0.0748, -0.0419,  ..., -0.0892, -0.0595, -0.0281]]],
       grad_fn=<TransposeBackward0>)

It can be seen that the token_ids of the two are consistent when tokenizing.

English text classification

  Then we need to look at the effect of the converted tensorflow version of the Roberta model on the English text classification dataset.
  Here we use the GLUE data set SST-2. SST-2(The Stanford Sentiment Treebank, Stanford Sentiment Treebank), a single sentence classification task containing sentences from movie reviews and human annotations of their sentiments. This task is to give the sentiment of a sentence. The category is divided into two categories: positive sentiment (positive, corresponding to the sample label is 1) and negative sentiment (negative, corresponding to the sample label is 0), and only sentence-level labels are used. That is, this task is also a binary classification task, which is divided into positive and negative emotions for the sentence level. For a detailed introduction to this dataset, please refer to the URL: https://nlp.stanford.edu/sentiment/index.html .
  SST-2The number of training set samples in the data set is 67349, the number of validation set samples is 872, and the number of test set samples is 1820. The data storage format is tsv. The code to read the data is as follows: (utils/load_data.py)

def read_model_data(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8') as f:
        lines = [_.strip() for _ in f.readlines()]
    for i, line in enumerate(lines):
        if i:
            items = line.split('\t')
            label = [0, 1] if int(items[1]) else [1, 0]
            data.append([label, items[0]])
    return data

  In the tokenizer part, we use GTP2Tokenizer, the code of this part is as follows (utils/roberta_tokenizer.py):

# roberta tokenizer function for text pair
def tokenizer_encode(tokenizer, text, max_seq_length):
    sep = [tokenizer.sep_token]
    cls = [tokenizer.cls_token]
    # 1. 先用'bpe_tokenize'将文本转换成bpe tokens
    tokens1 = cls + tokenizer.bpe_tokenize(text) + sep
    # 2. 最后转换成id
    token_ids = tokenizer.convert_tokens_to_ids(tokens1)
    segment_ids = [0] * len(token_ids)
    pad_length = max_seq_length - len(token_ids)
    if pad_length >= 0:
        token_ids += [0] * pad_length
        segment_ids += [0] * pad_length
    else:
        token_ids = token_ids[:max_seq_length]
        segment_ids = segment_ids[:max_seq_length]

    return token_ids, segment_ids

  Create the model as follows (model_train.py):

# 构建模型
def create_cls_model():
    # Roberta model
    roberta_model = build_bert_model(CONFIG_FILE_PATH, CHECKPOINT_FILE_PATH, roberta=True)  # 建立模型,加载权重

    for layer in roberta_model.layers:
        layer.trainable = True

    cls_layer = Lambda(lambda x: x[:, 0])(roberta_model.output)    # 取出[CLS]对应的向量用来做分类
    p = Dense(2, activation='softmax')(cls_layer)     # 多分类

    model = Model(roberta_model.input, p)
    model.compile(
        loss='categorical_crossentropy',
        optimizer=Adam(1e-5),   # 用足够小的学习率
        metrics=['accuracy']
    )

    return model

The model parameters are as follows:

# 模型参数配置
EPOCH = 10              # 训练轮次
BATCH_SIZE = 64         # 批次数量
MAX_SEQ_LENGTH = 80     # 最大长度

After the model was trained, the accuracy on the verification data set was 0.9415, and the F1 value was 0.9415, achieving good results.

Model prediction

  We perform model prediction (model_predict.py) on new samples, and the prediction results are as follows:

Awesome movie for everyone to watch. Animation was flawless.
label: 1, prob: 0.9999607

I almost balled my eyes out 5 times. Almost. Beautiful movie, very inspiring.
label: 1, prob: 0.9999519

Not even worth it. It’s a movie that’s too stupid for adults, and too crappy for everyone. Skip if you’re not 13, or even if you are.
label: 0, prob: 0.9999864

Summarize

  This article describes how to convert the original torch version of the English Roberta model into the tensorflow version model, and use the tensorflow version model in Keras to implement English text classification.
  The code of this project has been put on Github, the URL is: https://github.com/percent4/keras_roberta_text_classificaiton .
  Thanks for reading. If you have any questions, please feel free to share~

Reference URL

  1. fairseq: https://github.com/pytorch/fairseq
  2. GLUE tasks: https://gluebenchmark.com/tasks
  3. SST-2: https://nlp.stanford.edu/sentiment/index.html
  4. keras_roberta: https://github.com/midori1/keras_roberta
  5. Roberta paper: https://arxiv.org/pdf/1907.11692.pdf

おすすめ

転載: blog.csdn.net/jclian91/article/details/124775338