[Natural Language Processing (NLP)] Text Semantic Matching Based on ERNIE Language Model

[Natural Language Processing (NLP)] Text Semantic Matching Based on ERNIE Language Model


insert image description here


About the author : A college student, an expert on Huawei Cloud Sharing, an expert blogger on Alibaba Cloud, a member of Tengyun Pioneer (TDP), the general director of the Yunxi Smart Project, the National Committee of Experts on Computer Teaching and Industrial Practice Resource Construction in Colleges and Universities ( TIPCC ) volunteers , as well as programming enthusiasts, look forward to learning and making progress together with everyone~
.
Blog homepage :ぃ Ling Yu が's learning log
.
This article column : artificial intelligence
. No stop .


insert image description here


foreword

(1) Task description

Text matching has always been a basic and important direction in the field of natural language processing (NLP), which generally studies the relationship between two texts. Text similarity calculation, natural language reasoning, question answering system, information retrieval, etc. can all be regarded as text matching applications for different data and scenarios. To a large extent, these natural language processing tasks can be abstracted into text matching problems. For example, information retrieval can be attributed to the matching of search terms and document resources, question answering systems can be attributed to the matching of questions and candidate answers, and paraphrase problems can be attributed to two The matching of a synonymous sentence, the dialogue system can be attributed to the matching of the previous sentence and the reply, and the machine translation can be attributed to the matching of two languages.

insert image description here


(2) Data sources

The data set is provided by the Tianchi "Public AI Star" Challenge-New Crown Epidemic Similar Sentence Pair Judgment Competition.

In the face of the fight against the epidemic, the application of epidemic knowledge quiz has been widely promoted. How to perform similarity classification of question answering by natural language technology is still a valuable question. For example, identifying similar questions of patients can help understand the real demands of patients, help to quickly match accurate answers, and improve patients’ sense of gain; summarizing similar answers from doctors can help to analyze the standardization of answers, ensure the standardization of consultation during the epidemic, and avoid misdiagnosis.

The competition focuses on the accumulation of real data in the respiratory field related to the epidemic. The data granularity is more refined, the judgment difficulty is higher than that of multi-department text similarity matching, and the question and answer data is also more timely. Questions are limited to 20 words and form relatively standardized sentence pairs.

Dataset example:

# 解压数据集
!tar -zxvf /home/aistudio/data/data48492/COVID19_sim_competition.tar.gz

# 查看数据集样例
!head -n 5 COVID19_sim_competition/train.tsv

The output is shown in Figure 1 below:
insert image description here


The dataset gives text pairs (text_a, text_b, text_a is query, text_b is title) and categories (label). The label is 1, which means that the text semantics of text_a and text_b are similar, otherwise it means that they are not similar.

Text matching tasks are built into PaddleHub since version 1.8.0. Text matching tasks can be divided into pointwise and pairwise types.

  • Pointwise, each sample usually consists of two texts (query, title). The category form is 0 or 1, 0 means the query does not match the title; 1 means it matches.
  • Pairwise, each sample usually consists of three texts (query, positive_title, negative_title). positive_title matches query better than negative_title.

According to this dataset example, the matching task is of pointwise type.

Next, this tutorial shows how to use PaddleHub combined with the pre-trained model ERNIE to complete the pointwise text matching task.

Pairwise text matching task can refer to the tutorial :
https://aistudio.baidu.com/aistudio/projectdetail/709472


1. PaddleHub loads custom datasets

To load a custom data set for text matching tasks, users only need to inherit the TextMatchingDataset class and replace the data set storage address. The following code example shows how to load a custom dataset into PaddleHub for use. This way we only need to fine-tune the pretrained model on a small dataset.

# 安装PaddleHub 1.8.1版本
!pip install paddlehub==1.8.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
from paddlehub.dataset.base_nlp_dataset import TextMatchingDataset


class COVID19Competition(TextMatchingDataset):
    def __init__(self, tokenizer=None, max_seq_len=None):
        base_path = 'COVID19_sim_competition'
        super(COVID19Competition, self).__init__(
            is_pair_wise=False, # 文本匹配类型,是否为pairwise
            base_path=base_path,
            train_file="train.tsv", # 相对于base_path的文件路径
            dev_file="dev.tsv", # 相对于base_path的文件路径
            train_file_with_header=True,
            dev_file_with_header=True,
            label_list=["0", "1"],
            tokenizer=tokenizer,
            max_seq_len=max_seq_len)

Second, the semantic pre-training model ERNIE optimizes text matching

If you are interested in pre-trained models, such as Google's BERT model, or Baidu's ERNIE model, it's also worth trying the effect on your own tasks.

Baidu's pre-training model ERNIE has done a very good job of feature extraction after training with massive data. Drawing on the idea of ​​transfer learning, we can use the semantic information it learns in massive data to assist tasks on small datasets (such as the medical text dataset in this example).

insert image description here
PaddleHub provides a wealth of pre-trained models, and you can easily obtain all pre-trained models in the PaddlePaddle ecosystem. The following shows how to use PaddleHub to load ERNIE with one click to optimize text matching tasks.


(1), PaddleHub one-click loading ERNIE

insert image description here


import paddlehub as hub

import paddle 
paddle.enable_static()

module = hub.Module(name="ernie")

# Pointwise任务需要: query, title_left (2 slots)
inputs, outputs, program = module.context(
trainable=True, max_seq_len=128, num_slots=2)

The maximum sequence length max_seq_len is an adjustable parameter. The recommended value is 128. The value can be adjusted according to the length of the task text, but it should not exceed 512.

num_slots: The amount of data in the text matching task input text. The pointwise text matching task num_slots should be 2, which means query and title. The pairtwise text matching task num_slots should be 3.

If you want to try other semantic models (such as ernie_tiny, RoBERTa, etc.), just replace the name parameter in Module.

insert image description here


(2) Select Tokenizer to read data

tokenizer = hub.BertTokenizer(vocab_file=module.get_vocab_path(), tokenize_chinese_chars=True)

dataset = COVID19Competition(tokenizer=tokenizer, max_seq_len=128)

module.get_vocab_path()will return the vocabulary corresponding to the pre-trained model;

tokenize_chinese_charsWhether to segment Chinese text

NOTE:

  1. If you use Transformer class models (such as ERNIE, BERT, RoBerta, etc.), you should choose hub.BertTokenizer.
  2. If you use non-Transformer class models (such as word2vec_skipgram, tencent_ailab_chinese_embedding_small, etc.), you should choosehub.CustomTokenizer
  3. When creating a dataset object, it max_seq_lenmust be consistent with the module.contextinterface in the first step max_seq_len.
  4. Here, take out a piece of data and print it out. You can use docs to get the list of data, use labels to get the label value of the data, and print it out to have a preliminary impression of the data.

(3) Select the optimization strategy and operation configuration

The migration optimization strategy for Transformer models such as ERNIE/BERT is AdamWeightDecayStrategy:

See Strategy for details .

AdamWeightDecayStrategyThe parameters:

  • learning_rate: maximum learning rate
  • lr_scheduler: There are linear_decayand noam_decaytwo decay strategies are optional
  • warmup_proprotion: The ratio of training warm-up, if set to 0.1, the learning rate will gradually increase to 0.1 in the first 10% of the training stepslearning_rate
  • weight_decay: Weight decay, similar to the model regular term strategy, to avoid model overfitting

insert image description here

strategy = hub.AdamWeightDecayStrategy(
    weight_decay=0.01,
    warmup_proportion=0.1,
    learning_rate=5e-5)

PaddleHub provides many optimization strategies, such as AdamWeightDecayStrategy, ULMFiTStrategy, DefaultFinetuneStrategyetc. For details, see Strategies


(4), select the running configuration

Before performing Finetune, we can set some runtime configurations, such as the configuration in the following code, which means:

  • use_cuda: Set to False to use CPU for training. If your machine supports GPU and installed the GPU version of PaddlePaddle, we recommend that you set this option to True;

  • num_epoch: The number of times to traverse the training set during Finetune,;

  • batch_size: During each training, the size of each batch of data input to the model is 32, and the batch data can be processed in parallel during model training. Therefore, the larger the batch_size, the higher the training efficiency, but it also brings a memory load, which is too large. batch_size may cause insufficient memory to train, so choosing a suitable batch_size is an important step;

  • checkpoint_dir: The training parameters and data storage directory;

  • eval_interval: Perform performance evaluation on the validation set every 100 steps;

  • strategy: Fine-tune strategy;

For more run configurations, see RunConfig

config = hub.RunConfig(
    eval_interval=300,
    use_cuda=True,
    num_epoch=3,
    batch_size=32,
    checkpoint_dir='ckpt_ernie_pointwise_matching',
    strategy=strategy)

(5), the establishment of Finetune Task

Using the pre-trained model ERNIE to complete the pointwise text matching task, you may think of splicing query and title text, then inputting it into ERNIE, taking the CLS feature (pooled_output), and then outputting the fully connected layer for binary classification. The following figure shows the usage of BERT for sentence pair classification tasks:

insert image description here

However, the problem with the above usage is that the model parameters of ERNIE are very large, resulting in a very large amount of calculation and an unsatisfactory prediction speed . Therefore, it cannot meet the requirements of online business. In response to this problem, the results of PaddleHub's built-in text matching network adopt the structure of sentence-bert.
​sentence
insert image description here
-bert adopts the network structure of the twin towers (Siamese). Query and Title enter ERNIE respectively, share an ERNIE parameter, and get their respective sequence_output features. After that, pooling sequence_output (PaddleHub uses mean pooling operation by default. The author of PaddleHub found that mean_pooling and max_pooling have little difference in experimental effect after a lot of experiments), and then the output is recorded as u, v respectively. Then the three representations (u, v, |uv|) are spliced ​​together for binary classification. The network structure is shown in the figure above.

More information about Sentence-BERT can refer to the paper: https://arxiv.org/abs/1908.10084

So how does Sentence-BERT use Siamese's network structure to improve the prediction speed?

The advantage of Siamese's network structure is that query and title are entered into the same network respectively. For example, in the information search task, the title text in the database can be calculated in advance and the corresponding sequence_output feature can be saved in the database. When a user searches for a query, they only need to calculate the sequence_output feature of the query and the title sequence_output feature stored in the database, and perform binary classification through a simple mean_pooling and fully connected layer. In this way, the prediction efficiency is greatly improved, and the model performance is also guaranteed.

For the commonly used Siamese network structure for matching tasks, please refer to: https://blog.csdn.net/thriving_fcl/article/details/73730552


3. Set up Task

With a suitable pre-trained model and a dataset ready to be migrated, we start to assemble a Task.

  1. Get the context of the module, including input and output variables, and Paddle Program;
  2. Find word-level features sequence_output for text matching from output variables;
  3. Connect a matching network after sequence_output to generate Task;

PointwiseTextMatchingTaskThe parameters are:

  • dataset:data;

  • query_feature: corresponding features of query extracted from pre-training;

  • title_feature: Title corresponding feature extracted from pre-training;

  • tokenizer: data processor

  • config: run configuration;

insert image description here

# 构建迁移网络,使用ERNIE的token-level输出
query = outputs["sequence_output"]
title = outputs['sequence_output_2']

# 创建pointwise文本匹配任务
pointwise_matching_task = hub.PointwiseTextMatchingTask(
    dataset=dataset,
    query_feature=query,
    title_feature=title,
    tokenizer=tokenizer,
    config=config)

4. Start Finetune

We choose the finetune_and_eval interface for model training. During the finetune process, this interface will periodically evaluate the model effect so that we can understand the performance changes throughout the training process.

insert image description here

run_states=pointwise_matching_task.finetune_and_eval()

5. Use the model to make predictions

When Finetune is completed, we use the model to make predictions. The entire prediction process can be roughly divided into the following steps:

  1. Build the network
  2. Tokenizer that generates prediction data
  3. Switch to Predicted Program
  4. Load pre-trained parameters
  5. Run the Program to make predictions

Prediction data sample, the code is as follows:

# 预测数据样例
text_pairs = [
    [
        "小孩吃了百令胶囊能打预防针吗",  # query
        "小孩吃了百令胶囊能不能打预防针",  # title
    ],
    [
        "请问呕血与咯血有什么区别?",  # query
        "请问呕血与咯血异同?",  # title
    ]
]

results = pointwise_matching_task.predict(
    data=text_pairs,
    max_seq_len=128,
    label_list=dataset.get_labels(),
    return_result=True,
    accelerate_mode=False)
for index, text in enumerate(text_pairs):
    print("data: %s, prediction_label: %s" % (text, results[index]))

The output is shown in Figure 2 below:
insert image description here


Summarize

The content of this series of articles is based on the relevant notes and insights of the "Natural Language Processing Practice" published by Tsinghua. The code is developed based on Baidu Fei Pao. If there is any infringement or inappropriateness, please send a private message to me and be positive. Cooperate with the processing, you will return when you see it! ! !

Finally, I quote a sentence from this event as the conclusion of the article~( ̄▽ ̄~)~:

[ The biggest reason for learning is to get rid of mediocrity, one day sooner, one more wonderful life; one day later, one more day of mediocrity.

ps: For more exciting content, please enter the column of this article : artificial intelligence , check it out, welcome everyone to support and advise~( ̄▽ ̄~)~

insert image description here

Guess you like

Origin blog.csdn.net/m0_54754302/article/details/126645628