Python-based natural language processing tasks (text classification, text matching, semantic understanding and sequence annotation)

Resource download address : https://download.csdn.net/download/sheziqiong/88308584
Resource download address : https://download.csdn.net/download/sheziqiong/88308584

1. Depends on the environment

  • Python >= 3.6

  • torch >= 1.1

  • argparse

  • json

  • window

  • numpy

  • packaging

  • re

2. Text classification

This project demonstrates how Finetune can use pre-trained models represented by BERT to complete text classification tasks. We take the Chinese sentiment classification public data set ChnSentiCorp as an example. Run the following command to perform single-machine multi-card distributed training based on DistributedDataParallel, perform model training on the training set (train.tsv), and perform model training on the verification set (dev.tsv). to evaluate:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 run_classifier.py --train_data_file ./data/ChnSentiCorp/train.tsv --dev_data_file ./data/ChnSentiCorp/dev.tsv --label_file ./data/ChnSentiCorp/labels.txt --save_best_model --epochs 3 --batch_size 32

Supported configuration parameters:

usage: run_classifier.py [-h] [--local_rank LOCAL_RANK]
                         [--pretrained_model_name_or_path PRETRAINED_MODEL_NAME_OR_PATH]
                         [--init_from_ckpt INIT_FROM_CKPT] --train_data_file
                         TRAIN_DATA_FILE [--dev_data_file DEV_DATA_FILE]
                         --label_file LABEL_FILE [--batch_size BATCH_SIZE]
                         [--scheduler {linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup}]
                         [--learning_rate LEARNING_RATE]
                         [--warmup_proportion WARMUP_PROPORTION] [--seed SEED]
                         [--save_steps SAVE_STEPS]
                         [--logging_steps LOGGING_STEPS]
                         [--weight_decay WEIGHT_DECAY] [--epochs EPOCHS]
                         [--max_seq_length MAX_SEQ_LENGTH]
                         [--saved_dir SAVED_DIR]
                         [--max_grad_norm MAX_GRAD_NORM] [--save_best_model]
                         [--is_text_pair]
  • local_rank: Optional, the node number of distributed training, the default is -1.

  • pretrained_model_name_or_path: Optional, the name or path of the pretrained model in huggingface, the default is bert-base-chinese.

  • train_data_file: Required, training set data file path.

  • dev_data_file: Optional, verification set data file path, default is None.

  • label_file: required, category label file path.

  • batch_size: Optional, batch size, please adjust according to the video memory situation. If there is insufficient video memory, please lower this parameter appropriately. The default is 32.

  • init_from_ckpt: Optional, model parameter path to be loaded, hot start model training. Default is None.

  • scheduler: Optional, optimizer learning rate change strategy, the default is linear.

  • learning_rate: Optional, the maximum learning rate of the optimizer, the default is 5e-5.

  • warmup_proportion: Optional, the proportion of the learning rate warmup strategy. If it is 0.1, the learning rate will slowly increase from 0 to learning_rate during the first 10% training step, and then slowly decay. Default is 0.

  • weight_decay: Optional, a parameter that controls the strength of the regular term to prevent overfitting. The default is 0.0.

  • seed: optional, random seed, default is 1000.

  • logging_steps: Optional, log printing interval steps, the default is 20.

  • save_steps: Optional, interval steps for saving model parameters, default is 100.

  • epochs: optional, training rounds, default is 3.

  • max_seq_length: Optional, the maximum sequence length input to the pre-trained model, the maximum cannot exceed 512, the default is 128.

  • saved_dir: Optional, the folder path to save the training model. By default, it is saved in the checkpoint folder of the current directory.

  • max_grad_norm: Optional, max_norm parameter for gradient clipping during training, default is 1.0.

  • save_best_model: Optional, whether to save the model on the best validation set indicator, when added to the training command

    –save_best_model, save_best_model is True, otherwise it is False.

  • is_text_pair: Optional, whether to perform text pair classification. When --is_text_pair is added to the training command, text pairs will be classified, otherwise ordinary text classification will be performed.

The intermediate log of model training is as follows:

2022-05-25 07:22:29.403 | INFO     | __main__:train:301 - global step: 20, epoch: 1, batch: 20, loss: 0.23227, accuracy: 0.87500, speed: 2.12 step/s
2022-05-25 07:22:39.131 | INFO     | __main__:train:301 - global step: 40, epoch: 1, batch: 40, loss: 0.30054, accuracy: 0.87500, speed: 2.06 step/s
2022-05-25 07:22:49.010 | INFO     | __main__:train:301 - global step: 60, epoch: 1, batch: 60, loss: 0.23514, accuracy: 0.93750, speed: 2.02 step/s
2022-05-25 07:22:58.909 | INFO     | __main__:train:301 - global step: 80, epoch: 1, batch: 80, loss: 0.12026, accuracy: 0.96875, speed: 2.02 step/s
2022-05-25 07:23:08.804 | INFO     | __main__:train:301 - global step: 100, epoch: 1, batch: 100, loss: 0.21955, accuracy: 0.90625, speed: 2.02 step/s
2022-05-25 07:23:13.534 | INFO     | __main__:train:307 - eval loss: 0.22564, accuracy: 0.91750
2022-05-25 07:23:25.222 | INFO     | __main__:train:301 - global step: 120, epoch: 1, batch: 120, loss: 0.32157, accuracy: 0.90625, speed: 2.03 step/s
2022-05-25 07:23:35.104 | INFO     | __main__:train:301 - global step: 140, epoch: 1, batch: 140, loss: 0.20107, accuracy: 0.87500, speed: 2.02 step/s
2022-05-25 07:23:44.978 | INFO     | __main__:train:301 - global step: 160, epoch: 2, batch: 10, loss: 0.08750, accuracy: 0.96875, speed: 2.03 step/s
2022-05-25 07:23:54.869 | INFO     | __main__:train:301 - global step: 180, epoch: 2, batch: 30, loss: 0.08308, accuracy: 1.00000, speed: 2.02 step/s
2022-05-25 07:24:04.754 | INFO     | __main__:train:301 - global step: 200, epoch: 2, batch: 50, loss: 0.10256, accuracy: 0.93750, speed: 2.02 step/s
2022-05-25 07:24:09.480 | INFO     | __main__:train:307 - eval loss: 0.22497, accuracy: 0.93083
2022-05-25 07:24:21.020 | INFO     | __main__:train:301 - global step: 220, epoch: 2, batch: 70, loss: 0.23989, accuracy: 0.93750, speed: 2.03 step/s
2022-05-25 07:24:30.919 | INFO     | __main__:train:301 - global step: 240, epoch: 2, batch: 90, loss: 0.00897, accuracy: 1.00000, speed: 2.02 step/s
2022-05-25 07:24:40.777 | INFO     | __main__:train:301 - global step: 260, epoch: 2, batch: 110, loss: 0.13605, accuracy: 0.93750, speed: 2.03 step/s
2022-05-25 07:24:50.640 | INFO     | __main__:train:301 - global step: 280, epoch: 2, batch: 130, loss: 0.14508, accuracy: 0.93750, speed: 2.03 step/s
2022-05-25 07:25:00.529 | INFO     | __main__:train:301 - global step: 300, epoch: 2, batch: 150, loss: 0.04770, accuracy: 0.96875, speed: 2.02 step/s
2022-05-25 07:25:05.256 | INFO     | __main__:train:307 - eval loss: 0.23039, accuracy: 0.93500
2022-05-25 07:25:16.818 | INFO     | __main__:train:301 - global step: 320, epoch: 3, batch: 20, loss: 0.04312, accuracy: 0.96875, speed: 2.04 step/s
2022-05-25 07:25:26.700 | INFO     | __main__:train:301 - global step: 340, epoch: 3, batch: 40, loss: 0.05103, accuracy: 0.96875, speed: 2.02 step/s
2022-05-25 07:25:36.588 | INFO     | __main__:train:301 - global step: 360, epoch: 3, batch: 60, loss: 0.12114, accuracy: 0.87500, speed: 2.02 step/s
2022-05-25 07:25:46.443 | INFO     | __main__:train:301 - global step: 380, epoch: 3, batch: 80, loss: 0.01080, accuracy: 1.00000, speed: 2.03 step/s
2022-05-25 07:25:56.228 | INFO     | __main__:train:301 - global step: 400, epoch: 3, batch: 100, loss: 0.14839, accuracy: 0.96875, speed: 2.04 step/s
2022-05-25 07:26:00.953 | INFO     | __main__:train:307 - eval loss: 0.22589, accuracy: 0.94083
2022-05-25 07:26:12.483 | INFO     | __main__:train:301 - global step: 420, epoch: 3, batch: 120, loss: 0.14986, accuracy: 0.96875, speed: 2.05 step/s
2022-05-25 07:26:22.289 | INFO     | __main__:train:301 - global step: 440, epoch: 3, batch: 140, loss: 0.00687, accuracy: 1.00000, speed: 2.04 step/s

When text pair classification is required, just set is_text_pair to True. Taking the AFQMC Ant Financial Semantic Similarity Dataset in CLUEbenchmark as an example, you can run the following command for training:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 run_classifier.py --train_data_file ./data/AFQMC/train.txt --dev_data_file ./data/AFQMC/dev.txt --label_file ./data/AFQMC/labels.txt --is_text_pair --save_best_model --epochs 3 --batch_size 32

Training is performed on different data sets, and the results on the verification set are as follows:

Task ChnSentiCorp AFQMC TNEWS
dev-acc 0.94083 0.74305 0.56990

TNEWS is today's headline news classification data set in CLUEbenchmark.

CLUEbenchmark data set link: https://github.com/CLUEbenchmark/CLUE

3. Text matching

This project shows how to complete the Chinese text matching task based on the Sentence-BERT structure Finetune. Sentence BERT adopts a Siamese network structure. Query and Title are input into two bert encoders that share parameters respectively, and their respective token embedding features are obtained. Then the token embedding is pooled (mean pooling operation is used in the paper), and the outputs are recorded as u and v respectively. Finally, the three vectors (u, v, |uv|) are concatenated and input into the linear classifier for classification.

When using NLI data for training, you need to add the --is_nli option and --label_file LABEL_FILE. The training command is as follows:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 run_sentencebert.py --train_data_file ./data/CMNLI/train.txt --dev_data_file ./data/CMNLI/dev.txt --label_file ./data/CMNLI/labels.txt --is_nli --save_best_model --epochs 3 --batch_size 32

Training is performed on different data sets, and the results on the verification set are as follows:

Task LCQMC Chinese-MNLI Chinese-SNLI
dev-acc 0.86730 0.71105 0.80567

Chinese-MNLI and Chinese-SNLI links: https://github.com/zejunwang1/CSTS

4. Semantic understanding

4.1 SimCSE

The SimCSE model is suitable for matching and retrieval scenarios that lack supervised data but have a large amount of unsupervised data. This project implements the SimCSE unsupervised method and trains the sentence vector representation model on Chinese Wikipedia sentence data.

For more information about SimCSE, please refer to the paper: https://arxiv.org/abs/2104.08821

Extract 150,000 sentence data from the Chinese Wikipedia and save it in the wiki_sents.txt file under the data/zhwiki/ folder. Run the following command, based on Tencent uer’s open source pre-trained language model uer/chinese_roberta_L-6_H-128 ( https://huggingface.co/uer/chinese_roberta_L-6_H-128), trained using the SimCSE unsupervised method, and evaluated on the Chinese-STS-B validation set (https://github.com/zejunwang1/CSTS) :

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 run_simcse.py --pretrained_model_name_or_path uer/chinese_roberta_L-6_H-128 --train_data_file ./data/zhwiki/wiki_sents.txt --dev_data_file ./data/STS-B/sts-b-dev.txt --learning_rate 5e-5 --epochs 1 --dropout 0.1 --margin 0.2 --scale 20 --batch_size 32

The SimCSE sentence vector pre-training model simcse_tiny_chinese_wiki with the number of hidden layers num_hidden_layers=6 and dimension hidden_size=128 can be obtained from the following link:

model_name link
WangZeJun/simcse-tiny-chinese-wiki https://huggingface.co/WangZeJun/simcse-tiny-chinese-wiki

4.2 In-Batch Negatives

Extract all semantically similar text pairs from the Harbin Institute of Technology LCQMC data set, Google PAWS-X data set, and Peking University text paraphrase PKU-Paraphrase-Bank data set (https://github.com/zejunwang1/CSTS) as a training set, and save At: data/batchneg/paraphrase_lcqmc_semantic_pairs.txt

Run the following command, based on Tencent uer's open source pre-trained language model uer/chinese_roberta_L-6_H-128, using the In-batch negatives strategy, train the sentence vector representation model on four GPU cards 0, 1, 2, and 3, and Evaluation on the Chinese-STS-B validation set:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 run_batchneg.py --pretrained_model_name_or_path uer/chinese_roberta_L-6_H-128 --train_data_file ./data/batchneg/paraphrase_lcqmc_semantic_pairs.txt --dev_data_file ./data/STS-B/sts-b-dev.txt --learning_rate 5e-5 --epochs 3 --margin 0.2 --scale 20 --batch_size 64 --mean_loss

Use the model obtained above as a hot start and continue to train In-batch negatives on the sentence data set data/batchneg/domain_finetune.txt:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 run_batchneg.py --pretrained_model_name_or_path uer/chinese_roberta_L-6_H-128 --init_from_ckpt ./checkpoint/pytorch_model.bin --train_data_file ./data/batchneg/domain_finetune.txt --dev_data_file ./data/STS-B/sts-b-dev.txt --learning_rate 1e-5 --epochs 1 --margin 0.2 --scale 20 --batch_size 32 --mean_loss

You can get the sentence vector pre-training model with the number of hidden layers num_hidden_layers=6 and the dimension hidden_size=128:

model_name link
WangZeJun/batchneg-tiny-chinese https://huggingface.co/WangZeJun/batchneg-tiny-chinese

5. Sequence annotation

This project demonstrates how Finetune can use pre-trained models represented by BERT to complete sequence labeling tasks. Taking the Chinese named entity recognition task as an example, training and testing were conducted on four data sets: msra, ontonote4, resume and weibo. The training set and validation set of each data set are preprocessed into the following format, and each line is a json string composed of text and labels.

{
    
    "text": ["我", "们", "的", "藏", "品", "中", "有", "几", "十", "册", "为", "北", "京", "图", "书", "馆", "等", "国", "家", "级", "藏", "馆", "所", "未", "藏", "。"], "label": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "B-NS", "I-NS", "I-NS", "I-NS", "I-NS", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]}
{
    
    "text": ["由", "于", "这", "一", "时", "期", "战", "争", "频", "繁", ",", "条", "件", "艰", "苦", ",", "又", "遭", "国", "民", "党", "毁", "禁", ",", "传", "世", "量", "稀", "少", ",", "购", "藏", "不", "易", "。"], "label": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "B-NT", "I-NT", "I-NT", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]}

Run the following command to use the BERT+Linear structure on the msra data set for single-machine multi-card distributed training, and evaluate it on the validation set:

Most parameters are the same as those introduced in text classification. The following are unique parameters:

  • tag: optional, entity tagging method, supports bios and bio tagging methods, the default is bios.

  • use_crf: Optional, whether to use the CRF structure. When --use_crf is added to the training command, the BERT+CRF model structure is used; otherwise, the BERT+Linear model structure is used.

  • crf_learning_rate: Optional, the initial learning rate of CRF model parameters, the default is 5e-5.

When training using the BERT+CRF architecture, run the following command:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 run_ner.py --train_data_file ./data/ner/msra/train.json --dev_data_file ./data/ner/msra/dev.json --label_file ./data/ner/msra/labels.txt --tag bios --learning_rate 5e-5 --save_best_model --batch_size 32 --use_crf --crf_learning_rate 1e-4

F1 index of the model on different validation sets:

Model Msra Resume Ontonote Weibo
BERT+Linear 0.94179 0.95643 0.80206 0.70588
BERT+CRF 0.94265 0.95818 0.80257 0.72215

Among them, Msra, Resume and Ontonote were trained for 3 epochs, and Weibo was trained for 5 epochs. The logging_steps and save_steps of Resume, Ontonote and Weibo were all set to 10. The initial learning rate of the BERT parameters of all data sets was set to 5e-5, and the CRF parameters The initial learning rate is set to 1e-4 and batch_size is set to 32.

Resource download address : https://download.csdn.net/download/sheziqiong/88308584
Resource download address : https://download.csdn.net/download/sheziqiong/88308584

Guess you like

Origin blog.csdn.net/newlw/article/details/132729673