Transformers example | six

Author | huggingface compiled | VK source | Github

In this section, some examples of the binding. All of these examples are suitable for a variety of models, and use a very similar API between different models.

Important : To run the latest version of the example, you must install and is an example of some of the specific requirements of the installation from the source code. Perform the following steps in a new virtual environment:

git clone https://github.com/huggingface/transformers
cd transformers
pip install .
pip install -r ./examples/requirements.txt
Section Description
TensorFlow 2.0 GLUE model Run the sample BERT TensorFlow 2.0 GLUE model on task.
Language model training Library model on text data sets to fine-tune (or from the beginning of training). Causal Modeling Language GPT / GPT-2's, BERT / RoBERTa mask language modeling.
Language Generation Use the library from the regression model generation conditional text: GPT, GPT-2, Transformer XL and XLNet.
GLUE BERT XLM XLNet sample / / / RoBERTa run on 9 GLUE task. Examples of the use of distributed training and semi accuracy.
SQuAD Use BERT / RoBERTa / XLNet / XLM answered questions example uses the distributed training.
Multiple Choice BERT example / XLNet / RoBERTa run on the SWAG / RACE / ARC task.
Named entity recognition Using BERT named entity recognition (the NER) CoNLL 2003 on the data set, using the exemplary distributed training.
XNLI Run sample BERT / XLM in XNLI reference.
Confrontational performance evaluation model Heuristic in NLI system (HANS) data set (McCoy et al., 2019) analysis to assess adversarial model on natural language reasoning test

TensorFlow 2.0 Bert model on the GLUE

Script based run_tf_glue.pyTensorFlow 2.0 Bert model on the GLUE.

Fine-tuning TensorFlow 2.0 Bert model of GLUE benchmark MRPC task sequence classification.

The script has mixed accuracy (Automatic Mixed Precision / AMP) for running on a Tensor Core (NVIDIA Volta / Turing GPU) hardware models and future options, as well as the XLA option, which use XLA compiler to reduce model run time . Use "USE_XLA" or "USE_AMP" variable in the script to switch the option. These options and the following reference provided by the @tlkh.

Script rapid test results (no other changes):

GPU mode Time (second Epoch) Accuracy (3 times)
Titan V FP32 41s 0.8438 / 0.8281 / 0.8333
Titan V AMP 26s 0.8281 / 0.8568 / 0.8411
V100 FP32 35s 0.8646 / 0.8359 / 0.8464
V100 AMP 22s 0.8646 / 0.8385 / 0.8411
1080 Ti FP32 55s -

For the same hardware and hyperparameters (using the same batch size), mixed-precision (AMP) greatly reduces training time.

Language model training

Based on the script run_language_modeling.py.

In GPT, GPT-2, BERT and RoBERTa (to be added DistilBERT) to fine-tune the text data set (or re-training) for library model language modeling. GPT and GPT-2 using causal modeling language (CLM) loss of fine-tuning, and the use of masks BERT and RoBERTa language modeling (MLM) loss of fine-tuning.

Before running the following example, you should get a file containing the text of the document on language training or fine-tune the model. A good example of such text is WikiText-2 data set ( https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/).

We will refer to two different files: $ TRAIN_FILEcontains text used for training, and $ TEST_FILEwhich contains the text will be used for evaluation.

GPT-2 / GPT and causal modeling language

The following example GPT-2 WikiText-2 on fine-tuning. (Does not replace any mark before tokenization) we are using the original WikiText-2. Loss causation here is the loss of language modeling.

export TRAIN_FILE=/path/to/dataset/wiki.train.raw
export TEST_FILE=/path/to/dataset/wiki.test.raw

python run_language_modeling.py \
    --output_dir=output \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE

K80 GPU single training takes about half an hour, and then about one minute run time evaluation. It fine-tune the results on the data set perplexity about 20 degrees.

RoBERTa / BERT and language modeling mask

The following example RoBERTa WikiText-2 on fine-tuning. Here, we use the original WikiText-2. Here the loss is not the same as BERT / RoBERTa a two-way mechanism. Loss before training the same loss we used to, are masked language modeling.

According to RoBERTa paper, we use a dynamic rather than a static mask mask. Thus, the model convergence speed may be slightly slower (over-fitting will take more time).

We use the --mlmlogo, so that the script can change its loss function.

export TRAIN_FILE=/path/to/dataset/wiki.train.raw
export TEST_FILE=/path/to/dataset/wiki.test.raw

python run_language_modeling.py \
    --output_dir=output \
    --model_type=roberta \
    --model_name_or_path=roberta-base \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --do_eval \
    --eval_data_file=$TEST_FILE \
    --mlm

Language Generation

Based on the script run_generation.py.

Library using autoregressive model generation conditional text: GPT, GPT-2, Transformer -XL, XLNet, CTRL. Our official demo ( https://transformer.huggingface.co) used a similar script, various models you can try out the library provides.

Example usage:

python run_generation.py \ 
    --model_type = gpt2 \ 
    --model_name_or_path = gpt2

GLUE

Based on the script run_glue.py.

GLUE fine-tuning on the reference sequence library model for classification. The script can fine-tune the following models: BERT, XLM, XLNet and RoBERTa.

GLUE made up of nine different tasks. We obtained the following results on BERT basic model without capitalization ( "bert-base-uncased") to develop benchmarks set. All experiments were run single V100 GPU, training total batch size between 16-64. Some of the smaller tasks of data collection, training can lead to large differences in the results. Between different runs. We take for each indicator reports five runs (different random number seed) median.

task measure result
CoLA Matthew's correlation coefficient 49.23
SST-2 Accuracy 91.97
MRPC F1 / Accuracy 89.47/85.29
STS-B Person / Spearman correlation coefficients 83.95/83.70
QQP Accuracy / F1 88.40/84.31
Railroads The accuracy of matching / non-matching accuracy 80.61/81.08
QNLI Accuracy 87.46
RTE Accuracy 61.73
WNLI Accuracy 45.07

Some of these results have significant differences with the results of the report on the website of the GLUE benchmark suite. About QQP and WNLI, please refer FAQ12 (on site https://gluebenchmark.com/faq).

GLUE running these tasks before any one, you should (by running this script https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e) GLUE download data (https://gluebenchmark.com/tasks) and unzip save it to a $ GLUE_DIRdirectory.

export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC

python run_glue.py \
  --model_type bert \
  --model_name_or_path bert-base-cased \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/$TASK_NAME/

Wherein the task name can be one CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.

Development set results are displayed in the specified output_dirtext file eval_results.txtin. For MnII, since there are two separate sets of development (matched and unmatched), so in addition /tmp/MNLI/, there are a separate output folder called /tmp/MNLI-MM/.

In addition to MRPC, MNLI, CoLA, SST-2 outside, apex more than half have no training in any GLUE precise tasks. The following sections provide detailed information on how to use the MRPC run half precise training. Nevertheless, the use of the remaining tasks to run half-precision GLUE training should not have any problem because each task data processors following the Chengzi Ji class data processor.

MRPC

Examples of fine-tuning

The following example BERT on Microsoft Research Paraphrase Corpus (MRPC) corpus trimming, and run on a single K-80 less than 10 minutes, on a single tesla V100 16GB, only 27 seconds installed apex.

GLUE running these tasks before any one, you should download and run this script ( https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e ) to download data GLUE ( https://gluebenchmark.com/tasks), and extract it to a directory $ ` GLUE_DIR` in.

export GLUE_DIR=/path/to/glue

python run_glue.py \
  --model_type bert \
  --model_name_or_path bert-base-cased \
  --task_name MRPC \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir $GLUE_DIR/MRPC/ \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/mrpc_output/ 

Our super test parameters (based on the original implementation of the results of the assessment https://github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks), the result is between 84% and 88%.

And blending accuracy using Apex

Apex and use 16-bit precision, the trimming on MRPC only 27 seconds. Install Apex ( https://github.com/NVIDIA/apex), and then run the following example:

export GLUE_DIR=/path/to/glue

python run_glue.py \
  --model_type bert \
  --model_name_or_path bert-base-cased \
  --task_name MRPC \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir $GLUE_DIR/MRPC/ \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir /tmp/mrpc_output/ \
  --fp16
Distributed training

The following is an example of using distributed trained on 8 V100 GPU. The model used is BERT whole-word-masking mode, to F1> 92 on MRPC.

export GLUE_DIR=/path/to/glue

python -m torch.distributed.launch \
    --nproc_per_node 8 run_glue.py \
    --model_type bert \
    --model_name_or_path bert-base-cased \
    --task_name MRPC \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir $GLUE_DIR/MRPC/ \
    --max_seq_length 128 \
    --per_gpu_train_batch_size 8 \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir /tmp/mrpc_output/

These ultra-training parameters gives us the following results

acc = 0.8823529411764706
acc_and_f1 = 0.901702786377709
eval_loss = 0.3418912578906332
f1 = 0.9210526315789473
global_step = 174
loss = 0.07231863956341798
Railroads

The following example uses a BERT-large, uncased, whole-word-masking model and fine tune on MNLI task.

export GLUE_DIR=/path/to/glue

python -m torch.distributed.launch \
    --nproc_per_node 8 run_glue.py \
    --model_type bert \
    --model_name_or_path bert-base-cased \
    --task_name mnli \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir $GLUE_DIR/MNLI/ \
    --max_seq_length 128 \
    --per_gpu_train_batch_size 8 \
    --learning_rate 2e-5 \
    --num_train_epochs 3.0 \
    --output_dir output_dir \

The results are as follows:

***** Eval results *****
  acc = 0.8679706601466992
  eval_loss = 0.4911287787382479
  global_step = 18408
  loss = 0.04755385363816904

***** Eval results *****
  acc = 0.8747965825874695
  eval_loss = 0.45516540421714036
  global_step = 18408
  loss = 0.04755385363816904

multiple choices

Based on the script run_multiple_choice.py.

Fine-tuning on the SWAG

Download SWAG ( https://github.com/rowanz/swagaf/tree/master/data) data

#在4个tesla V100(16GB)GPU上进行训练
export SWAG_DIR=/path/to/swag_data_dir
python ./examples/run_multiple_choice.py \
--model_type roberta \
--task_name swag \
--model_name_or_path roberta-base \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $SWAG_DIR \
--learning_rate 5e-5 \
--num_train_epochs 3 \
--max_seq_length 80 \
--output_dir models_bert/swag_base \
--per_gpu_eval_batch_size=16 \
--per_gpu_train_batch_size=16 \
--gradient_accumulation_steps 2 \
--overwrite_output

And ultra-defined training parameters produced the following results

***** Eval results *****
eval_acc = 0.8338998300509847
eval_loss = 0.44457291918821606

SQuAD

Based on the script run_squad.py.

BERT for fine-tuning on SQuAD1.0

This sample code on the trimming BERT SQuAD1.0 dataset. In the single tesla V100 16GB, it can be 24 minutes (based on BERT-base) or 68 minutes (for the BERT-large) run. SQuAD data can be downloaded via the following link and save it in the $ SQUAD_DIRdirectory.

For SQuAD2.0, you need to download:

export SQUAD_DIR=/path/to/SQUAD

python run_squad.py \
  --model_type bert \
  --model_name_or_path bert-base-cased \
  --do_train \
  --do_eval \
  --do_lower_case \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --per_gpu_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/ 

Produces the following results over the training parameters previously defined

F1 = 88.52 
EXACT_MATCH = 81.22 
Distributed training

Here is an example of distributed training 8 V100 GPU and BERT Whole Word Masking uncased model SQuAD1.1 reaches F1> 93

python -m torch.distributed.launch --nproc_per_node=8 ./examples/run_squad.py \
    --model_type bert \
    --model_name_or_path bert-large-uncased-whole-word-masking \
    --do_train \
    --do_eval \
    --do_lower_case \
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ./examples/models/wwm_uncased_finetuned_squad/ \
    --per_gpu_eval_batch_size=3   \
    --per_gpu_train_batch_size=3   \

The use of ultra parameters previously defined for training the following results

F1 = 93.15 
EXACT_MATCH = 86.91 

This model is also the model library, the following string can be referenced bert-large-uncased-whole-word-masking-finetuned-squad.

Fine-tuning XLNet on SQuAD

This sample code and trimming XLNet in SQuAD1.0 SQuAD2.0 dataset. See above, download SQuAD data.

SQuAD1.0 command:
export SQUAD_DIR=/path/to/SQUAD

python run_squad.py \
    --model_type xlnet \
    --model_name_or_path xlnet-large-cased \
    --do_train \
    --do_eval \
    --do_lower_case \
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --learning_rate 3e-5 \
    --num_train_epochs 2 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ./wwm_cased_finetuned_squad/ \
    --per_gpu_eval_batch_size=4  \
    --per_gpu_train_batch_size=4   \
    --save_steps 5000
SQuAD2.0 command:
export SQUAD_DIR=/path/to/SQUAD

python run_squad.py \
    --model_type xlnet \
    --model_name_or_path xlnet-large-cased \
    --do_train \
    --do_eval \
    --version_2_with_negative \
    --train_file $SQUAD_DIR/train-v2.0.json \
    --predict_file $SQUAD_DIR/dev-v2.0.json \
    --learning_rate 3e-5 \
    --num_train_epochs 4 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --output_dir ./wwm_cased_finetuned_squad/ \
    --per_gpu_eval_batch_size=2  \
    --per_gpu_train_batch_size=2   \
    --save_steps 5000

Larger batch size can improve performance, while consuming more memory.

SQuAD1.0 results have hyper-parameters previously defined:
{
"exact": 85.45884578997162,
"f1": 92.5974600601065,
"total": 10570,
"HasAns_exact": 85.45884578997162,
"HasAns_f1": 92.59746006010651,
"HasAns_total": 10570
}
SQuAD2.0 results have hyper-parameters previously defined:
{
"exact": 80.4177545691906,
"f1": 84.07154997729623,
"total": 11873,
"HasAns_exact": 76.73751686909581,
"HasAns_f1": 84.05558584352873,
"HasAns_total": 5928,
"NoAns_exact": 84.0874684608915,
"NoAns_f1": 84.0874684608915,
"NoAns_total": 5945
}

XNLI

Based on a script run_xnli.py( https://github.com/huggingface/transformers/blob/master/examples/run_xnli.py).

XNLI ( https://www.nyu.edu/projects/bowman/xnli/) based MultiNLI (http://www.nyu.edu/projects/bowman/multinli/) a public packet data sets. It is a cross-language text indicates the valuation form. A pair of text in 15 different languages (including high resource language (such as English) and low-resource languages (such as Swahili) text comments).

Fine-tuning on the XNLI

This trimming the sample code mBERT (BERT multiple languages) on XNLI dataset. It needs to be run on a single tesla V100 16GB 106 minutes. Can and should be saved by following links to download data while its XNLI (and extract) in the $ XNLI_DIRdirectory.

export XNLI_DIR=/path/to/XNLI

python run_xnli.py \
  --model_type bert \
  --model_name_or_path bert-base-multilingual-cased \
  --language de \
  --train_language en \
  --do_train \
  --do_eval \
  --data_dir $XNLI_DIR \
  --per_gpu_train_batch_size 32 \
  --learning_rate 5e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 128 \
  --output_dir /tmp/debug_xnli/ \
  --save_steps -1

Produces the following results over the training parameters previously defined

ACC = 0.7093812375249501 

MM-IMDB

Based on a script run_mmimdb.py( https://github.com/huggingface/transformers/blob/master/examples/mm-imdb/run_mmimdb.py).

IMDb-the MM ( http://lisi1.unal.edu.co/mmimdb/) is a multi-mode data set, comprising about 26,000 films, including images, drama and other metadata.

Training MM-IMDB

python run_mmimdb.py \
    --data_dir /path/to/mmimdb/dataset/ \
    --model_type bert \
    --model_name_or_path bert-base-uncased \
    --output_dir /path/to/save/dir/ \
    --do_train \
    --do_eval \
    --max_seq_len 512 \
    --gradient_accumulation_steps 20 \
    --num_image_embeds 3 \
    --num_train_epochs 100 \
    --patience 5

Adversarial model performance assessment

This antagonism is assessed and NLI system using a natural language reasoning exemplary heuristic analysis (HANS) assessment model data set. The example of Nafise Sadat Moosavi ( provided https://github.com/ns-moosavi).

From this position can ( https://github.com/tommccoy1/hans) HANS downloaded data set.

This is an example of the use of test_hans.py:

export HANS_DIR=path-to-hans
export MODEL_TYPE=type-of-the-model-e.g.-bert-roberta-xlnet-etc
export MODEL_PATH=path-to-the-model-directory-that-is-trained-on-NLI-e.g.-by-using-run_glue.py

python examples/hans/test_hans.py \
        --task_name hans \
        --model_type $MODEL_TYPE \
        --do_eval \
        --do_lower_case \
        --data_dir $HANS_DIR \
        --model_name_or_path $MODEL_PATH \
        --max_seq_length 128 \
        --output_dir $MODEL_PATH \

This will create a file in MODEL_PATH hans_predictions.txt, and then you can use the HANS dataset hans / evaluate_heur_output.py be evaluated.

Random seed batch size HANS 8 and 42 on the training data set based on a result of the model MNLI BERT follows:

Heuristic entailed results:
lexical_overlap: 0.9702
subsequence: 0.9942
constituent: 0.9962

Heuristic non-entailed results:
lexical_overlap: 0.199
subsequence: 0.0396
constituent: 0.118

Original link: https://huggingface.co/transformers/examples.html

AI welcomes the attention Pan Chong station blog: http://panchuang.net/

OpenCV Chinese official document: http://woshicver.com/

Welcome attention Pan Chong blog resources Summary station: http://docs.panchuang.net/

Published 372 original articles · won praise 1063 · Views 670,000 +

Guess you like

Origin blog.csdn.net/fendouaini/article/details/105205303