NLP (sixty-six) uses the Trainer in HuggingFace for BERT model fine-tuning

  In the past, when we used HuggingFace to train the BERT model, the code was written more complicated, involving steps such as data processing, token encoding, model encoding, and model training. People who are engaged in the NLP field have this personal experience. In fact, HuggingFace provides datasetsmodules (data processing) and Trainer functions, which make our model training more convenient. For datasetsthe module, please refer to the article NLP (62) for the use of Datasets in HuggingFace .
  This article will introduce how to use the Trainer in HuggingFace to fine-tune the BERT model.

Trainer

  Trainer is the model training function in HuggingFace, its URL is: https://huggingface.co/docs/transformers/main_classes/trainer .
  The incoming parameters of the Trainer are as follows:

model: typing.Union[transformers.modeling_utils.PreTrainedModel, torch.nn.modules.module.Module] = None
args: TrainingArguments = None
data_collator: typing.Optional[DataCollator] = None
train_dataset: typing.Optional[torch.utils.data.dataset.Dataset] = None
eval_dataset: typing.Union[torch.utils.data.dataset.Dataset, typing.Dict[str, torch.utils.data.dataset.Dataset], NoneType] = None
tokenizer: typing.Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None
model_init: typing.Union[typing.Callable[[], transformers.modeling_utils.PreTrainedModel], NoneType] = None
compute_metrics: typing.Union[typing.Callable[[transformers.trainer_utils.EvalPrediction], typing.Dict], NoneType] = None
callbacks: typing.Optional[typing.List[transformers.trainer_callback.TrainerCallback]] = None
optimizers: typing.Tuple[torch.optim.optimizer.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None)
preprocess_logits_for_metrics: typing.Union[typing.Callable[[torch.Tensor, torch.Tensor], torch.Tensor], NoneType] = None )

Parameter explanation:

  • modelfor the pre-trained model
  • argsFor the TrainingArguments (training parameters) class
  • data_collatorThe elements in the data set will form a batch, default_data_collator() is used by default, and if the tokenizer is not provided, useDataCollatorWithPadding
  • train_dataset, eval_datasetfor the training set and the validation set
  • tokenizerThe tokenizer used for model training
  • model_initInitialize the model
  • compute_metricsCompute functions for evaluation metrics on the validation set
  • callbacksCallback list for the training process
  • optimizersfor the optimizer in model training
  • preprocess_logits_for_metricsPreprocessing of logits before the model evaluation phase

  TrainingArguments is the training parameter class, its URL is: https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments , there are a lot of incoming parameters (there are 98 parameters in transformers version 4.32.1!), we Here are just a few common ones:

output_dir: stroverwrite_output_dir: bool = False
evaluation_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'no'
per_gpu_train_batch_size: typing.Optional[int] = None
per_gpu_eval_batch_size: typing.Optional[int] = None
learning_rate: float = 5e-05
num_train_epochs: float = 3.0
logging_dir: typing.Optional[str] = None
logging_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'steps'
save_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'steps'save_steps: float = 500
report_to: typing.Optional[typing.List[str]] = None

Parameter explanation:

  • output_diroutput directory for the model
  • evaluation_strategyEvaluate strategies for the model
  1. "no": Do not do model evaluation
  2. "steps": evaluate by training steps (steps), need to specify the number of steps
  3. "epoch": evaluate after each epoch training
  • per_gpu_train_batch_size, per_gpu_eval_batch_sizefor the batch size of the training set and test set on each GPU, there are also corresponding parameters on the CPU
  • learning_rateis the learning rate
  • logging_dirdirectory for log output
  • logging_strategyFor the log output strategy, there are also three types: no, steps, and epoch, with the same meaning as above
  • save_strategySave the strategy for the model, there are also no, steps, epoch three, the meaning is the same as above
  • report_toFor the output of important indicators (such as loss, accuracy) in model training and evaluation, you can choose azure_ml, clearml, codecarbon, comet_ml, dagshub, flyte, mlflow, neptune, tensorboard, wandb, and use all to output to all places. Using no will not output.

  Below we use Trainer to fine-tune the BERT model, and give sample codes for text classification on English and Chinese datasets.

BERT fine-tuning

  Use datasetsthe module to import the imdb dataset (English movie review dataset, often used in text classification), and load bert-base-casedthe tokenizer of the pre-trained model.

import numpy as np
from transformers import AutoTokenizer, DataCollatorWithPadding
import datasets

checkpoint = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
raw_datasets = datasets.load_dataset('imdb')

  Looking at the data set, there are three parts: train (training set), test (test set), and unsupervised (unsupervised). We use the training set and test set here, each with 25,000 samples.

raw_datasets
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

  Create a data tokenize function to tokenize the text, set the maximum length to 300, and use data_collector as DataCollatorWithPadding.

def tokenize_function(sample):
    return tokenizer(sample['text'], max_length=300, truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

  Load the classification model, the output class is 2.

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

  Set the compute_metrics function to output accuracy, f1, precision, and recall four indicators during the evaluation process. Set training parameters TrainingArguments class, set Trainer.

from transformers import Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {
    
    
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

training_args = TrainingArguments(output_dir='imdb_test_trainer', # 指定输出文件夹,没有会自动创建
                                 evaluation_strategy="epoch",
                                 per_device_train_batch_size=32,
                                 per_device_eval_batch_size=32,
                                 learning_rate=5e-5,
                                 num_train_epochs=3,
                                 warmup_ratio=0.2,
                                 logging_dir='./imdb_train_logs',
                                 logging_strategy="epoch",
                                 save_strategy="epoch",
                                 report_to="tensorboard") 

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,  # 在定义了tokenizer之后,其实这里的data_collator就不用再写了,会自动根据tokenizer创建
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

  Start model training.

trainer.train()
Epoch Training Loss Validation Loss Accuracy F1 Precision Recall
1 0.364300 0.223223 0.910600 0.910509 0.912276 0.910600
2 0.164800 0.204420 0.923960 0.923941 0.924375 0.923960
3 0.071000 0.241350 0.925520 0.925510 0.925759 0.925520
TrainOutput(global_step=588, training_loss=0.20003824169132986, metrics={'train_runtime': 1539.8692, 'train_samples_per_second': 48.705, 'train_steps_per_second': 0.382, 'total_flos': 1.156249755e+16, 'train_loss': 0.20003824169132986, 'epoch': 3.0})

  The above is a fine-tuning of the text classification model for the English dataset.
  The Chinese data set uses the sougou-mini data set (4000 samples in the training set, 495 samples in the test set, a total of 5 output categories), and the pre-training model uses bert-base-chinese. The code is basically similar to the English data set, as long as the pre-trained model is modified, the data set is loaded and the maximum length is 128, and the output category. Here's where the code differs:

import numpy as np
from transformers import AutoTokenizer, DataCollatorWithPadding
import datasets

checkpoint = 'bert-base-chinese'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

data_files = {
    
    "train": "./data/sougou/train.csv", "test": "./data/sougou/test.csv"}
raw_datasets = datasets.load_dataset("csv", data_files=data_files, delimiter=",")
...
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=5)
...

The output is as follows:

Epoch Training Loss Validation Loss Accuracy F1 Precision Recall
1 0.849200 0.115189 0.969697 0.969449 0.970073 0.969697
2 0.106900 0.093987 0.973737 0.973770 0.975372 0.973737
3 0.047800 0.078861 0.973737 0.973740 0.974117 0.973737

model evaluation

  In the above model evaluation process, there are already various indicators for model evaluation.
  This article also gives the code for model evaluation alone, which is convenient for subsequent quantification of the model (the dynamic quantification of the BERT model will be introduced later) to obtain various indicators of model reasoning before and after quantification.
  The evaluation code of the Chinese dataset text classification model is as follows:

import torch
from transformers import AutoModelForSequenceClassification

MAX_LENGTH = 128
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
checkpoint = f"./sougou_test_trainer_{
      
      MAX_LENGTH}/checkpoint-96"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint).to(device)

from transformers import AutoTokenizer, DataCollatorWithPadding

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

import pandas as pd

test_df = pd.read_csv("./data/sougou/test.csv")
test_df.head()
text label
0 Number of sessions Competition time Competition place Participation in national and regional champion runner-up final results The first session 1956-1957 UK 11 US Denmark 6... 0
1 Commodity attribute material soft rubber belt plus embossing process + alloy color team emblem hang tag specification 162mm quantity this series of products is unlimited release pattern... 0
2 This afternoon, Shenyang Jinde and Changchun Yatai will meet in Wulihe. In these two teams, there are mostly players from Shenyang, so this game is actually... 0
3 According to our newspaper, the Chinese Football Association has prepared the contract text for negotiation with Troussier, and also booked a room for him in Beijing, but Troussier missed the appointment! ... 0
4 Netizens click to comment to congratulate the Chinese team for winning the fifth consecutive championship. Sohu Sports News Beijing time on May 6th, the 2006 Uber Cup Badminton Tournament was held in Japan... 0
import numpy as np
import time

s_time = time.time()
true_labels, pred_labels = [], [] 
for i, row in test_df.iterrows():
    row_s_time = time.time()
    true_labels.append(row["label"])
    encoded_text = tokenizer(row['text'], max_length=MAX_LENGTH, truncation=True, padding=True, return_tensors='pt').to(device)
    # print(encoded_text)
    logits = model(**encoded_text)
    label_id = np.argmax(logits[0].detach().cpu().numpy(), axis=1)[0]
    pred_labels.append(label_id)
    if i % 100 == 0:
    	print(i, (time.time() - row_s_time)*1000, label_id)

print("avg time: ", (time.time() - s_time) * 1000 / test_df.shape[0])
0 229.3872833251953 0
100 362.0314598083496 1
200 311.16747856140137 2
300 324.13792610168457 3
400 406.9099426269531 4
avg time:  352.44047810332944
true_labels[:10]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
pred_labels[:10]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
from sklearn.metrics import classification_report

print(classification_report(true_labels, pred_labels, digits=4))
              precision    recall  f1-score   support

           0     0.9900    1.0000    0.9950        99
           1     0.9691    0.9495    0.9592        99
           2     0.9900    1.0000    0.9950        99
           3     0.9320    0.9697    0.9505        99
           4     0.9895    0.9495    0.9691        99

    accuracy                         0.9737       495
   macro avg     0.9741    0.9737    0.9737       495
weighted avg     0.9741    0.9737    0.9737       495

Summarize

  This article describes how to use the Trainer in HuggingFace to fine-tune the BERT model. It can be seen that using Trainer for model fine-tuning, the code is relatively simple, and the support functions are rich, which is an ideal model training method.
  The project code of this article has been open sourced to Github at https://github.com/percent4/PyTorch_Learning/tree/master/huggingface_learning .
  I have opened a personal blog website, the URL is: https://percent4.github.io/ , welcome to visit~

Guess you like

Origin blog.csdn.net/jclian91/article/details/132642662