[Huggingface series learning] Finetuning a pre-training model

Processing the data

Load a dataset from the Hub

We can use the following code to train a sequence classifier

import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

# This is new
batch["labels"] = torch.tensor([1, 1])

optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

Obviously, it is impossible to get good results with two sentences, so we need a larger dataset

  • huggingface also saves a lot of data sets, which can be load_datadownloaded through

    from datasets import load_dataset
    
    raw_datasets = load_dataset("glue", "mrpc") # GLUE benchmark 中的 MRPC数据集
    raw_datasets
    > DatasetDict({
          
          
        train: Dataset({
          
          
            features: ['sentence1', 'sentence2', 'label', 'idx'],
            num_rows: 3668
        })
        validation: Dataset({
          
          
            features: ['sentence1', 'sentence2', 'label', 'idx'],
            num_rows: 408
        })
        test: Dataset({
          
          
            features: ['sentence1', 'sentence2', 'label', 'idx'],
            num_rows: 1725
        })
    })
    
    • We can access each pair of sentences by indexing

      raw_train_dataset = raw_datasets["train"]
      raw_train_dataset[0]
      
      >{
              
              'idx': 0,
       'label': 1,
       'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
       'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}
      
    • If you want to know the meaning of each part of the data set, you can featuresview it through the attribute

      raw_train_dataset.features
      {
              
              'sentence1': Value(dtype='string', id=None),
       'sentence2': Value(dtype='string', id=None),
       'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
       'idx': Value(dtype='int32', id=None)}
      

Preprocessing a dataset

The tokenizer can directly process paired data, which is BERTwhat is hoped

inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs
>{
    
     
  'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102],
  'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
  'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
}
  • token_type_idsIt is used to distinguish the first sentence from the second sentence, which may not be available in every tokenizer. They are only returned if the model knows what to do with them, since it has seen them during pre-training.
    • Here BERTtoken_type_dis is used in pre-training
    • If we decode, we will find that the format is [CLS] sentence1 [SEP] sentence2 [SEP]

We can let the tokenizer process a list of pairs of sentences by giving it a list of the first sentence and a list of the second sentence

tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)

This returns a dictionary, and we can add this returned result to the original dataset

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
  • We use batched=True when calling map, so that the function will be applied to multiple elements of the data set at the same time, instead of applying to each element separately (faster)
  • We can also change existing fields if the preprocessing function returns a new value for an existing key in the dataset

Dynamic padding

The function responsible for putting samples into a batch is collate function(it is a parameter of DataLoader, which converts samples into pytorch tensor by default and connects them)

We deliberately defer padding, only applying it in each batch, and avoid overly long inputs with a lot of padding.

  • This will make training faster
  • But there will be problems on the TPU, TPU prefers a fixed shape

In order to do this in practice, we have to define a collate function that will apply the correct amount of padding to the items in the dataset we want to batch

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
samples = tokenized_datasets["train"][:8]
batch = data_collator(samples)
  • A tokenize needs to be given during initialization, so that he can know what token to fill, whether to fill on the left or the right

Fine-tuning a model with the Trainer API

Transformers provides a Trainer class to help fine-tune the pre-training model on its own data. When the data processing is done, there are only a few steps to define the Trainer

We summarize the previous operations

from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Training

Before defining the trainer, you need to define the TrainingArguments class, which includes all parameters. We only need to provide the location where the model is to be saved, and other parameters can be trained well by keeping the default

from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

define a model

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Define a Trainer

from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)
  • When we pass in the tokenizer, the data_collator here will default to what we defined above, so we can omit this step

train

trainer.train()
  • will output training loss every 500 steps, but will not tell us how good the model is:
    • Did not let the trainer evaluate during training (by setting evaluation_strategy to steps/epoch)
    • No is provided compute_metrics()to calculate a metric during evaluate, just return loss, which is not intuitive

Evaluation

We can use Trainer.predict() to make our model predict

predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)
>(408, 2) (408,)
  • The output is a tuple with three fields (predictions, label_ids, metrics)
  • Here metrics only have loss, and some running time, if we define our own compute_metrics() function and pass it to Trainer , this field will also contain the result of compute_metrics()
  • predictions are logits for each element of the data set

To compare our predicted to the true label, we need the index of the maximum value on the second axis:

import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)

To build compute_metraic(), we can use the Evaluate library

import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)
> {
    
    'accuracy': 0.8578431372549019, 'f1': 0.8996539792387542}

Finally putting everything together we get the compute_metrics() function

def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

At this point we can define a new Trainer

training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
  • Note that evaluate will be performed every epoch. This time, it will output the validation loss and metrics at the end of each epoch in addition to the training loss

The process behind Trainer

A short summary of data processing

from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

preparation before training

We need to do some processing on tokenized_datasets, specifically:

  • sentence1Remove columns (such as and sentence2columns) that correspond to values ​​that the model does not expect .
  • Rename the column name labelto labels(as the model expects the parameter to be labels).
  • Format the dataset so that it returns PyTorch tensors instead of lists.
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

Load DataLoader

from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

load model

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
  • The model is loaded on the gpu

    import torch
    
    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
    model.to(device)
    

load optimizer

from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

load scheduler

from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

training loop

from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))  # 进度条

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {
    
    k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

evaluate loop

import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {
    
    k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()
  • add_batch()This indicator actually accumulates all the results for us when we use the method for the forecast loop batch. Once we have accumulated all batch, we can metric.compute()get the final result using

Circuit Training with Accelerate: A Complete Workout - Hugging Face Course

Guess you like

Origin blog.csdn.net/qq_52852138/article/details/128997766