Article Directory
Processing the data
Load a dataset from the Hub
We can use the following code to train a sequence classifier
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification
# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
"I've been waiting for a HuggingFace course my whole life.",
"This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
# This is new
batch["labels"] = torch.tensor([1, 1])
optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()
Obviously, it is impossible to get good results with two sentences, so we need a larger dataset
-
huggingface also saves a lot of data sets, which can be
load_data
downloaded throughfrom datasets import load_dataset raw_datasets = load_dataset("glue", "mrpc") # GLUE benchmark 中的 MRPC数据集 raw_datasets > DatasetDict({ train: Dataset({ features: ['sentence1', 'sentence2', 'label', 'idx'], num_rows: 3668 }) validation: Dataset({ features: ['sentence1', 'sentence2', 'label', 'idx'], num_rows: 408 }) test: Dataset({ features: ['sentence1', 'sentence2', 'label', 'idx'], num_rows: 1725 }) })
-
We can access each pair of sentences by indexing
raw_train_dataset = raw_datasets["train"] raw_train_dataset[0] >{ 'idx': 0, 'label': 1, 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}
-
If you want to know the meaning of each part of the data set, you can
features
view it through the attributeraw_train_dataset.features { 'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), 'idx': Value(dtype='int32', id=None)}
-
Preprocessing a dataset
The tokenizer can directly process paired data, which is BERT
what is hoped
inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs
>{
'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
}
token_type_ids
It is used to distinguish the first sentence from the second sentence, which may not be available in every tokenizer. They are only returned if the model knows what to do with them, since it has seen them during pre-training.- Here
BERT
token_type_dis is used in pre-training - If we decode, we will find that the format is [CLS] sentence1 [SEP] sentence2 [SEP]
- Here
We can let the tokenizer process a list of pairs of sentences by giving it a list of the first sentence and a list of the second sentence
tokenized_dataset = tokenizer(
raw_datasets["train"]["sentence1"],
raw_datasets["train"]["sentence2"],
padding=True,
truncation=True,
)
This returns a dictionary, and we can add this returned result to the original dataset
def tokenize_function(example):
return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
- We use batched=True when calling map, so that the function will be applied to multiple elements of the data set at the same time, instead of applying to each element separately (faster)
- We can also change existing fields if the preprocessing function returns a new value for an existing key in the dataset
Dynamic padding
The function responsible for putting samples into a batch is collate function
(it is a parameter of DataLoader, which converts samples into pytorch tensor by default and connects them)
We deliberately defer padding, only applying it in each batch, and avoid overly long inputs with a lot of padding.
- This will make training faster
- But there will be problems on the TPU, TPU prefers a fixed shape
In order to do this in practice, we have to define a collate function that will apply the correct amount of padding to the items in the dataset we want to batch
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
samples = tokenized_datasets["train"][:8]
batch = data_collator(samples)
- A tokenize needs to be given during initialization, so that he can know what token to fill, whether to fill on the left or the right
Fine-tuning a model with the Trainer API
Transformers provides a Trainer class to help fine-tune the pre-training model on its own data. When the data processing is done, there are only a few steps to define the Trainer
We summarize the previous operations
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize_function(example):
return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
Training
Before defining the trainer, you need to define the TrainingArguments class, which includes all parameters. We only need to provide the location where the model is to be saved, and other parameters can be trained well by keeping the default
from transformers import TrainingArguments
training_args = TrainingArguments("test-trainer")
define a model
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
Define a Trainer
from transformers import Trainer
trainer = Trainer(
model,
training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
)
- When we pass in the tokenizer, the data_collator here will default to what we defined above, so we can omit this step
train
trainer.train()
- will output training loss every 500 steps, but will not tell us how good the model is:
- Did not let the trainer evaluate during training (by setting evaluation_strategy to steps/epoch)
- No is provided
compute_metrics()
to calculate a metric during evaluate, just return loss, which is not intuitive
Evaluation
We can use Trainer.predict() to make our model predict
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)
>(408, 2) (408,)
- The output is a tuple with three fields (predictions, label_ids, metrics)
- Here metrics only have loss, and some running time, if we define our own compute_metrics() function and pass it to Trainer , this field will also contain the result of compute_metrics()
- predictions are logits for each element of the data set
To compare our predicted to the true label, we need the index of the maximum value on the second axis:
import numpy as np
preds = np.argmax(predictions.predictions, axis=-1)
To build compute_metraic(), we can use the Evaluate library
import evaluate
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)
> {
'accuracy': 0.8578431372549019, 'f1': 0.8996539792387542}
Finally putting everything together we get the compute_metrics() function
def compute_metrics(eval_preds):
metric = evaluate.load("glue", "mrpc")
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
At this point we can define a new Trainer
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
trainer = Trainer(
model,
training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
- Note that evaluate will be performed every epoch. This time, it will output the validation loss and metrics at the end of each epoch in addition to the training loss
The process behind Trainer
A short summary of data processing
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize_function(example):
return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
preparation before training
We need to do some processing on tokenized_datasets, specifically:
sentence1
Remove columns (such as andsentence2
columns) that correspond to values that the model does not expect .- Rename the column name
label
tolabels
(as the model expects the parameter to belabels
). - Format the dataset so that it returns PyTorch tensors instead of lists.
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names
Load DataLoader
from torch.utils.data import DataLoader
train_dataloader = DataLoader(
tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)
load model
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
-
The model is loaded on the gpu
import torch device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") model.to(device)
load optimizer
from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=5e-5)
load scheduler
from transformers import get_scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
"linear",
optimizer=optimizer,
num_warmup_steps=0,
num_training_steps=num_training_steps,
)
training loop
from tqdm.auto import tqdm
progress_bar = tqdm(range(num_training_steps)) # 进度条
model.train()
for epoch in range(num_epochs):
for batch in train_dataloader:
batch = {
k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
evaluate loop
import evaluate
metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
batch = {
k: v.to(device) for k, v in batch.items()}
with torch.no_grad():
outputs = model(**batch)
logits = outputs.logits
predictions = torch.argmax(logits, dim=-1)
metric.add_batch(predictions=predictions, references=batch["labels"])
metric.compute()
add_batch()
This indicator actually accumulates all the results for us when we use the method for the forecast loopbatch
. Once we have accumulated allbatch
, we canmetric.compute()
get the final result using
Circuit Training with Accelerate: A Complete Workout - Hugging Face Course