background
Transformers provides a very convenient API to fine-tune large models. Let’s talk about the steps of using Trainer to fine-tune large models.
Step 1: Load the pre-trained large model
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
Step 2: Set training hyperparameters
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="path/to/save/folder/",
learning_rate=2e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=2,
)
For example, this set epoch equal to 2
Step 3: Get the tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
Step 4: Load the dataset
from datasets import load_dataset
dataset = load_dataset("rotten_tomatoes") # doctest: +IGNORE_RESULT
Step 5: Create a word segmentation function and specify the fields in the data set that need to be segmented:
def tokenize_dataset(dataset):
return tokenizer(dataset["text"])
Step 6: Call map() to apply the word segmentation function to the entire data set
dataset = dataset.map(tokenize_dataset, batched=True)
Step 7: Use DataCollatorWithPadding to fill data in batches to speed up the filling process:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
Step 8: Initialize Trainer
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
tokenizer=tokenizer,
data_collator=data_collator,
) # doctest: +SKIP
Step 9: Start training
trainer.train()
Summarize:
Using the API provided by Trainer, you can fine-tune a large model in just nine simple steps and a dozen lines of code. Do you want to give it a try?