Distributed training with huggingface.Accelerate

The gods are silent-personal CSDN blog post directory

This article is part of the blog post of huggingface.transformers’ all documentation study notes.
Full text link: huggingface transformers package document study notes (continuously updated...)

URL of this part: https://huggingface.co/docs/transformers/main/en/accelerate
This article introduces how to use huggingface.accelerate (official document: https://huggingface.co/docs/accelerate/index ) for distributed training .

In addition, refer to the installation documentation of accelerate: https://huggingface.co/docs/accelerate/basic_tutorials/install

A Python environment available for the code in this article: Python 3.9.7, PyTorch 2.0.1, transformers 4.31.0, accelerate 0.22.0

Parallelism allows us to train larger models when hardware conditions are limited, and the training speed can be accelerated by several orders of magnitude.

1. Installation and configuration

Install:pip install accelerate

Configuration: accelerate config
Then it will give some questions, use the up and down keys to change the options, and use EnterOK
Insert image description here

It doesn’t matter if you choose the wrong one, you can change it anyway.

Use accelerate envthe command to view the configuration environment.

2. Use in code

Use the script before accelerate (for detailed explanation, please see my previous blog post: Using huggingface.transformers.AutoModelForSequenceClassification to fine-tune the pre-trained model on text classification tasks. The native PyTorch version is used, because Trainer will automatically use distributed training. The metric part Change to the new version and use all data to train):

from tqdm.auto import tqdm

import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW

import datasets,evaluate
from transformers import AutoTokenizer,AutoModelForSequenceClassification,get_scheduler

dataset=datasets.load_from_disk("download/yelp_full_review_disk")

tokenizer=AutoTokenizer.from_pretrained("/data/pretrained_models/bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length",truncation=True,max_length=512)

tokenized_datasets=dataset.map(tokenize_function, batched=True)

#Postprocess dataset
tokenized_datasets=tokenized_datasets.remove_columns(["text"])
#删除模型不用的text列

tokenized_datasets=tokenized_datasets.rename_column("label", "labels")
#改名label列为labels,因为AutoModelForSequenceClassification的入参键名为label
#我不知道为什么dataset直接叫label就可以啦……

tokenized_datasets.set_format("torch")  #将值转换为torch.Tensor对象

small_train_dataset=tokenized_datasets["train"].shuffle(seed=42)
small_eval_dataset=tokenized_datasets["test"].shuffle(seed=42)

train_dataloader=DataLoader(small_train_dataset,shuffle=True,batch_size=32)
eval_dataloader=DataLoader(small_eval_dataset,batch_size=64)

model=AutoModelForSequenceClassification.from_pretrained("/data/pretrained_models/bert-base-cased",
                                                         num_labels=5)

optimizer=AdamW(model.parameters(),lr=5e-5)

num_epochs=3
num_training_steps=num_epochs*len(train_dataloader)
lr_scheduler=get_scheduler(name="linear",optimizer=optimizer,num_warmup_steps=0,num_training_steps=num_training_steps)

device=torch.device("cuda:1") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch={
    
    k:v.to(device) for k,v in batch.items()}
        outputs=model(**batch)
        loss=outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

metric=evaluate.load("accuracy")
model.eval()
for batch in eval_dataloader:
    batch={
    
    k:v.to(device) for k,v in batch.items()}
    with torch.no_grad():
        outputs=model(**batch)

    logits=outputs.logits
    predictions=torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

print(metric.compute())

I'm too lazy to finish the run. Anyway, it's expected to take 11 hours, which is very slow.

Add the following code:

from accelerate import Accelerator

accelerator = Accelerator()

#去掉将模型和数据集放到指定卡上的代码

#在建立好数据集、模型和优化器之后:
train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

#训练阶段将loss.backward()替换成
accelerator.backward(loss)

The code after adding (I used all the data sets and estimated that the training time is 4 hours (3 cards), but I am too lazy to run for so long, so I will still use 1000 runs to complete the whole process):
use See the code behind for accelerate launch Python脚本路径running
the verification part.

from tqdm.auto import tqdm

import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW

import datasets
from transformers import AutoTokenizer,AutoModelForSequenceClassification,get_scheduler

from accelerate import Accelerator

accelerator = Accelerator()

dataset=datasets.load_from_disk("download/yelp_full_review_disk")

tokenizer=AutoTokenizer.from_pretrained("/data/pretrained_models/bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length",truncation=True,max_length=512)

tokenized_datasets=dataset.map(tokenize_function, batched=True)

#Postprocess dataset
tokenized_datasets=tokenized_datasets.remove_columns(["text"])
#删除模型不用的text列

tokenized_datasets=tokenized_datasets.rename_column("label", "labels")
#改名label列为labels,因为AutoModelForSequenceClassification的入参键名为label
#我不知道为什么dataset直接叫label就可以啦……

tokenized_datasets.set_format("torch")  #将值转换为torch.Tensor对象

small_train_dataset=tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset=tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

train_dataloader=DataLoader(small_train_dataset,shuffle=True,batch_size=32)
eval_dataloader=DataLoader(small_eval_dataset,batch_size=64)

model=AutoModelForSequenceClassification.from_pretrained("/data/pretrained_models/bert-base-cased",
                                                         num_labels=5)

optimizer=AdamW(model.parameters(),lr=5e-5)

train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

num_epochs=3
num_training_steps=num_epochs*len(train_dataloader)
lr_scheduler=get_scheduler(name="linear",optimizer=optimizer,num_warmup_steps=0,num_training_steps=num_training_steps)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        outputs=model(**batch)
        loss=outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

The verification part is like this. You can run it directly with the original verification part, but because the script will be run twice, the verification part will also be run twice.
So in principle, I recommend using accelerate to just train, and the verification part should be implemented with a single card.
If you still want to see the verification effect during the training process, you can verify normally; you can also limit the verification part to if accelerator.is_main_process:, so that only the main process (usually the first GPU) will execute the verification code, and other GPUs will not, so The indicator will only be printed once.

Guess you like

Origin blog.csdn.net/PolarisRisingWar/article/details/120546935