GPUs have been a tight commodity ever since large models became a hot trend. The reserves of many companies are not necessarily sufficient, let alone individual developers. Is there any way to use computing power to train models more efficiently?
In a recent blog, Sebastian Raschka introduced the "gradient accumulation" method, which can use a larger batch size training model when GPU memory is limited, bypassing hardware limitations.
Prior to this, Sebastian Raschka also shared an article on using multi-GPU training strategies to accelerate large-scale language model fine-tuning, including mechanisms such as model or tensor sharding, which distribute model weights and calculations on different devices to solve GPU problems. memory limit.
Fine-tuning the BLOOM model for classification
Suppose we are interested in adopting recently pretrained large language models for downstream tasks such as text classification. Then, we might choose to use the open-source alternative to GPT-3, the BLOOM model, specifically the BLOOM version with "only" 560 million parameters - which should fit into the RAM of a traditional GPU without any problems (Google Colab free version has a GPU with 15 Gb RAM).
Once you start, you are likely to run into a problem: the memory will increase rapidly during training or fine-tuning. The only way to train this model is to make the batch size 1 (batch size=1).
The code to fine-tune BLOOM for the target classification task using a batch size of 1 (batch size=1) is shown below. You can also download the full code from the GitHub project page:
https://github.com/rasbt/gradient-accumulation-blog/blob/main/src/1_batchsize-1.py
You can copy and paste this code directly into Google Colab, but you also have to drag and drop the accompanying local_dataset_utilities.py file into the same folder from which you imported some dataset utilities.
# pip install torch lightning matplotlib pandas torchmetrics watermark transformers datasets -U
import os
import os.path as op
import time
from datasets import load_dataset
from lightning import Fabric
import torch
from torch.utils.data import DataLoader
import torchmetrics
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from watermark import watermark
from local_dataset_utilities import download_dataset, load_dataset_into_to_dataframe, partition_dataset
from local_dataset_utilities import IMDBDataset
def tokenize_text (batch):
return tokenizer (batch ["text"], truncation=True, padding=True, max_length=1024)
def train (num_epochs, model, optimizer, train_loader, val_loader, fabric):
for epoch in range (num_epochs):
train_acc = torchmetrics.Accuracy (
task="multiclass", num_classes=2).to (fabric.device)
for batch_idx, batch in enumerate (train_loader):
model.train ()
### FORWARD AND BACK PROP
outputs = model (
batch ["input_ids"],
attention_mask=batch ["attention_mask"],
labels=batch ["label"]
)
fabric.backward (outputs ["loss"])
### UPDATE MODEL PARAMETERS
optimizer.step ()
optimizer.zero_grad ()
### LOGGING
if not batch_idx % 300:
print (f"Epoch: {epoch+1:04d}/{num_epochs:04d}"
f"| Batch {batch_idx:04d}/{len (train_loader):04d}"
f"| Loss: {outputs ['loss']:.4f}")
model.eval ()
with torch.no_grad ():
predicted_labels = torch.argmax (outputs ["logits"], 1)
train_acc.update (predicted_labels, batch ["label"])
### MORE LOGGING
model.eval ()
with torch.no_grad ():
val_acc = torchmetrics.Accuracy (task="multiclass", num_classes=2).to (fabric.device)
for batch in val_loader:
outputs = model (
batch ["input_ids"],
attention_mask=batch ["attention_mask"],
labels=batch ["label"]
)
predicted_labels = torch.argmax (outputs ["logits"], 1)
val_acc.update (predicted_labels, batch ["label"])
print (f"Epoch: {epoch+1:04d}/{num_epochs:04d}"
f"| Train acc.: {train_acc.compute ()*100:.2f}%"
f"| Val acc.: {val_acc.compute ()*100:.2f}%"
)
train_acc.reset (), val_acc.reset ()
if __name__ == "__main__":
print (watermark (packages="torch,lightning,transformers", python=True))
print ("Torch CUDA available?", torch.cuda.is_available ())
device = "cuda" if torch.cuda.is_available () else "cpu"
torch.manual_seed (123)
# torch.use_deterministic_algorithms (True)
##########################
### 1 Loading the Dataset
##########################
download_dataset ()
df = load_dataset_into_to_dataframe ()
if not (op.exists ("train.csv") and op.exists ("val.csv") and op.exists ("test.csv")):
partition_dataset (df)
imdb_dataset = load_dataset (
"csv",
data_files={
"train": "train.csv",
"validation": "val.csv",
"test": "test.csv",
},
)
#########################################
### 2 Tokenization and Numericalization
#########################################
tokenizer = AutoTokenizer.from_pretrained ("bigscience/bloom-560m", max_length=1024)
print ("Tokenizer input max length:", tokenizer.model_max_length, flush=True)
print ("Tokenizer vocabulary size:", tokenizer.vocab_size, flush=True)
print ("Tokenizing ...", flush=True)
imdb_tokenized = imdb_dataset.map (tokenize_text, batched=True, batch_size=None)
del imdb_dataset
imdb_tokenized.set_format ("torch", columns=["input_ids", "attention_mask", "label"])
os.environ ["TOKENIZERS_PARALLELISM"] = "false"
#########################################
### 3 Set Up DataLoaders
#########################################
train_dataset = IMDBDataset (imdb_tokenized, partition_key="train")
val_dataset = IMDBDataset (imdb_tokenized, partition_key="validation")
test_dataset = IMDBDataset (imdb_tokenized, partition_key="test")
train_loader = DataLoader (
dataset=train_dataset,
batch_size=1,
shuffle=True,
num_workers=4,
drop_last=True,
)
val_loader = DataLoader (
dataset=val_dataset,
batch_size=1,
num_workers=4,
drop_last=True,
)
test_loader = DataLoader (
dataset=test_dataset,
batch_size=1,
num_workers=2,
drop_last=True,
)
#########################################
### 4 Initializing the Model
#########################################
fabric = Fabric (accelerator="cuda", devices=1, precision="16-mixed")
fabric.launch ()
model = AutoModelForSequenceClassification.from_pretrained (
"bigscience/bloom-560m", num_labels=2)
optimizer = torch.optim.Adam (model.parameters (), lr=5e-5)
model, optimizer = fabric.setup (model, optimizer)
train_loader, val_loader, test_loader = fabric.setup_dataloaders (
train_loader, val_loader, test_loader)
#########################################
### 5 Finetuning
#########################################
start = time.time ()
train (
num_epochs=1,
model=model,
optimizer=optimizer,
train_loader=train_loader,
val_loader=val_loader,
fabric=fabric,
)
end = time.time ()
elapsed = end-start
print (f"Time elapsed {elapsed/60:.2f} min")
with torch.no_grad ():
model.eval ()
test_acc = torchmetrics.Accuracy (task="multiclass", num_classes=2).to (fabric.device)
for batch in test_loader:
outputs = model (
batch ["input_ids"],
attention_mask=batch ["attention_mask"],
labels=batch ["label"]
)
predicted_labels = torch.argmax (outputs ["logits"], 1)
test_acc.update (predicted_labels, batch ["label"])
print (f"Test accuracy {test_acc.compute ()*100:.2f}%")
The author used Lightning Fabric because it allows developers to flexibly change the number of GPUs and multi-GPU training strategy when running this code on different hardware. It also allows mixed-precision training to be enabled just by adjusting the precision flag. In this case, mixed-precision training can triple training speed and reduce memory requirements by about 25%.
The main code shown above is executed in the main function (the context of if __name__ == "__main__"). Even if only a single GPU is used, it is recommended to use the PyTorch runtime environment to perform multi-GPU training. Then, the following three code sections wrapped in if __name__ == "__main__" are responsible for data loading:
# 1 Load the dataset
# 2 tokenization and digitization
# 3 Set up the data loader
Section 4 is in Initializing the Model, and then in Section 5, Finetuning, the train function is called, and this is where things start to get interesting. In the train (...) function, a standard PyTorch loop is implemented. An annotated version of the core training loop looks like this:
The problem with a batch size of 1 (Batch size=1) is that gradient updates can become very messy and difficult, as seen with fluctuating training loss and poor test set performance when training the model below:
...
torch : 2.0.0
lightning : 2.0.0
transformers: 4.27.2
Torch CUDA available? True
...
Epoch: 0001/0001 | Batch 23700/35000 | Loss: 0.0969
Epoch: 0001/0001 | Batch 24000/35000 | Loss: 1.9902
Epoch: 0001/0001 | Batch 24300/35000 | Loss: 0.0395
Epoch: 0001/0001 | Batch 24600/35000 | Loss: 0.2546
Epoch: 0001/0001 | Batch 24900/35000 | Loss: 0.1128
Epoch: 0001/0001 | Batch 25200/35000 | Loss: 0.2661
Epoch: 0001/0001 | Batch 25500/35000 | Loss: 0.0044
Epoch: 0001/0001 | Batch 25800/35000 | Loss: 0.0067
Epoch: 0001/0001 | Batch 26100/35000 | Loss: 0.0468
Epoch: 0001/0001 | Batch 26400/35000 | Loss: 1.7139
Epoch: 0001/0001 | Batch 26700/35000 | Loss: 0.9570
Epoch: 0001/0001 | Batch 27000/35000 | Loss: 0.1857
Epoch: 0001/0001 | Batch 27300/35000 | Loss: 0.0090
Epoch: 0001/0001 | Batch 27600/35000 | Loss: 0.9790
Epoch: 0001/0001 | Batch 27900/35000 | Loss: 0.0503
Epoch: 0001/0001 | Batch 28200/35000 | Loss: 0.2625
Epoch: 0001/0001 | Batch 28500/35000 | Loss: 0.1010
Epoch: 0001/0001 | Batch 28800/35000 | Loss: 0.0035
Epoch: 0001/0001 | Batch 29100/35000 | Loss: 0.0009
Epoch: 0001/0001 | Batch 29400/35000 | Loss: 0.0234
Epoch: 0001/0001 | Batch 29700/35000 | Loss: 0.8394
Epoch: 0001/0001 | Batch 30000/35000 | Loss: 0.9497
Epoch: 0001/0001 | Batch 30300/35000 | Loss: 0.1437
Epoch: 0001/0001 | Batch 30600/35000 | Loss: 0.1317
Epoch: 0001/0001 | Batch 30900/35000 | Loss: 0.0112
Epoch: 0001/0001 | Batch 31200/35000 | Loss: 0.0073
Epoch: 0001/0001 | Batch 31500/35000 | Loss: 0.7393
Epoch: 0001/0001 | Batch 31800/35000 | Loss: 0.0512
Epoch: 0001/0001 | Batch 32100/35000 | Loss: 0.1337
Epoch: 0001/0001 | Batch 32400/35000 | Loss: 1.1875
Epoch: 0001/0001 | Batch 32700/35000 | Loss: 0.2727
Epoch: 0001/0001 | Batch 33000/35000 | Loss: 0.1545
Epoch: 0001/0001 | Batch 33300/35000 | Loss: 0.0022
Epoch: 0001/0001 | Batch 33600/35000 | Loss: 0.2681
Epoch: 0001/0001 | Batch 33900/35000 | Loss: 0.2467
Epoch: 0001/0001 | Batch 34200/35000 | Loss: 0.0620
Epoch: 0001/0001 | Batch 34500/35000 | Loss: 2.5039
Epoch: 0001/0001 | Batch 34800/35000 | Loss: 0.0131
Epoch: 0001/0001 | Train acc.: 75.11% | Val acc.: 78.62%
Time elapsed 69.97 min
Test accuracy 78.53%
Since there are not many GPUs available for tensor sharding, what can be done to train models with larger batch sizes?
One such solution is gradient accumulation, which modifies the aforementioned training loop.
What is Gradient Accumulation?
Gradient accumulation is a way to virtually increase the batch size during training, which is useful when the available GPU memory is insufficient to hold the desired batch size. In gradient accumulation, gradients are computed for smaller batches and accumulated (usually summed or averaged) over multiple iterations, rather than updating the model weights after each batch. Once the cumulative gradient reaches the target "virtual" batch size, the model weights are updated using the cumulative gradient.
See the updated PyTorch training loop below:
If accumulation_steps is set to 2, then zero_grad() and optimizer.step() will only be called every second. Therefore, running the modified training loop with accumulation_steps=2 has the same effect as doubling the batch size.
For example, if you want to use a batch size of 256, but can only fit a batch size of 64 into GPU memory, you can perform gradient accumulation on four batches of size 64. (When all four batches have been processed, the cumulative gradients are equivalent to a single batch size of 256.) This effectively simulates larger batch sizes without requiring larger GPU memory or splitting tensors across different devices. piece.
While gradient accumulation can help us train models with larger batch sizes, it does not reduce the total computation required. In fact, it can sometimes cause the training process to be slightly slower because weight updates are performed less frequently. Nevertheless, it helps us to work around the limitation that very small batch sizes cause frequent and chaotic updates.
For example, now let's run the above code with a batch size of 1, which requires 16 accumulation steps to simulate a batch size equal to 16.
The output is as follows:
...
torch : 2.0.0
lightning : 2.0.0
transformers: 4.27.2
Torch CUDA available? True
...
Epoch: 0001/0001 | Batch 23700/35000 | Loss: 0.0168
Epoch: 0001/0001 | Batch 24000/35000 | Loss: 0.0006
Epoch: 0001/0001 | Batch 24300/35000 | Loss: 0.0152
Epoch: 0001/0001 | Batch 24600/35000 | Loss: 0.0003
Epoch: 0001/0001 | Batch 24900/35000 | Loss: 0.0623
Epoch: 0001/0001 | Batch 25200/35000 | Loss: 0.0010
Epoch: 0001/0001 | Batch 25500/35000 | Loss: 0.0001
Epoch: 0001/0001 | Batch 25800/35000 | Loss: 0.0047
Epoch: 0001/0001 | Batch 26100/35000 | Loss: 0.0004
Epoch: 0001/0001 | Batch 26400/35000 | Loss: 0.1016
Epoch: 0001/0001 | Batch 26700/35000 | Loss: 0.0021
Epoch: 0001/0001 | Batch 27000/35000 | Loss: 0.0015
Epoch: 0001/0001 | Batch 27300/35000 | Loss: 0.0008
Epoch: 0001/0001 | Batch 27600/35000 | Loss: 0.0060
Epoch: 0001/0001 | Batch 27900/35000 | Loss: 0.0001
Epoch: 0001/0001 | Batch 28200/35000 | Loss: 0.0426
Epoch: 0001/0001 | Batch 28500/35000 | Loss: 0.0012
Epoch: 0001/0001 | Batch 28800/35000 | Loss: 0.0025
Epoch: 0001/0001 | Batch 29100/35000 | Loss: 0.0025
Epoch: 0001/0001 | Batch 29400/35000 | Loss: 0.0000
Epoch: 0001/0001 | Batch 29700/35000 | Loss: 0.0495
Epoch: 0001/0001 | Batch 30000/35000 | Loss: 0.0164
Epoch: 0001/0001 | Batch 30300/35000 | Loss: 0.0067
Epoch: 0001/0001 | Batch 30600/35000 | Loss: 0.0037
Epoch: 0001/0001 | Batch 30900/35000 | Loss: 0.0005
Epoch: 0001/0001 | Batch 31200/35000 | Loss: 0.0013
Epoch: 0001/0001 | Batch 31500/35000 | Loss: 0.0112
Epoch: 0001/0001 | Batch 31800/35000 | Loss: 0.0053
Epoch: 0001/0001 | Batch 32100/35000 | Loss: 0.0012
Epoch: 0001/0001 | Batch 32400/35000 | Loss: 0.1365
Epoch: 0001/0001 | Batch 32700/35000 | Loss: 0.0210
Epoch: 0001/0001 | Batch 33000/35000 | Loss: 0.0374
Epoch: 0001/0001 | Batch 33300/35000 | Loss: 0.0007
Epoch: 0001/0001 | Batch 33600/35000 | Loss: 0.0341
Epoch: 0001/0001 | Batch 33900/35000 | Loss: 0.0259
Epoch: 0001/0001 | Batch 34200/35000 | Loss: 0.0005
Epoch: 0001/0001 | Batch 34500/35000 | Loss: 0.4792
Epoch: 0001/0001 | Batch 34800/35000 | Loss: 0.0003
Epoch: 0001/0001 | Train acc.: 78.67% | Val acc.: 87.28%
Time elapsed 51.37 min
Test accuracy 87.37%
According to the above results, the fluctuation of loss is smaller than before. In addition, the test set performance improved by 10%. Since the training set is iterated only once, each training example will only be encountered once. Training models for multiple epochs can further improve predictive performance.
You may also notice that this code also executes faster than the batch size 1 code used earlier. If we increase the virtual batch size to 8 using gradient accumulation, there will still be the same number of forward passes. However, since the model is updated only once every eight epochs, there will be fewer backward passes, allowing faster iteration over samples within one epoch (number of training rounds).
in conclusion
Gradient accumulation is a technique that simulates larger batch sizes by accumulating multiple small batch gradients before performing weight updates. This technique helps when available memory is limited and the batch size can fit in memory is small.
But first, think of a scenario where you can run with a batch size, meaning that the available memory is large enough to hold the desired batch size. In that case gradient accumulation may not be necessary. In fact, it may be more efficient to run with a larger batch size, as it allows more parallelism and reduces the number of weight updates required to train the model.
In summary, gradient accumulation is a practical technique that can be used to reduce the impact of mini-batch size noise on the accuracy of gradient updates. This is by far a simple and effective technique that allows us to bypass hardware limitations.
PS: Can this be made to run faster?
no problem. It can be made to run even faster using torch.compile introduced in PyTorch 2.0. Just need to add some model = torch.compile, as shown in the image below:
The full script is available on GitHub.
In this case, torch.compile shaves off another ten minutes of training time without affecting modeling performance:
poch: 0001/0001 | Batch 26400/35000 | Loss: 0.0320
Epoch: 0001/0001 | Batch 26700/35000 | Loss: 0.0010
Epoch: 0001/0001 | Batch 27000/35000 | Loss: 0.0006
Epoch: 0001/0001 | Batch 27300/35000 | Loss: 0.0015
Epoch: 0001/0001 | Batch 27600/35000 | Loss: 0.0157
Epoch: 0001/0001 | Batch 27900/35000 | Loss: 0.0015
Epoch: 0001/0001 | Batch 28200/35000 | Loss: 0.0540
Epoch: 0001/0001 | Batch 28500/35000 | Loss: 0.0035
Epoch: 0001/0001 | Batch 28800/35000 | Loss: 0.0016
Epoch: 0001/0001 | Batch 29100/35000 | Loss: 0.0015
Epoch: 0001/0001 | Batch 29400/35000 | Loss: 0.0008
Epoch: 0001/0001 | Batch 29700/35000 | Loss: 0.0877
Epoch: 0001/0001 | Batch 30000/35000 | Loss: 0.0232
Epoch: 0001/0001 | Batch 30300/35000 | Loss: 0.0014
Epoch: 0001/0001 | Batch 30600/35000 | Loss: 0.0032
Epoch: 0001/0001 | Batch 30900/35000 | Loss: 0.0004
Epoch: 0001/0001 | Batch 31200/35000 | Loss: 0.0062
Epoch: 0001/0001 | Batch 31500/35000 | Loss: 0.0032
Epoch: 0001/0001 | Batch 31800/35000 | Loss: 0.0066
Epoch: 0001/0001 | Batch 32100/35000 | Loss: 0.0017
Epoch: 0001/0001 | Batch 32400/35000 | Loss: 0.1485
Epoch: 0001/0001 | Batch 32700/35000 | Loss: 0.0324
Epoch: 0001/0001 | Batch 33000/35000 | Loss: 0.0155
Epoch: 0001/0001 | Batch 33300/35000 | Loss: 0.0007
Epoch: 0001/0001 | Batch 33600/35000 | Loss: 0.0049
Epoch: 0001/0001 | Batch 33900/35000 | Loss: 0.1170
Epoch: 0001/0001 | Batch 34200/35000 | Loss: 0.0002
Epoch: 0001/0001 | Batch 34500/35000 | Loss: 0.4201
Epoch: 0001/0001 | Batch 34800/35000 | Loss: 0.0018
Epoch: 0001/0001 | Train acc.: 78.39% | Val acc.: 86.84%
Time elapsed 43.33 min
Test accuracy 87.91%
Note that the slight increase in accuracy compared to before is most likely due to randomness.
whaosoft aiot http://143ai.com