Fine-tuning LLM with a single GPU

GPUs have been a tight commodity ever since large models became a hot trend. The reserves of many companies are not necessarily sufficient, let alone individual developers. Is there any way to use computing power to train models more efficiently?

In a recent blog, Sebastian Raschka introduced the "gradient accumulation" method, which can use a larger batch size training model when GPU memory is limited, bypassing hardware limitations.

Prior to this, Sebastian Raschka also shared an article on using multi-GPU training strategies to accelerate large-scale language model fine-tuning, including mechanisms such as model or tensor sharding, which distribute model weights and calculations on different devices to solve GPU problems. memory limit.

Fine-tuning the BLOOM model for classification

Suppose we are interested in adopting recently pretrained large language models for downstream tasks such as text classification. Then, we might choose to use the open-source alternative to GPT-3, the BLOOM model, specifically the BLOOM version with "only" 560 million parameters - which should fit into the RAM of a traditional GPU without any problems (Google Colab free version has a GPU with 15 Gb RAM).

Once you start, you are likely to run into a problem: the memory will increase rapidly during training or fine-tuning. The only way to train this model is to make the batch size 1 (batch size=1).

The code to fine-tune BLOOM for the target classification task using a batch size of 1 (batch size=1) is shown below. You can also download the full code from the GitHub project page:

https://github.com/rasbt/gradient-accumulation-blog/blob/main/src/1_batchsize-1.py

You can copy and paste this code directly into Google Colab, but you also have to drag and drop the accompanying local_dataset_utilities.py file into the same folder from which you imported some dataset utilities.

# pip install torch lightning matplotlib pandas torchmetrics watermark transformers datasets -U

import osimport os.path as opimport time

from datasets import load_datasetfrom lightning import Fabricimport torchfrom torch.utils.data import DataLoaderimport torchmetricsfrom transformers import AutoTokenizerfrom transformers import AutoModelForSequenceClassificationfrom watermark import watermark

from local_dataset_utilities import download_dataset, load_dataset_into_to_dataframe, partition_datasetfrom local_dataset_utilities import IMDBDataset

def tokenize_text (batch):    return tokenizer (batch ["text"], truncation=True, padding=True, max_length=1024)

def train (num_epochs, model, optimizer, train_loader, val_loader, fabric):

    for epoch in range (num_epochs):        train_acc = torchmetrics.Accuracy (            task="multiclass", num_classes=2).to (fabric.device)

        for batch_idx, batch in enumerate (train_loader):            model.train ()

            ### FORWARD AND BACK PROP            outputs = model (                batch ["input_ids"],                attention_mask=batch ["attention_mask"],                labels=batch ["label"]            ) 

            fabric.backward (outputs ["loss"])

            ### UPDATE MODEL PARAMETERS            optimizer.step ()            optimizer.zero_grad ()

            ### LOGGING            if not batch_idx % 300:                print (f"Epoch: {epoch+1:04d}/{num_epochs:04d}"                      f"| Batch {batch_idx:04d}/{len (train_loader):04d}"                      f"| Loss: {outputs ['loss']:.4f}")

            model.eval ()            with torch.no_grad ():                predicted_labels = torch.argmax (outputs ["logits"], 1)                train_acc.update (predicted_labels, batch ["label"])

        ### MORE LOGGING        model.eval ()        with torch.no_grad ():            val_acc = torchmetrics.Accuracy (task="multiclass", num_classes=2).to (fabric.device)            for batch in val_loader:                outputs = model (                    batch ["input_ids"],                    attention_mask=batch ["attention_mask"],                    labels=batch ["label"]                )                predicted_labels = torch.argmax (outputs ["logits"], 1)                val_acc.update (predicted_labels, batch ["label"])

            print (f"Epoch: {epoch+1:04d}/{num_epochs:04d}"                  f"| Train acc.: {train_acc.compute ()*100:.2f}%"                  f"| Val acc.: {val_acc.compute ()*100:.2f}%"                  )            train_acc.reset (), val_acc.reset ()

if __name__ == "__main__":

    print (watermark (packages="torch,lightning,transformers", python=True))    print ("Torch CUDA available?", torch.cuda.is_available ())    device = "cuda" if torch.cuda.is_available () else "cpu"

    torch.manual_seed (123)    # torch.use_deterministic_algorithms (True)

    ##########################    ### 1 Loading the Dataset    ##########################    download_dataset ()    df = load_dataset_into_to_dataframe ()    if not (op.exists ("train.csv") and op.exists ("val.csv") and op.exists ("test.csv")):        partition_dataset (df)

    imdb_dataset = load_dataset (        "csv",        data_files={
   
               "train": "train.csv",            "validation": "val.csv",            "test": "test.csv",        },    )

    #########################################    ### 2 Tokenization and Numericalization    #########################################

    tokenizer = AutoTokenizer.from_pretrained ("bigscience/bloom-560m", max_length=1024)    print ("Tokenizer input max length:", tokenizer.model_max_length, flush=True)    print ("Tokenizer vocabulary size:", tokenizer.vocab_size, flush=True)

    print ("Tokenizing ...", flush=True)    imdb_tokenized = imdb_dataset.map (tokenize_text, batched=True, batch_size=None)    del imdb_dataset    imdb_tokenized.set_format ("torch", columns=["input_ids", "attention_mask", "label"])    os.environ ["TOKENIZERS_PARALLELISM"] = "false"

    #########################################    ### 3 Set Up DataLoaders    #########################################

    train_dataset = IMDBDataset (imdb_tokenized, partition_key="train")    val_dataset = IMDBDataset (imdb_tokenized, partition_key="validation")    test_dataset = IMDBDataset (imdb_tokenized, partition_key="test")

    train_loader = DataLoader (        dataset=train_dataset,        batch_size=1,        shuffle=True,        num_workers=4,        drop_last=True,    )

    val_loader = DataLoader (        dataset=val_dataset,        batch_size=1,        num_workers=4,        drop_last=True,    )

    test_loader = DataLoader (        dataset=test_dataset,        batch_size=1,        num_workers=2,        drop_last=True,    )

    #########################################    ### 4 Initializing the Model    #########################################

    fabric = Fabric (accelerator="cuda", devices=1, precision="16-mixed")    fabric.launch ()

    model = AutoModelForSequenceClassification.from_pretrained (        "bigscience/bloom-560m", num_labels=2)

    optimizer = torch.optim.Adam (model.parameters (), lr=5e-5)

    model, optimizer = fabric.setup (model, optimizer)    train_loader, val_loader, test_loader = fabric.setup_dataloaders (        train_loader, val_loader, test_loader)

    #########################################    ### 5 Finetuning    #########################################

    start = time.time ()    train (        num_epochs=1,        model=model,        optimizer=optimizer,        train_loader=train_loader,        val_loader=val_loader,        fabric=fabric,    )

    end = time.time ()    elapsed = end-start    print (f"Time elapsed {elapsed/60:.2f} min")

    with torch.no_grad ():        model.eval ()        test_acc = torchmetrics.Accuracy (task="multiclass", num_classes=2).to (fabric.device)        for batch in test_loader:            outputs = model (                batch ["input_ids"],                attention_mask=batch ["attention_mask"],                labels=batch ["label"]            )            predicted_labels = torch.argmax (outputs ["logits"], 1)            test_acc.update (predicted_labels, batch ["label"])

    print (f"Test accuracy {test_acc.compute ()*100:.2f}%")

The author used Lightning Fabric because it allows developers to flexibly change the number of GPUs and multi-GPU training strategy when running this code on different hardware. It also allows mixed-precision training to be enabled just by adjusting the precision flag. In this case, mixed-precision training can triple training speed and reduce memory requirements by about 25%.

The main code shown above is executed in the main function (the context of if __name__ == "__main__"). Even if only a single GPU is used, it is recommended to use the PyTorch runtime environment to perform multi-GPU training. Then, the following three code sections wrapped in if __name__ == "__main__" are responsible for data loading:

# 1 Load the dataset

# 2 tokenization and digitization

# 3 Set up the data loader

Section 4 is in Initializing the Model, and then in Section 5, Finetuning, the train function is called, and this is where things start to get interesting. In the train (...) function, a standard PyTorch loop is implemented. An annotated version of the core training loop looks like this:

The problem with a batch size of 1 (Batch size=1) is that gradient updates can become very messy and difficult, as seen with fluctuating training loss and poor test set performance when training the model below:

...torch : 2.0.0lightning : 2.0.0transformers: 4.27.2

Torch CUDA available? True...Epoch: 0001/0001 | Batch 23700/35000 | Loss: 0.0969Epoch: 0001/0001 | Batch 24000/35000 | Loss: 1.9902Epoch: 0001/0001 | Batch 24300/35000 | Loss: 0.0395Epoch: 0001/0001 | Batch 24600/35000 | Loss: 0.2546Epoch: 0001/0001 | Batch 24900/35000 | Loss: 0.1128Epoch: 0001/0001 | Batch 25200/35000 | Loss: 0.2661Epoch: 0001/0001 | Batch 25500/35000 | Loss: 0.0044Epoch: 0001/0001 | Batch 25800/35000 | Loss: 0.0067Epoch: 0001/0001 | Batch 26100/35000 | Loss: 0.0468Epoch: 0001/0001 | Batch 26400/35000 | Loss: 1.7139Epoch: 0001/0001 | Batch 26700/35000 | Loss: 0.9570Epoch: 0001/0001 | Batch 27000/35000 | Loss: 0.1857Epoch: 0001/0001 | Batch 27300/35000 | Loss: 0.0090Epoch: 0001/0001 | Batch 27600/35000 | Loss: 0.9790Epoch: 0001/0001 | Batch 27900/35000 | Loss: 0.0503Epoch: 0001/0001 | Batch 28200/35000 | Loss: 0.2625Epoch: 0001/0001 | Batch 28500/35000 | Loss: 0.1010Epoch: 0001/0001 | Batch 28800/35000 | Loss: 0.0035Epoch: 0001/0001 | Batch 29100/35000 | Loss: 0.0009Epoch: 0001/0001 | Batch 29400/35000 | Loss: 0.0234Epoch: 0001/0001 | Batch 29700/35000 | Loss: 0.8394Epoch: 0001/0001 | Batch 30000/35000 | Loss: 0.9497Epoch: 0001/0001 | Batch 30300/35000 | Loss: 0.1437Epoch: 0001/0001 | Batch 30600/35000 | Loss: 0.1317Epoch: 0001/0001 | Batch 30900/35000 | Loss: 0.0112Epoch: 0001/0001 | Batch 31200/35000 | Loss: 0.0073Epoch: 0001/0001 | Batch 31500/35000 | Loss: 0.7393Epoch: 0001/0001 | Batch 31800/35000 | Loss: 0.0512Epoch: 0001/0001 | Batch 32100/35000 | Loss: 0.1337Epoch: 0001/0001 | Batch 32400/35000 | Loss: 1.1875Epoch: 0001/0001 | Batch 32700/35000 | Loss: 0.2727Epoch: 0001/0001 | Batch 33000/35000 | Loss: 0.1545Epoch: 0001/0001 | Batch 33300/35000 | Loss: 0.0022Epoch: 0001/0001 | Batch 33600/35000 | Loss: 0.2681Epoch: 0001/0001 | Batch 33900/35000 | Loss: 0.2467Epoch: 0001/0001 | Batch 34200/35000 | Loss: 0.0620Epoch: 0001/0001 | Batch 34500/35000 | Loss: 2.5039Epoch: 0001/0001 | Batch 34800/35000 | Loss: 0.0131Epoch: 0001/0001 | Train acc.: 75.11% | Val acc.: 78.62%Time elapsed 69.97 minTest accuracy 78.53%

Since there are not many GPUs available for tensor sharding, what can be done to train models with larger batch sizes?

One such solution is gradient accumulation, which modifies the aforementioned training loop.

What is Gradient Accumulation?

Gradient accumulation is a way to virtually increase the batch size during training, which is useful when the available GPU memory is insufficient to hold the desired batch size. In gradient accumulation, gradients are computed for smaller batches and accumulated (usually summed or averaged) over multiple iterations, rather than updating the model weights after each batch. Once the cumulative gradient reaches the target "virtual" batch size, the model weights are updated using the cumulative gradient.

See the updated PyTorch training loop below:

If accumulation_steps is set to 2, then zero_grad() and optimizer.step() will only be called every second. Therefore, running the modified training loop with accumulation_steps=2 has the same effect as doubling the batch size.

For example, if you want to use a batch size of 256, but can only fit a batch size of 64 into GPU memory, you can perform gradient accumulation on four batches of size 64. (When all four batches have been processed, the cumulative gradients are equivalent to a single batch size of 256.) This effectively simulates larger batch sizes without requiring larger GPU memory or splitting tensors across different devices. piece.

While gradient accumulation can help us train models with larger batch sizes, it does not reduce the total computation required. In fact, it can sometimes cause the training process to be slightly slower because weight updates are performed less frequently. Nevertheless, it helps us to work around the limitation that very small batch sizes cause frequent and chaotic updates.

For example, now let's run the above code with a batch size of 1, which requires 16 accumulation steps to simulate a batch size equal to 16.

The output is as follows:​​​​​​​​

...torch : 2.0.0lightning : 2.0.0transformers: 4.27.2

Torch CUDA available? True...Epoch: 0001/0001 | Batch 23700/35000 | Loss: 0.0168Epoch: 0001/0001 | Batch 24000/35000 | Loss: 0.0006Epoch: 0001/0001 | Batch 24300/35000 | Loss: 0.0152Epoch: 0001/0001 | Batch 24600/35000 | Loss: 0.0003Epoch: 0001/0001 | Batch 24900/35000 | Loss: 0.0623Epoch: 0001/0001 | Batch 25200/35000 | Loss: 0.0010Epoch: 0001/0001 | Batch 25500/35000 | Loss: 0.0001Epoch: 0001/0001 | Batch 25800/35000 | Loss: 0.0047Epoch: 0001/0001 | Batch 26100/35000 | Loss: 0.0004Epoch: 0001/0001 | Batch 26400/35000 | Loss: 0.1016Epoch: 0001/0001 | Batch 26700/35000 | Loss: 0.0021Epoch: 0001/0001 | Batch 27000/35000 | Loss: 0.0015Epoch: 0001/0001 | Batch 27300/35000 | Loss: 0.0008Epoch: 0001/0001 | Batch 27600/35000 | Loss: 0.0060Epoch: 0001/0001 | Batch 27900/35000 | Loss: 0.0001Epoch: 0001/0001 | Batch 28200/35000 | Loss: 0.0426Epoch: 0001/0001 | Batch 28500/35000 | Loss: 0.0012Epoch: 0001/0001 | Batch 28800/35000 | Loss: 0.0025Epoch: 0001/0001 | Batch 29100/35000 | Loss: 0.0025Epoch: 0001/0001 | Batch 29400/35000 | Loss: 0.0000Epoch: 0001/0001 | Batch 29700/35000 | Loss: 0.0495Epoch: 0001/0001 | Batch 30000/35000 | Loss: 0.0164Epoch: 0001/0001 | Batch 30300/35000 | Loss: 0.0067Epoch: 0001/0001 | Batch 30600/35000 | Loss: 0.0037Epoch: 0001/0001 | Batch 30900/35000 | Loss: 0.0005Epoch: 0001/0001 | Batch 31200/35000 | Loss: 0.0013Epoch: 0001/0001 | Batch 31500/35000 | Loss: 0.0112Epoch: 0001/0001 | Batch 31800/35000 | Loss: 0.0053Epoch: 0001/0001 | Batch 32100/35000 | Loss: 0.0012Epoch: 0001/0001 | Batch 32400/35000 | Loss: 0.1365Epoch: 0001/0001 | Batch 32700/35000 | Loss: 0.0210Epoch: 0001/0001 | Batch 33000/35000 | Loss: 0.0374Epoch: 0001/0001 | Batch 33300/35000 | Loss: 0.0007Epoch: 0001/0001 | Batch 33600/35000 | Loss: 0.0341Epoch: 0001/0001 | Batch 33900/35000 | Loss: 0.0259Epoch: 0001/0001 | Batch 34200/35000 | Loss: 0.0005Epoch: 0001/0001 | Batch 34500/35000 | Loss: 0.4792Epoch: 0001/0001 | Batch 34800/35000 | Loss: 0.0003Epoch: 0001/0001 | Train acc.: 78.67% | Val acc.: 87.28%Time elapsed 51.37 minTest accuracy 87.37%

According to the above results, the fluctuation of loss is smaller than before. In addition, the test set performance improved by 10%. Since the training set is iterated only once, each training example will only be encountered once. Training models for multiple epochs can further improve predictive performance.

You may also notice that this code also executes faster than the batch size 1 code used earlier. If we increase the virtual batch size to 8 using gradient accumulation, there will still be the same number of forward passes. However, since the model is updated only once every eight epochs, there will be fewer backward passes, allowing faster iteration over samples within one epoch (number of training rounds).

in conclusion

Gradient accumulation is a technique that simulates larger batch sizes by accumulating multiple small batch gradients before performing weight updates. This technique helps when available memory is limited and the batch size can fit in memory is small.

But first, think of a scenario where you can run with a batch size, meaning that the available memory is large enough to hold the desired batch size. In that case gradient accumulation may not be necessary. In fact, it may be more efficient to run with a larger batch size, as it allows more parallelism and reduces the number of weight updates required to train the model.

In summary, gradient accumulation is a practical technique that can be used to reduce the impact of mini-batch size noise on the accuracy of gradient updates. This is by far a simple and effective technique that allows us to bypass hardware limitations.

PS: Can this be made to run faster?

no problem. It can be made to run even faster using torch.compile introduced in PyTorch 2.0. Just need to add some model = torch.compile, as shown in the image below:

The full script is available on GitHub.

In this case, torch.compile shaves off another ten minutes of training time without affecting modeling performance:​​​​​​​​

poch: 0001/0001 | Batch 26400/35000 | Loss: 0.0320Epoch: 0001/0001 | Batch 26700/35000 | Loss: 0.0010Epoch: 0001/0001 | Batch 27000/35000 | Loss: 0.0006Epoch: 0001/0001 | Batch 27300/35000 | Loss: 0.0015Epoch: 0001/0001 | Batch 27600/35000 | Loss: 0.0157Epoch: 0001/0001 | Batch 27900/35000 | Loss: 0.0015Epoch: 0001/0001 | Batch 28200/35000 | Loss: 0.0540Epoch: 0001/0001 | Batch 28500/35000 | Loss: 0.0035Epoch: 0001/0001 | Batch 28800/35000 | Loss: 0.0016Epoch: 0001/0001 | Batch 29100/35000 | Loss: 0.0015Epoch: 0001/0001 | Batch 29400/35000 | Loss: 0.0008Epoch: 0001/0001 | Batch 29700/35000 | Loss: 0.0877Epoch: 0001/0001 | Batch 30000/35000 | Loss: 0.0232Epoch: 0001/0001 | Batch 30300/35000 | Loss: 0.0014Epoch: 0001/0001 | Batch 30600/35000 | Loss: 0.0032Epoch: 0001/0001 | Batch 30900/35000 | Loss: 0.0004Epoch: 0001/0001 | Batch 31200/35000 | Loss: 0.0062Epoch: 0001/0001 | Batch 31500/35000 | Loss: 0.0032Epoch: 0001/0001 | Batch 31800/35000 | Loss: 0.0066Epoch: 0001/0001 | Batch 32100/35000 | Loss: 0.0017Epoch: 0001/0001 | Batch 32400/35000 | Loss: 0.1485Epoch: 0001/0001 | Batch 32700/35000 | Loss: 0.0324Epoch: 0001/0001 | Batch 33000/35000 | Loss: 0.0155Epoch: 0001/0001 | Batch 33300/35000 | Loss: 0.0007Epoch: 0001/0001 | Batch 33600/35000 | Loss: 0.0049Epoch: 0001/0001 | Batch 33900/35000 | Loss: 0.1170Epoch: 0001/0001 | Batch 34200/35000 | Loss: 0.0002Epoch: 0001/0001 | Batch 34500/35000 | Loss: 0.4201Epoch: 0001/0001 | Batch 34800/35000 | Loss: 0.0018Epoch: 0001/0001 | Train acc.: 78.39% | Val acc.: 86.84%Time elapsed 43.33 minTest accuracy 87.91%

Note that the slight increase in accuracy compared to before is most likely due to randomness.

whaosoft aiot http://143ai.com

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/130673460