DeepSpeed accelerates large model training

DeepSpeed ​​is a framework launched by Microsoft that can package Pytorch models, providing advanced features such as accelerating model training, reducing GPU memory usage, and facilitating distributed training. Here I also tested DeepSpeed ​​to see if it could improve the training performance of my transformer model.

In my previous blog, I introduced how to train the GPT 2 model with SFT and summon Shenlong to build my own ChatGPT_gzroy's blog-CSDN blog . Based on the previous model, I will use DeepSpeed ​​to train and see what the effect is.

To use DeepSpeed, we need to define the configuration. The following is a JSON configuration file I defined:

{
    "train_batch_size": 4,
    "steps_per_print": 100,
    "fp16": {
        "enabled": true
    },
    "gradient_accumulation_steps": 1,
    "optimizer": {
        "type": "AdamW",
        "params": {
          "lr": 0.00006,
          "betas": [0.9, 0.95],
          "weight_decay": 0.01
        }
    },
    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": 0.000006,
            "warmup_max_lr": 0.00006,
            "warmup_num_steps": 1000,
            "total_num_steps": 40000
        }
    }
}

Build a normal Pytorch model, as mentioned in my previous blog, and then encapsulate it with DeepSpeed,

model, _, _, lr_scheduler = deepspeed.initialize(model=model, model_parameters=optim_groups, config=args.deepspeedcfg)

The args.deepspeedcfg here points to the JSON file we configured previously.

Then just rewrite it in the training steps.

logits, loss = model(x, y)
model.backward(loss)
model.step()

Finally, I tested it on my local 2080Ti graphics card. The following are the test results:

1. Without DeepSpeed ​​enabled, batch_size=4, training 4000 batches, the video memory usage is 8050MB, and it takes 748 seconds.

2. Enable DeepSpeed, batch_size=4, use DeepSpeed's AdamW optimizer, do not enable ZeRO, the video memory occupied is 6392MB, and it takes 613 seconds.

3. Enable DeepSpeed, batch_size=4, use Pytorch’s AdamW optimizer, do not enable ZeRO, the video memory usage is 6392MB, and it takes 802 seconds.

4. Enable DeepSpeed, batch_size=8, use DeepSpeed's AdamW optimizer, do not enable ZeRO, the video memory occupied is 8688MB, and it takes 1031 seconds.

5. Enable DeepSpeed, batch_size=8, use DeepSpeed's AdamW optimizer, enable ZeRO Stage 0, and the video memory occupied is 8688MB.

6. Enable DeepSpeed, batch_size=8, use DeepSpeed's AdamW optimizer, enable ZeRO Stage 1, and the video memory occupied is 9382MB.

7. Enable DeepSpeed, batch_size=8, use DeepSpeed's AdamW optimizer, enable ZeRO Stage 2, and the video memory occupied is 10336MB.

8. Enable DeepSpeed, batch_size=8, use DeepSpeed's AdamW optimizer, enable ZeRO Stage 3, if torch.compile(model) is enabled, a deepcopy error will be reported, cancel torch.compile and run, it will be shown that FP16 is not supported on AMD CPU. In addition, my device does not support offloading to nvme, so the test was aborted.

From the above test results, we can see that using DeepSpeed's own optimizer can greatly save video memory and improve training speed. However, after turning on ZeRO, the video memory usage increased in Stage 0, 1, and 2. Normally, this can save video memory. I don’t understand why this is the case. I will leave it to be analyzed later.

Guess you like

Origin blog.csdn.net/gzroy/article/details/132340327