DeepSpeed Configuration Parameters - Quick Start

DeepSpeed ​​Configuration Parameters - Quick Start

DeepSpeed ​​is an open source deep learning optimization library for PyTorch released by Microsoft. Its main features are:

  • Heterogeneous Computing: The ZeRO-Offload mechanism utilizes CPU and GPU memory at the same time, making it possible to train a model 10 times larger on a single GPU card;
  • Computing acceleration: Sparse Attention kernel technology supports longer input sequences (10 times), faster execution speed (6 times), and maintains precision;
  • 3D Parallelism: Divide the layers of the model between multiple workers, borrowing NVIDIA's Megatron-LM to reduce the usage of video memory

Official document: https://deepspeed.readthedocs.io/en/latest/
Configuration parameter document: https://www.deepspeed.ai/docs/config-json/

Here are several important parameters to explain:

batch Size

train_batch_size = train_micro_batch_size_per_gpu * gradient_accumulation * number of GPUs.
// 训练批次的大小 = 每个GPU上的微批次大小 * 几个微批次 * 几个GPU

optimizer

type:support的有Adam, AdamW, OneBitAdam, Lamb, and OneBitLamb

Among them, AdamW is used in the conventional example, that is, Adam with L2 regularization

params: Fill in the parameter field with the same parameters as in torch

For example, AdamW can refer to https://pytorch.org/docs/stable/optim.html#torch.optim.AdamW

// example:

  "optimizer": {
    "type": "AdamW",
    "params": {
        "lr": 3e-5,
        "betas": [0.8, 0.999],
        "eps": 1e-8,
        "weight_decay": 3e-7
    }
  }

scheduler

type: LRRangeTest, OneCycle, WarmupLR, WarmupDecayLR are supported (see https://deepspeed.readthedocs.io/en/latest/schedulers.html)

fp16

Mixed precision/FP16 training configuration of NVIDIA's Apex package (Apex also provides amp mode, which can also be used, but if you use amp in deepspeed, you cannot use zero offload)

float32 (FP32, single-precision) uses 32-bit binary to represent floating-point numbers, and the lower-precision float16 (FP16, half-precision) can represent a smaller range of numbers, but the advantage of fp16 is that the same GPU memory can accommodate more Large amount of parameters, more training data; low-precision computing power (FLOPS) can be done higher; per unit of time, computing units can access data on GPU memory to obtain higher speed (taken from: https:// zhuanlan.zhihu.com/p/601250710)

The precision range of FP16 is limited. When training some models, the gradient value is represented as 0 under the precision of FP16. In order to make these gradients can be represented by FP16, the loss can be multiplied by an expanded coefficient loss scale when calculating Loss. , such as 1024. In this way, a very small number close to 0 can be expressed by FP16 after multiplication. This process happens in the last step of forward propagation, before backpropagation. There are two setting strategies for loss scale:

  • Loss scale fixed value, such as between [8, 32000];
  • Dynamic adjustment, first initialize the loss scale to 65536, if overflow or underflow occurs, increase or decrease appropriately based on the loss scale value.

Combined example:

"fp16": {
    "enabled": true,
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 16,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
}

This configuration opens fp16, sets the initial loss scale to 2 to the 16th power = 65536, and then sets dynamic adjustment (loss_scale=0.0 uses dynamic adjustment, otherwise it is fixed)

The log records the change of loss scale during a training
Please add a picture description

zero optimization

stage: zero optimization has several gears: 0, 1, 2, and 3 refer to disabled, optimizer state partition, optimizer+gradient state partition, optimizer+gradient+parameter partition respectively.

offload_optimizer : Offload the optimizer state to the CPU or NVMe, and offload the optimizer calculation to the CPU, applicable to stage 1, 2, and 3.

offload_param : Offload model parameters to CPU or NVMe, only valid for stage = 3

Example for stage= 2:

"zero_optimization": {
      "stage": 2,
      "offload_optimizer": {
          "device": "cpu",
          "pin_memory": true
      },
      "allgather_partitions": true,
      "allgather_bucket_size": 2e8,
      "overlap_comm": true,
      "reduce_scatter": true,
      "reduce_bucket_size": 2e8,
      "contiguous_gradients": true
  }

Example with stage = 3:

 "zero_optimization": {
      "stage": 3,
      "offload_optimizer": {
          "device": "cpu",
          "pin_memory": true
      },
      "offload_param": {
          "device": "cpu",
          "pin_memory": true
      },
      "overlap_comm": true,
      "contiguous_gradients": true,
      "sub_group_size": 1e9,
      "reduce_bucket_size": "auto",
      "stage3_prefetch_bucket_size": "auto",
      "stage3_param_persistence_threshold": "auto",
      "stage3_max_live_parameters": 1e9,
      "stage3_max_reuse_distance": 1e9,
      "stage3_gather_16bit_weights_on_model_save": true
  }

csv monitor

The Monitor section logs training details to a Tensorboard-compatible file, WandB, or a simple CSV file.

Here's a csv example:

"csv_monitor": {
    "enabled": true,
    "output_path": "output/ds_logs/",
    "job_name": "train_bert"
}

The change of the loss value recorded in the training again
Please add a picture description

example

Finally, there are two configuration files of stage=2 and 3 that can be used directly, and the parameters are all set to auto

{
  "fp16": {
      "enabled": "auto",
      "loss_scale": 0,
      "loss_scale_window": 1000,
      "initial_scale_power": 16,
      "hysteresis": 2,
      "min_loss_scale": 1
  },

  "optimizer": {
      "type": "AdamW",
      "params": {
          "lr": "auto",
          "betas": "auto",
          "eps": "auto",
          "weight_decay": "auto"
      }
  },

  "scheduler": {
      "type": "WarmupLR",
      "params": {
          "warmup_min_lr": "auto",
          "warmup_max_lr": "auto",
          "warmup_num_steps": "auto"
      }
  },

  "zero_optimization": {
      "stage": 2,
      "offload_optimizer": {
          "device": "cpu",
          "pin_memory": true
      },
      "allgather_partitions": true,
      "allgather_bucket_size": 2e8,
      "overlap_comm": true,
      "reduce_scatter": true,
      "reduce_bucket_size": 2e8,
      "contiguous_gradients": true
  },

  "csv_monitor" : {
    "enabled": true,
    "job_name" : "stage2_test"
  },

  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 100,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}
{
  "fp16": {
      "enabled": "auto",
      "loss_scale": 0,
      "loss_scale_window": 1000,
      "initial_scale_power": 16,
      "hysteresis": 2,
      "min_loss_scale": 1
  },

  "optimizer": {
      "type": "AdamW",
      "params": {
          "lr": "auto",
          "betas": "auto",
          "eps": "auto",
          "weight_decay": "auto"
      }
  },

  "scheduler": {
      "type": "WarmupLR",
      "params": {
          "warmup_min_lr": "auto",
          "warmup_max_lr": "auto",
          "warmup_num_steps": "auto"
      }
  },

  "zero_optimization": {
      "stage": 3,
      "offload_optimizer": {
          "device": "cpu",
          "pin_memory": true
      },
      "offload_param": {
          "device": "cpu",
          "pin_memory": true
      },
      "overlap_comm": true,
      "contiguous_gradients": true,
      "sub_group_size": 1e9,
      "reduce_bucket_size": "auto",
      "stage3_prefetch_bucket_size": "auto",
      "stage3_param_persistence_threshold": "auto",
      "stage3_max_live_parameters": 1e9,
      "stage3_max_reuse_distance": 1e9,
      "stage3_gather_16bit_weights_on_model_save": true
  },

  "csv_monitor" : {
    "enabled": true,
    "job_name" : "stage3_test"
  },

  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 100,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Guess you like

Origin blog.csdn.net/O_1CxH/article/details/129031307