[Deep Learning] Framework for Large Model Training--Use of DeepSpeed

The current model is getting bigger and bigger, often several B or even hundreds of B. However, the size of the video memory of the graphics card cannot support training and reasoning at all. For example, with a 10G video memory of RTX2090, just loading the model will do it OOM, not to mention the subsequent training optimization.

As an alternative to the traditional pytorch Dataparallel, the goal of DeepSpeed ​​is to enable models with hundreds of millions of parameters to be trained and reasoned on their own personal work servers. DeepSpeed ​​is a large-scale model distributed training tool launched by Microsoft, which mainly implements the ZeRO parallel training algorithm.

This article aims to briefly introduce the core concepts of Deepspeed for large-scale model training, as well as the most basic methods of use. For more content, the author strongly recommends reading the HuggingFace Transformer official website's tutorial for DeepSpeed:

Transformer DeepSpeed Integration

Link to original documentation: DeepSpeed ​​Integration (huggingface.co)

1. The core idea of ​​DeepSpeed

The core of DeepSpeed ​​is that the GPU memory is not enough, and the CPU memory is used to make up .

For example, if we only have a 10GB GPU, then we probably need an 80GB CPU to train a large model.

Take a look at the description of this concept on the official website:

Why would you want to use DeepSpeed with just one GPU?

  1. It has a ZeRO-offload feature which can delegate some computations and memory to the host’s CPU and RAM, and thus leave more GPU resources for model’s needs - e.g. larger batch size, or enabling a fitting of a very big model which normally won’t fit.
  2. It provides a smart GPU memory management system, that minimizes memory fragmentation, which again allows you to fit bigger models and data batches.

Specifically, DeepSpeed ​​caches the parameters that are not used by the training model at the current moment into the CPU, and when they are used, they are moved from the CPU to the GPU. The "parameters" here refer not only to model parameters, but also to optimizers, gradients, etc.

The more parameters are moved to the CPU, the smaller the burden on the GPU; but the price is that more frequent CPU and GPU interactions greatly increase the time spent on training and reasoning. Therefore, one of the core essences of DeepSpeed ​​is the trade-off between time overhead and memory usage .

  1. Optimizer state partitioning (ZeRO stage 1)
  2. Gradient partitioning (ZeRO stage 2)
  3. Parameter partitioning (ZeRO stage 3)
  4. Custom mixed precision training handling
  5. A range of fast CUDA-extension-based optimizers
  6. ZeRO-Offload to CPU and NVMe

The ZeRO config file of DeepSpeed ​​can be divided into the following categories:

  • ZeRO Stage 1: Divide optimizer states. The optimizer parameters are divided into multiple memories, and each momoey process is only responsible for updating its own part of the parameters.
  • ZeRO Stage 2: Divide the gradient. Each memory only retains the gradient corresponding to the optimizer state it is assigned to. This makes sense, since gradients and optimizers are closely linked. Only know the gradient, do not know the optimizer state, there is no way to optimize the model parameters.
  • ZeRO Stage 3: Divide model parameters, or different layers. ZeRO-3 will automatically allocate model parameters to multiple memories during forward and backward.

Since ZeRO-1 only allocates optimizer states (the number of parameters is small), in actual use, only the sum is generally ZeRO-2considered ZeRO-3.

Second, the use of DeepSpeed

run method

After using DeepSpeed, your command line will look like this:

deepspeed --master_port 9900 --num_gpus=2 run_s2s.py \
--deepspeed ds_config.json
  • --master_port:The port number. It is best to specify explicitly, the default is 9900, which may be occupied (ie, running multiple DeepSpeed ​​processes).
  • --num_gpus: Number of GPUs, by default all GPUs currently seen will be used.
  • --deepspeed: The provided config file is used to specify many important parameters of DeepSpeed.

One of the core points of using DeepSpeed ​​is to write a configfile (it can be .json or a json-like configuration file). In this configuration file, you can specify the parameters you want, for example, trade-off time and video memory (As mentioned earlier, this is an important trade-off). Therefore, among the above parameters, the most important thing is --deepspeedthe config file you provide, ie ZeRO. This is what this article will focus on next.

# 1.单卡的使用方法
deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py ...

# 单卡,并指定对应的GPU
deepspeed --include localhost:1 examples/pytorch/translation/run_translation.py ...
​
# 2.多GPU的使用方法1
torch.distributed.run --nproc_per_node=2 your_program.py <normal cl args> --deepspeed ds_config.json

# 多GPU的使用方法2
deepspeed --num_gpus=2 your_program.py <normal cl args> --deepspeed ds_config.json
​
# 3.多节点多卡方法1,需要在多个节点上手动启动
python -m torch.distributed.run --nproc_per_node=8 --nnode=2 --node_rank=0 --master_addr=hostname1 --master_port=9901 your_program.py <normal cl args> --deepspeed ds_config.json

# 多节点多卡方法2,需要创建一个 hostfile 文件,只需在一个节点上启动
hostname1 slots=8
hostname2 slots=8
# 然后运行
deepspeed --num_gpus 8 --num_nodes 2 --hostfile hostfile --master_addr hostname1 --master_port=9901 your_program.py <normal cl args> --deepspeed ds_config.json
​
# 在SLURM上运行,略,参见原始文档
# 在jupyter中运行,略,参见原始文档

Why can deepspeed be used even in the case of a single card?

  1. Use ZeRO-offload to offload some data to the CPU to reduce the demand for video memory
  2. Provides management of video memory to reduce fragmentation in video memory

Pass the deepspeed parameter

TrainingArguments(..., deepspeed="/path/to/ds_config.json")
​
# or
ds_config_dict = dict(scheduler=scheduler_params, optimizer=optimizer_params)
TrainingArguments(..., deepspeed=ds_config_dict)

2.1 ZeRO Stage 2

Combined with the introduction of the official website, the author provides a common ZeRO-stage-2 config file:

{
	"bfloat16": {
		"enabled": "auto"
	},
	"fp16": {
		"enabled": "auto",
		"loss_scale": 0,
		"loss_scale_window": 1000,
		"initial_scale_power": 16,
		"hysteresis": 2,
		"min_loss_scale": 1
	},
	"optimizer": {
		"type": "AdamW",
		"params": {
			"lr": "auto",
			"betas": "auto",
			"eps": "auto",
			"weight_decay": "auto"
		}
	},
	"scheduler": {
		"type": "WarmupLR",
		"params": {
			"warmup_min_lr": "auto",
			"warmup_max_lr": "auto",
			"warmup_num_steps": "auto"
		}
	},
	"zero_optimization": {
		"stage": 2,
		"offload_optimizer": {
			"device": "cpu",
			"pin_memory": true
		},
		"allgather_partitions": true,
		"allgather_bucket_size": 2e8,
		"overlap_comm": true,
		"reduce_scatter": true,
		"reduce_bucket_size": 2e8,
		"contiguous_gradients": true
	},
	"gradient_accumulation_steps": "auto",
	"gradient_clipping": "auto",
	"train_batch_size": "auto",
	"train_micro_batch_size_per_gpu": "auto",
	"steps_per_print": 1e5
}
  • overlap_comm: Controls whether to use overlap of communication and computation. When set Trueto , DeepSpeed ​​will attempt to perform gradient communication in parallel as gradient computations occur. The communication time can be effectively reduced, thus speeding up the whole training process. Simply understood, it controls the size of the buffer for communication between processes on multiple memories. The larger this value is, the faster the communication between processes will be, and the model training speed will also increase, but the corresponding video memory usage will also increase; and vice versa. Therefore, overlap_commit is also a parameter that requires a certain trade-off.
  • allgather_bucket_size: Used to control the bucket size of the Allgather operation. The Allgather operation means that in distributed training, each process collects the tensors of all other processes and stitches these tensors together in order. By dividing tensors into smaller buckets, data can be transferred more efficiently during communication. allgather_bucket_sizeThe larger the value, the larger the size of each bucket, and the communication operation may become faster, but also requires more memory to store intermediate results. The appropriate bucket size should be adjusted according to the actual situation.
  • reduce_bucket_size: Similar to allgather_bucket_size, used to control the bucket size of the Allreduce operation. The Allreduce operation is to reduce a certain tensor of all processes (such as summation), and broadcast the result back to all processes. Data can be transferred more efficiently by dividing tensors into smaller buckets. reduce_bucket_sizeThe larger the value, the larger the size of each bucket, and the communication operation may become faster, but it also requires more memory to store intermediate results. The appropriate bucket size needs to be adjusted according to the actual situation.
  • offload_optimizer:As shown above, we ”device“set it to cpu, and DeepSpeed ​​will operate according to the ZeRO mentioned before, and allocate the optimizer state to the cpu during the training process. Thereby reducing the memory usage of a single GPU.

overlap_comm4.5 times the sum value allgather_bucket_sizeis used . reduce_bucket_sizeIf they are set to 5e8, 9GB of video memory is required (5e8 x 2Bytes x 2 x 4.5). If the memory size is 8GB or less, these parameters need to be reduced to about 2e8 to avoid OOM, which requires 3.6GB of video memory. If OOM also occurs on a large-capacity GPU, the same adjustments need to be made.

Added an option in deepspeed==0.4.4  round_robin_gradients to parallelize CPU offload. When the number of gradient accumulation steps increases, or the number of GPUs increases, there will be better performance advantages.

aboutauto

We can find that a large number of parameters mentioned above are set to auto. Since DeepSpeed ​​has been integrated into the HuggingFace Transformer framework. Many parameters of DeepSpeed ​​are exactly the same as Transformer's Trainer parameter settings, for example, "optimizer", "scheduler". Therefore, the official recommendation is to set many commonly used model training parameters so that autowhen using Trainer for training, these values ​​will be automatically updated to the settings in Trainer, or automatically calculated for you.

Of course, you can also set it yourself, but make sure it is the same as the setting in Trainer. Because, if the setting is wrong, DeepSpeed ​​will still run normally and will not report an error immediately.

In most cases, you only need to pay attention to DeepSpedd-specific parameters (eg, offload), and other parameters that are duplicated with Trainer are strongly recommended to be set auto. For the meaning of each of these parameters and the setting of the value, please refer to the detailed introduction on the official website.

All in all, due to the settings auto, the above config can adapt to most stage-2use-cases of the Transformer framework.

2.2 ZeRO Stage 3

{
	"bfloat16": {
		"enabled": false
	},
	"fp16": {
		"enabled": "auto",
		"loss_scale": 0,
		"loss_scale_window": 1000,
		"initial_scale_power": 16,
		"hysteresis": 2,
		"min_loss_scale": 1
	},
	"optimizer": {
		"type": "AdamW",
		"params": {
			"lr": "auto",
			"betas": "auto",
			"eps": "auto",
			"weight_decay": "auto"
		}
	},
	"scheduler": {
		"type": "WarmupLR",
		"params": {
			"warmup_min_lr": "auto",
			"warmup_max_lr": "auto",
			"warmup_num_steps": "auto"
		}
	},
	"zero_optimization": {
		"stage": 3,
		"offload_optimizer": {
			"device": "cpu",
			"pin_memory": true
		},
		"offload_param": {
			"device": "cpu",
			"pin_memory": true
		},
		"overlap_comm": true,
		"contiguous_gradients": true,
		"sub_group_size": 1e9,
		"reduce_bucket_size": "auto",
		"stage3_prefetch_bucket_size": "auto",
		"stage3_param_persistence_threshold": "auto",
		"stage3_max_live_parameters": 1e9,
		"stage3_max_reuse_distance": 1e9,
		"stage3_gather_fp16_weights_on_model_save": true
	},
	"gradient_accumulation_steps": "auto",
	"gradient_clipping": "auto",
	"steps_per_print": 1e5,
	"train_batch_size": "auto",
	"train_micro_batch_size_per_gpu": "auto",
	"wall_clock_breakdown": false
}

It can be seen that, in addition to having the same offload_optimizerparameters as stage2, stage3 also has a offload_paramparameter. That is, the model parameters are divided.

  • stage3_max_live_parameters is an upper bound on the number of complete parameters to keep on the GPU.
  • stage3_max_reuse_distance Refers to the indicator of when the parameter will be used again in the future, thus deciding whether to discard the parameter or keep it. If a parameter is to be reused in the near future (less than stage3_max_reuse_distance), it can be reserved to reduce communication overhead. This is very useful when using activation checkpointing.
  • If you encounter OOM, you can reduce  stage3_max_live_parameters and  stage3_max_reuse_distance. Their performance impact should be minimal unless activation checkpointing is being used. 1e9 will consume ~2GB. The memory is shared by  stage3_max_live_parameters and  stage3_max_reuse_distance , so it is not added, a total of 2GB.
  • stage3_gather_16bit_weights_on_model_save Enable model fp16 weight pooling when saving the model. For large models and multiple GPUs, it is an expensive operation in terms of memory and speed. If you plan to return to training, you need to use it now. A future update will remove this limitation.
  • sub_group_size Controls the granularity of updating parameters in optimizer steps. The buckets into which parameters are grouped  sub_group_size , and each bucket is updated one at a time. When used with NVMe offload in ZeRO-Infinity, sub_group_size controls the granularity at which model state is moved from NVMe into and out of CPU memory during optimizer steps. Prevent very large models from using up CPU memory. Leave it at default when not using NVMe offload. Decrease when OOM occurs sub_group_size. Can be increased when the optimizer iterates slowly sub_group_size .
  • Unused in ZeRO-3  allgather_partitions, allgather_bucket_size and  reduce_scatter configuration parameters

In the same way, many of these values ​​can be used to control the memory usage and training efficiency of stage-3 (eg, sub_group_size); at the same time, some parameters can also be set to auto, allowing the Trainer to determine the value (eg, ,  reduce_bucket_size, stage3_prefetch_bucket_size) stage3_param_persistence_threshold.

For the specific description of these parameters and the trade-off of the value, please refer to the official website ZeRO-3 Config

For the same reason, the above config file can also adapt to most use-cases. Some stage-3-specific parameters may require additional attention. Specifically, reading the official documentation is recommended.

2.3  ZeRO Infinity

In addition to stage2 and 3, here is a brief introduction ZeRO-Infinity.

ZeRO-InfinityIt can be regarded as an advanced version of stage-3, which needs to rely on NVMe support. He can offload all model parameter states to CPU and NVMe. Thanks to the NMVe protocol, in addition to using CPU memory, ZeRO can additionally utilize SSD (solid state), which greatly saves memory overhead and accelerates communication speed.

ZeRO-InfinityDetailed introduction on the official website :

DeepSpeed官方教程 :
ZeRO-Infinity has all of the savings of ZeRO-Offload, plus is able to offload more the model weights and has more effective bandwidth utilization and overlapping of computation and communication.
HuggingFace官网
It allows for training incredibly large models by extending GPU and CPU memory with NVMe memory. Thanks to smart partitioning and tiling algorithms each GPU needs to send and receive very small amounts of data during offloading so modern NVMe proved to be fit to allow for an even larger total memory pool available to your training process. ZeRO-Infinity requires ZeRO-3 enabled.

NVMe Support

  • ZeRO-Infinity requires ZeRO-3
  • ZeRO-3 will be much slower than ZeRO-2. Using the following strategy, the speed of ZeRO-3 can be brought closer to that of ZeRO-2
    • Set stage3_param_persistence_thresholdthe parameters to be very large, such as6 * hidden_size * hidden_size
    • Turn offload_paramsthe parameter off (can greatly improve performance)

2.4 How to choose different Zero stage and offload

  • From left to right, slower and slower
    Stage 0 (DDP) > Stage 1 > Stage 2 > Stage 2 + offload > Stage 3 > Stage 3 + offloads
  • From left to right, the required GPU memory is getting less and less
    Stage 0 (DDP) < Stage 1 < Stage 2 < Stage 2 + offload < Stage 3 < Stage 3 + offloads

3. Adjustment steps

  1. Set batch_sizeto 1 to achieve arbitrary effectivebatch_size
  2. If OOM then, set --gradient_checkpointing 1 (HF Trainer), or model.gradient_checkpointing_enable()
  3. If OOM then, try ZeRO stage 2
  4. If OOM then, try ZeRO stage 2+ offload_optimizer
  5. If OOM then, try ZeRO stage 3
  6. If OOM, try to offload_param to CPU
  7. If OOM, try offload_optimizer to CPU
  8. If OOM then, try lowering some default parameters. For example, when using generate, reduce the search range of beam search
  9. If OOM, use mixed precision training, use bf16 on Ampere GPU, use fp16 on older GPU
  10. If still OOM, use ZeRO-Infinity, use offload_param and offload_optimizer to NVME
  11. Once batch_size=1 is used, no OOM is caused, measure the effective throughput at this time, and then increase the batch_size as much as possible
  12. To start optimizing the parameters, you can turn off the offload parameter, or reduce the ZeRO stage, then adjust the batch_size, and then continue to measure the throughput until the performance is satisfactory (tuning the parameters can increase the performance by 66%)

some other suggestions

  1. If the training model is from scratch, the hidden size should preferably be divisible by 16
  2. The batch size is preferably divisible by 2

4. Optimizer and Scheduler

  • When not in use offload_optimizer , the optimizers and iterators of HF and DS can be mixed according to the following table, except for the case of HF Scheduler and DS Optimizer.
Combos HF Scheduler DS Scheduler
HF Optimizer Yes Yes
DS Optimizer No Yes

4.1 Optimizer

  • A non-DeepSpeed ​​optimizer can be used when offload_optimizer is enabled, as long as it has both CPU and GPU implementations (except LAMB).
  • The main optimizers of DeepSpeed ​​are Adam, AdamW, OneBitAdam and Lamb. These have been thoroughly tested with ZeRO and are recommended.
  • If no optimizer parameter is configured in the configuration file, Trainer will automatically set it to AdamW and will use the default values ​​of the command line parameters: --learning_rate, --adam_beta1, --adam_beta2, --adam_epsilon, and --weight_decay.
  • Similar to AdamW, other officially supported optimizers can be configured. Keep in mind that they may have different configuration values. For Adam, for example, weight_decay needs to be set to around 0.01.
  • Additionally, offload works best when used with Deepspeed's CPU Adam optimizer. If you want to use a different optimizer for offload, deepspeed==0.8.3 and later, you need to add:
{
    "zero_force_ds_cpu_optimizer": false
}

4.2 Scheduler

  • DeepSpeed ​​supports the LRRangeTest, OneCycle, WarmupLR, and WarmupDecayLR learning rate schedulers.
  • The overlap of schedulers in Transformers and DeepSpeed
WarmupLR 使用 --lr_scheduler_type constant_with_warmup
WarmupDecayLR 使用 --lr_scheduler_type linear

5. Training Accuracy

  • Since fp16 mixed precision greatly reduces memory requirements and can achieve greater speed, only consider training without mixed precision if you do not perform well in this training mode. Typically, this occurs when the model is not pretrained in fp16 mixed precision (for example, a model pretrained with bf16). Such a model may overflow, causing the loss to be NaN. If this is the case, use full fp32 mode.
  • In the case of GPUs based on the Ampere architecture, pytorch 1.7 and later will automatically switch to use the more efficient tf32 format for some operations, but the result will still be in fp32.
  • With Trainer, it can be enabled with --tf32 or disabled with --tf32 0 or --no_tf32 . PyTorch default is to use tf32.

Automatic Mixed Precision

  • fp16
    • You can use pytorch-like AMP method or apex-like method
    • This mode is enabled when using  --fp16--fp16_backend amp the or  --fp16_full_eval command line argument
  • bf16
    • This mode is enabled when using --bf16 the or  --bf16_full_eval command line argument

NCCL

  • Communication will use a separate data type
  • By default, half-precision training uses fp16 as the default for reduction operations
  • can add a small overhead and ensure that the reduction will use fp32 as the accumulation data type
{
    "communication_data_type": "fp32"
}

apex

  • Apex is a library for accelerating training and improving performance under the PyTorch deep learning framework. Apex provides functions such as mixed precision training, distributed training, and memory optimization to help users increase training speed, expand training scale, and optimize GPU resource utilization.
  •  This mode is enabled when using --fp16the ,  --fp16_backend apex,  command-line arguments--fp16_opt_level 01
"amp": {
     "enabled": "auto",
     "opt_level": "auto"
}

6. Obtain model parameters

  • deepspeed will store the main parameters of the model in the optimizer parameters, stored in global_step*/*optim_states.pt the file, and the data type is fp32. Therefore, if you want to resume training from checkpoint, just keep the default
  • If the model is saved in ZeRO-2 mode, the model parameters will be stored as fp16 pytorch_model.binin
  • If the model is saved in ZeRO-3 mode, parameters need to be set as shown below, otherwise pytorch_model.bin will not be created
{
  "zero_optimization": {
         "stage3_gather_16bit_weights_on_model_save": true
    }
}
  • Online fp32 weight restoration (needs a lot of RAM) slightly
  • Get fp32 weights offline
python zero_to_fp32.py . pytorch_model.bin

Seven, model reasoning

In addition to model training, sometimes the model is too large, and even predictive reasoning may blow up the video memory.

ZeRO inference uses the same configuration as ZeRO-3 training. You just don't need the optimizer and scheduler parts. In fact, you can leave them in the config file if you want to share the same config file as the training. They will be ignored.

Specific reference: ZeRO-Inference

Only ZeRO-3 makes sense because parameters can be sliced:

deepspeed --num_gpus=2 your_program.py <normal cl args> --do_eval --deepspeed ds_config.json

Deepspeed-Inference is now available, use DeepSpeed ​​and Accelerate for ultra-fast BLOOM model inference

Eight, memory estimation

As emphasized many times before, one of the difficulties in the use of DeepSpeed ​​lies in the trade-off 时间和空间.

Allocating more parameters to the CPU can reduce the memory overhead, but it will also greatly increase the time overhead.

DeepSpeed ​​provides a simple memory estimation code:

from transformers import AutoModel
from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live## specify the model you want to train on your device
model = AutoModel.from_pretrained("t5-large") 
## estimate the memory cost (both CPU and GPU)
estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1)

Taking T5-large and using only one GPU as an example, the overhead of using DeepSpeed ​​will be as follows:

As above, if you don't use stage2 and stage3 (the bottom two lines), training T5-large requires a graphics card with at least 12.49GB of video memory (considering many other cache variables, and your batch_size, it may actually need 24GB size card). After successively using stage 2 and 3, the video memory overhead is greatly reduced, but the CPU memory consumption is significantly increased, and the model training time overhead is correspondingly increased.

Suggestion:
Before using DeepSpeed, use the above code to roughly estimate the memory consumption, determine the number of GPUs to use, and ZeRO-stage.

The principle is, if you can directly train with multiple cards, then don’t use ZeRO; if you can use ZeRO-2, don’t use ZeRO-3.

See the official website for details: Memory Requirements

9. Other Supplements

The first is stage 2, that is, only put the optimizer on the cpu. The following is a comparison of GPU memory usage and training speed before and after use:

  • GPU memory: 20513MB =>  17349MiB
  • Training speed ( tqdmestimated by): 1.3 iter/s =>  0.77 iter/s

It can be clearly seen that the memory usage of the GPU has been significantly reduced, but the training speed has also slowed down. In terms of the author's current use experience, deepspeed has not brought any benefits.

The author's machine is equipped with 24000a MB graphics card, and when the batch_size is 2, it takes up 20513MB; and DeepSpeed ​​only helps the author to free up 3000MB of video memory, but it is still not enough to increase the batch_size , resulting in a longer total training time.

Therefore, DeepSpeed ​​may only be suitable for extreme shortage of video memory (ie, the model is too large to run with batch_size == 1); or, the video memory saved by using DeepSpped is just enough to support a larger batch_size. Otherwise, in the current situation of the author, using DeepSpeed ​​will only increase the time overhead and have no other benefits.


Since then, the author also tried to use stage 3, but the speed is extremely slow . A training process that originally required 6 hours, after using DeepSpeed ​​stage3, it ran for 2 days and 2 nights, and there was no sign of ending. The author had no choice but to terminate the test.

In addition, when using DeepSpeed ​​stage2, because the model parameters are assigned to multiple devices, no output information can be seen in the console (but the GPU is still beeping, and the utility is also 100%), so people do not know the program It can be said that it is very unfriendly to users.

some frequently asked questions

Since DeepSpeed ​​will slow down the overhead of GPU by occupying CPU memory, when the system CPU is insufficient, the DeepSpeed ​​process will be automatically stopped by the system, resulting in the phenomenon that DeepSpeed ​​cannot be started without any error reporting . It is recommended to use the estimation described above to estimate the CPU memory usage first, and then free -hcheck the CPU memory space of the machine to determine whether DeepSpeed ​​can be used.

In addition, it is also possible that the loss is due to the training accuracy problem NAN. See: Troubleshooting.

  • When starting, the process is killed, and no traceback is printed: insufficient CPU memory
  • loss is NaN: bf16 is used for training, and fp16 is used for use. It often occurs in Google's train model on TPU, such as T5. At this time, you need to use fp32 or bf16.

Integrating DeepSpeed ​​in Transformers - Zhihu (zhihu.com)

DeepSpeed ​​User Guide (Abbreviated Version) (betheme.net)

deepspeed introductory tutorial - Zhihu (zhihu.com)

Guess you like

Origin blog.csdn.net/zwqjoy/article/details/132651063