Article directory
0x0. Preamble
This article explores the process of training the GPT2 model based on the Megatron-related examples given in the DeepSpeedExamples warehouse. It mainly consists of 3 parts. The first part is how to train the GPT2 model based on the original Megatron. The second part is how to train the Megatron GPT2 with the characteristics of DeepSpeed. Due to the length of this article, only the first part is written, which is mainly very It carefully records some problems encountered in the Megatron GPT2 training process and how to solve them. This article is mainly based on the codebase here https://github.com/microsoft/DeepSpeedExamples/tree/bdf8e59aede8c8e0577e8d4d557298ca8515268f.
0x1. Megatron uses a single card to train GPT2
First read the README at https://github.com/microsoft/DeepSpeedExamples/tree/bdf8e59aede8c8e0577e8d4d557298ca8515268f/Megatron-LM here. The part of BERT is not focused here, the purpose is to run the training and reasoning of GPT2.
First mentioned, Megatron is a large and powerful Transformer, and this code base is used for continuous research on large Transformer language models. Currently, Megatron supports GPT2 and BERT model parallel, multi-node training, and uses mixed precision. Megatron's code base can efficiently train a 72-layer, 8.3 billion parameter GPT2 language model using 512 GPUs for 8-way model and 64-way data parallelism. The authors found that larger language models (referring to the previous 8.3 billion parameter GPT2) were able to surpass the current GPT2-1.5B wikitext perplexities in only 5 training epochs.
Dependency installation
First, go to Megatron-LM
the directory and install the dependencies. pip install -r requirements.txt
Note that requirements.txt
it depends on TensorFlow. This is related to BERT training. I don’t care about it here, so I won’t install TensorFlow. requiresment.txt
The content is as follows:
nltk>=3.4
numpy>=1.15.4
pandas>=0.24.0
sentencepiece>=0.1.8
# tensorflow>=1.12.0
boto3==1.11.11
regex==2020.1.8
When installing, an error will be reported:
ERROR: Could not find a version that satisfies the requirement boto3==1.11.11 (from versions: none)
ERROR: No matching distribution found for boto3==1.11.11
I pip install boto3
installed the latest version directly using .
Then follow the tutorial and execute bash scripts/pretrain_gpt2.sh
. Here is a PyTorch error:
ModuleNotFoundError: No module named 'torch._six'
This error is caused by the change of PyTorch version. I searched and found that I only need to from torch._six import inf
change this line of code to from torch import inf
. Continue to execute, the error is reported as: AssertionError: make sure to set PATH for wikipedia data_utils/corpora.py
. This is because scripts/pretrain_gpt2.sh
the training data set is specified as wikipedia, so we need to specify the wikipedia data path we downloaded locally in DeepSpeedExamples/Megatron-LM/data_utils/corpora.py
here .PATH = 'data/wikipedia/wikidump_lines.json'
Prepare training data
When downloading the data, I found that the wikipedia data is too big, so I switched to the webtext data set. The README of this data set Megatron is introduced as follows:
"We" utilize the publicly available OpenWebText (https://github.com/eukaryote31/openwebtext) library developed by jcpeterson (https://github.com/jcpeterson/openwebtext) and eukaryote31 (https://github.com /eukaryote31/openwebtext) for download URLs. We then filtered, cleaned and deduped all downloaded content according to the process we described in the openwebtext catalog. For the content corresponding to the Reddit URL as of October 2018, we got about 37GB of content. 37G is still too big for running training, so I only downloaded the first 1url file among dozens of urls.
Then copy this file to the openwebtxt directory of Megatron-LM:
Next, follow the README of openwebtext to start executing.
pip install ftfy langdetect numpy torch pandas nltk sentencepiece boto3 tqdm regex bs4 newspaper3k htmlmin tldextract
git clone https://github.com/mattilyra/LSH
cd LSH
python setup.py install
Installing LSH encountered a problem caused by the incompatibility of two Python versions:
lsh/cMinhash.cpp:19292:21: error: ‘PyThreadState’ {
aka ‘struct _ts’} has no member named ‘exc_type’; did you mean ‘curexc_type’?
19292 | *type = tstate->exc_type;
This problem can be solved by replacing exc_type with curexc_type.
lsh/cMinhash.cpp:17704:26: error: ‘PyTypeObject’ {
aka ‘struct _typeobject’} has no member named ‘tp_print’
17704 | __pyx_type___pyx_array.tp_print = 0;
This problem can be solved by replacing tp_print with tp_vectorcall_offset.
Next, execute the command to deduplicate the URL:
python3 blacklist_urls.py RS_2011-01.bz2.deduped.txt clean_urls.txt
I found that clean_urls.txt is empty after executing this command. After looking at the code, I found that the script requires that the deduplicated url file must be in a directory, and pass the path of this directory to the script.
Therefore, create a new urls directory under the current folder, and put the url file just now into it. As follows:
Then execute: python3 blacklist_urls.py urls clean_urls.txt
to complete deduplication. Next, use https://github.com/eukaryote31/openwebtext/blob/master/download.py to download the text corresponding to the deduplicated url.
It takes a long time to download all of them here. I only download the data corresponding to 50 urls for a demonstration. Here, to save the data corresponding to each downloaded url as a json file, you need to modify the and default values download.py
in it , and change them to and respectively , so that a folder will be generated after execution, and the downloaded text of each url will be saved in a sub-file Clamp down:--sqlite_meta
--save_uncompressed
False
True
python3 openwebtext/download.py clean_urls.txt
scraped
data
Then we use the following script ( merge_jsons.py
) to merge all the txt in the folder into a json file, where each line is the text
corresponding content of a field:
import glob
import sys
import json
import argparse
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument("--data_path", type=str, default=".",
help="path where all the json files are located")
parser.add_argument("--output_file", type=str, default="merged_output.json",
help="filename where the merged json should go")
args = parser.parse_args()
data_path = args.data_path
out_file = args.output_file
text_files = glob.glob(data_path + '/*.txt')
counter = 0
with open(out_file, 'w') as outfile:
for fname in text_files:
counter += 1
if counter % 1024 == 0:
print("Merging at ", counter, flush=True)
with open(fname, 'r') as infile:
for row in infile:
tmp = {
}
tmp['text'] = row
outfile.write(json.dumps(tmp))
outfile.write('\n')
print("Merged file", out_file, flush=True)
Execute this script to get merged_output.json
: python3 merge_jsons.py --data_pathDeepSpeedExamples/Megatron-LM/openwebtext/scraped/data
.
Next, we openwebtext
execute it under the folder cleanup_dataset.py
to delete all the texts with less than 128 tokens. python3 cleanup_dataset.py merged_output.json merged_cleand.json
.
Detailed training process and pitfalls
After the data is ready, let's modify DeepSpeedExamples/Megatron-LM/scripts/pretrain_gpt2.sh
the following --train-data
as webtext
. Also set the path of DeepSpeedExamples/Megatron-LM/data_utils/corpora.py
the webtext
class to the path we just obtained merged_cleand.json
.
In addition, since I only use dozens of pieces of data here to demonstrate the training process, I need to change the DeepSpeedExamples/Megatron-LM/scripts/pretrain_gpt2.sh
following --split
parameters here, and change it to 400,300,300
, that is, the data ratio of training, testing, and verification sets is 4:3:3 , so as to avoid setting the number of test sets to 0.
Then you can use it bash scripts/pretrain_gpt2.sh
to start training. Give some training logs out:
Setting ds_accelerator to cuda (auto detect)
using world size: 1 and model-parallel size: 1
> using dynamic loss scaling
> initializing model parallel with size 1
Pretrain GPT2 model
arguments:
pretrained_bert .............. False
attention_dropout ............ 0.1
num_attention_heads .......... 16
hidden_size .................. 1024
intermediate_size ............ None
num_layers ................... 24
layernorm_epsilon ............ 1e-05
hidden_dropout ............... 0.1
max_position_embeddings ...... 1024
vocab_size ................... 30522
deep_init .................... False
make_vocab_size_divisible_by . 128
cpu_optimizer ................ False
cpu_torch_adam ............... False
fp16 ......................... True
fp32_embedding ............... False
fp32_layernorm ............... False
fp32_tokentypes .............. False
fp32_allreduce ............... False
hysteresis ................... 2
loss_scale ................... None
loss_scale_window ............ 1000
min_scale .................... 1
batch_size ................... 8
weight_decay ................. 0.01
checkpoint_activations ....... True
checkpoint_num_layers ........ 1
deepspeed_activation_checkpointing False
clip_grad .................... 1.0
train_iters .................. 320000
log_interval ................. 100
exit_interval ................ None
seed ......................... 1234
reset_position_ids ........... False
reset_attention_mask ......... False
lr_decay_iters ............... None
lr_decay_style ............... cosine
lr ........................... 0.00015
warmup ....................... 0.01
save ......................... checkpoints/gpt2_345m
save_interval ................ 5000
no_save_optim ................ False
no_save_rng .................. False
load ......................... checkpoints/gpt2_345m
no_load_optim ................ False
no_load_rng .................. False
finetune ..................... False
resume_dataloader ............ True
distributed_backend .......... nccl
local_rank ................... None
eval_batch_size .............. None
eval_iters ................... 100
eval_interval ................ 1000
eval_seq_length .............. None
eval_max_preds_per_seq ....... None
overlapping_eval ............. 32
cloze_eval ................... False
eval_hf ...................... False
load_openai .................. False
temperature .................. 1.0
top_p ........................ 0.0
top_k ........................ 0
out_seq_length ............... 256
model_parallel_size .......... 1
shuffle ...................... False
train_data ................... ['webtext']
use_npy_data_loader .......... False
train_data_path ..............
val_data_path ................
test_data_path ...............
input_data_sizes_file ........ sizes.txt
delim ........................ ,
text_key ..................... sentence
eval_text_key ................ None
valid_data ................... None
split ........................ 400,300,300
test_data .................... None
lazy_loader .................. True
loose_json ................... False
presplit_sentences ........... False
num_workers .................. 2
tokenizer_model_type ......... bert-large-uncased
tokenizer_path ............... tokenizer.model
tokenizer_type ............... GPT2BPETokenizer
cache_dir .................... cache
use_tfrecords ................ False
seq_length ................... 1024
max_preds_per_seq ............ None
deepspeed .................... False
deepspeed_config ............. None
deepscale .................... False
deepscale_config ............. None
deepspeed_mpi ................ False
cuda ......................... True
rank ......................... 0
world_size ................... 1
dynamic_loss_scale ........... True
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
configuring data
> padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> found end-of-document token: 50256
building GPT2 model ...
> number of parameters on model parallel rank 0: 354871296
Optimizer = FusedAdam
learning rate decaying cosine
WARNING: could not find the metadata file checkpoints/gpt2_345m/latest_checkpointed_iteration.txt
will not load any checkpoints and will start from random
Partition Activations False and Correctness Check False
iteration 100/ 320000 | elapsed time per iteration (ms): 963.3 | learning rate 3.937E-06 | lm loss 8.995377E+00 | loss scale 131072.0 |
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py:416: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
warnings.warn(
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py:424: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved
warnings.warn(
after 100 iterations memory (MB) | allocated: 6784.88427734375 | max allocated: 11927.470703125 | cached: 13826.0 | max cached: 13826.0
time (ms) | forward: 276.11 | backward: 672.99 | allreduce: 13.96 | optimizer: 14.00 | batch generator: 5.22 | data loader: 4.53
iteration 200/ 320000 | elapsed time per iteration (ms): 950.6 | learning rate 8.625E-06 | lm loss 3.041360E+00 | loss scale 131072.0 |
time (ms) | forward: 259.24 | backward: 674.56 | allreduce: 13.45 | optimizer: 16.63 | batch generator: 0.78 | data loader: 0.14
From nvidia-smi
the screenshot of , you can also see that the training of megatron is running at card 0:
The following StopIteration errors may occur during training :
time (ms) | forward: 259.07 | backward: 671.87 | allreduce: 13.03 | optimizer: 16.64 | batch generator: 0.76 | data loader: 0.13
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/pretrain_gpt2.py:713 in <module> │
│ │
│ 710 │
│ 711 │
│ 712 if __name__ == "__main__": │
│ ❱ 713 │ main() │
│ 714 │
│ │
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/pretrain_gpt2.py:686 in main │
│ │
│ 683 │ iteration = 0 │
│ 684 │ if args.train_iters > 0: │
│ 685 │ │ if args.do_train: │
│ ❱ 686 │ │ │ iteration, skipped = train(model, optimizer, │
│ 687 │ │ │ │ │ │ │ │ │ lr_scheduler, │
│ 688 │ │ │ │ │ │ │ │ │ train_data_iterator, │
│ 689 │ │ │ │ │ │ │ │ │ val_data_iterator, │
│ │
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/pretrain_gpt2.py:415 in train │
│ │
│ 412 │ report_memory_flag = True │
│ 413 │ while iteration < args.train_iters: │
│ 414 │ │ │
│ ❱ 415 │ │ lm_loss, skipped_iter = train_step(train_data_iterator, │
│ 416 │ │ │ │ │ │ │ │ │ │ model, │
│ 417 │ │ │ │ │ │ │ │ │ │ optimizer, │
│ 418 │ │ │ │ │ │ │ │ │ │ lr_scheduler, │
│ │
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/pretrain_gpt2.py:369 in train_step │
│ │
│ 366 │ │
│ 367 │ # Forward model for one step. │
│ 368 │ timers('forward').start() │
│ ❱ 369 │ lm_loss = forward_step(data_iterator, model, args, timers) │
│ 370 │ timers('forward').stop() │
│ 371 │ │
│ 372 │ #print_rank_0("loss is {}".format(lm_loss)) │
│ │
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/pretrain_gpt2.py:286 in forward_step │
│ │
│ 283 │ │
│ 284 │ # Get the batch. │
│ 285 │ timers('batch generator').start() │
│ ❱ 286 │ tokens, labels, loss_mask, attention_mask, position_ids = get_batch( │
│ 287 │ │ data_iterator, args, timers) │
│ 288 │ timers('batch generator').stop() │
│ 289 │
│ │
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/pretrain_gpt2.py:257 in get_batch │
│ │
│ 254 │ # Broadcast data. │
│ 255 │ timers('data loader').start() │
│ 256 │ if data_iterator is not None: │
│ ❱ 257 │ │ data = next(data_iterator) │
│ 258 │ else: │
│ 259 │ │ data = None │
│ 260 │ timers('data loader').stop() │
│ │
│ /home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/utils/data/dataloader.p │
│ y:633 in __next__ │
│ │
│ 630 │ │ │ if self._sampler_iter is None: │
│ 631 │ │ │ │ # TODO(https://github.com/pytorch/pytorch/issues/76750) │
│ 632 │ │ │ │ self._reset() # type: ignore[call-arg] │
│ ❱ 633 │ │ │ data = self._next_data() │
│ 634 │ │ │ self._num_yielded += 1 │
│ 635 │ │ │ if self._dataset_kind == _DatasetKind.Iterable and \ │
│ 636 │ │ │ │ │ self._IterableDataset_len_called is not None and \ │
│ │
│ /home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/utils/data/dataloader.p │
│ y:1318 in _next_data │
│ │
│ 1315 │ │ │ │ # no valid `self._rcvd_idx` is found (i.e., didn't break) │
│ 1316 │ │ │ │ if not self._persistent_workers: │
│ 1317 │ │ │ │ │ self._shutdown_workers() │
│ ❱ 1318 │ │ │ │ raise StopIteration │
│ 1319 │ │ │ │
│ 1320 │ │ │ # Now `self._rcvd_idx` is the batch index we want to fetch │
│ 1321 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
StopIteration
Don't worry, this error means that the amount of data is not enough to train so many iters. The reason for this is that the dataset is sampled when constructing the dataloader. This sampler samples torch.utils.data.SequentialSampler
according to the length of the dataset, so it cannot be args.train_iters
associated with Up, resulting in a StopIteration error after the data has been read after training a lot of iter.
Let's adjust the script, change the number of iters to 600, and set the checkpoint saving interval to 500 to ensure that megatron can save a checkpoint. Run the script again:
0x2. Megatron uses a single card to predict the trained GPT2 model
Modify the path of the model trained for us DeepSpeedExamples/Megatron-LM/scripts/generate_text.sh
here , let’s change it here , and then execute it in the root directory of Megatron: . But an error was reported:CHECKPOINT_PATH
DeepSpeedExamples/Megatron-LM/checkpoints/gpt2_345m
bash scripts/generate_text.sh
Setting ds_accelerator to cuda (auto detect)
Generate Samples
WARNING: No training data specified
using world size: 1 and model-parallel size: 1
> using dynamic loss scaling
> initializing model parallel with size 1
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
prepare tokenizer done
building GPT2 model ...
> number of parameters on model parallel rank 0: 354823168
global rank 0 is loading checkpoint /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/checkpoints/gpt2_345m/iter_0000600/mp_rank_00/model_optim_rng.pt
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/generate_samples.py:277 in <module> │
│ │
│ 274 │
│ 275 │
│ 276 if __name__ == "__main__": │
│ ❱ 277 │ main() │
│ 278 │
│ 279 │
│ 280 │
│ │
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/generate_samples.py:267 in main │
│ │
│ 264 │ tokenizer = prepare_tokenizer(args) │
│ 265 │ │
│ 266 │ # Model, optimizer, and learning rate. │
│ ❱ 267 │ model = setup_model(args) │
│ 268 │ │
│ 269 │ #setting default batch size to 1 │
│ 270 │ args.batch_size = 1 │
│ │
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/generate_samples.py:80 in setup_model │
│ │
│ 77 │ model = get_model(args) │
│ 78 │ │
│ 79 │ if args.load is not None: │
│ ❱ 80 │ │ _ = load_checkpoint( │
│ 81 │ │ │ model, None, None, args) │
│ 82 │ │
│ 83 │ return model │
│ │
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/utils.py:305 in load_checkpoint │
│ │
│ 302 │ │ │
│ 303 │ │ # Model. │
│ 304 │ │ try: │
│ ❱ 305 │ │ │ model.load_state_dict(sd['model']) │
│ 306 │ │ except KeyError: │
│ 307 │ │ │ print_rank_0('A metadata file exists but unable to load model ' │
│ 308 │ │ │ │ │ │ 'from checkpoint {}, exiting'.format(checkpoint_name)) │
│ │
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/model/distributed.py:90 in load_state_dict │
│ │
│ 87 │ │ return sd │
│ 88 │ │
│ 89 │ def load_state_dict(self, state_dict, strict=True): │
│ ❱ 90 │ │ self.module.load_state_dict(state_dict, strict=strict) │
│ 91 │ │
│ 92 │ ''' │
│ 93 │ def _sync_buffers(self): │
│ │
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/fp16/fp16.py:71 in load_state_dict │
│ │
│ 68 │ │ return self.module.state_dict(destination, prefix, keep_vars) │
│ 69 │ │
│ 70 │ def load_state_dict(self, state_dict, strict=True): │
│ ❱ 71 │ │ self.module.load_state_dict(state_dict, strict=strict) │
│ 72 │
│ 73 # TODO: Update overflow check + downscale to use Carl's fused kernel. │
│ 74 class FP16_Optimizer(object): │
│ │
│ /home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/nn/modules/module.py:20 │
│ 41 in load_state_dict │
│ │
│ 2038 │ │ │ │ │ │ ', '.join('"{}"'.format(k) for k in missing_keys))) │
│ 2039 │ │ │
│ 2040 │ │ if len(error_msgs) > 0: │
│ ❱ 2041 │ │ │ raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( │
│ 2042 │ │ │ │ │ │ │ self.__class__.__name__, "\n\t".join(error_msgs))) │
│ 2043 │ │ return _IncompatibleKeys(missing_keys, unexpected_keys) │
│ 2044 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Error(s) in loading state_dict for GPT2Model:
size mismatch for word_embeddings.weight: copying a param with shape torch.Size([50304, 1024]) from checkpoint, the shape in current model is
torch.Size([50257, 1024]).
You can see that the shape prompted when loading the model word_embeddings.weight
does not match, let's look at word_embeddings
the definition in GPT2:
So this problem should be caused by the difference in vocab_size during training and testing. num_tokens
After positioning, it is found that this is because the number of tokens needs to be padd to be divisible during training args.make_vocab_size_divisible_by=128
, but there is no such restriction during prediction, which leads to a mismatch in the dimensions of embedding. Let us modify the DeepSpeedExamples/Megatron-LM/generate_samples.py
processing num_token
logic to make it consistent with training .
Execute again bash scripts/generate_text.sh
, we can talk to GPT2, output a prompt model will give you different completion output, and then enter to stop
end the dialogue.
Since the model here only uses very little data for demonstration, there is basically no good completion effect. Later, the amount of data can be increased to train a better GPT2 dialogue model.
0x3. Parameter amount and video memory estimation
In this article https://zhuanlan.zhihu.com/p/624740065, there is a derivation of the parameters and training memory usage of the Transformer of the GPT2 architecture. Here we apply the publicity summarized in it to calculate our current GPT2 model. The amount of parameters and the theoretical video memory usage during training.
parameter estimation
Apply the following publicity:
here: l=24, hidden_size=1024, 12lh^2=12x24x1024x1024=301989888=0.3B. So the GPT2 model we train here has only about 0.3B parameters. From the name of the model 345M, we can also know that the calculation result is basically consistent with the real size.
Estimation of training video memory usage
According to the above formula, the memory usage of model parameters, gradients, and optimizer state during training is about 301989888*20bytes=6039797760bytes=5898240kb=5760MB=5.6G. Then activate the occupied video memory as follows:
When we train, batch_size=8, s=1024, h=1024, a=num-attention-heads=16, l=24, then 34 bsh + 5 bs 2 a = 22951231488 bytes = 21888 M i B = 21 G 34bsh +5bs^2a=22951231488bytes=21888MiB=21G34bsh+5bs2a _=22951231488 b y t es=21888MiB=21G。
Therefore, the training video memory of 0.3B GPT2 is about 5.6G+21G=26.6G. But in the 0x1 section, we can see that the single card memory of our graphics card is 24G, and the memory consumption during the training process is only 15107MiB=14.75G, which means that the memory occupied by the activation is not the 21G we calculated, but 14.75- 5.6= 9.15G , why?
This is because DeepSpeedExamples/Megatron-LM/scripts/pretrain_gpt2.sh
it is opened inside --checkpoint-activations
and the Activation Checkpoint is done. We can locate this part of the code, in DeepSpeedExamples/Megatron-LM/mpu/transformer.py:406-413
:
It can be seen that for each Transformer layer, the intermediate activation that needs to be saved when the internal Self-Attention and MLP are used for backward can be saved, and the purpose of reducing the video memory is achieved.
0x4. Megatron uses Doka to train the GPT2 model
2 cards data parallel
The above has completed the training of the single-card GPT2 model. It is relatively simple to start multi-card training. Modify the DeepSpeedExamples/Megatron-LM/scripts/pretrain_gpt2_distributed.sh
for --train-data
and webtext
then --train-iters
change it to 600/num_gpus. In fact, this script starts data parallel training, so we only need to set the number of iters to 600/num_gpus to scan the same scale of data as a single card. The ratio of training data, verification set, and test should also be changed, because there is only too little simulated data here. According to the original ratio, the number of data items in the test set will be counted as 0 and an error will be reported. Finally, set GPUS_PER_NODE to 2, which means using 2 cards for data parallel training. Then you can start the training: bash scripts/pretrain_gpt2_distributed.sh
, the log is as follows:
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
using world size: 2 and model-parallel size: 1
> using dynamic loss scaling
> initializing model parallel with size 1
Pretrain GPT2 model
arguments:
pretrained_bert .............. False
attention_dropout ............ 0.1
num_attention_heads .......... 16
hidden_size .................. 1024
intermediate_size ............ None
num_layers ................... 24
layernorm_epsilon ............ 1e-05
hidden_dropout ............... 0.1
max_position_embeddings ...... 1024
vocab_size ................... 30522
deep_init .................... False
make_vocab_size_divisible_by . 128
cpu_optimizer ................ False
cpu_torch_adam ............... False
fp16 ......................... True
fp32_embedding ............... False
fp32_layernorm ............... False
fp32_tokentypes .............. False
fp32_allreduce ............... False
hysteresis ................... 2
loss_scale ................... None
loss_scale_window ............ 1000
min_scale .................... 1
batch_size ................... 8
weight_decay ................. 0.01
checkpoint_activations ....... True
checkpoint_num_layers ........ 1
deepspeed_activation_checkpointing False
clip_grad .................... 1.0
train_iters .................. 300
log_interval ................. 100
exit_interval ................ None
seed ......................... 1234
reset_position_ids ........... False
reset_attention_mask ......... False
lr_decay_iters ............... None
lr_decay_style ............... cosine
lr ........................... 0.00015
warmup ....................... 0.01
save ......................... checkpoints/gpt2_345m
save_interval ................ 5000
no_save_optim ................ False
no_save_rng .................. False
load ......................... checkpoints/gpt2_345m
no_load_optim ................ False
no_load_rng .................. False
finetune ..................... False
resume_dataloader ............ True
distributed_backend .......... nccl
local_rank ................... 0
eval_batch_size .............. None
eval_iters ................... 100
eval_interval ................ 1000
eval_seq_length .............. None
eval_max_preds_per_seq ....... None
overlapping_eval ............. 32
cloze_eval ................... False
eval_hf ...................... False
load_openai .................. False
temperature .................. 1.0
top_p ........................ 0.0
top_k ........................ 0
out_seq_length ............... 256
model_parallel_size .......... 1
shuffle ...................... False
train_data ................... ['webtext']
use_npy_data_loader .......... False
train_data_path ..............
val_data_path ................
test_data_path ...............
input_data_sizes_file ........ sizes.txt
delim ........................ ,
text_key ..................... sentence
eval_text_key ................ None
valid_data ................... None
split ........................ 400,300,300
test_data .................... None
lazy_loader .................. True
loose_json ................... False
presplit_sentences ........... False
num_workers .................. 2
tokenizer_model_type ......... bert-large-uncased
tokenizer_path ............... tokenizer.model
tokenizer_type ............... GPT2BPETokenizer
cache_dir .................... cache
use_tfrecords ................ False
seq_length ................... 1024
max_preds_per_seq ............ None
deepspeed .................... False
deepspeed_config ............. None
deepscale .................... False
deepscale_config ............. None
deepspeed_mpi ................ False
cuda ......................... True
rank ......................... 0
world_size ................... 2
dynamic_loss_scale ........... True
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
configuring data
> padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> found end-of-document token: 50256
building GPT2 model ...
> number of parameters on model parallel rank 0: 354871296
Optimizer = FusedAdam
Optimizer = FusedAdam
learning rate decaying cosine
WARNING: could not find the metadata file checkpoints/gpt2_345m/latest_checkpointed_iteration.txt
will not load any checkpoints and will start from random
Partition Activations False and Correctness Check False
iteration 100/ 300 | elapsed time per iteration (ms): 1048.5 | learning rate 1.258E-04 | lm loss 4.799004E+00 | loss scale 32768.0 |
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py:416: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
warnings.warn(
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py:424: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved
warnings.warn(
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py:416: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
warnings.warn(
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py:424: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved
warnings.warn(
after 100 iterations memory (MB) | allocated: 6784.88427734375 | max allocated: 11927.470703125 | cached: 13826.0 | max cached: 13826.0
time (ms) | forward: 284.78 | backward: 749.95 | allreduce: 93.32 | optimizer: 13.60 | batch generator: 14.88 | data loader: 14.19
iteration 200/ 300 | elapsed time per iteration (ms): 1020.9 | learning rate 5.257E-05 | lm loss 7.708308E-02 | loss scale 32768.0 |
time (ms) | forward: 256.87 | backward: 747.37 | allreduce: 93.08 | optimizer: 16.52 | batch generator: 0.71 | data loader: 0.11
iteration 300/ 300 | elapsed time per iteration (ms): 1018.4 | learning rate 1.806E-06 | lm loss 4.669175E-03 | loss scale 32768.0 |
time (ms) | forward: 256.74 | backward: 744.96 | allreduce: 93.51 | optimizer: 16.53 | batch generator: 0.73 | data loader: 0.12
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
validation loss at the end of training for val data | LM loss: 1.170473E+01 | LM PPL: 1.211437E+05
----------------------------------------------------------------------------------------------------
global rank 0 is saving checkpoint at iteration 300 to checkpoints/gpt2_345m/iter_0000300/mp_rank_00/model_optim_rng.pt
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
successfully saved checkpoints/gpt2_345m/iter_0000300/mp_rank_00/model_optim_rng.pt
Evaluating iter 100/100
----------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------
validation loss at the end of training for test data | LM loss: 1.169765E+01 | LM PPL: 1.202885E+05
-----------------------------------------------------------------------------------------------------
Screenshot of video memory usage:
Due to data parallelism, the memory usage of a single card is similar to that of using a single card for training.
Inference based on models trained in data parallelism can also run normally:
2 card models in parallel
We use this script DeepSpeedExamples/Megatron-LM/scripts/pretrain_gpt2_model_parallel.sh
to carry out parallel training of 2-card models. In addition to the modification related to 2-card data parallelism, we also need to remove the parameters in this script --deepspeed
, because to use DeepSpeed, we need to execute the deepspeed config configuration file. The training features related to deepspeed will be explored in the next article.
Use to bash scripts/pretrain_gpt2_model_parallel.sh
start model parallel training on 2 cards. log:
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
using world size: 2 and model-parallel size: 2
> using dynamic loss scaling
> initializing model parallel with size 2
Pretrain GPT2 model
arguments:
pretrained_bert .............. False
attention_dropout ............ 0.1
num_attention_heads .......... 16
hidden_size .................. 1024
intermediate_size ............ None
num_layers ................... 24
layernorm_epsilon ............ 1e-05
hidden_dropout ............... 0.1
max_position_embeddings ...... 1024
vocab_size ................... 30522
deep_init .................... False
make_vocab_size_divisible_by . 128
cpu_optimizer ................ False
cpu_torch_adam ............... False
fp16 ......................... True
fp32_embedding ............... False
fp32_layernorm ............... False
fp32_tokentypes .............. False
fp32_allreduce ............... False
hysteresis ................... 2
loss_scale ................... None
loss_scale_window ............ 1000
min_scale .................... 1
batch_size ................... 8
weight_decay ................. 0.01
checkpoint_activations ....... True
checkpoint_num_layers ........ 1
deepspeed_activation_checkpointing False
clip_grad .................... 1.0
train_iters .................. 600
log_interval ................. 100
exit_interval ................ None
seed ......................... 1234
reset_position_ids ........... False
reset_attention_mask ......... False
lr_decay_iters ............... None
lr_decay_style ............... cosine
lr ........................... 0.00015
warmup ....................... 0.01
save ......................... checkpoints/gpt2_345m_mp2
save_interval ................ 5000
no_save_optim ................ False
no_save_rng .................. False
load ......................... checkpoints/gpt2_345m_mp2
no_load_optim ................ True
no_load_rng .................. False
finetune ..................... False
resume_dataloader ............ True
distributed_backend .......... nccl
local_rank ................... 0
eval_batch_size .............. None
eval_iters ................... 100
eval_interval ................ 1000
eval_seq_length .............. None
eval_max_preds_per_seq ....... None
overlapping_eval ............. 32
cloze_eval ................... False
eval_hf ...................... False
load_openai .................. False
temperature .................. 1.0
top_p ........................ 0.0
top_k ........................ 0
out_seq_length ............... 256
model_parallel_size .......... 2
shuffle ...................... False
train_data ................... ['webtext']
use_npy_data_loader .......... False
train_data_path ..............
val_data_path ................
test_data_path ...............
input_data_sizes_file ........ sizes.txt
delim ........................ ,
text_key ..................... sentence
eval_text_key ................ None
valid_data ................... None
split ........................ 400,300,300
test_data .................... None
lazy_loader .................. True
loose_json ................... False
presplit_sentences ........... False
num_workers .................. 2
tokenizer_model_type ......... bert-large-uncased
tokenizer_path ............... tokenizer.model
tokenizer_type ............... GPT2BPETokenizer
cache_dir .................... None
use_tfrecords ................ False
seq_length ................... 1024
max_preds_per_seq ............ None
deepspeed .................... False
deepspeed_config ............. None
deepscale .................... False
deepscale_config ............. None
deepspeed_mpi ................ False
cuda ......................... True
rank ......................... 0
world_size ................... 2
dynamic_loss_scale ........... True
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
configuring data
> padded vocab (size: 50257) with 175 dummy tokens (new size: 50432)
> found end-of-document token: 50256
building GPT2 model ...
> number of parameters on model parallel rank 0: 178100224
> number of parameters on model parallel rank 1: 178100224
Optimizer = FusedAdam
learning rate decaying cosine
WARNING: could not find the metadata file checkpoints/gpt2_345m_mp2/latest_checkpointed_iteration.txt
will not load any checkpoints and will start from random
Optimizer = FusedAdam
Partition Activations False and Correctness Check False
s iteration 100/ 600 | elapsed time per iteration (ms): 810.9 | learning rate 1.444E-04 | lm loss 5.023855E+00 | loss scale 8192.0 |
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py:416: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
warnings.warn(
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py:424: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved
warnings.warn(
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py:416: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
warnings.warn(
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py:424: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved
warnings.warn(
after 100 iterations memory (MB) | allocated: 3447.24365234375 | max allocated: 6237.830078125 | cached: 7890.0 | max cached: 7890.0
time (ms) | forward: 252.44 | backward: 550.96 | allreduce: 12.11 | optimizer: 7.26 | batch generator: 7.15 | data loader: 6.35
iteration 200/ 600 | elapsed time per iteration (ms): 844.2 | learning rate 1.210E-04 | lm loss 1.112287E-01 | loss scale 8192.0 |
time (ms) | forward: 242.53 | backward: 589.63 | allreduce: 11.37 | optimizer: 10.92 | batch generator: 4.28 | data loader: 2.71
iteration 300/ 600 | elapsed time per iteration (ms): 824.7 | learning rate 8.518E-05 | lm loss 8.868908E-03 | loss scale 8192.0 |
time (ms) | forward: 240.10 | backward: 572.66 | allreduce: 11.63 | optimizer: 11.32 | batch generator: 3.64 | data loader: 2.12
iteration 400/ 600 | elapsed time per iteration (ms): 790.5 | learning rate 4.666E-05 | lm loss 2.208042E-03 | loss scale 8192.0 |
time (ms) | forward: 233.81 | backward: 547.29 | allreduce: 11.90 | optimizer: 9.11 | batch generator: 1.16 | data loader: 0.21
iteration 500/ 600 | elapsed time per iteration (ms): 792.8 | learning rate 1.574E-05 | lm loss 8.129998E-04 | loss scale 8192.0 |
time (ms) | forward: 234.04 | backward: 549.56 | allreduce: 13.62 | optimizer: 9.02 | batch generator: 0.91 | data loader: 0.16
iteration 600/ 600 | elapsed time per iteration (ms): 787.7 | learning rate 6.939E-07 | lm loss 6.003926E-04 | loss scale 8192.0 |
time (ms) | forward: 234.25 | backward: 544.30 | allreduce: 10.23 | optimizer: 9.00 | batch generator: 0.83 | data loader: 0.12
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
validation loss at the end of training for val data | LM loss: 1.231077E+01 | LM PPL: 2.220759E+05
----------------------------------------------------------------------------------------------------
global rank 1 is saving checkpoint at iteration 600 to checkpoints/gpt2_345m_mp2/iter_0000600/mp_rank_01/model_optim_rng.pt
global rank 0 is saving checkpoint at iteration 600 to checkpoints/gpt2_345m_mp2/iter_0000600/mp_rank_00/model_optim_rng.pt
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
successfully saved checkpoints/gpt2_345m_mp2/iter_0000600/mp_rank_01/model_optim_rng.pt
successfully saved checkpoints/gpt2_345m_mp2/iter_0000600/mp_rank_00/model_optim_rng.pt
Evaluating iter 100/100
----------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------
validation loss at the end of training for test data | LM loss: 1.215604E+01 | LM PPL: 1.902403E+05
-----------------------------------------------------------------------------------------------------
Screenshot of video memory usage:
Due to the segmentation of the model parameters, the peak memory usage of a single card has now been reduced from about 15 G for data parallelism to 9 G.
Here, if you use this model directly for inference, there will be a problem that the parameters do not match the model definition when loading checkpoint. This is because this version of the Meagtron code does not consider loading the model to train the stored model in parallel, so here we can only let Megatron load and perform inference by merging the parallel sub-models of the two models into a complete single-card model.
However, the Megatron-LM source code where this article is located does not provide a tool for model merging, so the model for parallel training of this model will not be reasoned here. If you want to reason about the checkpoint of model parallel training, the easiest way is to directly use the latest code of nvidia's Megatron-LM for model training and reasoning. It not only supports model parallelism but also supports pipeline parallelism and can load any combination of parallel models. reasoning. In addition, the official Megatron also provides a tool to convert the checkpoint of the original arbitrary model parallel size and pipeline parallel size to the checkpoint of the model parallel size and pipeline parallel size specified by the user. (https://github.com/NVIDIA/Megatron-LM/tree/main#evaluation-and-tasks) as shown below:
0x5. Summary
The article is relatively long. Let’s explore the use of DeepSpeed with Megatron for training in the next note. Recommend several good Megatron-related source code learning blogs (the first two are highly recommended, you can also pay attention to this blogger, I feel that the blog is very well written):
- Illustrated Large Model Series: Megatron Source Code Interpretation 1, Distributed Environment Initialization
- Graphical large model training: Megatron source code interpretation 2, model parallelism
- [Source Code Analysis] Model Parallel Distributed Training Megatron (1) — Papers & Basics
- [Source Code Analysis] Model Parallel Distributed Training Megatron (2) — Overall Architecture
- [Source Code Analysis] Model Parallel Distributed Training Megatron (3) — Model Parallel Implementation
- [Source code analysis] Model parallel distributed training Megatron (4) — how to set up various parallelism
- [Source code analysis] Model parallel distributed training Megatron (5) --Pipedream Flush