DeepSpeed combined with Megatron-LM training GPT2 model notes (on)

0x0. Preamble

This article explores the process of training the GPT2 model based on the Megatron-related examples given in the DeepSpeedExamples warehouse. It mainly consists of 3 parts. The first part is how to train the GPT2 model based on the original Megatron. The second part is how to train the Megatron GPT2 with the characteristics of DeepSpeed. Due to the length of this article, only the first part is written, which is mainly very It carefully records some problems encountered in the Megatron GPT2 training process and how to solve them. This article is mainly based on the codebase here https://github.com/microsoft/DeepSpeedExamples/tree/bdf8e59aede8c8e0577e8d4d557298ca8515268f.

0x1. Megatron uses a single card to train GPT2

First read the README at https://github.com/microsoft/DeepSpeedExamples/tree/bdf8e59aede8c8e0577e8d4d557298ca8515268f/Megatron-LM here. The part of BERT is not focused here, the purpose is to run the training and reasoning of GPT2.

First mentioned, Megatron is a large and powerful Transformer, and this code base is used for continuous research on large Transformer language models. Currently, Megatron supports GPT2 and BERT model parallel, multi-node training, and uses mixed precision. Megatron's code base can efficiently train a 72-layer, 8.3 billion parameter GPT2 language model using 512 GPUs for 8-way model and 64-way data parallelism. The authors found that larger language models (referring to the previous 8.3 billion parameter GPT2) were able to surpass the current GPT2-1.5B wikitext perplexities in only 5 training epochs.

Dependency installation

First, go to Megatron-LMthe directory and install the dependencies. pip install -r requirements.txtNote that requirements.txtit depends on TensorFlow. This is related to BERT training. I don’t care about it here, so I won’t install TensorFlow. requiresment.txtThe content is as follows:

nltk>=3.4
numpy>=1.15.4
pandas>=0.24.0
sentencepiece>=0.1.8
# tensorflow>=1.12.0
boto3==1.11.11
regex==2020.1.8

When installing, an error will be reported:

ERROR: Could not find a version that satisfies the requirement boto3==1.11.11 (from versions: none)
ERROR: No matching distribution found for boto3==1.11.11

I pip install boto3installed the latest version directly using .

Then follow the tutorial and execute bash scripts/pretrain_gpt2.sh. Here is a PyTorch error:

ModuleNotFoundError: No module named 'torch._six'

This error is caused by the change of PyTorch version. I searched and found that I only need to from torch._six import infchange this line of code to from torch import inf. Continue to execute, the error is reported as: AssertionError: make sure to set PATH for wikipedia data_utils/corpora.py. This is because scripts/pretrain_gpt2.shthe training data set is specified as wikipedia, so we need to specify the wikipedia data path we downloaded locally in DeepSpeedExamples/Megatron-LM/data_utils/corpora.pyhere .PATH = 'data/wikipedia/wikidump_lines.json'

Prepare training data

When downloading the data, I found that the wikipedia data is too big, so I switched to the webtext data set. The README of this data set Megatron is introduced as follows:

"We" utilize the publicly available OpenWebText (https://github.com/eukaryote31/openwebtext) library developed by jcpeterson (https://github.com/jcpeterson/openwebtext) and eukaryote31 (https://github.com /eukaryote31/openwebtext) for download URLs. We then filtered, cleaned and deduped all downloaded content according to the process we described in the openwebtext catalog. For the content corresponding to the Reddit URL as of October 2018, we got about 37GB of content. 37G is still too big for running training, so I only downloaded the first 1url file among dozens of urls.

insert image description here
Then copy this file to the openwebtxt directory of Megatron-LM:

insert image description here

Next, follow the README of openwebtext to start executing.

    pip install ftfy langdetect numpy torch pandas nltk sentencepiece boto3 tqdm regex bs4 newspaper3k htmlmin tldextract 
    git clone https://github.com/mattilyra/LSH
    cd LSH
    python setup.py install

Installing LSH encountered a problem caused by the incompatibility of two Python versions:

lsh/cMinhash.cpp:19292:21: error: ‘PyThreadState’ {
    
    aka ‘struct _ts’} has no member named ‘exc_type’; did you mean ‘curexc_type’?
19292 | *type = tstate->exc_type;

This problem can be solved by replacing exc_type with curexc_type.

lsh/cMinhash.cpp:17704:26: error: ‘PyTypeObject’ {
    
    aka ‘struct _typeobject’} has no member named ‘tp_print’
17704 | __pyx_type___pyx_array.tp_print = 0;

This problem can be solved by replacing tp_print with tp_vectorcall_offset.

Next, execute the command to deduplicate the URL:

python3 blacklist_urls.py RS_2011-01.bz2.deduped.txt clean_urls.txt

I found that clean_urls.txt is empty after executing this command. After looking at the code, I found that the script requires that the deduplicated url file must be in a directory, and pass the path of this directory to the script.

insert image description here

Therefore, create a new urls directory under the current folder, and put the url file just now into it. As follows:

insert image description here

Then execute: python3 blacklist_urls.py urls clean_urls.txtto complete deduplication. Next, use https://github.com/eukaryote31/openwebtext/blob/master/download.py to download the text corresponding to the deduplicated url.

insert image description here

It takes a long time to download all of them here. I only download the data corresponding to 50 urls for a demonstration. Here, to save the data corresponding to each downloaded url as a json file, you need to modify the and default values download.py​​in it , and change them to and respectively , so that a folder will be generated after execution, and the downloaded text of each url will be saved in a sub-file Clamp down:--sqlite_meta--save_uncompressedFalseTruepython3 openwebtext/download.py clean_urls.txtscrapeddata

insert image description here
Then we use the following script ( merge_jsons.py) to merge all the txt in the folder into a json file, where each line is the textcorresponding content of a field:

import glob
import sys
import json
import argparse

if __name__ == '__main__':

    parser = argparse.ArgumentParser()
    parser.add_argument("--data_path", type=str, default=".",
        help="path where all the json files are located")

    parser.add_argument("--output_file", type=str, default="merged_output.json",
        help="filename where the merged json should go")

    args = parser.parse_args()

    data_path = args.data_path
    out_file = args.output_file

    text_files = glob.glob(data_path + '/*.txt')

    counter = 0

    with open(out_file, 'w') as outfile:
        for fname in text_files:
            counter += 1

            if counter % 1024 == 0:
                print("Merging at ", counter, flush=True)

            with open(fname, 'r') as infile:
                for row in infile:
                    tmp = {
    
    }
                    tmp['text'] = row
                    outfile.write(json.dumps(tmp))
                    outfile.write('\n')


    print("Merged file", out_file, flush=True)

Execute this script to get merged_output.json: python3 merge_jsons.py --data_pathDeepSpeedExamples/Megatron-LM/openwebtext/scraped/data.

Next, we openwebtextexecute it under the folder cleanup_dataset.pyto delete all the texts with less than 128 tokens. python3 cleanup_dataset.py merged_output.json merged_cleand.json.

Detailed training process and pitfalls

After the data is ready, let's modify DeepSpeedExamples/Megatron-LM/scripts/pretrain_gpt2.shthe following --train-dataas webtext. Also set the path of DeepSpeedExamples/Megatron-LM/data_utils/corpora.pythe webtextclass to the path we just obtained merged_cleand.json.

insert image description here
In addition, since I only use dozens of pieces of data here to demonstrate the training process, I need to change the DeepSpeedExamples/Megatron-LM/scripts/pretrain_gpt2.shfollowing --splitparameters here, and change it to 400,300,300, that is, the data ratio of training, testing, and verification sets is 4:3:3 , so as to avoid setting the number of test sets to 0.

Then you can use it bash scripts/pretrain_gpt2.shto start training. Give some training logs out:

Setting ds_accelerator to cuda (auto detect)
using world size: 1 and model-parallel size: 1 
 > using dynamic loss scaling
> initializing model parallel with size 1
Pretrain GPT2 model
arguments:
  pretrained_bert .............. False
  attention_dropout ............ 0.1
  num_attention_heads .......... 16
  hidden_size .................. 1024
  intermediate_size ............ None
  num_layers ................... 24
  layernorm_epsilon ............ 1e-05
  hidden_dropout ............... 0.1
  max_position_embeddings ...... 1024
  vocab_size ................... 30522
  deep_init .................... False
  make_vocab_size_divisible_by . 128
  cpu_optimizer ................ False
  cpu_torch_adam ............... False
  fp16 ......................... True
  fp32_embedding ............... False
  fp32_layernorm ............... False
  fp32_tokentypes .............. False
  fp32_allreduce ............... False
  hysteresis ................... 2
  loss_scale ................... None
  loss_scale_window ............ 1000
  min_scale .................... 1
  batch_size ................... 8
  weight_decay ................. 0.01
  checkpoint_activations ....... True
  checkpoint_num_layers ........ 1
  deepspeed_activation_checkpointing  False
  clip_grad .................... 1.0
  train_iters .................. 320000
  log_interval ................. 100
  exit_interval ................ None
  seed ......................... 1234
  reset_position_ids ........... False
  reset_attention_mask ......... False
  lr_decay_iters ............... None
  lr_decay_style ............... cosine
  lr ........................... 0.00015
  warmup ....................... 0.01
  save ......................... checkpoints/gpt2_345m
  save_interval ................ 5000
  no_save_optim ................ False
  no_save_rng .................. False
  load ......................... checkpoints/gpt2_345m
  no_load_optim ................ False
  no_load_rng .................. False
  finetune ..................... False
  resume_dataloader ............ True
  distributed_backend .......... nccl
  local_rank ................... None
  eval_batch_size .............. None
  eval_iters ................... 100
  eval_interval ................ 1000
  eval_seq_length .............. None
  eval_max_preds_per_seq ....... None
  overlapping_eval ............. 32
  cloze_eval ................... False
  eval_hf ...................... False
  load_openai .................. False
  temperature .................. 1.0
  top_p ........................ 0.0
  top_k ........................ 0
  out_seq_length ............... 256
  model_parallel_size .......... 1
  shuffle ...................... False
  train_data ................... ['webtext']
  use_npy_data_loader .......... False
  train_data_path .............. 
  val_data_path ................ 
  test_data_path ............... 
  input_data_sizes_file ........ sizes.txt
  delim ........................ ,
  text_key ..................... sentence
  eval_text_key ................ None
  valid_data ................... None
  split ........................ 400,300,300
  test_data .................... None
  lazy_loader .................. True
  loose_json ................... False
  presplit_sentences ........... False
  num_workers .................. 2
  tokenizer_model_type ......... bert-large-uncased
  tokenizer_path ............... tokenizer.model
  tokenizer_type ............... GPT2BPETokenizer
  cache_dir .................... cache
  use_tfrecords ................ False
  seq_length ................... 1024
  max_preds_per_seq ............ None
  deepspeed .................... False
  deepspeed_config ............. None
  deepscale .................... False
  deepscale_config ............. None
  deepspeed_mpi ................ False
  cuda ......................... True
  rank ......................... 0
  world_size ................... 1
  dynamic_loss_scale ........... True
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
configuring data
> padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> found end-of-document token: 50256
building GPT2 model ...
 > number of parameters on model parallel rank 0: 354871296
Optimizer = FusedAdam
learning rate decaying cosine
WARNING: could not find the metadata file checkpoints/gpt2_345m/latest_checkpointed_iteration.txt 
    will not load any checkpoints and will start from random
Partition Activations False and Correctness Check False
 iteration      100/  320000 | elapsed time per iteration (ms): 963.3 | learning rate 3.937E-06 | lm loss 8.995377E+00 | loss scale 131072.0 |
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py:416: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
  warnings.warn(
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py:424: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved
  warnings.warn(
after 100 iterations memory (MB) | allocated: 6784.88427734375 | max allocated: 11927.470703125 | cached: 13826.0 | max cached: 13826.0
time (ms) | forward: 276.11 | backward: 672.99 | allreduce: 13.96 | optimizer: 14.00 | batch generator: 5.22 | data loader: 4.53
 iteration      200/  320000 | elapsed time per iteration (ms): 950.6 | learning rate 8.625E-06 | lm loss 3.041360E+00 | loss scale 131072.0 |
time (ms) | forward: 259.24 | backward: 674.56 | allreduce: 13.45 | optimizer: 16.63 | batch generator: 0.78 | data loader: 0.14

From nvidia-smithe screenshot of , you can also see that the training of megatron is running at card 0:

insert image description here
The following StopIteration errors may occur during training :


time (ms) | forward: 259.07 | backward: 671.87 | allreduce: 13.03 | optimizer: 16.64 | batch generator: 0.76 | data loader: 0.13
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/pretrain_gpt2.py:713 in <module>                 │
│                                                                                                  │
│   710                                                                                            │
│   711                                                                                            │
│   712 if __name__ == "__main__":                                                                 │
│ ❱ 713 │   main()                                                                                 │
│   714                                                                                            │
│                                                                                                  │
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/pretrain_gpt2.py:686 in main                     │
│                                                                                                  │
│   683 │   iteration = 0                                                                          │
│   684 │   if args.train_iters > 0:                                                               │
│   685 │   │   if args.do_train:                                                                  │
│ ❱ 686 │   │   │   iteration, skipped = train(model, optimizer,                                   │
│   687 │   │   │   │   │   │   │   │   │      lr_scheduler,                                       │
│   688 │   │   │   │   │   │   │   │   │      train_data_iterator,                                │
│   689 │   │   │   │   │   │   │   │   │      val_data_iterator,                                  │
│                                                                                                  │
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/pretrain_gpt2.py:415 in train                    │
│                                                                                                  │
│   412 │   report_memory_flag = True                                                              │
│   413 │   while iteration < args.train_iters:                                                    │
│   414 │   │                                                                                      │
│ ❱ 415 │   │   lm_loss, skipped_iter = train_step(train_data_iterator,                            │
│   416 │   │   │   │   │   │   │   │   │   │      model,                                          │
│   417 │   │   │   │   │   │   │   │   │   │      optimizer,                                      │
│   418 │   │   │   │   │   │   │   │   │   │      lr_scheduler,                                   │
│                                                                                                  │
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/pretrain_gpt2.py:369 in train_step               │
│                                                                                                  │
│   366 │                                                                                          │
│   367 │   # Forward model for one step.                                                          │
│   368 │   timers('forward').start()                                                              │
│ ❱ 369 │   lm_loss = forward_step(data_iterator, model, args, timers)                             │
│   370 │   timers('forward').stop()                                                               │
│   371 │                                                                                          │
│   372 │   #print_rank_0("loss is {}".format(lm_loss))                                            │
│                                                                                                  │
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/pretrain_gpt2.py:286 in forward_step             │
│                                                                                                  │
│   283 │                                                                                          │
│   284 │   # Get the batch.                                                                       │
│   285 │   timers('batch generator').start()                                                      │
│ ❱ 286 │   tokens, labels, loss_mask, attention_mask, position_ids = get_batch(                   │
│   287 │   │   data_iterator, args, timers)                                                       │
│   288 │   timers('batch generator').stop()                                                       │
│   289                                                                                            │
│                                                                                                  │
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/pretrain_gpt2.py:257 in get_batch                │
│                                                                                                  │
│   254 │   # Broadcast data.                                                                      │
│   255 │   timers('data loader').start()                                                          │
│   256 │   if data_iterator is not None:                                                          │
│ ❱ 257 │   │   data = next(data_iterator)                                                         │
│   258 │   else:                                                                                  │
│   259 │   │   data = None                                                                        │
│   260 │   timers('data loader').stop()                                                           │
│                                                                                                  │
│ /home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/utils/data/dataloader.p │
│ y:633 in __next__                                                                                │
│                                                                                                  │
│    630 │   │   │   if self._sampler_iter is None:                                                │
│    631 │   │   │   │   # TODO(https://github.com/pytorch/pytorch/issues/76750)                   │
│    632 │   │   │   │   self._reset()  # type: ignore[call-arg]                                   │
│ ❱  633 │   │   │   data = self._next_data()                                                      │
│    634 │   │   │   self._num_yielded += 1                                                        │
│    635 │   │   │   if self._dataset_kind == _DatasetKind.Iterable and \                          │
│    636 │   │   │   │   │   self._IterableDataset_len_called is not None and \                    │
│                                                                                                  │
│ /home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/utils/data/dataloader.p │
│ y:1318 in _next_data                                                                             │
│                                                                                                  │
│   1315 │   │   │   │   # no valid `self._rcvd_idx` is found (i.e., didn't break)                 │
│   1316 │   │   │   │   if not self._persistent_workers:                                          │
│   1317 │   │   │   │   │   self._shutdown_workers()                                              │
│ ❱ 1318 │   │   │   │   raise StopIteration                                                       │
│   1319 │   │   │                                                                                 │
│   1320 │   │   │   # Now `self._rcvd_idx` is the batch index we want to fetch                    │
│   1321                                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
StopIteration

Don't worry, this error means that the amount of data is not enough to train so many iters. The reason for this is that the dataset is sampled when constructing the dataloader. This sampler samples torch.utils.data.SequentialSampleraccording to the length of the dataset, so it cannot be args.train_itersassociated with Up, resulting in a StopIteration error after the data has been read after training a lot of iter.

Let's adjust the script, change the number of iters to 600, and set the checkpoint saving interval to 500 to ensure that megatron can save a checkpoint. Run the script again:

insert image description here

0x2. Megatron uses a single card to predict the trained GPT2 model

Modify the path of the model trained for us DeepSpeedExamples/Megatron-LM/scripts/generate_text.shhere , let’s change it here , and then execute it in the root directory of Megatron: . But an error was reported:CHECKPOINT_PATHDeepSpeedExamples/Megatron-LM/checkpoints/gpt2_345mbash scripts/generate_text.sh

Setting ds_accelerator to cuda (auto detect)
Generate Samples
WARNING: No training data specified
using world size: 1 and model-parallel size: 1 
 > using dynamic loss scaling
> initializing model parallel with size 1
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
prepare tokenizer done
building GPT2 model ...
 > number of parameters on model parallel rank 0: 354823168
global rank 0 is loading checkpoint /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/checkpoints/gpt2_345m/iter_0000600/mp_rank_00/model_optim_rng.pt
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/generate_samples.py:277 in <module>              │
│                                                                                                  │
│   274                                                                                            │
│   275                                                                                            │
│   276 if __name__ == "__main__":                                                                 │
│ ❱ 277 │   main()                                                                                 │
│   278                                                                                            │
│   279                                                                                            │
│   280                                                                                            │
│                                                                                                  │
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/generate_samples.py:267 in main                  │
│                                                                                                  │
│   264 │   tokenizer = prepare_tokenizer(args)                                                    │
│   265 │                                                                                          │
│   266 │   # Model, optimizer, and learning rate.                                                 │
│ ❱ 267 │   model = setup_model(args)                                                              │
│   268 │                                                                                          │
│   269 │   #setting default batch size to 1                                                       │
│   270 │   args.batch_size = 1                                                                    │
│                                                                                                  │
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/generate_samples.py:80 in setup_model            │
│                                                                                                  │
│    77 │   model = get_model(args)                                                                │
│    78 │                                                                                          │
│    79 │   if args.load is not None:                                                              │
│ ❱  80 │   │   _ = load_checkpoint(                                                               │
│    81 │   │   │   model, None, None, args)                                                       │
│    82 │                                                                                          │
│    83 │   return model                                                                           │
│                                                                                                  │
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/utils.py:305 in load_checkpoint                  │
│                                                                                                  │
│   302 │   │                                                                                      │
│   303 │   │   # Model.                                                                           │
│   304 │   │   try:                                                                               │
│ ❱ 305 │   │   │   model.load_state_dict(sd['model'])                                             │
│   306 │   │   except KeyError:                                                                   │
│   307 │   │   │   print_rank_0('A metadata file exists but unable to load model '                │
│   308 │   │   │   │   │   │   'from checkpoint {}, exiting'.format(checkpoint_name))             │
│                                                                                                  │
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/model/distributed.py:90 in load_state_dict       │
│                                                                                                  │
│    87 │   │   return sd                                                                          │
│    88 │                                                                                          │
│    89 │   def load_state_dict(self, state_dict, strict=True):                                    │
│ ❱  90 │   │   self.module.load_state_dict(state_dict, strict=strict)                             │
│    91 │                                                                                          │
│    92 │   '''                                                                                    │
│    93 │   def _sync_buffers(self):                                                               │
│                                                                                                  │
│ /home/zhangxiaoyu/DeepSpeedExamples/Megatron-LM/fp16/fp16.py:71 in load_state_dict               │
│                                                                                                  │
│    68 │   │   return self.module.state_dict(destination, prefix, keep_vars)                      │
│    69 │                                                                                          │
│    70 │   def load_state_dict(self, state_dict, strict=True):                                    │
│ ❱  71 │   │   self.module.load_state_dict(state_dict, strict=strict)                             │
│    72                                                                                            │
│    73 # TODO:  Update overflow check + downscale to use Carl's fused kernel.                     │
│    74 class FP16_Optimizer(object):                                                              │
│                                                                                                  │
│ /home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/nn/modules/module.py:20 │
│ 41 in load_state_dict                                                                            │
│                                                                                                  │
│   2038 │   │   │   │   │   │   ', '.join('"{}"'.format(k) for k in missing_keys)))               │
│   2039 │   │                                                                                     │
│   2040 │   │   if len(error_msgs) > 0:                                                           │
│ ❱ 2041 │   │   │   raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(     │
│   2042 │   │   │   │   │   │   │      self.__class__.__name__, "\n\t".join(error_msgs)))         │
│   2043 │   │   return _IncompatibleKeys(missing_keys, unexpected_keys)                           │
│   2044                                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Error(s) in loading state_dict for GPT2Model:
        size mismatch for word_embeddings.weight: copying a param with shape torch.Size([50304, 1024]) from checkpoint, the shape in current model is 
torch.Size([50257, 1024]).

You can see that the shape prompted when loading the model word_embeddings.weightdoes not match, let's look at word_embeddingsthe definition in GPT2:
insert image description here

So this problem should be caused by the difference in vocab_size during training and testing. num_tokensAfter positioning, it is found that this is because the number of tokens needs to be padd to be divisible during training args.make_vocab_size_divisible_by=128, but there is no such restriction during prediction, which leads to a mismatch in the dimensions of embedding. Let us modify the DeepSpeedExamples/Megatron-LM/generate_samples.pyprocessing num_tokenlogic to make it consistent with training .

insert image description here
Execute again bash scripts/generate_text.sh, we can talk to GPT2, output a prompt model will give you different completion output, and then enter to stopend the dialogue.

insert image description here
Since the model here only uses very little data for demonstration, there is basically no good completion effect. Later, the amount of data can be increased to train a better GPT2 dialogue model.

0x3. Parameter amount and video memory estimation

In this article https://zhuanlan.zhihu.com/p/624740065, there is a derivation of the parameters and training memory usage of the Transformer of the GPT2 architecture. Here we apply the publicity summarized in it to calculate our current GPT2 model. The amount of parameters and the theoretical video memory usage during training.

parameter estimation

Apply the following publicity: insert image description here
here: l=24, hidden_size=1024, 12lh^2=12x24x1024x1024=301989888=0.3B. So the GPT2 model we train here has only about 0.3B parameters. From the name of the model 345M, we can also know that the calculation result is basically consistent with the real size.

Estimation of training video memory usage

insert image description hereAccording to the above formula, the memory usage of model parameters, gradients, and optimizer state during training is about 301989888*20bytes=6039797760bytes=5898240kb=5760MB=5.6G. Then activate the occupied video memory as follows:

insert image description here
When we train, batch_size=8, s=1024, h=1024, a=num-attention-heads=16, l=24, then 34 bsh + 5 bs 2 a = 22951231488 bytes = 21888 M i B = 21 G 34bsh +5bs^2a=22951231488bytes=21888MiB=21G34bsh+5bs2a _=22951231488 b y t es=21888MiB=21G

Therefore, the training video memory of 0.3B GPT2 is about 5.6G+21G=26.6G. But in the 0x1 section, we can see that the single card memory of our graphics card is 24G, and the memory consumption during the training process is only 15107MiB=14.75G, which means that the memory occupied by the activation is not the 21G we calculated, but 14.75- 5.6= 9.15G , why?

This is because DeepSpeedExamples/Megatron-LM/scripts/pretrain_gpt2.shit is opened inside --checkpoint-activationsand the Activation Checkpoint is done. We can locate this part of the code, in DeepSpeedExamples/Megatron-LM/mpu/transformer.py:406-413:

insert image description here

It can be seen that for each Transformer layer, the intermediate activation that needs to be saved when the internal Self-Attention and MLP are used for backward can be saved, and the purpose of reducing the video memory is achieved.

0x4. Megatron uses Doka to train the GPT2 model

2 cards data parallel

The above has completed the training of the single-card GPT2 model. It is relatively simple to start multi-card training. Modify the DeepSpeedExamples/Megatron-LM/scripts/pretrain_gpt2_distributed.shfor --train-dataand webtextthen --train-iterschange it to 600/num_gpus. In fact, this script starts data parallel training, so we only need to set the number of iters to 600/num_gpus to scan the same scale of data as a single card. The ratio of training data, verification set, and test should also be changed, because there is only too little simulated data here. According to the original ratio, the number of data items in the test set will be counted as 0 and an error will be reported. Finally, set GPUS_PER_NODE to 2, which means using 2 cards for data parallel training. Then you can start the training: bash scripts/pretrain_gpt2_distributed.sh, the log is as follows:

/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
using world size: 2 and model-parallel size: 1 
 > using dynamic loss scaling
> initializing model parallel with size 1
Pretrain GPT2 model
arguments:
  pretrained_bert .............. False
  attention_dropout ............ 0.1
  num_attention_heads .......... 16
  hidden_size .................. 1024
  intermediate_size ............ None
  num_layers ................... 24
  layernorm_epsilon ............ 1e-05
  hidden_dropout ............... 0.1
  max_position_embeddings ...... 1024
  vocab_size ................... 30522
  deep_init .................... False
  make_vocab_size_divisible_by . 128
  cpu_optimizer ................ False
  cpu_torch_adam ............... False
  fp16 ......................... True
  fp32_embedding ............... False
  fp32_layernorm ............... False
  fp32_tokentypes .............. False
  fp32_allreduce ............... False
  hysteresis ................... 2
  loss_scale ................... None
  loss_scale_window ............ 1000
  min_scale .................... 1
  batch_size ................... 8
  weight_decay ................. 0.01
  checkpoint_activations ....... True
  checkpoint_num_layers ........ 1
  deepspeed_activation_checkpointing  False
  clip_grad .................... 1.0
  train_iters .................. 300
  log_interval ................. 100
  exit_interval ................ None
  seed ......................... 1234
  reset_position_ids ........... False
  reset_attention_mask ......... False
  lr_decay_iters ............... None
  lr_decay_style ............... cosine
  lr ........................... 0.00015
  warmup ....................... 0.01
  save ......................... checkpoints/gpt2_345m
  save_interval ................ 5000
  no_save_optim ................ False
  no_save_rng .................. False
  load ......................... checkpoints/gpt2_345m
  no_load_optim ................ False
  no_load_rng .................. False
  finetune ..................... False
  resume_dataloader ............ True
  distributed_backend .......... nccl
  local_rank ................... 0
  eval_batch_size .............. None
  eval_iters ................... 100
  eval_interval ................ 1000
  eval_seq_length .............. None
  eval_max_preds_per_seq ....... None
  overlapping_eval ............. 32
  cloze_eval ................... False
  eval_hf ...................... False
  load_openai .................. False
  temperature .................. 1.0
  top_p ........................ 0.0
  top_k ........................ 0
  out_seq_length ............... 256
  model_parallel_size .......... 1
  shuffle ...................... False
  train_data ................... ['webtext']
  use_npy_data_loader .......... False
  train_data_path .............. 
  val_data_path ................ 
  test_data_path ............... 
  input_data_sizes_file ........ sizes.txt
  delim ........................ ,
  text_key ..................... sentence
  eval_text_key ................ None
  valid_data ................... None
  split ........................ 400,300,300
  test_data .................... None
  lazy_loader .................. True
  loose_json ................... False
  presplit_sentences ........... False
  num_workers .................. 2
  tokenizer_model_type ......... bert-large-uncased
  tokenizer_path ............... tokenizer.model
  tokenizer_type ............... GPT2BPETokenizer
  cache_dir .................... cache
  use_tfrecords ................ False
  seq_length ................... 1024
  max_preds_per_seq ............ None
  deepspeed .................... False
  deepspeed_config ............. None
  deepscale .................... False
  deepscale_config ............. None
  deepspeed_mpi ................ False
  cuda ......................... True
  rank ......................... 0
  world_size ................... 2
  dynamic_loss_scale ........... True
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
configuring data
> padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> found end-of-document token: 50256
building GPT2 model ...
 > number of parameters on model parallel rank 0: 354871296
Optimizer = FusedAdam
Optimizer = FusedAdam
learning rate decaying cosine
WARNING: could not find the metadata file checkpoints/gpt2_345m/latest_checkpointed_iteration.txt 
    will not load any checkpoints and will start from random
Partition Activations False and Correctness Check False
 iteration      100/     300 | elapsed time per iteration (ms): 1048.5 | learning rate 1.258E-04 | lm loss 4.799004E+00 | loss scale 32768.0 |
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py:416: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
  warnings.warn(
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py:424: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved
  warnings.warn(
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py:416: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
  warnings.warn(
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py:424: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved
  warnings.warn(
after 100 iterations memory (MB) | allocated: 6784.88427734375 | max allocated: 11927.470703125 | cached: 13826.0 | max cached: 13826.0
time (ms) | forward: 284.78 | backward: 749.95 | allreduce: 93.32 | optimizer: 13.60 | batch generator: 14.88 | data loader: 14.19
 iteration      200/     300 | elapsed time per iteration (ms): 1020.9 | learning rate 5.257E-05 | lm loss 7.708308E-02 | loss scale 32768.0 |
time (ms) | forward: 256.87 | backward: 747.37 | allreduce: 93.08 | optimizer: 16.52 | batch generator: 0.71 | data loader: 0.11
 iteration      300/     300 | elapsed time per iteration (ms): 1018.4 | learning rate 1.806E-06 | lm loss 4.669175E-03 | loss scale 32768.0 |
time (ms) | forward: 256.74 | backward: 744.96 | allreduce: 93.51 | optimizer: 16.53 | batch generator: 0.73 | data loader: 0.12
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
 validation loss at the end of training for val data | LM loss: 1.170473E+01 | LM PPL: 1.211437E+05
----------------------------------------------------------------------------------------------------
global rank 0 is saving checkpoint at iteration     300 to checkpoints/gpt2_345m/iter_0000300/mp_rank_00/model_optim_rng.pt
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
  successfully saved checkpoints/gpt2_345m/iter_0000300/mp_rank_00/model_optim_rng.pt
Evaluating iter 100/100
----------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------
 validation loss at the end of training for test data | LM loss: 1.169765E+01 | LM PPL: 1.202885E+05
-----------------------------------------------------------------------------------------------------

Screenshot of video memory usage:

insert image description here
Due to data parallelism, the memory usage of a single card is similar to that of using a single card for training.

Inference based on models trained in data parallelism can also run normally:

insert image description here

2 card models in parallel

We use this script DeepSpeedExamples/Megatron-LM/scripts/pretrain_gpt2_model_parallel.shto carry out parallel training of 2-card models. In addition to the modification related to 2-card data parallelism, we also need to remove the parameters in this script --deepspeed, because to use DeepSpeed, we need to execute the deepspeed config configuration file. The training features related to deepspeed will be explored in the next article.

Use to bash scripts/pretrain_gpt2_model_parallel.shstart model parallel training on 2 cards. log:

/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Setting ds_accelerator to cuda (auto detect)
Setting ds_accelerator to cuda (auto detect)
using world size: 2 and model-parallel size: 2 
 > using dynamic loss scaling
> initializing model parallel with size 2
Pretrain GPT2 model
arguments:
  pretrained_bert .............. False
  attention_dropout ............ 0.1
  num_attention_heads .......... 16
  hidden_size .................. 1024
  intermediate_size ............ None
  num_layers ................... 24
  layernorm_epsilon ............ 1e-05
  hidden_dropout ............... 0.1
  max_position_embeddings ...... 1024
  vocab_size ................... 30522
  deep_init .................... False
  make_vocab_size_divisible_by . 128
  cpu_optimizer ................ False
  cpu_torch_adam ............... False
  fp16 ......................... True
  fp32_embedding ............... False
  fp32_layernorm ............... False
  fp32_tokentypes .............. False
  fp32_allreduce ............... False
  hysteresis ................... 2
  loss_scale ................... None
  loss_scale_window ............ 1000
  min_scale .................... 1
  batch_size ................... 8
  weight_decay ................. 0.01
  checkpoint_activations ....... True
  checkpoint_num_layers ........ 1
  deepspeed_activation_checkpointing  False
  clip_grad .................... 1.0
  train_iters .................. 600
  log_interval ................. 100
  exit_interval ................ None
  seed ......................... 1234
  reset_position_ids ........... False
  reset_attention_mask ......... False
  lr_decay_iters ............... None
  lr_decay_style ............... cosine
  lr ........................... 0.00015
  warmup ....................... 0.01
  save ......................... checkpoints/gpt2_345m_mp2
  save_interval ................ 5000
  no_save_optim ................ False
  no_save_rng .................. False
  load ......................... checkpoints/gpt2_345m_mp2
  no_load_optim ................ True
  no_load_rng .................. False
  finetune ..................... False
  resume_dataloader ............ True
  distributed_backend .......... nccl
  local_rank ................... 0
  eval_batch_size .............. None
  eval_iters ................... 100
  eval_interval ................ 1000
  eval_seq_length .............. None
  eval_max_preds_per_seq ....... None
  overlapping_eval ............. 32
  cloze_eval ................... False
  eval_hf ...................... False
  load_openai .................. False
  temperature .................. 1.0
  top_p ........................ 0.0
  top_k ........................ 0
  out_seq_length ............... 256
  model_parallel_size .......... 2
  shuffle ...................... False
  train_data ................... ['webtext']
  use_npy_data_loader .......... False
  train_data_path .............. 
  val_data_path ................ 
  test_data_path ............... 
  input_data_sizes_file ........ sizes.txt
  delim ........................ ,
  text_key ..................... sentence
  eval_text_key ................ None
  valid_data ................... None
  split ........................ 400,300,300
  test_data .................... None
  lazy_loader .................. True
  loose_json ................... False
  presplit_sentences ........... False
  num_workers .................. 2
  tokenizer_model_type ......... bert-large-uncased
  tokenizer_path ............... tokenizer.model
  tokenizer_type ............... GPT2BPETokenizer
  cache_dir .................... None
  use_tfrecords ................ False
  seq_length ................... 1024
  max_preds_per_seq ............ None
  deepspeed .................... False
  deepspeed_config ............. None
  deepscale .................... False
  deepscale_config ............. None
  deepspeed_mpi ................ False
  cuda ......................... True
  rank ......................... 0
  world_size ................... 2
  dynamic_loss_scale ........... True
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
configuring data
> padded vocab (size: 50257) with 175 dummy tokens (new size: 50432)
> found end-of-document token: 50256
building GPT2 model ...
 > number of parameters on model parallel rank 0: 178100224
 > number of parameters on model parallel rank 1: 178100224
Optimizer = FusedAdam
learning rate decaying cosine
WARNING: could not find the metadata file checkpoints/gpt2_345m_mp2/latest_checkpointed_iteration.txt 
    will not load any checkpoints and will start from random
Optimizer = FusedAdam
Partition Activations False and Correctness Check False
s iteration      100/     600 | elapsed time per iteration (ms): 810.9 | learning rate 1.444E-04 | lm loss 5.023855E+00 | loss scale 8192.0 |
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py:416: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
  warnings.warn(
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py:424: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved
  warnings.warn(
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py:416: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
  warnings.warn(
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/cuda/memory.py:424: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved
  warnings.warn(
after 100 iterations memory (MB) | allocated: 3447.24365234375 | max allocated: 6237.830078125 | cached: 7890.0 | max cached: 7890.0
time (ms) | forward: 252.44 | backward: 550.96 | allreduce: 12.11 | optimizer: 7.26 | batch generator: 7.15 | data loader: 6.35
 iteration      200/     600 | elapsed time per iteration (ms): 844.2 | learning rate 1.210E-04 | lm loss 1.112287E-01 | loss scale 8192.0 |
time (ms) | forward: 242.53 | backward: 589.63 | allreduce: 11.37 | optimizer: 10.92 | batch generator: 4.28 | data loader: 2.71
 iteration      300/     600 | elapsed time per iteration (ms): 824.7 | learning rate 8.518E-05 | lm loss 8.868908E-03 | loss scale 8192.0 |
time (ms) | forward: 240.10 | backward: 572.66 | allreduce: 11.63 | optimizer: 11.32 | batch generator: 3.64 | data loader: 2.12
 iteration      400/     600 | elapsed time per iteration (ms): 790.5 | learning rate 4.666E-05 | lm loss 2.208042E-03 | loss scale 8192.0 |
time (ms) | forward: 233.81 | backward: 547.29 | allreduce: 11.90 | optimizer: 9.11 | batch generator: 1.16 | data loader: 0.21
 iteration      500/     600 | elapsed time per iteration (ms): 792.8 | learning rate 1.574E-05 | lm loss 8.129998E-04 | loss scale 8192.0 |
time (ms) | forward: 234.04 | backward: 549.56 | allreduce: 13.62 | optimizer: 9.02 | batch generator: 0.91 | data loader: 0.16
 iteration      600/     600 | elapsed time per iteration (ms): 787.7 | learning rate 6.939E-07 | lm loss 6.003926E-04 | loss scale 8192.0 |
time (ms) | forward: 234.25 | backward: 544.30 | allreduce: 10.23 | optimizer: 9.00 | batch generator: 0.83 | data loader: 0.12
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
 validation loss at the end of training for val data | LM loss: 1.231077E+01 | LM PPL: 2.220759E+05
----------------------------------------------------------------------------------------------------
global rank 1 is saving checkpoint at iteration     600 to checkpoints/gpt2_345m_mp2/iter_0000600/mp_rank_01/model_optim_rng.pt
global rank 0 is saving checkpoint at iteration     600 to checkpoints/gpt2_345m_mp2/iter_0000600/mp_rank_00/model_optim_rng.pt
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/home/zhangxiaoyu/miniconda3/envs/eval/lib/python3.9/site-packages/torch/nn/modules/module.py:1802: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
  successfully saved checkpoints/gpt2_345m_mp2/iter_0000600/mp_rank_01/model_optim_rng.pt
  successfully saved checkpoints/gpt2_345m_mp2/iter_0000600/mp_rank_00/model_optim_rng.pt
Evaluating iter 100/100
----------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------
 validation loss at the end of training for test data | LM loss: 1.215604E+01 | LM PPL: 1.902403E+05
-----------------------------------------------------------------------------------------------------

Screenshot of video memory usage:

insert image description here

Due to the segmentation of the model parameters, the peak memory usage of a single card has now been reduced from about 15 G for data parallelism to 9 G.

Here, if you use this model directly for inference, there will be a problem that the parameters do not match the model definition when loading checkpoint. This is because this version of the Meagtron code does not consider loading the model to train the stored model in parallel, so here we can only let Megatron load and perform inference by merging the parallel sub-models of the two models into a complete single-card model.

insert image description hereHowever, the Megatron-LM source code where this article is located does not provide a tool for model merging, so the model for parallel training of this model will not be reasoned here. If you want to reason about the checkpoint of model parallel training, the easiest way is to directly use the latest code of nvidia's Megatron-LM for model training and reasoning. It not only supports model parallelism but also supports pipeline parallelism and can load any combination of parallel models. reasoning. In addition, the official Megatron also provides a tool to convert the checkpoint of the original arbitrary model parallel size and pipeline parallel size to the checkpoint of the model parallel size and pipeline parallel size specified by the user. (https://github.com/NVIDIA/Megatron-LM/tree/main#evaluation-and-tasks) as shown below:

insert image description here

0x5. Summary

The article is relatively long. Let’s explore the use of DeepSpeed ​​with Megatron for training in the next note. Recommend several good Megatron-related source code learning blogs (the first two are highly recommended, you can also pay attention to this blogger, I feel that the blog is very well written):

Guess you like

Origin blog.csdn.net/just_sort/article/details/131173500