GPT series training and deployment - GPT2 environment configuration and model training

        This article is an original article of the blogger, and may not be reproduced without the permission of the blogger.

        This article is a series of articles in the column "Python AIGC large model training and reasoning from scratch", the address is "https://blog.csdn.net/suiyingy/article/details/130169592".

        Colossal-AI provides multiple parallel ways to run GPT, and the corresponding configurations of different parallel ways are located in the gpt2_configs folder. The tutorial address for running the sample program is "https://github.com/hpcaitech/ColossalAI-Examples/tree/main/language/gpt". We will run the GPT training program according to the steps in this tutorial. Colossal-AI environment setup and running test please refer to this column "GPT series training and deployment - Colossal-AI environment configuration and test verification", the address is "https://blog.csdn.net/suiyingy/article/details/ 130209217".

Figure 1 GPT training tutorial page

        This section will focus on how to run the GPT program in Colossal-AI. For more AIGC model training, reasoning and deployment, please refer to this column article " Python AIGC large model training and reasoning from scratch ", the address is "https://blog.csdn.net/suiyingy/article/details/130169592" . We will also make simultaneous updates in the official account below the end of the article. The relevant AIGC model experience will be launched simultaneously in the RdFast applet .

1 environment installation

1.1 Colossal-AI environment 

        For Colossal-AI environment setup and testing, please refer to this column "GPT Series Training and Deployment——Colossal-AI Environment Configuration and Test Verification", the address is "https://blog.csdn.net/suiyingy/article/details/130209217 ".

1.2 ColossalAI-Examples environment 

        ColossalAI-Examples contains some sample programs, such as the GPT training program, and its environment construction steps are as follows.

git clone https://github.com/hpcaitech/ColossalAI-Examples.git
cd ColossalAI-Examples
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

1.3 Colossal-AI GPT environment

        The environment installation instructions are as follows. Among them, the installation of LSH depends on the earlier version of GCC, that is, the version is not easy to be too high. The tutorial states that version 9.3.0 of gcc satisfies the requirement, while version 10.3.0 does not.

pip install ftfy langdetect numpy torch pandas nltk sentencepiece boto3 tqdm regex bs4 newspaper3k htmlmin tldextract cached-path  -i https://pypi.tuna.tsinghua.edu.cn/simple
git clone https://github.com/mattilyra/LSH.git
cd LSH
python setup.py install

        The following error may be reported when installing LSH. The solution is to replace the files in the LSH/lsh folder with cMinhash.cpp of Colossal-AI, and then run "python setup.py install" again. The reference command for file replacement is "cp ~/project/ColossalAI-Examples/language/gpt/tools/LSH/cMinhash.cpp lsh/cMinhash.cpp".

/root/miniconda3/envs/clai/lib/python3.8/site-packages/numpy/core/include/numpy/__multiarray_api.h: At global scope:
/root/miniconda3/envs/clai/lib/python3.8/site-packages/numpy/core/include/numpy/__multiarray_api.h:1477:1: warning: ‘int _import_array()’ defined but not used [-Wunused-function]
 _import_array(void)
 ^~~~~~~~~~~~~
error: command '/usr/bin/gcc' failed with exit code 1

2 Data download

2.1 url file download

        The dataset of the GPT2 model is OpenWebText, which mainly comes from webpage text, and its complete dataset size is 38GB. Colossal-AI provides the download address of the OpenWebText dataset, namely "https://mega.nz/#F!EZZD0YwJ!9_PlEQzdMVLaNdKv_ICNVQ!cc4RgQQZ". The downloaded file is a compressed file of OpenWebText.zip, and the file directory after decompression is OpenWebText/Version 1/URLs. The directory includes 161 text files, and each file records the address url of the web page content.

2.2 Data cleaning

        Since some urls in the OpenWebText dataset may no longer be accessible, you need to run the following command to complete data cleaning.

cd path/to/tools
python Megatron/blacklist_urls.py <path/to/URLs> <path/to/clean_urls.txt>

        Specific examples are as follows:

cd ColossalAI-Examples/language/gpt/tools/
python Megatron/blacklist_urls.py /data/data/clai/data/OpenWebText/Version\ 1/URLs/ /data/data/clai/data/OpenWebText/clean_urls.txt

        After the program runs, a clean_urls.txt file will be generated in the specified path, which contains 21269934 url ​​addresses after cleaning.

2.3 Content download

        According to the cleaned url address, we need to download the webpage content therein. The download command is as follows, where n_procs is the number of threads.

Cd /language/gpt/tools
python download/download.py <path/to/clean_urls.txt> --n_procs 50 --output <path/to/raw.json>

        Since it takes a long time to download the complete data set, we can download only part of the data with the following command. Among them, max_urls can specify the maximum number of downloaded urls, and timeout is used to set the timeout time for url access. The advantage of setting a timeout is to quickly skip unconnectable url addresses.

python download/download.py  /data/data/clai/data/OpenWebText/clean_urls.txt --output /data/data/clai/data/OpenWebText/raw.json --max_urls 1000 --timeout 30

        The downloaded webpage content is stored in the raw.json file under the specified path. The file contains a series of lists composed of data in json format, and each json data corresponds to the content of a webpage in a url. The Json format is {'text': text, 'url': unique_url}, and an example is shown below.

{"text": "The space station looks like an airplane or a very bright star moving across the sky, except it doesn't have flashing lights or change direction. It will also be moving considerably faster than a typical airplane (airplanes generally fly at about 600 miles per hour; the space station flies at 17,500 miles per hour).\n\nBelow is a time-lapse photo of the space station moving across the sky.\n\nThe International Space Station is seen in this 30 second exposure as it flies over Elkton, VA early in the morning, Saturday, August 1, 2015. Photo Credit: NASA/Bill Ingalls\n\nVisit the NASA Johnson Flickr Photostream", "url": "http://spotthestation.nasa.gov/sightings/view.cfm?country=United_States®ion=Arizona&city=Phoenix#.UvPTWWSwLpM"}

 Figure 2 Raw data in raw.json

3 Data processing

        Related programs are located in the language/gpt/tools folder.

3.1 Remove text that is too short in length.

        Run "python Megatron/cleanup_dataset.py <path/to/raw.json> <path/to/clean.json>" to delete data whose length is less than 128 tokens. A sample program is as follows. The cleanup_fix_dataset.py provided in the program supports more cleanup data options. The cleaned data is saved in clean.json.

python Megatron/cleanup_dataset.py /data/data/clai/data/OpenWebText/raw.json  /data/data/clai/data/OpenWebText/clean.json

3.2 Delete similar data

        The program uses LSH to find potentially similar data, and then groups the data with high similarity, and only keeps one of the data in each group of data, and deletes all other similar data.

#查找相似数据
python Megatron/find_duplicates.py --inputs <path/to/clean.json> url --output <path/to/process_stage_one.json>
#相似数据分组
python Megatron/group_duplicate_url.py <path/to/process_stage_one.json> <path/to/process_stage_two.json>
#删除相似数据
python Megatron/remove_group_duplicates.py <path/to/process_stage_two.json> <path/to/clean.json> <path/to/dedup.json>

        The sample program is as follows, and the processed result is saved in dedup.json.

python Megatron/find_duplicates.py --inputs /data/data/clai/data/OpenWebText/clean.json url --output /data/data/clai/data/OpenWebText/process_stage_one.json
python Megatron/group_duplicate_url.py /data/data/clai/data/OpenWebText/process_stage_one.json /data/data/clai/data/OpenWebText/process_stage_two.json
python Megatron/remove_group_duplicates.py /data/data/clai/data/OpenWebText/process_stage_two.json /data/data/clai/data/OpenWebText/clean.json /data/data/clai/data/OpenWebText/dedup.json

3.3 Shuffle the data order

        Use the command "shuf <path/to/dedup.json> -o <path/to/train_data.json>" to randomly shuffle the data and save it in train.json. The processed data structure is completely consistent with the data structure of the second part, namely {'text': text, 'url': unique_url}.

shuf /data/data/clai/data/OpenWebText/dedup.json -o /data/data/clai/data/OpenWebText/train_data.json

4 Chinese dataset processing

        The Chinese data set uses the "source data" provided by Inspur, which can be applied for at the "https://air.inspur.com/home" website. The download address of the word segmentation table is "https://github.com/Shawn-Inspur/Yuan-1.0/blob/main/src/vocab.txt". The directory structure after downloading the complete source data is as follows.

|--dataset
|     |--001.txt
|     |--002.txt
|     |--...
|--vocab.txt

        When training with this dataset, line 44 of train_gpt.py needs to be replaced, as shown below.

import dataset.yuan import YuanDataset
train_ds = YuanDataset(os.environ['DATA'], vocab_path='/path/to/data/vocab.txt'seq_len=gpc.config.SEQ_LEN)

5 Model training

5.1 Set training data path

        Here the OpenWebText dataset is used for training. The dataset path can be set by setting environment variables, that is, "export DATA=/path/to/train_data.json", or by changing line 44 of train_gpt.py to "train_ds = WebtextDataset('/data /data/clai/data/OpenWebText/train_data.json', seq_len=gpc.config.SEQ_LEN)" to achieve.

5.2 Model training

        The model training command is "colossalai run --nproc_per_node=<num_gpus> train_gpt.py --config=gpt2_configs/<config_file>". Among them, num_gpus is the number of GPUs, and config can load different configurations of GPT2 or GPT3 training parameters.

        The sample program is "colossalai run --nproc_per_node=2 train_gpt.py --config=gpt2_configs/gpt2_vanilla.py". During the running process, the two GPUs occupy 4621MB of video memory respectively. The training process is shown in the figure below. If an error is reported during operation, please continue reading the following .

 Figure 3 Schematic diagram of GPT2 training

        The video memory and parallel training methods under other GPT2 model training configurations are shown in the table below. TP, PP, and DP represent three parallel modes, representing Tensor Parallel, Pipeline Parallel, and Data Parallel, respectively. The number of GPUs is obtained by multiplying the three, that is, TP * PP * DP, and the value of DP is automatically calculated.

 Figure 4 GPT2 training configuration

        If the error "ModuleNotFoundError: No module named 'colossalai.zero.init_ctx'" is reported during operation, it is recommended to uninstall colossalai through pip uninstall and install it again through pip install colossalai. It is recommended not to use mirrors or replace them with different mirrors during the installation process, because resources on individual mirror websites may still have this problem.

        If the error "KeyError: 'SLURM_PROCID'" is reported, the running command needs to be replaced with "colossalai run --nproc_per_node=2 train_gpt.py --config=gpt2_configs/gpt2_1d.py --from_torch", that is, add --from_torch.

6 Configuration file introduction

        The configuration file can refer to the official introduction, mainly the parallel mode, number of training iterations, Batch Size and hidden variable dimension settings, https://github.com/hpcaitech/ColossalAI-Examples/tree/main/language/gpt. No additional introduction will be made here, and subsequent articles will introduce in detail the debugging process of the training program.

        The GPT3 model training parameter configuration is as follows.

 Figure 5 GPT3 training configuration

        For more AIGC model training, reasoning and deployment, please refer to this column article " Python AIGC large model training and reasoning from scratch ", the address is "https://blog.csdn.net/suiyingy/article/details/130169592" . We will also make simultaneous updates in the official account below the end of the article. The relevant AIGC model experience will be launched simultaneously in the RdFast applet .

        This article is an original article of the blogger, and may not be reproduced without the permission of the blogger.

        This article is a series of articles in the column "Python AIGC large model training and reasoning from scratch", the address is "https://blog.csdn.net/suiyingy/article/details/130169592".

Guess you like

Origin blog.csdn.net/suiyingy/article/details/128711444