Huawei's open-source self-developed AI framework Shengsi MindSpore application case: a basic example of distributed parallel training (CPU)

This tutorial mainly explains how to use MindSpore to perform data parallel distributed training on the CPU platform to improve training efficiency.
Complete sample code: distributed_training_cpu

The directory structure is as follows:

bash └─sample_code
    ├─distributed_training_cpu
    │      resnet.py
    │      resnet50_distributed_training.py
    │      run.sh 

Among them, resnet.py and resnet50_distributed_training.py are training network definition scripts, and run.sh is a distributed training execution script.

If you are interested in MindSpore, you can follow the Shengsi MindSpore community

insert image description here

insert image description here

1. Environmental preparation

1. Enter ModelArts official website

The cloud platform helps users quickly create and deploy models, and manage full-cycle AI workflows. Select the following cloud platform to start using Shengsi MindSpore, get the installation command , install MindSpore2.0.0-alpha version, and enter the ModelArts official website in the Shengsi tutorial

insert image description here

Choose CodeLab below to experience it immediately

insert image description here

Wait for the environment to be built

insert image description here

2. Use ModelArts to experience examples

Enter Shengsi MindSpore official website , click on the installation above

insert image description here

get install command

insert image description here
Open a Terminal in ModelArts and enter the installation command

conda install mindspore=2.0.0a0 -c mindspore -c conda-forge

insert image description here

Then click Clone a Repository in the sidebar and enter

https://gitee.com/mindspore/docs.git

insert image description here

You can see that the docs project was imported successfully

insert image description here

2. Preparation

1. Download the dataset

This example uses the CIFAR-10 data set , which consists of 10 categories of 32*32 color pictures, each category contains 6,000 pictures, of which the training set has a total of 50,000 pictures, and the test set has a total of 10,000 pictures.

Upload the downloaded dataset to, after decompression, the folder is cifar-10-batches-bin.

insert image description here

tar -zxvf cifar-10-binary.tar.gz 

insert image description here

insert image description here

Note: If you are using ModelArts, there is no need to configure the following content, you can skip to five, start training, and start training the model directly

2. Configure the distributed environment

Data parallelism on the CPU is mainly divided into two parallel modes: single-machine multi-node and multi-machine multi-node parallelism (a training process can be understood as a node). Before running the training script, it is necessary to build a networking environment, mainly including the environment variable configuration and the call of the initialization interface in the training script.

The environment variables are configured as follows:


export MS_WORKER_NUM=8                # Worker number
export MS_SCHED_HOST=127.0.0.1        # Scheduler IP address
export MS_SCHED_PORT=6667             # Scheduler port
export MS_ROLE=MS_WORKER              # The role of this node: MS_SCHED represents the scheduler, MS_WORKER represents the worker

in,

  • MS_WORKER_NUM: Indicates the number of worker nodes. In a multi-machine scenario, the number of worker nodes is the sum of the worker nodes of each machine.
  • MS_SCHED_HOST: Indicates the scheduler node ip address.
  • MS_SCHED_PORT: Indicates the scheduler node service port, used to receive the ip and service port sent by the worker node, and then send all the collected worker node ip and port to each worker.
  • MS_ROLE: Indicates the node type, which is divided into two types: worker (MS_WORKER) and scheduler (MS_SCHED). Regardless of whether it is a single-machine multi-node or multi-machine multi-node, a scheduler node needs to be configured for networking.

The initialization interface call in the training script is as follows:


import mindspore as ms
from mindspore.communication import init

ms.set_context(mode=ms.GRAPH_MODE, device_target="CPU")
ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True)
ms.set_ps_context(enable_ssl=False)
init()

in,

  • ms.set_context(mode=context.GRAPH_MODE, device_target="CPU"): The specified mode is graph mode (parallel is not supported in PyNative mode on CPU), and the device is CPU.
  • ms.set_auto_parallel_context(parallel_mode=ParallelMode.DATA_PARALLEL,
    gradients_mean=True): Specifies the data parallel mode. Gradients_mean=True means that an average will be performed after gradient reduction. Currently, gradient reduction on the CPU only supports summation.
  • ms.set_ps_context: Configure secure encrypted communication, you can enable secure encrypted communication through ms.set_ps_context(enable_ssl=True), the default is False, and disable secure encrypted communication.
  • init: node initialization, the completion of initialization indicates that the networking is successful.

3. Load the dataset

During distributed training, datasets are imported in a data-parallel manner. Below we take the CIFAR-10 dataset as an example to introduce the method of importing the CIFAR-10 dataset in data parallel mode. data_path refers to the path of the dataset, that is, the path of the cifar-10-batches-bin folder.

The sample code is as follows


import mindspore as ms
import mindspore.dataset as ds
import mindspore.dataset.vision as vision
import mindspore.dataset.transforms as transforms
from mindspore.communication import get_rank, get_group_size

def create_dataset(data_path, repeat_num=1, batch_size=32):
    """Create training dataset"""
    resize_height = 224
    resize_width = 224
    rescale = 1.0 / 255.0
    shift = 0.0

    # get rank_id and rank_size
    rank_size = get_group_size()
    rank_id = get_rank()
    data_set = ds.Cifar10Dataset(data_path, num_shards=rank_size, shard_id=rank_id)

    # define map operations
    random_crop_op = vision.RandomCrop((32, 32), (4, 4, 4, 4))
    random_horizontal_op = vision.RandomHorizontalFlip()
    resize_op = vision.Resize((resize_height, resize_width))
    rescale_op = vision.Rescale(rescale, shift)
    normalize_op = vision.Normalize((0.4465, 0.4822, 0.4914), (0.2010, 0.1994, 0.2023))
    changeswap_op = vision.HWC2CHW()
    type_cast_op = transforms.TypeCast(ms.int32)

    c_trans = [random_crop_op, random_horizontal_op]
    c_trans += [resize_op, rescale_op, normalize_op, changeswap_op]

    # apply map operations on images
    data_set = data_set.map(operations=type_cast_op, input_columns="label")
    data_set = data_set.map(operations=c_trans, input_columns="image")

    # apply shuffle operations
    data_set = data_set.shuffle(buffer_size=10)

    # apply batch operations
    data_set = data_set.batch(batch_size=batch_size, drop_remainder=True)

    # apply repeat operations
    data_set = data_set.repeat(repeat_num)

    return data_set

Different from the stand-alone machine, the num_shards and shard_id parameters need to be passed in when constructing the Cifar10Dataset, which correspond to the number of worker nodes and the logical serial number respectively, which can be obtained through the framework interface, as follows:

  • get_group_size: Get the number of worker nodes in the cluster.
  • get_rank: Get the logical serial number of the current worker node in the cluster.

When loading data sets in data parallel mode, it is recommended to specify the same data set file for each card. If the data sets loaded by each card are different, the calculation accuracy may be affected.

4. Define the model

In the data parallel mode, the network definition is consistent with the stand-alone writing method, and you can refer to the ResNet network sample script .

For the definition of optimizer, loss function and training model, please refer to the definition of training model .

The complete training script code reference sample, the training startup code is listed below.


import os
import mindspore as ms
import mindspore.nn as nn
from mindspore import train
from mindspore.communication import init
from resnet import resnet50

def train_resnet50_with_cifar10(epoch_size=10):
    """Start the training"""
    loss_cb = train.LossMonitor()
    data_path = os.getenv('DATA_PATH')
    dataset = create_dataset(data_path)
    batch_size = 32
    num_classes = 10
    net = resnet50(batch_size, num_classes)
    loss = SoftmaxCrossEntropyExpand(sparse=True)
    opt = nn.Momentum(filter(lambda x: x.requires_grad, net.get_parameters()), 0.01, 0.9)
    model = ms.Model(net, loss_fn=loss, optimizer=opt)
    model.train(epoch_size, dataset, callbacks=[loss_cb], dataset_sink_mode=True)


if __name__ == "__main__":
    ms.set_context(mode=ms.GRAPH_MODE, device_target="CPU")
    ms.set_auto_parallel_context(parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True)
    ms.set_ps_context(enable_ssl=False)
    init()
    train_resnet50_with_cifar10()

The create_dataset and SoftmaxCrossEntropyExpand interfaces in the script are referenced from distributed_training_cpu , and
the resnet50 interface is referenced from the ResNet network sample script .

5. Start training

On the CPU platform, take a single machine with 8 nodes as an example to perform distributed training.

enter into /home/ma-user/work/docs/docs/sample_code/distributed_training_cputhe directory

insert image description here

insert image description here

Start the training through the following shell script, command bash run.sh /dataset/cifar-10-batches-bin, you can see that the training has been successful

insert image description here

(PyTorch-1.8) [ma-user distributed_training_cpu]$bash run.sh cifar-10-batches-bin
==============================================================================================================
Please run the script with dataset path, such as: 
bash run.sh DATA_PATH
For example: bash run.sh /path/dataset
It is better to use the absolute path.
==============================================================================================================
scheduler start success!
worker 0 start success with pid 8240
worker 1 start success with pid 8241
worker 2 start success with pid 8242
worker 3 start success with pid 8243
worker 4 start success with pid 8244
worker 5 start success with pid 8245
worker 6 start success with pid 8246
worker 7 start success with pid 8247

#!/bin/bash
# run data parallel training on CPU

echo "=============================================================================================================="
echo "Please run the script with dataset path, such as: "
echo "bash run.sh DATA_PATH"
echo "For example: bash run.sh /path/dataset"
echo "It is better to use the absolute path."
echo "=============================================================================================================="
set -e
DATA_PATH=$1
export DATA_PATH=${DATA_PATH}

export MS_WORKER_NUM=8
export MS_SCHED_HOST=127.0.0.1
export MS_SCHED_PORT=8117

# Launch 1 scheduler.
export MS_ROLE=MS_SCHED
python3 resnet50_distributed_training.py >scheduler.txt 2>&1 &
echo "scheduler start success!"

# Launch 8 workers.
export MS_ROLE=MS_WORKER
for((i=0;i<${MS_WORKER_NUM};i++));
do
    python3 resnet50_distributed_training.py >worker_$i.txt 2>&1 &
    echo "worker ${i} start success with pid ${
     
     !}"
done

Among them, resnet50_distributed_training.py is the defined training script.

For multi-machine and multi-node scenarios, it is necessary to start the corresponding worker node on each machine to participate in training in this way, but there is only one scheduler node, and it only needs to be started on one of the machines (that is, MS_SCHED_HOST).

The defined MS_WORKER_NUM value indicates that a corresponding number of worker nodes need to be started to participate in the training, otherwise the networking will fail.

Although the training script is also started for the scheduler node, the scheduler is mainly used for networking and will not participate in training.

Train for a period of time, open the worker_0 log, and the training information is as follows:

insert image description here

(PyTorch-1.8) [ma-user distributed_training_cpu]$tail -f worker_0.txt 

……
epoch: 1 step: 1, loss is 1.4686084
epoch: 1 step: 2, loss is 1.3278534
epoch: 1 step: 3, loss is 1.4246798
epoch: 1 step: 4, loss is 1.4920032
epoch: 1 step: 5, loss is 1.4324203
epoch: 1 step: 6, loss is 1.432581
epoch: 1 step: 7, loss is 1.319618

Guess you like

Origin blog.csdn.net/qq_46207024/article/details/129665922