Guide to multi-node training with Intel® Distribution of Caffe*

This is an introduction to multi-node training with Intel® Distribution of Caffe* framework. All other pages related to multi-node in this wiki are supplementary and they are referred to in this guide. By the end of it, you should understand how multi-node training was implemented in Intel® Distribution of Caffe* and be able to train any topology yourself on a simple cluster. Basic knowledge of BVLC Caffe usage might be necessary to understand it fully. Also be sure to check out the performance optimization guidelines.

Introduction

To make the practical part of this guide more comprehensible, the instructions assume you have configured from scratch a cluster comprising 4 nodes. You will learn how to configure such a cluster, how to compile Intel® Distribution of Caffe*, how to run a training of a particular model, and how to verify the network actually have trained.

How it works

In case you are not interested in how multi-node in Intel® Distribution of Caffe* works and just want to run the training, please skip to the practical part chapter of this Wiki.

Intel® Distribution of Caffe* is designed for both single-node and multi-node operation. Here, the multi-node part is explained.

There are two general approaches to parallelization. Data parallelism and model parallelism. The approach used in Intel® Distribution of Caffe* is the data parallelism.

Data parallelism

The data parallelization technique runs training on different batches of data on each of the nodes. The data is split among all nodes but the same model is used. It means that the total batch size in a single iteration is equal to the sum of individual batch sizes of all nodes. For example a network is trained on 8 nodes. All of them have batch size of 128. The (total) batch size in a single iteration of the Stochastic Gradient Descent algorithm is 8*128=1024.

Intel® Distribution of Caffe* with MLSL average gradients of multi-node training by using Allreduce operation.

Distribution of data

One approach is to divide your training data set into disjoint subsets of roughly equal size. Distribute each subset into each node used for training. Run the multinode training with data layer prepared accordingly, which means either preparing separate proto configurations or placing each subset in exactly the same path for each node.

An easier approach is to simply distribute the full data set on all nodes and configure data layer to draw different subset on each node. Remember to set shuffle:true for the training phase in prototxt. Since each node has its own unique randomizing seed, it will effectively draw unique image subset.

Communication

Intel® Distribution of Caffe* is utilizing Intel® Machine Learning Scaling Library (MLSL) which provides communication primitives for data parallelism and model parallelism, communication patterns for SGD and its variants (AdaGrad, Momentum, etc), distributed weight update. It is optimized for Intel® Xeon® and Intel® Xeon Phi (TM) processors and supports Intel® Omni-Path Architecture, Infiniband and Ethernet. Refer to MLSL Wiki or "MLSL Developer Guide and Reference" for more details on the library.

Snapshots

Snapshots are saved only by the node hosting the root process (rank number 0). In order to resume training from a snapshot the file has to be populated across all nodes participating in a training.

Test phase during training

If test phase is enabled in the solver’s protobuf file all the nodes are carrying out the tests and results are aggregated by Allreduce operation. The validation set needs to be present on every machine which have test phase specified in solver protobuf file. This is important because when you want to use the same solver file on all machines instead of working with multiple protobuf files you need to remember about that.

Configuring Cluster for Intel® Distribution of Caffe*

This chapter explains how to configure a cluster, and what components to install in order to build Intel® Distribution of Caffe* to start distributed training using Intel® Machine Learning Scaling Library.

Hardware and software configuration

Hardware assumptions for this guide: 4 machines with IP addresses in the range from 192.161.32.1to 192.161.32.4 up the cluster Start from fresh installation of CentOS 7.2 64-bit. The OS image can be downloaded free of charge from the official website. Minimal ISO is enough. You should install the OS on each node (all 4 in our example). Next upgrade to the latest version of packages (do it on each node):

# yum upgrade

TIP: You can also execute yum -y upgrade to suppress the prompt asking for confirmation of the operation (unattended upgrade).

Preparing the system

Before installing Intel® Distribution of Caffe* you need to install prerequisites. Start by choosing the master machine (e.g. 192.161.32.1 in our example).

On each machine install “Extra Packages for Enterprise Linux”:

# yum install epel-release
# yum clean all

On master machine install "Development Tools" and ansible:

# yum groupinstall "Development Tools"
# yum install ansible

Configuring ansible and ssh

Configure ansible's inventory on master machine by adding sections ourmaster and ourcluster in /etc/ansible/ hosts and fill in slave IPs:

[ourmaster]
192.161.31.1
[ourcluster]
192.161.32.[2:4]

On each slave machine configure SSH authentication using master machine’s public key, so that you can log in with ssh HOSTNAME without a password. Generate RSA key on master machine:

$ ssh-keygen -t rsa

And copy the public part of the key to slave machines:

$ ssh-copy-id -i ~/.ssh/id_rsa.pub 192.161.32.2
$ ssh-copy-id -i ~/.ssh/id_rsa.pub 192.161.32.3
$ ssh-copy-id -i ~/.ssh/id_rsa.pub 192.161.32.4

Verify ansible works by running ping command from master machine. The slave machines should respond.

$ ansible ourcluster -m ping

Example output:

192.168.31.2 | SUCCESS => {
    “changed“: false,
    “ping“: “pong“
}
192.168.31.3 | SUCCESS => {
    “changed“: false,
    “ping“: “pong“
}
192.168.31.4 | SUCCESS => {
    “changed“: false,
    “ping“: “pong“
}

Master machine can also ping itself by ansible ourmaster -m ping and entire inventory by ansible all -m ping.

Installing tools

On master machine use ansible to install packages listed by running the command below for the entire cluster.

# ansible all -m shell -a 'yum -y install python-devel boost boost-devel cmake numpy \
  numpy-devel gflags gflags-devel glog glog-devel protobuf protobuf-devel hdf5 \
  hdf5-devel lmdb lmdb-devel leveldb leveldb-devel snappy-devel opencv opencv-devel'

Optionally you can install additional system tools you may find usefull.

# ansible all -m shell -a 'yum install -y mc cpuinfo htop tmux screen iftop iperf \
  vim wget'

You might be required to turn off the firewall on each node (refer to Firewalls and MPI for more information), too.

# ansible all -m shell -a 'systemctl stop firewalld.service'

The cluster is ready to deploy binaries of Intel® Distribution of Caffe*. Let’s build it now.

Building Intel® Distribution of Caffe*

This chapter explains how to build Intel® Distribution of Caffe* for multi-node (distributed) training of neural networks.

Getting Intel® Distribution of Caffe* Source Code

On master machine execute the following git command in order to obtain the latest snapshot of Intel® Distribution of Caffe* including multi-node support for distributed training.

$ git clone https://github.com/intel/caffe.git intelcaffe

Note: Build of Intel® Distribution of Caffe* will trigger Intel® Math Kernel Library for Machine Learning (MKLML) and Machine Learning Scaling Library (MLSL) to be downloaded to the intelcaffe/external/mkl/ and intelcaffe/external/mlsl directory and automatically configured.

Building from Makefile

This section covers only the portion required to build Intel® Distribution of Caffe* with multi-node support using Makefile. Please refer to Caffe documentation for general information on how to build Caffe using Makefile.

Start by changing work directory to the location where Intel® Distribution of Caffe* repository have been downloaded (e.g. ~/intelcaffe).

$ cd ~/intelcaffe

Make a copy of Makefile.config.example, and name it Makefile.config

$ cp Makefile.config.example Makefile.config

Open Makefile.config in your favorite editor and make sure the USE_MLSL variable is uncommented (by default it's uncommented).

# Intel(r) Machine Learning Scaling Library (uncomment to build with MLSL)
USE_MLSL := 1

Execute make command to build Intel® Distribution of Caffe* with multi-node support.

$ make -j <number_of_physical_cores> -k

Building from CMake

This section covers only the portion required to build Intel® Distribution of Caffe* with multi-node support using CMake. Please refer to Caffe documentation for general information on how to build Caffe using CMake. Start by changing work directory to the location where Intel® Distribution of Caffe* repository have been downloaded (e.g. ~/intelcaffe).

$ cd ~/intelcaffe

Create build directory and change work directory to build directory.

$ mkdir build
$ cd build

Execute the following CMake command in order to prepare the build

$ cmake .. -DBLAS=mkl -DUSE_MLSL=1 -DCPU_ONLY=1

Execute make command to build Intel® Distribution of Caffe* with multi-node support.

$ make -j <number_of_physical_cores> -k
$ cd ..

Populating Caffe Binaries across Cluster Nodes

After successful build, then synchronize intelcaffe directories onto the slave machines. By the way, if you're building intelcaffe on a NFS directory which can be accessed by all nodes (including master and slaves), then you don't need to do below synchronization.

$ ansible ourcluster -m synchronize -a ‘src=~/intelcaffe dest=~/’

Running Multi-node Training with Intel® Distribution of Caffe*

Instructions on how to train CIFAR10 and GoogLeNet are explained in more details in Multi-node CIFAR10 tutorial and Multi-node GoogLeNet tutorial. It is recommended to do CIFAR10 tutorial before you proceed. Here, the GoogLeNet will be trained on 4 node cluster. If you want to learn more about GoogLeNet training see the tutorial mentioned above as well.

Before you can train anything you need to prepare the dataset. It is assumed that you have already downloaded the ImageNet training and validation datasets, and they are stored on each node in /home/data/imagenet/train directory for training set and /home/data/imagenet/val for validation set. For details you can look at the Data Preparation section of BVLC Caffe examples at http://caffe.berkeleyvision.org/gathered/examples/imagenet.html. You can use your own data sets as well. We prefer to transform the datasets into lmdb format by using scripts like examples/imagenet/create_imagenet.sh.

Next step is to create machine file ~/mpd.hosts on master node for controlling the placement of MPI process across the machines:

192.161.32.1
192.161.32.2
192.161.32.3
192.161.32.4

Update lmdb path in your model file models/intel_optimized_models/multinode/googlenet_4nodes/train_val.prototxt:

 name: "GoogleNet"
 layer {
   name: "data"
   type: "ImageData"
   top: "data"
   top: "label"
   include {
   phase: TRAIN
   }
   transform_param {
   mirror: true
   crop_size: 224
   mean_value: 104
   mean_value: 117
   mean_value: 123
   }
   image_data_param {
   source: "/home/data/ilsvrc12_train_lmdb"
   batch_size: 256
   shuffle: true
   }
 }
 layer {
 name: "data"
 type: "ImageData"
 top: "data"
 top: "label"
 include {
 phase: TEST
   }
   transform_param {
   crop_size: 224
   mean_value: 104
   mean_value: 117
   mean_value: 123
   }
   image_data_param {
   source: "/home/data/ilsvrc12_val_lmdb"
   batch_size: 50
   new_width: 256
   new_height: 256
   }
 }

Synchronize the intelcaffe directories, change your working directory to intelcaffe.

For intelcaffe multi-node training, we have maturing script run_intelcaffe.sh to use, below are its all available parameters you can specify. Usually, you only need subset of them for your scratch training unless you're seeking for a better scaling performance of training or other functions like model re-training or network environment benchmarking etc. Anyway, have a happy try. Any issues encountered, welcome to file an issue against it.

./scripts/run_intelcaffe.sh
Usage:
  ./scripts/run_intelcaffe.sh [--hostfile host_file] [--solver solver_file]
               [--weights weight_file] [--num_omp_threads num_omp_threads]
               [--network opa/tcp] [--netmask tcp_netmask] [--debug on/off]
               [--mode train/time/none] [--benchmark all/qperf/mpi/none]
               [--iteration iter] [--model_file deploy.prototxt]
               [--snapshot snapshot.caffemodel]
               [--num_mlsl_servers num_mlsl_servers]
               [--internal_thread_pin on/off]
               [--output output_folder]
               [--mpibench_bin mpibench_bin]
               [--mpibench_param mpibench_param]
               [--caffe_bin  caffe_binary_path]
               [--cpu cpu_model]
               [--msg_priority on/off] [--msg_priority_threshold 10000]
               [--mpi_iallreduce_algo mpi_iallreduce_algo]
               [--ppn ppn]

  Parameters:
    hostfile: host file includes list of nodes. Only used if you're running with multinode
    solver: need to be specified a solver file if mode is train
    weight_file: weight file for finetuning
    num_omp_threads: number of OpenMP threads
    network: opa(default), tcp
    netmask: only used if network is tcp, which is the name of network interface card of the host node you're using
    debug: off(default). MLSL debug information is outputed if it's on
    mode: train(default), time, none(not to run caffe test)
    benchmark: none(disabled by default). Includes qperf, all-reduce performance
      Dependency: user needs to install qperf, Intel MPI library (including IMB-MPI1);
                  and add them in system path.
    iteration and model_file: only used if mode is time (caffe time)
    snapshot: it's specified if train is resumed
    num_mlsl_servers: number of MLSL ep servers
    internal_thread_pin: on(default). pin internal threads to 2 CPU cores for reading data.
    output_folder: output folder for storing results
    mpibench_bin: IMB-MPI1 (default). relative path of binary of mpi benchmark.
    mpibench_param: allreduce (default). parameter of mpi benchmark.
    caffe_binary_path: path of caffe binary.
    cpu_model: specify cpu model and use the optimal settings if the CPU is not
               included in supported list. Value: bdw, knl, skx and knm.
               bdw - Broadwell, knl - Knights Landing, skx - Skylake,
               knm - Knights Mill.
    msg_priority: off (default). Enable/disable message priority in MLSL.
    msg_priority_threshold: 10000 (default), if message priority is on.
    mpi_iallreduce_algo: adjust MPI iallreduce algorithms for synchronizing gradients in nodes.
    ppn: process per node. default is 1.

And start the training process with the following command:

$ ./scripts/run_intelcaffe.sh --hostfile ~/mpd.hosts --solver \
      models/intel_optimized_models/multinode/googlenet_4nodes/solver.prototxt --network \
      tcp --netmask eth0 --benchmark none

Log from the training process will be written to result-{date}/outputCluster-{cpu_model}-{num_nodes}.txt file. And the trained model weight file will be saved to the path you specified on the solver file, here it is under models/intel_optimized_models/multinode/googlenet_4nodes/googlenet_iter_350000.caffemodel.

Test the trained network

When the training is finished, you can test how your network has trained with the following command:

$ ./build/tools/caffe test --model=models/intel_optimized_models/multinode/googlenet_4nodes/train_val.prototxt 
  --weights=models/intel_optimized_models/multinode/googlenet_4nodes/googlenet_iter_350000.caffemodel --iterations=1000

Look at the bottom lines of output from the above command which contains loss3/top-1 and loss3/top-5. The values should be around loss3/top-1 = 0.69 and loss3/top-5 = 0.886.

For more information about caffe test visit Caffe interfaces website at http://caffe.berkeleyvision.org/tutorial/interfaces.html.

Caffe 分布式训练