Image segmentation suite PaddleSeg comprehensive analysis (1) train.py code interpretation

First of all, congratulations to the Baidu team Baidu for winning the NeurIPS2020 Challenge Championship, https://www.jiqizhixin.com/articles/2020-12-09-2 .
The image segmentation suite PaddleSeg developed based on the flying paddle deep learning framework is used in this competition. Starting from this article, I will continue to update the "Image Segmentation Suite PaddleSeg Comprehensive Analysis" series of articles. Due to my limited personal level, please forgive me for any errors, thank you.

PaddleSeg is an end-to-end image segmentation development kit developed by Baidu based on its own PaddlePaddle. Contains a variety of mainstream segmentation networks. PaddleSeg is designed in a modular way. Models can be combined through configuration files, helping developers to quickly and easily complete model training and deployment without the need for a deep understanding of the principle of image segmentation. But when you need to modify and optimize the model, you still need to have a better understanding of the principle of image segmentation and the PaddleSeg suite. The main content of this article is to interpret the code of PaddleSeg to help developers further understand the principle of image segmentation and the implementation method of PaddleSeg. This article only needs to introduce the implementation method of PaddleSeg's dynamic graph.

This code interpretation is based on PaddleSeg dynamic graph version V2.0.0-rc. The source code of the PaddleSeg suite can be downloaded from GitHub, the command is as follows:
The source code of the PaddleSeg suite can be downloaded from GitHub, the command is as follows:

!git clone https://github.com/PaddlePaddle/PaddleSeg.git

The PaddleSeg directory contains the following directories:

  • configs: Save configuration files of different neural networks.
  • contrib: Real case related configuration and data
  • legacy: static map version code, only maintains, does not update new features
  • docs: documentation
  • paddleseg: PaddleSeg core code, including training, evaluation, inference and other files.
  • tools: tool script
  • train.py: training entry file
  • val.py: Evaluation model file
  • predict.py: prediction file

This article is roughly divided into the following 7 parts:
1. Train.py code interpretation: here we mainly explain the code of the paddleseg training entry file. The file describes the analysis of parameters, the starting method of training, and the resources prepared for training.
Image segmentation suite PaddleSeg comprehensive analysis (1) train.py code interpretation
2. Config code interpretation: here we mainly explain the code of the Config class, the config class is instantiated by train.py, and the config is generated by the configuration file specified when running train.py Object.
Comprehensive analysis of the image segmentation suite PaddleSeg (2)
3. DataSet code interpretation: Here we mainly explain the Dataset class. Each data set is abstracted as a class. By inheriting the Dataset class, the anonymous protocol is realized, and the file list is constructed for training. .
Comprehensive analysis of image segmentation suite PaddleSeg (3)
4. Interpretation of data enhancement code: Here we mainly explain some common algorithms for data preprocessing and enhancement.
Image segmentation suite PaddleSeg comprehensive analysis (4)
5. Model and Backbone code interpretation: here we mainly explain commonly used models and backbone networks and algorithms.
Image segmentation suite PaddleSeg comprehensive analysis (5)
Image segmentation suite PaddleSeg comprehensive analysis (6)

6. Interpretation of Loss Function Code: Here we mainly explain the codes and algorithms of commonly used loss functions.
Image segmentation suite PaddleSeg comprehensive analysis (7)

7. Interpretation of the evaluation model code: here is an explanation of the code and evaluation method for evaluating model performance.

1.train.py code interpretation

Neural network model training needs to use train.py to complete. It is the core code in PaddleSeg.

Let's first combine the figure below to understand the preparations before training.

You can quickly start a training task with the following command.

python train.py --config configs/quick_start/bisenet_optic_disc_512x512_1k.yml

The -config parameter in the command specifies the configuration file for this training. For a detailed description of the configuration file, please refer to the second section below.

At the beginning of the execution of the train.py script, some packages will be imported, as follows:

from paddleseg.cvlibs import manager, Config
from paddleseg.utils import get_sys_env, logger
from paddleseg.core import train
  • When importing the manager module, five ComponentManager objects in the manage box on the left side of the figure will be created. They are MODELS, BACKBONES, DATASETS, TRANSFORMS and LOSSES. These five ComponentManagers are similar to dictionaries and are used to maintain all the corresponding classes in the suite, such as FCN classes, ResNet classes, etc., and the corresponding classes can be found by class names.
  • When train.py is running, a config object will be created.
cfg = Config(
    args.cfg,
    learning_rate=args.learning_rate,
    iters=args.iters,
    batch_size=args.batch_size)

When the config object is created, the class specified in the configuration file will be obtained through the manager, and objects such as model and loss will be instantiated.

  • train.py calls the train function, passing config as an actual parameter. The train function gets the members in config to complete the training work.

Let's interpret train.py in detail. First, let's start with the entry code of train.py:

if __name__ == '__main__':
    # 处理运行train.py传入的参数
    args = parse_args()
    #调用主函数。
    main(args)

First look at the first line of code

args = parse_args()

The implementation of parse_args() is as follows:

	#配置文件路径
	parser.add_argument(
        "--config", dest="cfg", help="The config file.", default=None, type=str)
    #总训练迭代次数
    parser.add_argument(
        '--iters',
        dest='iters',
        help='iters for training',
        type=int,
        default=None)
    #batchsize大小
    parser.add_argument(
        '--batch_size',
        dest='batch_size',
        help='Mini batch size of one gpu or cpu',
        type=int,
        default=None)
    #学习率
    parser.add_argument(
        '--learning_rate',
        dest='learning_rate',
        help='Learning rate',
        type=float,
        default=None)
    #保存模型间隔
    parser.add_argument(
        '--save_interval',
        dest='save_interval',
        help='How many iters to save a model snapshot once during training.',
        type=int,
        default=1000)
    #如果需要恢复训练,指定恢复训练模型路径
    parser.add_argument(
        '--resume_model',
        dest='resume_model',
        help='The path of resume model',
        type=str,
        default=None)
    #模型保存路径
    parser.add_argument(
        '--save_dir',
        dest='save_dir',
        help='The directory for saving the model snapshot',
        type=str,
        default='./output')
    #数据读取器线程数量,目前在AI Studio建议设置为0.
    parser.add_argument(
        '--num_workers',
        dest='num_workers',
        help='Num workers for data loader',
        type=int,
        default=0)
    #在训练过程中进行模型评估
    parser.add_argument(
        '--do_eval',
        dest='do_eval',
        help='Eval while training',
        action='store_true')
    #日志打印间隔
    parser.add_argument(
        '--log_iters',
        dest='log_iters',
        help='Display logging information at every log_iters',
        default=10,
        type=int)
    #开启可视化训练
    parser.add_argument(
        '--use_vdl',
        dest='use_vdl',
        help='Whether to record the data to VisualDL during training',
        action='store_true')

Then look at the next line of code:

main(args)

The code of main is as follows:

def main(args):
    #获取环境信息,比如操作系统类型、python版本号、Paddle版本、GPU数量、Opencv版本、gcc版本等内容
    env_info = get_environ_info()
    #打印环境信息
    info = ['{}: {}'.format(k, v) for k, v in env_info.items()]
    info = '\n'.join(['\n', format('Environment Information', '-^48s')] + info +
                     ['-' * 48])
    logger.info(info)
    
    #确定是否使用GPU
    place = 'gpu' if env_info['Paddle compiled with cuda'] and env_info[
        'GPUs used'] else 'cpu'
    #设置使用GPU或者CPU
    paddle.set_device(place)
    #如果没有指定配置文件这抛出异常。
    if not args.cfg:
        raise RuntimeError('No configuration file specified.')
    #构建cfg对象,该对象包含数据集、图像增强、模型结构、损失函数等设置
    #该对象基于命令行传入参数以及yaml配置文件构建
    cfg = Config(
        args.cfg,
        learning_rate=args.learning_rate,
        iters=args.iters,
        batch_size=args.batch_size)
	#从Config对象中获取train_data对象。train_data为迭代器
    train_dataset = cfg.train_dataset
    #如果没有设置训练集,抛出异常
    if not train_dataset:
        raise RuntimeError(
            'The training dataset is not specified in the configuration file.')
    #如果需要在训练中进行模型评估,则需要获取到验证集
    val_dataset = cfg.val_dataset if args.do_eval else None
    #获取损失函数
    losses = cfg.loss

    msg = '\n---------------Config Information---------------\n'
    msg += str(cfg)
    msg += '------------------------------------------------'
    #打印出详细设置。
    logger.info(msg)
    #调用core/train.py中train函数进行训练
    train(
        cfg.model,
        train_dataset,
        val_dataset=val_dataset,
        optimizer=cfg.optimizer,
        save_dir=args.save_dir,
        iters=cfg.iters,
        batch_size=cfg.batch_size,
        resume_model=args.resume_model,
        save_interval=args.save_interval,
        log_iters=args.log_iters,
        num_workers=args.num_workers,
        use_vdl=args.use_vdl,
        losses=losses)

In the train.py script, in addition to calling config to parse the configuration file, the train function in core/train.py is called to complete the training work. Let me first look at the workflow of the train function.

It can be seen from the figure that the entire training process consists of two loops. The outermost loop is controlled by the total number of iterations and needs to be configured in the yaml file, as shown in the following code:

iters: 80000

The inner loop is controlled by the data reader, and the loop will traverse all the data in the data reader until all the data is read out of the loop. This process is usually called an epoch.

Let's analyze in detail the code of the train function in core/train.py.

First look at the code summary of the train function.

Then we look at the detailed code interpretation,

def train(model, #模型对象
          train_dataset, #训练集对象
          val_dataset=None, #验证集对象,如果训练过程不需要验证,可以为None
          optimizer=None, #优化器对象
          save_dir='output', #模型输出路径
          iters=10000, #训练最大迭代次数
          batch_size=2, #batch size大学
          resume_model=None, # 是否需要恢复训练,如果需要指定恢复训练模型权重路径
          save_interval=1000, # 模型保存间隔
          log_iters=10, # 设置日志输出间隔
          num_workers=0, #设置数据读取器线程数,0为不开启多进程
          use_vdl=False, #是否使用vdl
          losses=None): # 损失函数系数,当使用多个损失函数时,需要指定各个损失函数的系数。
    #为了兼容多卡训练,这里需要获取显卡数量。
    nranks = paddle.distributed.ParallelEnv().nranks
    #在分布式训练中,每个显卡都会执行本程序,所以需要在程序里获取本显卡的序列号。
    local_rank = paddle.distributed.ParallelEnv().local_rank
    #循环起始的迭代数。如果是恢复训练的话,从恢复训练中获得起始的迭代数。
    #比如,在2000次迭代的时候保存了中间训练过程,通过resume恢复训练,那么start_iter则为2000。
    start_iter = 0
    if resume_model is not None:
        start_iter = resume(model, optimizer, resume_model)
    #创建保存输出模型文件的目录。
    if not os.path.isdir(save_dir):
        if os.path.exists(save_dir):
            os.remove(save_dir)
        os.makedirs(save_dir)
    #如果是多卡训练,则需要初始化多卡训练环境。
    if nranks > 1:
        # Initialize parallel training environment.
        paddle.distributed.init_parallel_env()
        strategy = paddle.distributed.prepare_context()
        ddp_model = paddle.DataParallel(model, strategy)
	#创建一个批量采样器,这里指定数据集,通过批量采样器组成一个batch。这里需要指定batch size,是否随机打乱,是否丢弃末尾不能组成一个batch的数据等参数。
    batch_sampler = paddle.io.DistributedBatchSampler(
        train_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
    #通过数据集参数和批量采样器等参数构建一个数据读取器。可以通过num_works设置多进程,这里的多进程通过共享内存通信,
    #如果共享内存过小可能会报错,如果报错可以尝将num_workers设置为0,则不开启多进程。
    loader = paddle.io.DataLoader(
        train_dataset,
        batch_sampler=batch_sampler,
        num_workers=num_workers,
        return_list=True,
    )

    if use_vdl:
        from visualdl import LogWriter
        log_writer = LogWriter(save_dir)
    #开启定时器
    timer = Timer()
    avg_loss = 0.0
    iters_per_epoch = len(batch_sampler)
    best_mean_iou = -1.0
    best_model_iter = -1
    train_reader_cost = 0.0
    train_batch_cost = 0.0
    timer.start()

    iter = start_iter
    #开始循环,通过迭代次数控制最外层循环。
    while iter < iters:
        #内部循环,遍历数据迭代器中的数据。
        for data in loader:
            iter += 1
            if iter > iters:
                break
            #记录读取器时间
            train_reader_cost += timer.elapsed_time()
            #保存样本
            images = data[0]
            #保存样本标签
            labels = data[1].astype('int64')
            #供BCELoss使用
            edges = None
            if len(data) == 3:
                edges = data[2].astype('int64')
                
            #如果有多张显卡,则开启分布式训练,如果只有一张显卡则直接调用模型对象进行训练。
            if nranks > 1:
                #通过模型前向运算获得预测结果
                logits_list = ddp_model(images)
            else:
            	#通过模型前向运算获得预测结果
                logits_list = model(images)
            #通过标签计算损失
            loss = loss_computation(
                logits_list=logits_list,
                labels=labels,
                losses=losses,
                edges=edges)
            #计算模型参数的梯度
            loss.backward()
            #执行一次优化器并进行参数更新
            optimizer.step()
            #获取当前优化器的学习率。
            lr = optimizer.get_lr()
            if isinstance(optimizer._learning_rate,
                          paddle.optimizer.lr.LRScheduler):
                optimizer._learning_rate.step()
            #清除模型中的梯度
            model.clear_gradients()
            #计算平均损失值
            avg_loss += loss.numpy()[0]
            train_batch_cost += timer.elapsed_time()
            #根据配置中的log_iters打印训练日志
            if (iter) % log_iters == 0 and local_rank == 0:
                avg_loss /= log_iters
                avg_train_reader_cost = train_reader_cost / log_iters
                avg_train_batch_cost = train_batch_cost / log_iters
                train_reader_cost = 0.0
                train_batch_cost = 0.0
                remain_iters = iters - iter
                eta = calculate_eta(remain_iters, avg_train_batch_cost)
                logger.info(
                    "[TRAIN] epoch={}, iter={}/{}, loss={:.4f}, lr={:.6f}, batch_cost={:.4f}, reader_cost={:.4f} | ETA {}"
                    .format((iter - 1) // iters_per_epoch + 1, iter, iters,
                            avg_loss, lr, avg_train_batch_cost,
                            avg_train_reader_cost, eta))
                if use_vdl:
                    log_writer.add_scalar('Train/loss', avg_loss, iter)
                    log_writer.add_scalar('Train/lr', lr, iter)
                    log_writer.add_scalar('Train/batch_cost',
                                          avg_train_batch_cost, iter)
                    log_writer.add_scalar('Train/reader_cost',
                                          avg_train_reader_cost, iter)
                avg_loss = 0.0
            #根据配置中的save_interval判断是否需要对当前模型进行评估。
            if (iter % save_interval == 0
                    or iter == iters) and (val_dataset is not None):
                num_workers = 1 if num_workers > 0 else 0
                mean_iou, acc = evaluate(
                    model, val_dataset, num_workers=num_workers)
                #评估后需要将模型训练模式,该模式影响dropout和batchnorm层
                model.train()
			#根据配置中的save_interval判断是否需要保存当前模型。
            if (iter % save_interval == 0 or iter == iters) and local_rank == 0:
                current_save_dir = os.path.join(save_dir,
                                                "iter_{}".format(iter))
                #如果输出路径不存在,需要创建目录。
                if not os.path.isdir(current_save_dir):
                    os.makedirs(current_save_dir)
                #保存模型权重
                paddle.save(model.state_dict(),
                            os.path.join(current_save_dir, 'model.pdparams'))
               	#保存优化器权重,恢复训练会用到。
                paddle.save(optimizer.state_dict(),
                            os.path.join(current_save_dir, 'model.pdopt'))
				#保存最佳模型。
                if val_dataset is not None:
                    if mean_iou > best_mean_iou:
                        best_mean_iou = mean_iou
                        best_model_iter = iter
                        best_model_dir = os.path.join(save_dir, "best_model")
                        paddle.save(
                            model.state_dict(),
                            os.path.join(best_model_dir, 'model.pdparams'))
                    logger.info(
                        '[EVAL] The model with the best validation mIoU ({:.4f}) was saved at iter {}.'
                        .format(best_mean_iou, best_model_iter))

                    if use_vdl:
                        log_writer.add_scalar('Evaluate/mIoU', mean_iou, iter)
                        log_writer.add_scalar('Evaluate/Acc', acc, iter)
            #重置定时器
            timer.restart()

    # Sleep for half a second to let dataloader release resources.
    time.sleep(0.5)
    if use_vdl:
        log_writer.close()

This concludes the interpretation of the train.py file of PaddleSeg suite training entry.

PaddleSeg warehouse address: https://github.com/PaddlePaddle/PaddleSeg

Guess you like

Origin blog.csdn.net/txyugood/article/details/111029854