vid2vid + training + test code debugging (debug + train + test) (ii) training articles

### Training

### Training with Cityscapes dataset

- First, download the FlowNet2 checkpoint file by running `python scripts/download_models_flownet2.py`.
- Training with 8 GPUs:(惹不起惹不起)
- We adopt a coarse-to-fine approach, sequentially increasing the resolution from 512 x 256, 1024 x 512, to 2048 x 1024.
- Train a model at 512 x 256 resolution (`bash ./scripts/street/train_512.sh`)
```bash
#!./scripts/street/train_512.sh
python train.py --name label2city_512 --label_nc 35 --gpu_ids 0,1,2,3,4,5,6,7 --n_gpus_gen 6 --n_frames_total 6 --use_instance --fg
```
- Train a model at 1024 x 512 resolution (must train 512 x 256 first) (`bash ./scripts/street/train_1024.sh`):
```bash
#!./scripts/street/train_1024.sh
python train.py --name label2city_1024 --label_nc 35 --loadSize 1024 --n_scales_spatial 2 --num_D 3 --gpu_ids 0,1,2,3,4,5,6,7 --n_gpus_gen 4 --use_instance --fg --niter_step 2 --niter_fix_global 10 --load_pretrain checkpoints/label2city_512
```
If you have TensorFlow installed, you can see TensorBoard logs in `./checkpoints/label2city_1024/logs` by adding `--tf_log` to the training scripts.

- Training with a single GPU:
- We trained our models using multiple GPUs.

The above is the rich practice to use eight GPU training.

- For convenience, we provide some sample training scripts (train_g1_XXX.sh) for single GPU users, up to 1024 x 512 resolution. Again a coarse-to-fine approach is adopted (256 x 128, 512 x 256, 1024 x 512). Performance is not guaranteed using these scripts.

- Here, "coarse-to-fine" argument is shown below:

We can see that the article is achieved through high-resolution nested models of video frames; of course, ideally, this is infinitely nested superimposed.


Prior to the formal entry into the experiment, I want to talk about the project is how to deal with data and utilization data.

In Cityscapes task, our data set format is "temporal" and see "vid2vid / data / temporal_dataset.py".

In the following we try to interpret the code that processes the frame sampling, pre-processing and the like.

### Copyright (C) 2017 NVIDIA Corporation. All rights reserved. 
### Licensed under the CC BY-NC-SA 4.0 license (https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode).
import os.path
import random
import torch
from data.base_dataset import BaseDataset, get_img_params, get_transform, get_video_params
from data.image_folder import make_grouped_dataset, check_path_valid
from PIL import Image
import numpy as np

class TemporalDataset(BaseDataset):
    def initialize(self, opt):
        self.opt = opt
        self.root = opt.dataroot                                             # 默认是 'datasets/Cityscapes/'
        self.dir_A = os.path.join(opt.dataroot, opt.phase + '_A')            # train_A
        self.dir_B = os.path.join(opt.dataroot, opt.phase + '_B')            # train_B
        self.A_is_label = self.opt.label_nc != 0                             # 默认是 35

        self.A_paths = sorted(make_grouped_dataset(self.dir_A))              # [ [], [], ..., [] ]
        self.B_paths = sorted(make_grouped_dataset(self.dir_B))              # [ [], [], ..., [] ]
        # 两者的不同在于train_A/B的不同,其下的文件是一一对应的(拥有相同的文件名)
        
        check_path_valid(self.A_paths, self.B_paths)                         # 检查mask与real是否一一对应(数量上)
        
        if opt.use_instance:                
            self.dir_inst = os.path.join(opt.dataroot, opt.phase + '_inst')  # 是否要使用前景背景的instance的知识
            self.I_paths = sorted(make_grouped_dataset(self.dir_inst))       # [ [], [], ..., [] ]
            check_path_valid(self.A_paths, self.I_paths)                     # 同样检查是否与mask一一对应(数量上)

        self.n_of_seqs = len(self.A_paths)                                   # number of sequences to train   视频序列数    
        self.seq_len_max = max([len(A) for A in self.A_paths])               # 最长的序列的帧数
        self.n_frames_total = self.opt.n_frames_total                        # 默认 30;the overall number of frames in a sequence to train with 从每一个序列中抽取的帧数进行训练

    def __getitem__(self, index):
        tG = self.opt.n_frames_G                                             # 默认是 3,每一次输入给generator的帧数,包括:当前帧与之前的(tg-1)帧
        A_paths = self.A_paths[index % self.n_of_seqs]                       # 得到 index 对应的若干视频mask帧的路径
        B_paths = self.B_paths[index % self.n_of_seqs]                       # 得到 index 对应的若干视频real帧的路径
        # 求模以保证不超出界限
        if self.opt.use_instance:
            I_paths = self.I_paths[index % self.n_of_seqs]                   # 假如使用 instance 的话,还需要加载对应的 inst 的路径
        
        # setting parameters
        n_frames_total, start_idx, t_step = get_video_params(self.opt, self.n_frames_total, len(A_paths), index)    
                                                                             # 返回的是这个序列的:1)最大可以获取的帧数;2)第一帧的索引;3)相邻帧之间的差距

        # setting transformers
        B_img = Image.open(B_paths[start_idx]).convert('RGB')                # 打开真实图像的第一帧
        params = get_img_params(self.opt, B_img.size)                        # 获取一些信息:1)返回新的大小;2)crop大小;3)crop位置;4)是否进行翻转增强
        # {'new_size': (new_w, new_h), 'crop_size': (crop_w, crop_h), 'crop_pos': (crop_x, crop_y), 'flip': flip}
        # {'new_size': (new_w, new_h), 'crop_size': (crop_w, crop_h), 'crop_pos': (crop_x, crop_y), 'flip': flip}
        
        transform_scaleB = get_transform(self.opt, params)
        transform_scaleA = get_transform(self.opt, params, method=Image.NEAREST, normalize=False) if self.A_is_label else transform_scaleB
        # 定义两种数据类型的【预处理方式】,唯一的不同是当A是label的时候,不对他进行正则化,理由是这样使得类别之间的差距更大?

        # read in images
        A = B = inst = 0
        for i in range(n_frames_total):            
            A_path = A_paths[start_idx + i * t_step]
            B_path = B_paths[start_idx + i * t_step]                                # 等间距采样
            
            Ai = self.get_image(A_path, transform_scaleA, is_label=self.A_is_label)            
            Bi = self.get_image(B_path, transform_scaleB)
            
            A = Ai if i == 0 else torch.cat([A, Ai], dim=0)            
            B = Bi if i == 0 else torch.cat([B, Bi], dim=0) 
                                                                                    # 将所有的帧拼在一起

            if self.opt.use_instance:
                I_path = I_paths[start_idx + i * t_step]                
                Ii = self.get_image(I_path, transform_scaleA) * 255.0               # 因为 instance map 是二值图,为了均衡与 mask 和 RGB 的数据分布,乘以255
                inst = Ii if i == 0 else torch.cat([inst, Ii], dim=0)                

        return_list = {'A': A, 'B': B, 'inst': inst, 'A_path': A_path, 'B_paths': B_path}
        return return_list                                                          # 至此,我们根据指示将30(6)帧从一个序列中抽取出来;每一帧都有<segment mask, real_RGB, inst_map, mask_pths, RGB_pths>的五元组

    def get_image(self, A_path, transform_scaleA, is_label=False):
        A_img = Image.open(A_path)        
        A_scaled = transform_scaleA(A_img)
        if is_label:
            A_scaled *= 255.0
        return A_scaled

    def __len__(self):
        return len(self.A_paths)

    def name(self):
        return 'TemporalDataset'

Finally, we come to sum up

for each_vid_sequences, do:
    1. 
        1) 获取我们需要的且可以从该序列中获取的帧数;
        2)获取第一帧的索引;
        3)获取采样的间距(步长);
    2. 
        1)定义数据与处理方式(transform);
    3. 
        1)读取采样到的帧的图像信息;进行预处理后在通道的维度上concencate在一起。

- For example, to train a 256 x 128 video with a single GPU (`bash ./scripts/street/train_g1_256.sh`)

- First, we train the nested structure of the network innermost (256 x 128).

- By executing script "./scripts/street/train_g1_256.sh"

python train.py --name label2city_256_g1 --label_nc 35 --loadSize 256 --use_instance --fg --n_downsample_G 2 --num_D 1 --max_frames_per_gpu 6 --n_frames_total 6
'''
--name                保存的路径名称
--label_nc            类别数
--loadSize            图片输入输出的长边大小(宽)                       help='scale images to this size'
--use_instance        是否使用前景背景的知识                            help='if specified, add instance map as feature for class A')
--fg                  是否要使用前景背景分离生成的方式                   help='if specified, use foreground-background seperation model'
--n_downsample_G      G的下采样次数                                    help='number of downsampling layers in netG'                                              
--num_D               D使用的是PatchGAN中的,这里指明Patch的缩放因子     help='number of patch scales in each discriminator'
--max_frames_per_gpu  每次给GPU载入多少帧
--n_frames_total      一个video中被选择用于训练的帧数
'''

We see that in the current document on label2face "./vid2vid/checkpoints" directory folder are:

After all, we are a small cost of postgraduate training process a total of 20 epoch, and each epoch is saved once for all models. Wherein each model respectively:

# 各个模型的意义

 

Parameters for each model almost half G, we can not always save, modify parameters select save two frequencies (options),

opt.save_latest_freq  --default=1000
每隔多少个训练的迭代次数(iterations)保留一次“latest”的模型。

opt.save_epoch_freq   --default=1
每隔多少个epoch保留一次模型

After the training is completed we look at the results (no accident, then there will be the middle of the output).

 


### Training with face datasets

- If you haven't, please first download example dataset by running `python scripts/download_datasets.py`.
- Run the following command to compute face landmarks for training dataset: 

python data/face_landmark_detection.py train

- 首先要制作face的数据集(landmark→sketch

- Run the example script (`bash ./scripts/face/train_512.sh`)

- 这是有钱人的做法,使用多个gpu
```bash
python train.py --name edge2face_512 --dataroot datasets/face/ --dataset_mode face --input_nc 15 --loadSize 512 --num_D 3 --gpu_ids 0,1,2,3,4,5,6,7 --n_gpus_gen 6 --n_frames_total 12 
``` 

- 对于单核训练的我,应该执行:

python train.py --name edge2face_256_g1 --dataroot datasets/face/ --dataset_mode face --input_nc 15 --loadSize 256 --ngf 64 --max_frames_per_gpu 6 --n_frames_total 12 --niter 20 --niter_decay 20 --save_epoch_freq 10

''' 假如还是内存不够,可以尝试着减少 max_frames_per_gpu 的值 '''

 


### Training with pose datasets

- 首先,确保你已经执行了下面的命令下载了一些样例数据集。

python scripts/download_datasets.py

- 执行下面的脚本开始训练 (`bash ./scripts/pose/train_256p.sh`)

# 多核大佬请看这里:

python train.py --name pose2body_256p --dataroot datasets/pose --dataset_mode pose --input_nc 6 --num_D 2 --resize_or_crop ScaleHeight_and_scaledCrop --loadSize 384 --fineSize 256 --gpu_ids 0,1,2,3,4,5,6,7 --batchSize 8 --max_frames_per_gpu 3 --no_first_img --n_frames_total 12 --max_t_step 4

- Again, for single GPU users, example scripts are in train_g1_XXX.sh. These scripts are not fully tested and please use at your own discretion. If you still hit out of memory errors, try reducing `max_frames_per_gpu`.

python train.py --name pose2body_256p_g1 --dataroot datasets/pose --dataset_mode pose --input_nc 6 --ngf 64 --num_D 2 --resize_or_crop randomScaleHeight_and_scaledCrop --loadSize 384 --fineSize 256 --niter 5 --niter_decay 5 --no_first_img --n_frames_total 12 --max_frames_per_gpu 4 --max_t_step 4 --save_epoch_freq 10

''' 假如还是内存不够,可以尝试着减少 max_frames_per_gpu 的值 '''


### Training with your own dataset

- 假如你的输入是标签的map,确保其是:

1)单通道的(HxW)张量;

2)对应类别,每个像素的值为:0,1,2,...,N-1;

3)在训练与测试的时候显著说明 label_nc 的值。

- 假如你的输入不是标签map,务必明确参数“--input_nc N”,其中 N 是输入数据的通道数(譬如:输入是RGB的时候是3)

- 对数据的预处理方式(transform)默认是“scaleWidth”,其作用:

将帧等比例放缩使得宽度等于‘opt.loadSize’(512)。

  如果你想要其他的效果,可以显示定义 `--resize_or_crop` 的值,概述如下:

scaleWidth_and_crop 首先等比例放缩使得宽度等于opt.loadSize,然后随机crop得到(opt.fineSize, opt.fineSize)大小。
crop 不放缩,仅仅随机crop到指定大小(opt.fineSize, opt.fineSize)。
scaledCrop crop,但还保持原来的长宽比例。
randomScaleHeight 随机将帧等比例放缩使得高度介于opt.loadSize和opt.fineSize之间。
none 仅仅对确保长与宽都能够被32整除。

 


## More Training/Test Details

下面小编将就本人对项目的理解对一些训练或测试细节进行翻译与解释。

首先,考虑到我们的帧都是顺序生成的,即当前帧的生成依赖于前一帧的生成;那么对于第一帧的生成我们采用:

1)使用如 pix2pix 的对单张 image(frame)进行转化;(明确:--use_single_G)

2)直接使用真实的帧;(明确:--use_real_img)

3)迫使模型直接合成。(明确:--no_first_img)

其次,说说“我们”的训练策略:

假设我们有8张卡,4张给G四张给D‘并且我们想要训练28帧(即每个子序列训练其生成28帧);此外,1个GPU一次生成一帧,并且把它传给下一个GPU;当4帧都生成完毕,4帧被分别送给4个D;然后第4帧作为gpu1的输入再次准备生成第5帧。因此我们将会重复7次。 


- Some important flags:
- `n_gpus_gen`:用于G的gpu数目,其他的则用于D;我们把D与G放在不同的gpu,是因为有时候图片很大,即使是一帧,一个gpu也无法使用;当被设为-1时,所有的gpu都同时被D和G占用。
- `n_frames_G`: 输入G的帧数,其中(n_frames_G-1)帧是过去的帧,加上当前的一帧(s, x);其值默认是3.
- `n_frames_D`: 输入D的帧数,默认是3.
- `n_scales_spatial`: 在空间域的放大次数。默认值是3,意味着从coarse到finest之间嵌套了三个,如:256→512→1024.
- `n_scales_temporal`: 我们知道vid2vid中有2种鉴别器,分别是image的与video的;其中Dv的输入是连续帧;这个参数指的是输入帧数的放大次数,默认是3.举个例子,假如原来:n_frames_D=3,那么在训练过程就会有:3→9(3^2)→27(3^3)。
- `max_frames_per_gpu`: 一个gpu中喂进的帧数;视情况而定;如果gpu显存较大,这个参数值可以调大一些;否则调小一些。当出现gpu显存不够的时候,试着首先调小这一参数。其默认值是1.
- `max_frames_backpropagate`: 这里应该只是论文里讲的K值,即将前K-1帧也用于计算对抗损失;默认值是1.
- `n_frames_total`: 对于每一个序列希望从中抽取的用于训练的帧数;在训练过程逐渐增大这个值。
- `niter_step`: 每几个epoch对n_frmaes_total的值作加倍,默认值是5.
- `niter_fix_global`: 假如这个值不是0,则只在最大的scale上(即最外层的嵌套)开始训练这个数值的epoch。
- `batchSize`: 每一次基于多少个子序列进行训练,但一个序列已经是极限了;如果希望batchSize越大,必须要满足:batchSize == n_gpus_gen。
- `no_first_img`: 假如没有被特别声明,则认为第一帧已经给定了,则继续合成后面的帧;假如被声明了,模型会尝试着去合成第一帧。
- `fg`: 假如被声明,使用BG-FG分离的思路;其中,当使用了fg的时候,需要确保另一个参数: `--fg_labels`的值与label中表示BG的标签是一致的。
- `no_flow`: 如果被声明了,则不会使用光流嵌入(flow warping)并且直接合成帧。“我们”发现了,当BG是静止的时候,这也可以工作的很好。
- `sparse_D`: 当被声明了,则不适用Di,只是用Dv。
- 更多的参数的意思,则参见'/vid2vid/options'。


## Citation

 If you find this useful for your research, please cite the following paper.

```
@inproceedings{wang2018vid2vid,
author = {Ting-Chun Wang and Ming-Yu Liu and Jun-Yan Zhu and Guilin Liu
and Andrew Tao and Jan Kautz and Bryan Catanzaro},
title = {Video-to-Video Synthesis},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)}, 
year = {2018},
}
```

## Acknowledgments
We thank Karan Sapra, Fitsum Reda, and Matthieu Le for generating the segmentation maps for us. We also thank Lisa Rhee for allowing us to use her dance videos for training. We thank William S. Peebles for proofreading the paper.</br>
This code borrows heavily from [pytorch-CycleGAN-and-pix2pix](https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix) and [pix2pixHD](https://github.com/NVIDIA/pix2pixHD).

Guess you like

Origin blog.csdn.net/WinerChopin/article/details/89332581