Vid2Vid installation

Vid2Vid

Prerequisites
Linux or macOS
Python 3
NVIDIA GPU + CUDA cuDNN
Getting started
Installation
Install PyTorch and dependencies from https://pytorch.org
Install python dominate and requests library.

pip install dominate requests

Clone this repository:

git clone https://github.com/NVIDIA/vid2vid
cd vid2vid

Testing
We include a sample Cityscapes video in the datasets folder.

First, download and compile the snapshot python scripts/download_flownet2.py of the FlowNet2 repo from https://github.com/NVIDIA/flownet2-pytorch by running.

Please download the pre-trained Cityscapes model in the following way:

python scripts/download_models.py
要测试模型(bash ./scripts/test_2048.sh):
#!./scripts/test_2048.sh
python test.py --name label2city_2048 --loadSize 2048 --n_scales_spatial 3 --use_instance --fg --use_single_G

The test results will be saved to ./results/label2city_2048/test_latest/index.html.

We also provide a smaller model trained with 1 GPU, which produces slightly worse performance at 1024 x 512 resolution.

Please download the model

python scripts/download_models_g1.py    
要测试模型(bash ./scripts/test_1024_g1.sh):
#!./scripts/test_1024_g1.sh
python test.py --name label2city_1024_g1 --loadSize 1024 --n_scales_spatial 3 --use_instance --fg --n_downsample_G 2 --use_single_G

You can find more example scripts in the scripts directory.

Data set
We use the Cityscapes data set as an example. To train the model on the complete data set, please download it from the official website (registration required).
We apply a pre-trained segmentation algorithm to obtain the corresponding semantic map (train_A) and instance map (train_inst).
Please put the obtained images in the datasets folder, the method is the same as that of the sample images.
Training
First, download the FlowNet2 checkpoint file python scripts/download_models_flownet2.py by running.

Use 8 GPUs for training:

我们采用从粗到精的方法,将分辨率从512 x 256,1024 x 5122048 x 1024逐步增加。
512 x 256分辨率训练模型(bash ./scripts/train_512.sh)
 #!./scripts/train_512.sh
python train.py --name label2city_512 --gpu_ids 0,1,2,3,4,5,6,7 --n_gpus_gen 6 --n_frames_total 6 --use_instance --fg
1024 x 512分辨率训练模型(必须首先训练512 x 256)(bash ./scripts/train_1024.sh):
 #!./scripts/train_1024.sh
python train.py --name label2city_1024 --loadSize 1024 --n_scales_spatial 2 --num_D 3 --gpu_ids 0,1,2,3,4,5,6,7 --n_gpus_gen 4 --use_instance --fg --niter_step 2 --niter_fix_global 10 --load_pretrain checkpoints/label2city_512
要查看培训结果,请查看中间结果./checkpoints/label2city_1024/web/index.html。如果安装了TensorFlow,则可以./checkpoints/label2city_1024/logs通过添加--tf_log到培训脚本来查看TensorBoard登录。

Use a single GPU for training:

We use multiple GPUs to train our model. For convenience, we provide some sample training scripts (XXX_g1.sh) for single GPU users with a resolution of up to 1024 x 512. Again adopt the method from coarse to fine (256 x 128, 512 x 256, 1024 x 512). Performance cannot be guaranteed using these scripts.
For example, use a single GPU to train a 256 x 128 video (bash ./scripts/train_256_g1.sh)

#!./scripts/train_256_g1.sh
python train.py --name label2city_256_g1 --loadSize 256 --use_instance --fg --n_downsample_G 2 --num_D 1 --max_frames_per_gpu 6 --n_frames_total 6
全速(2k x 1k)分辨率训练
要以全分辨率(2048 x 1024)训练图像,需要8个GPU,至少24G内存(bash ./scripts/train_2048.sh)。如果只有具有12G / 16G内存的GPU可用,请使用脚本./scripts/train_2048_crop.sh,该脚本将在训练期间裁剪图像。此脚本无法保证性能。
使用您自己的数据集进行培训

If your input is label maps, please generate label maps. These maps are single-channel and their pixel values ​​correspond to the object labels (ie 0, 1, ..., N-1, where N is the number of labels). This is because we need to generate a one-hot vector from the label map. Please use –label_nc N during training and testing.
If your input is not a label map, please specify –label_nc 0 and –input_nc N, where N is (the default value is 3 RGB images) the number of input channels.
The default setting for preprocessing is scaleWidth, and opt.loadSize scales the width of all training images to (1024) while maintaining the aspect ratio. If you need other settings, please use the –resize_or_crop option to change them. For example, scaleWidth_and_crop first adjusts the size of the image to have the width opt.loadSize, and then randomly crops the size (opt.fineSize, opt.fineSize). Crop skips the resizing step and only performs random cropping. scaledCrop crops the image when retraining the original aspect ratio. If you do not want to perform any preprocessing, please specify none, and do nothing except to ensure that the image can be divisible by 32.
More training/testing details
The way we train the model is as follows: Suppose we have 8 GPUs, 4 for generators, 4 for discriminators, and we want to train 28 frames. In addition, it is assumed that each GPU can only generate one frame. The first GPU generates the first frame and passes it to the next GPU, and so on. After generating 4 frames, they are passed to 4 discriminator GPUs to calculate the loss. Then, the last generated frame becomes input to the next batch, and the next 4 frames in the training sequence are loaded into the GPU. This is repeated 7 times (4 x 7 = 28) to train all 28 frames.
Some important signs:
n_gpus_gen: The number of GPUs used for the generator (while other GPUs are used for the discriminator). We divide the generator and discriminator into different GPUs, because when processing high resolution, even one frame is not suitable for GPU. If the number is set to -1, there is no separation, and all GPUs are used for generators and discriminators (only for low-resolution images).
n_frames_G: the number of input frames fed into the generator network; that is, n_frames_G-1 is the number of frames we have seen in the past. The default value is 3 (the previous two frames are the condition).
n_frames_D: The number of frames to be fed into the time discriminator. The default value is 3.
n_scales_spatial: the number of scales in the spatial domain. We train from the coarsest scale to the finest scale. The default value is 3.
n_scales_temporal: the number of scales of the temporal discriminator. The finest ratio is in the order of the original frame rate. The coarser ratio sub-samples the frame by a factor before feeding the frame n_frames_D to the discriminator. For example, if n_frames_D=3 and n_scales_temporal=3, the discriminator effectively sees 27 frames. The default value is 3.
max_frames_per_gpu: The number of frames in one GPU during training. If your GPU memory can hold more frames, try setting this number larger. The default value is 1.
max_frames_backpropagate: The number of frames in which the loss is propagated back to the previous frame. For example, if this number is 4, the loss on frame n will propagate back to frame n-3. Increasing this number will slightly improve performance, but it will also cause training instability. The default value is 1.
n_frames_total: The total number of frames in the sequence we want to train. We gradually increase this number during training.
niter_step: How many times have we doubled n_frames_total. The default value is 5.
niter_fix_global: If this number is not 0, then before starting to fine-tune all scales, only train the best spatial scale for this period.
batchSize: the number of sequences for one training. We usually set batchSize to 1, because usually, one sequence is enough to occupy all GPUs. If you want to execute batchSize> 1, currently only batchSize == n_gpus_gen supports.
no_first_img: If not specified, the model will assume the first frame and synthesize consecutive frames. If specified, the model will also try to synthesize the first frame.
fg: If specified, use the foreground-background separation model.
For other flags, see options/train_options.py and options/base_options.py to view all training flags; see options/test_options.py and options/base_options.py for all test flags.
Citation
If you find this useful for your research, please use the following.

@article{
    
    wang2018vid2vid,
  title={
    
    Video-to-Video Synthesis},
  author={
    
    Ting-Chun Wang and Ming-Yu Liu and Jun-Yan Zhu and Guilin Liu and Andrew Tao and Jan Kautz and Bryan Catanzaro},  
  journal={
    
    arXiv},
  year={
    
    2018}
}

Acknowledgements
This code borrows a lot of pytorch-CycleGAN-and-pix2pix and pix2pixHD.

Original: https://github.com/NVIDIA/vid2vid

Guess you like

Origin blog.csdn.net/weixin_49304690/article/details/112389611