NVIDIA open source DG-Net: use GAN to do "Taobao-style" clothes changing, assisting pedestrian re-identification

A few days ago, NVIDIA open sourced the source code of DG-Net. Let's review this CVPR19 Oral paper.

The paper is the article "Joint Discriminative and Generative Learning for Person Re-identification" presented orally at CVPR19 by researchers from NVIDIA, University of Technology Sydney (UTS), and Australian National University (ANU). Deep learning model training often requires a large amount of labeled data, but it is often difficult to collect and label a large amount of data. The author explores the method of using generated data to assist training on the task of pedestrian re-identification. By generating high-quality pedestrian images and fusing them with the pedestrian re-identification model, the quality of pedestrian generation and the accuracy of pedestrian re-identification are improved at the same time.
Paper link: https://arxiv.org/abs/1904.07223
Station B video: https://www.bilibili.com/video/av51439240/Tencent
video: https://v.qq.com/x/page/t0867x53ady .html

Code: https://github.com/NVlabs/DG-Net

insert image description here

Why: (What are the pain points of previous papers?)

  • Generating high-quality images of pedestrians is difficult. Pedestrian images generated by some previous works are of relatively low quality (as shown above). It is mainly reflected in two aspects: 1. The authenticity of generation: pedestrians are not realistic enough, the image is blurred, and the background is not real; 2. Additional annotations are needed to assist generation: additional human skeleton or attribute annotations are required.
  • If these low-quality pedestrian-generated images are used to train the person re-identification model, the difference (bias) between the original dataset and the original dataset will be introduced. Therefore, the previous work either only regards all generated pedestrian images as outliers to regularize the network; or additionally - trains a model for generating images and fuses them with the original model; or does not use generated images for training at all.
  • At the same time, due to the difficulty of labeling datasets, the amount of data in training sets for pedestrian re-identification (such as Market and DukeMTMC-reID) is generally around 2W, much smaller than datasets such as ImageNet, and the problem of easy overfitting has not been well resolved.

What: (what does this paper propose, what problem does it solve)

  • High-quality pedestrian images can be generated without additional annotations (such as pose, attribute, keypoints, etc.). By exchanging the extracted features, the appearance exchange of two pedestrian images is realized. These appearances are real variations in the training set, not random noise.
    insert image description hereImage encoding interchange

  • Part matching is not required to improve person re-identification results. Simply letting the model see more training samples can improve the performance of the model. Given N images, we first generate NxN training images and use these images to train the person re-identification model. (The first row and first column in the figure below are real image inputs, and the rest are generated images)
    insert image description here

  • There is a cycle in the training: the generated image is fed to the pedestrian re-identification model to learn good pedestrian features, and the features extracted by the pedestrian re-identification model are also fed to the generation model to improve the quality of the generated image.

How: (how this article achieves this goal)

  • Definition of features:
    In this paper, we first define two kinds of features. One is the appearance feature and the other is the structural feature. Appearance features are related to pedestrian ID, and structural features are related to low-level visual features.

insert image description here

  • Generated part:
  1. Same ID reconstruction: the appearance code of different photos of the same person should be the same. As shown in the figure below,
    we can have a self-reconstructed loss (above, similar to auto-encoder), and we can also use the postive sample with the same ID to build the generated image. Here we use pixel-level L1 Loss.

insert image description here

  1. Different ID generation:
    This is the most critical part. Given two input images, we can swap their appearance and structure code to generate two interesting outputs, as shown in the figure below. The corresponding losses are: GAN Loss that maintains authenticity, and the generated image can also reconstruct the corresponding a/s feature reconstruction loss.
There is no random part in our network, so the variation in the generated images is all from the training set itself. Therefore, it is closer to the original training set.

insert image description here

  • Part of reID:
    For real images, we still use the classified cross entropy loss.
    For image generation, we use two losses, one is L_{prime}, and the trained baseline model acts as a teacher to provide a soft label for the generated image, minimizing the KL distance between the prediction result and the teacher model. Another loss, to dig out the details of some images that remain after the appearance has changed, is L_{fine}. (See the paper for details.)

insert image description here

Results:

  • Qualitative indicators:
  1. Appearance swapping, we tested the results on three datasets, and we can see that our method is relatively robust to occlusion/large lighting changes.

insert image description here

  1. External interpolation. Does the network remember what the generated image looks like. Therefore, we did an experiment to gradually change the appearance, and we can see that the appearance changes gradually and smoothly.

insert image description here

  1. Failed case. Uncommon patterns such as logos cannot be restored.

insert image description here

  • Quantitative indicators:
  1. Generative image fidelity (FID) and diversity (SSIM) comparison. The smaller the FID, the better, and the larger the SSIM, the better.

insert image description here

  1. ReID results on multiple datasets (Market-1501, DukeMTMC-reID, MSMT17, CUHK03-NP).

insert image description here
insert image description here

Attachment: Video Demo

B station video backup: https://www.bilibili.com/video/av51439240/Tencent
video backup: https://v.qq.com/x/page/t0867x53ady.html

Finally, thank you all for reading. Because we are also in the initial stage of experimentation and exploration, it is inevitable that we will not think comprehensively enough about some issues. If you find something unclear, welcome to put forward your valuable opinions and discuss with us, thank you!

references

[1] Z. Zheng, L. Zheng, and Y. Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. ICCV, 2017.
[2] Y. Huang, J. Xu, Q. Wu, Z. Zheng, Z. Zhang, and J. Zhang. Multi-pseudo regularized label for generated samples in person reidentification. TIP, 2018.
[3] X. Qian, Y. Fu, T. Xiang, W. Wang, J. Qiu, Y. Wu, Y.-G. Jiang, and X. Xue. Pose-normalized image generation for person reidentification. ECCV, 2018.
[4] Y. Ge, Z. Li, H. Zhao, G. Yin, X. Wang, and H. Li. Fd-gan: Pose-guided feature distilling gan for robust person re-identification. In NIPS, 2018.

About the Author

The lead author of this paper, Zhedong Zheng, is a PhD student at the School of Computer Science at UTS, expected to graduate in June 2021. The thesis is the result of his internship at NVIDIA.

Zheng Zhedong has published 8 papers so far. One of them is ICCV17 spotlight, which has been cited more than 300 times. For the first time, feature learning for assisted person re-identification using GAN-generated images is proposed. A TOMM journal paper was selected as a 2018 Highly Cited Paper by Web of Science, with more than 200 citations. At the same time, he also contributed the benchmark code of the pedestrian re-identification problem to the community, which has more than 1,000 stars on Github and has been widely adopted.

In addition, other authors of the paper include Yang Xiaodong, an expert in the video field of Nvidia Research Institute, Yu Zhiding, an expert in the face field (Sphere Face, the author of LargeMargin), Dr. CVPR oral mid-draft), and VP Jan Kautz of NVIDIA Research, etc.

Zheng Zhedong's personal website: http://zdzheng.xyz/

Guess you like

Origin blog.csdn.net/Layumi1993/article/details/90257375