ICCV2019 Best Paper Interpretation: SinGAN learns to generate models from a single image

Author: Xiao Jian Harbin Engineering University

The paper "SinGAN: Learning a Generative Model from a Single Natural Image" co-authored by the Technion-Israel Institute of Technology and Google Research won the Best Paper Award at ICCV2019. The following is our interpretation of this article.

This paper proposes an unconditional generative model-SinGAN that can be learned from a single natural image, which can capture the internal block distribution information of the image and generate high-quality and variable samples with the same visual content. SinGAN contains a fully convolutional GAN ​​with a pyramid structure, and each GAN is responsible for learning the distribution information of different scales of the image. Therefore, new samples of any size and aspect ratio can be generated, these samples have obvious changes, while maintaining the overall structure and fine texture characteristics of the training image. Compared with the previous single-image GAN scheme, the method in this paper is not limited to texture images, and is unconditional (that is, samples are generated from noise). A large number of experiments have proved that the samples generated by SinGAN have good authenticity and can be used in a variety of image processing tasks.

Paper address: https://arxiv.org/abs/1905.01164

Source address: https://github.com/tamarott/SinGAN

Supplementary material address: http://webee.technion.ac.il/people/tomermic/SinGAN/SinGAN.htm

Author: Tamar Rott Shaham, Tali Dekel, Tomer Michaeli (Israel Institute of Technology, Google Research)

Research Background

The Generative Adversarial Network (GAN) has made a huge leap in modeling the high-dimensional distribution of visual data. Especially when training with category-specific data sets (such as face, bedroom), unconditional GAN ​​has achieved remarkable success in generating realistic and high-quality samples. However, modeling the distribution of highly diverse data sets with multiple categories (such as ImageNet) is still a major challenge, and it is usually necessary to adjust the generation according to another input signal or train a model for a specific task.

This article brings GAN into a new field-learning unconditional generative models from a single natural image. Modeling the internal distribution of a single natural image has been recognized as a useful prior for many computer vision tasks. A single natural image usually has enough internal statistical information to enable the network to learn a powerful generative model. The author proposes a model SinGAN with a simple unified architecture that can process ordinary natural images with complex structures and textures without having to rely on data sets with images of the same category. This is achieved through a fully convolutional GAN ​​with a pyramid structure, each GAN is responsible for capturing image distributions of different scales. After training, SinGAN can generate a variety of high-quality image samples at any size. These samples are semantically similar to training images, but contain new targets and structures, as shown in Figure 1. And SinGAN can be applied to a variety of image processing tasks, such as image rendering, editing, fusion, super-resolution reconstruction and animation.

Figure 1 The image generation model learned from a single training sample. This paper proposes SinGAN, a new type of unconditional generative model trained on a single natural image. SinGAN uses a multi-scale adversarial training scheme to learn image internal statistics across multiple scales, which can be used to generate new realistic image samples that retain the original image block distribution while generating new targets and structures.

Related work

1. Single image depth model. Some recent work proposes to train an "overfitting" depth model on a single sample. They are all designed for specific tasks, such as super-resolution reconstruction, texture expansion, and so on. InGAN proposed by Shocher et al. is the first single natural image training model based on internal GAN. The generated samples depend on the input image (that is, the image is mapped to the image), and random samples cannot be drawn. The framework of this article is purely generative (that is, mapping noise to image samples), so it is suitable for many different image processing tasks. At present, the unconditional single-image GAN model only studies images with textures. When these models are trained on images without textures, they will not generate meaningful samples. The method in this paper is not limited to textures, and can handle general natural images. ,Figure 1.

2. A generative model for image processing. In many different image processing tasks, GAN-based methods have proven the great advantages of adversarial learning, including interactive image editing, sketch synthesis of images, and other image-to-image translation tasks. However, all these methods are trained on a specific type of data set, and usually require additional input signal adjustment and generation. This article does not focus on how to obtain the common features between images of the same type, but considers the use of different training data sources-all overlapping image blocks on multiple scales of a single natural image. The authors show that powerful generative models can be learned from such data and used in many image processing tasks.

method

The goal of this article is to learn an unconditional generative model that can capture the internal statistics of a single training image x. This task is conceptually similar to the conventional GAN ​​setting, except that the training samples here are sampled images of a single image at different scales, rather than the entire image sample in the data set.

Model selection processes more general natural images, giving the model other functions besides generating textures. In order to capture the global attributes such as the shape and arrangement position of the target in the image (such as the sky at the top and the ground at the bottom), as well as fine details and texture information, SinGAN includes patch-GANs (Markov discriminators) with a hierarchical structure. Each discriminator is responsible for capturing the distribution of x at different scales, as shown in Figure 2. Although similar multi-scale structures have been explored in GAN, this article is the first network structure explored for internal learning from a single image.

1. Multi-scale structure

2. Training process

Experimental result

The author conducted a qualitative and quantitative test on SinGAN on a data set with a wide range of image scenes. The qualitatively generated images are shown in Figure 1 and Figure 4. SinGAN well retains the global structure and better texture information of the target, such as the mountain in Figure 1 and the hot air balloon or pyramid in Figure 4. In addition, the model is a very realistic synthesis of reflections and shadows.

Figure 4 Randomly generated image samples

When using a smaller number of scales during training, the effective receptive field of the coarsest scale will be smaller, so that only fine textures can be captured. As the number of scales increased, larger supporting structures appeared, and the arrangement (positional relationship) of global targets was better preserved. When testing, you can choose the scale to start generating, and SinGAN's multi-scale structure can control the total amount of difference between samples. Starting from the coarsest scale will result in great changes in the overall structure, and in some cases with large and significant targets, unreal samples may be generated. When starting from a finer scale, the overall structure can be kept intact while only changing the finer image characteristics.

In order to quantify the authenticity of the generated images and the extent to which they capture the internal statistical information of the training images, the author uses two metrics: AMT true and false user surveys and a single image version of FID. AMT test results found that SinGAN can generate very real samples, and the confusion rate of human discrimination is higher. The results of using a single image FID to quantify SinGAN's ability to capture x internal statistical information are shown in Table 1. The SFID evaluation value generated from the N-1 scale is lower than that generated from the N scale, which is consistent with user research. The author also reported the correlation between SIFID and false image confusion rate. There is a significant negative correlation between the two, which means that a smaller SIFID usually indicates a larger confusion rate.

Table 1 SIFD values ​​of the two modes

in conclusion

This article introduces a new type of unconditional generation framework-SinGAN that can be learned from a single natural image. It is proved that it can not only generate texture, but also has the ability to generate various realistic samples for complex natural images. Compared with the generation method of external training, internal learning has inherent limitations in terms of semantic diversity. For example, if the training image contains only one dog, SinGAN will not generate samples of different dog breeds. However, the author proves through experiments that SinGAN can provide a very powerful tool for a variety of image processing tasks.

Guess you like

Origin blog.csdn.net/AMiner2006/article/details/102797995