【多样化图像转换】1、You Only Need Adversarial Supervision for Semantic Image Synthesis(OASIS)论文

https://github.com/boschresearch/OASIS

语义图像合成只需要对抗性监督(ICLR 2021)


其中有一个很难理解的概念叫做:semantic classes,什么是semantic classes呢?

1、This new discriminator provides semantically-aware pixel-level feedback to the generator, partitioning the image into segments belonging to one of the N real semantic classes or the fake class.

这种新的鉴别器向生成器提供语义感知的像素级反馈,将图像划分为属于 N 个真实语义类或假类之一的片段。

2、Such noise is spatially sensitive, so we can re-sample it both globally (channel-wise) and locally (pixel-wise), allowing to change not only the appearance of the whole scene, but also of specific semantic classes or any chosen areas (see Fig. 2).【使用局部噪声可以改变任何选定的区域或者semantic classes

3、In addition to the N semantic classes from the label maps, all pixels of the fake images are categorized as one extra class. Overall, we have N + 1 classes in the semantic segmentation problem, and thus propose to use a (N+1)-class cross-entropy loss for training.【标签图中有N个语义类别,假图像的所有像素归为一个额外的语义类。 总体而言,我们在语义分割问题中有 N + 1 个类,因此使用 (N+1) 类交叉熵损失进行训练。】

4、For an object with the semantic class c, it will contain pixels from both real and fake images, resulting in two labels, i.e. c and N + 1.【对于具有语义类 c 的对象,它将包含来自真实图像和假图像的像素,从而产生两个标签,即 c 和 N + 1。】

5、With the powerful feedback from the discriminator, OASIS is able to learn the appearance of small or rarely occurring semantic classes (which is reflected in the per-class IoU scores presented in App. A.3), producing plausible results even for complex scenes with rare classes and reducing unnatural artifacts. The class balancing noticeably improves mIoU due to better supervision for rarely occurring semantic classes.【通过判别器的强大反馈,OASIS 能够学习小的或很少出现的语义类的外观所以显着提高了 mIoU】

总结:这里说的semantic classes应该是语义分割中的某个分割区域吧

OASIS两点贡献:

  1. This is achieved via detailed spatial and semantic-aware supervision from our novel segmentation based discriminator.(一个新奇的基于分割的鉴别器
  2. OASIS can easily generate diverse multi-modal outputs by re-sampling the 3D noise, both globally and locally, allowing to change the appearance of the whole scene and of individual objects.(可以从全局或者局部取样的 3D noise生成多样化的图像,可以改变图像的整体或者特定的某个地方而其他的地方不变

  Figure 3: SPADE (left) vs. OASIS (right). OASIS outperforms SPADE, while being simpler and lighter:
it uses only adversarial loss supervision and a single segmentation-based discriminator, without relying on heavy external networks.
Furthermore, OASIS learns to synthesize multi-modal outputs by directly re-sampling the 3D noise tensor, instead of using an image encoder as in SPADE.

3.2 DISCRIMINATOR(鉴别器)

Thus, we build our discriminator architecture upon U-Net, which consists of the encoder and decoder connected by skip connections. This discriminator architecture is multi-scale through its design, integrating information over up- and down-sampling pathways and through the encoder-decoder skip connections. For details on the architecture see App. C.1.

 The segmentation task of the discriminator is formulated to predict the per-pixel class label of thereal images, using the given semantic label maps as ground truth.
In addition to the N semantic classes from the label maps, all pixels of the fake images are categorized as one extra class. Overall,we have N + 1 classes in the semantic segmentation problem, and thus propose to use a (N+1)-class cross-entropy loss for training.【label maps 有N个语义class,假图像的所有像素被归为一个class,所以总共有N+1 个class,所以提出了N+1类交叉熵损失用于训练
Considering that the N semantic classes are usually imbalanced and that the per-pixel size of objects varies for different semantic classes, we weight each class by its inverse per-pixel frequency, giving rare semantic classes more weight. In doing so, the contributionsof each semantic class are equally balanced, and, thus, the generator is also encouraged to adequately synthesize less-represented classes. Mathematically, the new discriminator loss is expressed as:

 where x denotes the real image; (z, t) is the noise-label map pair used by the generator G to synthe-size a fake image; and the discriminator D maps the real or fake image into a per-pixel (N+1)-classprediction probability. The ground truth label map t has three dimensions, where the first two cor-respond to the spatial position (i, j) ∈ H × W , and the third one is a one-hot vector encoding theclass c ∈ {1, .., N+1}. The class balancing weight αc is the inverse of the per-pixel class frequency

3.3  GENERATOR(生成器)

To stay in line with the OASIS discriminator design, the training loss for the generator is changed to

 We next re-design the generator to enable multi-modal synthesis through noise sampling.
SPADE is deterministic in its default setup, but can be trained with an extra image encoder to generate multi-modal outputs. We introduce a simpler version, that enables synthesis of diverse outputs directly from input noise.

将  64×H×W 大小的noise tensor 与 大小为 H×W 的 label map逐通道连接从而形成一个3D的tensor,这个3D tensor作为生成器的输入,并且在每个空间自适应正则化层插入它

  • For this, we construct a noise tensor of size 64×H×W, matching the spatial dimensions of the label map H×W.
  • Channel-wise concatenation of the noise and label map forms a 3D tensor used as input to the generator and also as a conditioning at every spatially-adaptive normalization layer.
  • n doing so, intermediate feature maps are conditioned on both the semantic labels and the noise (see Fig. 3).

With such a design, the generator produces diverse, noise-dependent images. As the 3D noise is channel- and pixel-wise sensitive, at test time, one can sample the noise globally, per-channel, and locally, per-segment or per-pixel, for controlled synthesis of the whole scene or of specific semantic objects. For example, when generating a scene of a bedroom, one can re-sample the noise locally and change the appearance of the bed alone (see Fig. 2). 

Note that for simplicity during training we sample the 3D noise tensor globally, i.e. per-channel, replicating each channel value spatially along the height and width of the tensor. We analyse alternative ways of sampling 3D noise during training in App. A.7. Using image styles via an encoder, as in SPADE, is also possible in our setting, by replacing noise with encoder features. Lastly, to further reduce the complexity, we remove the fifirst residual block in the generator, reducing the number of parameters from 96M to 72M (see App. C.2) without a noticeable performance loss (see Table 3).

C.2 GENERATOR ARCHITECTURE

The generator architecture is built from SPADE ResNet blocks and includes a concatenation of 3D noise with the label map along the channel dimension as an option. The generator can be either trained directly on the label maps or with 3D noise concatenated to the label maps. The latter option is shown in Table M.

OASIS generator drops the first residual block used in Park et al. (2019), which decreases the number of learnable parameters from 96M to 72M. The optional 3D noise injection brings additionally 2M parameters. This sampling scheme is five times lighter than the image encoder used by SPADE (10M).

 Table M: The OASIS generator. N refers to the number of semantic classes, z is noise sampled from a unit Gaussian, y is the label map, interp interpolates a given input to the appropriate spatial dimensions of the current layer.

B.2 MULTI-MODAL IMAGE SYNTHESIS(多目标图像合成)

Multi-modal image synthesis for a given label map is easy for OASIS: we simply re-sample noise like in a conventional unconditional GAN model.
Since OASIS employs a 3D noise tensor (64- channels×height×width), the noise can be re-sampled entirely (”globally”) or only for specific regions in the 2D image plane (”locally”).
For our visualizations, we replicate a single 64-dimensional noise vector along the spatial dimensions for global sampling.
For local sampling, we re-sample a new noise vector and use it to replace the global noise vector at every spatial position within a restricted area of interest. The results are shown in Figure F. The generated images are diverse and of high quality.

Local noise re-sampling does not have to be restricted to only semantic class areas: in Figure H we sample a different noise vector for the left and right half of the image, as well as for arbitrarily shaped regions. In effect, the two areas can differ substantially. However, often a bridging element is found between two areas, such as clouds extending partly from one region to the other region of the image.

B.3 LATENT SPACE INTERPOLATIONS(潜在空间插入)

Figure I: Global latent space interpolations between images generated by OASIS for various outdoor and indoor scenes in the ADE20K dataset at resolution 256 × 256.(全局噪声改变整个图像的样子

In Figure I we present images that are the results of linear interpolations in the latent space (see Fig. I), using an OASIS model trained on the ADE20K dataset. To generate the images, we sample two noise vectors z ∈ R 64 and interpolate them with three intermediate points. The images are synthesized for these five different noise inputs while the label map is held fixed. Note that in Figure I we only vary the noise globally, not locally (See Section 3.3 in the main paper).

 Figure J: Latent space interpolations in local regions of the 3D noise, corresponding to a single semantic class. The noise is only changed within the restricted area. Trained on the ADE20K dataset at resolution 256 × 256.(局部噪声只改变特特定的区域

In contrast, Figure J shows local interpolations. For this, we only re-sample the 3D noise in the area corresponding to a single semantic class. The effect is that only the appearance of the selected semantic class varies while the rest of the image remains fixed. It can be observed that strong changes in a local area can slightly affect the surroundings if the local area is also very big. 

Training details.

We follow the experimental setting of (Park et al., 2019). The image resolution is set to 256x256 for ADE20K and COCO-Stuff and 256x512 for Cityscapes. The Adam (Kingma & Ba, 2015) optimizer was used with momentums β = (0, 0.999) and constant learning rates (0.0001, 0.0004) for G and D. We did not apply the GAN feature matching loss, and used the VGG perceptual loss only for ablations with λVGG = 10. The coefficient for LabelMix λLM was set to 5 for ADE20k and Cityscapes, and to 10 for COCO-Stuff. All our models use an exponential moving average (EMA) of the generator weights with 0.9999 decay (Brock et al., 2019). All the experiments were run on 4 Tesla V100 GPUs, with a batch size of 20 for Cityscapes, and 32 for ADE20k and COCO-Stuff. The training epochs are 200 on ADE20K and Cityscapes, and 100 for the larger COCO-Stuff dataset. On average, a complete forward-backward pass with batch size 32 on Ade20k takes around 0.95ms per training image.

猜你喜欢

转载自blog.csdn.net/weixin_43135178/article/details/126860336