More Control for Free! Image Synthesis with Semantic Diffusion Guidance(SDG)

paper and code:index

 Fig. 1: Our method allows fine-grained semantic control via language guidance, image guidance, or both, and can be applied to datasets without paired image-caption data.

Abstract.

Recently, denoising diffusion probabilistic models have been shown to generate more realistic imagery than prior methods, and have been successfully demonstrated in unconditional and class-conditional settings. We investigate fine-grained, continuous control of this model class, and introduce a novel unified framework for semantic diffusion guidance, which allows either language or image guidance, or both. Guidance is injected into a pretrained unconditional diffusion model using the gradient of image-text or image matching scores. We explore CLIP-based language guidance as well as both content and style-based image guidance in a unified framework. Our text-guided synthesis approach can be applied to datasets without associated text annotations. We conduct experiments on FFHQ and LSUN datasets, and show results on fine-grained text-guided image synthesis, synthesis of images related to a style or content reference image, and examples with both textual and image guidance.1

1 Introduction

    However, most previous text-to-image synthesis methods require image-caption pairs for training, and cannot generalize to datasets without text annotations.

   However, they cannot generate diverse images with various pose, structure, and layout based on a single reference image.

      Our model is based on denoising diffusion probabilistic models (DDPM) [19] which generates an image from a noise map and iteratively remove noise to approach the data distribution of natural images.

     We inject the semantic input by using a guidance function to guide the sampling process of an unconditional diffusion model. This enables more controllable generation in diffusion models and gives us a unified formulation for both language and image guidance. Specifically, our language guidance is based on the image-text matching score predicted by CLIP [41] finetuned on noised images. As for the image guidance, depending on what information we seek in the image, we define two options: content and style guidance. The flexibility of the guidance module allows us to inject either language or image guidance alone or both at once into any unconditional diffusion model without the need for re-training. We propose a self-supervised scheme to finetune the CLIP image encoder without text annotations, from which we obtain the guidance model with minimal cost.

    Our unified framework is flexible and allows fine-grained semantic control in image synthesis with various applications, as shown in Figure 1. We show that our model can handle:
(1) Text-guided image synthesis with a complex fine-grained text query on any dataset without language annotations;
(2) Imageguided image synthesis with content or style control from an input image, which generates diverse images with different pose, structure, and layout;
(3) Multimodal guidance for image synthesis with both language and image input.

Our guidance network can be injected into off-the-shelf(现成的) unconditional diffusion models, without the need for finetuning or re-training the diffusion model. We conduct experiments on FFHQ [12] and LSUN [56] datasets to validate the quality, diversity, and controllability of our generated images, and show various applications of our proposed Semantic Diffusion Guidance.

2 Related Work

Image-guided Synthesis

Some work investigate image synthesis guided by the content of the refernce images. ILVR [8] proposes a way to iteratively inject image guidance to a diffusion model, yet it exhibits limited structural diversity of the generated images. Instance-Conditioned GAN [6] utilizes nearest neighbor images of a given reference for adversarial training to generate structurally diverse yet semantically relevant images. Nonetheless, it requires training the GAN model with instance-conditioned techniques. Our approach demonstrates better controllability as different types of image guidance are proposed where users can decide how much semantic, structural, or style information to preserve by using different types and scales of guidance, while not needing to re-train the unconditional diffusion model.

Diffusion Models

Diffusion models are a new type of generative models consisting of a forward process (signal to noise) and a reverse process (noise to signal). The denoising diffusion probabilistic model (DDPM) [47,19] is a latent variable model where a denoising autoencoder gradually transforms Gaussian noise into real signal. Score-based generative model [48,49] trains a neural network to predict the score function which are used to draw samples via Langevin Dynamics. In [50], it is shown that diffusion probabilistic models and score-based generative models fall under the same framework as both can be viewed as discretizations to stochastic differential equations. Collectively, these models have demonstrated comparable or superior image quality compared to GANs while exhibiting better mode coverage and training stability.

Diffusion models have also been explored for conditional generation such as class-conditional generation, image-guided synthesis, super-resolution, and image-to-image translation [50,10,8,34]. Concurrent work [2] explored text-guided image editing with diffusion models. In this work, we further explore whether diffusion models can be semantically guided by text or image, or both to synthesize realistic images.

【50】(图像生成)Score-Based Generative Modeling through Stochastic Differential Equations:
GitHub - yang-song/score_sde_pytorch: PyTorch implementation for Score-Based Generative Modeling through Stochastic Differential Equations (ICLR 2021, Oral)

【8】(图像生成)ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models
https://arxiv.org/abs/2108.02938

【10】(图像合成)Diffusion Models Beat GANs on Image Synthesis
https://arxiv.org/abs/2105.05233

【34】(图像合成)SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
https://arxiv.org/abs/2108.01073

CLIP-guided Generation

CLIP [41] is a powerful vision-language joint emedding model trained on large-scale images and texts. Its representations have been shown to be robust and general enough to perform zero-shot classificaion and various vision-language tasks on diverse datasets. StyleCLIP [39] and StyleGAN-NADA [12] have demonstrated that CLIP enables text-guided image manipulation without domain-specific image-text pairs. However, the application to image synthesis has not been explored. Our work investigates text and/or image guided synthesis using CLIP and unconditional DDPM.

3 Semantic Diffusion Guidance

We propose Semantic Diffusion Guidance (SDG), The guidance module can be injected into any off-the-shelf unconditional diffusion model without re-training or finetuning it. We only need to finetune the guidance network, which is a CLIP [41] model in our implementation, on the images with different levels of noise. We propose a self-supervised finetuning scheme, which does not require paired language data to finetune the CLIP image encoder.

In Section 3.1, we review the preliminaries on diffusion models, and introduce our approach for injecting guidance into the diffusion model for controllable image synthesis. In Section 3.2, we illustrate the language guidance which enables the unconditional diffusion model to perform text-to-image synthesis. In Section 3.3, we propose two types of image guidance, which take the content andstyle information from the reference image as the guidance signal, respectively. In Section 3.5, we explain how we finetune the CLIP guidance network without requiring text annotations in the target domain.

An overview our method.

Fig. 2:Our method is based on the DDPM model which generates an image from a noise map by iteratively removing noise at each timestep.
We control the diffusion generation process by Semantic Diffusion Guidance (SDG) with language and/or a reference image. SDG is iteratively injected at each step of generation process. We only illustrate the guidance at one timestep t in the figure.

猜你喜欢

转载自blog.csdn.net/weixin_43135178/article/details/126794914