Image Synthesis Using Pretrained Diffusion Models

Move your little hand to make a fortune, give it a thumbs up!

Text-to-image diffusion models achieve astonishing performance in generating realistic images consistent with natural language description cues. The release of open-source pretrained models (e.g. stable diffusion) can help democratize these techniques. Pre-trained diffusion models allow anyone to create stunning images without requiring a lot of computing power or a long training process.

Although text-guided image generation offers a degree of control, obtaining an image with a predetermined composition is often tricky, even with a large number of cues. In fact, standard text-to-image diffusion models have little control over the various elements depicted in the generated images.

In this article [1] , I explain MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generationthe state-of-the-art based on papers. This technique enables greater control when placing elements in images generated by text-guided diffusion models. The method presented in the paper is more general and allows other applications such as generating panoramic images, but I will restrict here to the case of image synthesis using region-based text cues. The main advantage of this approach is that it works with out-of-the-box pretrained diffusion models without costly retraining or fine-tuning.

To complement this post with code, I prepared a simple Colab notebook and a GitHub repository containing the code implementation I used to generate the images in this post. The code is based on the stable diffusion pipeline included in the Hugging Face diffuser library, but it implements only the parts necessary for its functionality to make it simpler and easier to read.

Diffusion model

In this section, I review some basic facts about diffusion models. Diffusion models are generative models that generate new data by inverting the diffusion process that maps a data distribution to an isotropic Gaussian distribution. More specifically, given an image, the diffusion process consists of a series of steps, each of which adds a small amount of Gaussian noise to that image. Under the limit of infinite number of steps, a noisy image will be indistinguishable from pure noise sampled from an isotropic Gaussian distribution.

The goal of the diffusion model is to invert the process by trying to guess the noisy image at step t-1 during the diffusion process given the noisy image at step t. For example, this can be done by training a neural network to predict the noise added in this step and subtracting it from the noisy image.

Once we have trained such a model, we can generate new images by sampling noise from an isotropic Gaussian distribution, and use this model to reverse the diffusion process by gradually removing the noise.

alt

The text-to-image diffusion model reverses the diffusion process, trying to reach the image corresponding to the description of the text prompt. This is usually done by a neural network that at each step t predicts the noisy image at step t-1 conditioned not only on the noisy image at step t but also on a textual cue describing the image it is trying to reconstruct .

Many image diffusion models (including stable diffusion) do not operate in the original image space, but in a smaller learned latent space. In this way, the required computing resources can be reduced with minimal loss of quality. The latent space is usually learned through variational autoencoders. The diffusion process in the latent space is exactly the same as before, allowing new latent vectors to be generated from Gaussian noise. From this, a newly generated image can be obtained using the decoder of the Variational Autoencoder.

Image Compositing Using Multiple Diffusion

Now let's explain how to use MultiDiffusionthe method to obtain controlled image composition. The goal is to have better control over the elements generated in images via a pre-trained text-to-image diffusion model. More specifically, given a general description of an image (e.g., the living room in the cover image), we would like a series of elements specified by text cues to appear in specific locations (e.g., a red sofa in the center, a houseplant on the left, and a houseplant on the top right). is a painting). This is achieved by providing a set of text hints describing the desired element and a set of region-based binary masks specifying where the element must be described. For example, the image below contains the bounding box of the image element in the cover image.

alt

MultiDiffusionThe core idea of ​​controllable image generation is to combine multiple diffusion processes for different specified cues to obtain a coherent and smooth image showing the content of each cue in a predetermined region. The region associated with each hint is specified via a binary mask of the same dimensions as the image. The pixel of the mask is set to 1 if the hint must be drawn at that location, and to 0 otherwise.

更具体地说,让我们用 t 表示在潜在空间中运行的扩散过程中的通用步骤。给定时间步 t 处的噪声潜在向量,模型将预测每个指定文本提示的噪声。从这些预测噪声中,我们通过在时间步 t 处从先前的潜在向量中删除每个预测噪声,获得时间步 t-1 处的一组潜在向量(每个提示一个)。为了获得扩散过程中下一个时间步骤的输入,我们需要将这些不同的向量组合在一起。这可以通过将每个潜在向量乘以相应的提示掩码,然后采用掩码加权的每像素平均值来完成。遵循此过程,在特定掩模指定的区域中,潜在向量将遵循相应局部提示引导的扩散过程的轨迹。在预测噪声之前,在每一步将潜在向量组合在一起,确保生成图像的全局内聚性以及不同屏蔽区域之间的平滑过渡。

MultiDiffusion 在扩散过程开始时引入了引导阶段,以更好地粘附紧密掩模。在这些初始步骤期间,对应于不同提示的去噪潜在向量不会组合在一起,而是与对应于恒定颜色背景的一些去噪潜在向量组合。这样,由于布局通常是在扩散过程的早期确定的,因此可以获得与指定蒙版的更好匹配,因为模型最初可以仅关注蒙版区域来描绘提示。

实例

在本节中,我将展示该方法的一些应用。我使用 HuggingFace 托管的预训练稳定扩散 2 模型来创建本文中的所有图像,包括封面图像。

如所讨论的,该方法的直接应用是获取包含在预定义位置中生成的元素的图像。

alt
alt

该方法允许指定要描述的单个元素的样式或一些其他属性。例如,这可用于在模糊背景上获得清晰的图像。

alt
alt

元素的风格也可以非常不同,从而产生令人惊叹的视觉效果。例如,下图是通过将高质量照片风格与梵高风格的绘画混合而获得的。

alt
alt

总结

在这篇文章中,我们探索了一种将不同扩散过程结合在一起的方法,以改善对文本条件扩散模型生成的图像的控制。此方法增强了对生成图像元素的位置的控制,并且还可以无缝组合以不同风格描绘的元素。

One of the main advantages of the described procedure is that it can be used with pre-trained text-to-image diffusion models without the need for fine-tuning, which is usually an expensive process. Another advantage is that controllable image generation is obtained through binary masks, which are easier to specify and handle than more complex conditions.

The main disadvantage of this technique is that it requires passing a neural network for each cue in each diffusion step in order to predict the corresponding noise. Fortunately, these can be batched to reduce inference time overhead, at the cost of greater GPU memory utilization. Also, sometimes some hints (especially those specified in only a small part of the image) are ignored, or they cover a larger area than specified by the corresponding mask. While this can be mitigated with guided steps, too many guided steps can significantly reduce the overall quality of the image, since fewer steps are available to coordinate elements together.

It is worth mentioning that the idea of ​​combining different diffusion processes is not limited to what is described in this paper, it can also be used for further applications such as panoramic image generation, as in the paper Image Generation MultiDiffusion: Fusing Diffusion Paths for Controlled.

I hope you enjoyed this article, and if you want to dive into the technical details, you can check out this Colab notebook and GitHub repository with the code implementation.

Reference

[1]

Source: https://towardsdatascience.com/image-composition-with-pre-trained-diffusion-models-772cd01b5022

This article is published by mdnice multi-platform

Guess you like

Origin blog.csdn.net/swindler_ice/article/details/131820177