How can we make good use of stable diffusion ? What role does AI play in stable diffusion? How can we get the results we want as quickly and cheaply as possible?
Based on this series of questions, I began a long interpretation of the paper. High-Resolution Image Synthesis with Latent Diffusion Models (Address: https://arxiv.org/abs/2112.10752?spm=ata.21736010.0.0.7d0b28addsl7xQ&file=2112.10752)
Of course, this paper was confusing, so I read the additional article How does Stable Diffusion work? (Address: https://stable-diffusion-art.com/how-stable-diffusion-work/?spm=ata.21736010.0 .0.7d0b28addsl7xQ)
To give a brief summary first, the efforts of stable diffusion are basically for two purposes:
Low cost and efficient verification. Designed Latent Space
Conditioning Mechanisms. Conditional control, if the picture we want cannot be output, then this is like Monkey Coding. Spend unlimited time and resources.
These are the two most important and core parts of the entire content.
With the development of deep neural networks, generative models have made tremendous progress, and the mainstream ones include the following:
AutoRegressive model: generates images based on pixels, resulting in high computational costs. The experimental results are pretty good
Variational Autoencoder: Image to Latent, Latent to Image, VAE has problems with blurry or detailed generated images.
Flow-based method (Glow)
Generative adversarial network: Use the generator (G) and the discriminator (D) to play games, constantly making the distribution of generated images and real images closer and closer.
Both AR and GAN are generated for model training and inference in pixel space.
▐How does the model generate images?
Take a cat as an example. When we want to draw a cat, we always start from a whiteboard, and the framework and details are constantly improved.
From the flow in the figure, we can see that the reasoning process is as follows:
Generate a random noise image. This noise depends on the Random parameter. The noise image generated by the same Random is the same.
Use the noise predictor to predict how much noise is added to the image and generate a predicted noise.
Use the original noise minus the predicted noise.
Continuously loop 2 and 3 until our execution steps.
Eventually we will get a cat.
During this process, we will ask the following questions:
How to get a noise predictor?
How can we control whether we will eventually get a cat? Instead of a dog or something?
Before answering these questions, I will post some formulas:
We define a noise predictor: , which is the noise image in the t-th step process, and t represents the t-th stop.
This is a training process. The process is shown in the figure below:
Choose a training picture, such as a cat
Generate a random noise image
Superimpose the noise map onto the training image to get a picture with some noise. (Here you can superimpose 1~T steps of noise
Train the noise predictor and tell us how much noise is added. Adjust model weights with correct noise answers.
Finally we can get a relatively accurate noise-predictor. This is a U-Net model. In stable-diffusion-model.
Through this step, we can finally get a noise encoder and noise decoder.
PS: noise encoder will be applied in image2image.
The above processes of noise and noise-predictor are all in pixel space, so there will be huge performance problems. For example, a 1024x1024x3 RBG image corresponds to 3,145,728 numbers, which requires huge computing resources. Here stable diffusion defines a Latent Space to solve this problem.
▐ Latent Space
The proposal of Latent Space is based on a theory: Manifold_hypothesis
It assumes that in the real world, many high-dimensional data sets actually lie in low-dimensional latent manifolds within this high-dimensional space. Like Pixel Space, there are a lot of high-frequency details that are difficult to perceive, and these are information that need to be compressed in Latent Space.
So based on this assumption, we first define a picture in the RGB domain
Then there is a method z=varepsilon(x), , z is an expression of x in latent space.
Why is Latent Space feasible?
You may be wondering why VAE can compress an image into a smaller latent space without losing information.
The high dimensionality of images is artificial, whereas natural images can be easily compressed into smaller spaces without losing any information.
▐ 结合Latent Space与noise predictor的图像生成过程
生成一个随机的latent space matrix,也可以叫做latent representation。一种中间表达
noise-predictor预测这个latent representation的noise.并生成一个latent space noise
latent representation减去latent space noise
重复2~3,直到step结束
通过VAE的decoder将latent representation生成最终的图片
直到目前为止,都还没有条件控制的部分。按这个过程,我们最终只会得到一个随机的图片。
条件控制
非常关键,没有条件控制,我们最终只能不断地进行Monkey Coding,得到源源不断的随机图片。
相信你在上面的图片生成的过程中,已经感知到一个问题了,如果只是从一堆noise中去掉noise,那最后得到的为什么是有信息的图片,而不是一堆noise呢?
noise-predictor在训练的时候,其实就是基于已经成像的图片去预测noise,那么它预测的noise基本都来自于有图像信息的训练数据。
在这个denoise的过程中,noise会被附加上各种各样的图像信息。
怎么控制noise-predictor去选择哪些训练数据去预测noise,就是条件控制的核心要素。
▐ Text Conditioning
下面的流程图,展示了一个prompt如何处理,并提供给noise predictor。
Tokenizer
tokenized将自然语言转成计算机可理解的数字(NLP),它只能将words转成token。比如说dreambeach
会被CLIP模型拆分成dream
和beach
。一个word,并不意味着一个token。同时dream
与beach
也不等同于dream
和<space>beach
,stable diffusion model目前被限制只能使用75个tokens来进行prompt,并不等同于75个word。
Embedding
同样,这也是使用的openai ViT-L/14 Clip model. Embedding是一个768长度的向量。每一个token都会被转成一个768长度的向量,如上案例,我们最后会得到一个4x768
的矩阵。
为什么我们需要embedding呢?
man
,但这是不是同时可以意味着
gentleman
、
guy
、
sportsman
、
boy
。他们可能说在向量空间中,与
man
的距离由近而远。而你不一定非要一个完全准确无误的
man
。通过embedding的向量,我们可以决定究竟取多近的信息来生成图片。对应stable diffusion的参数就是(Classifier-Free Guidance scale)CFG。相当于用一个scale去放大距离,因此scale越大,对应的能获取的信息越少,就会越遵循prompt。而scale越小,则越容易获取到关联小,甚至无关的信息。
如何去控制embedding?
我们经常会遇到stable diffusion无法准确绘制出我们想要的内容。那么这里我们发现了第一种条件控制的方式:textual inversion
将我们想要的token用一个全新的别名定义,这个别名对应一个准确的token。那么就能准确无误地使用对应的embedding生成图片。
这里的embedding可以是新的对象,也可以是其他已存在的对象。
toy cat
就能产生如下的效果。
text transformer
Cross-attention
LoRA models modify the cross-attention module to change styles。后面在研究Lora,这里把原话摘到这。
blue
、
eyes
,然后有一个集合同时满足
blue
和
eye
。去取这个交叉的集合。问题:对应的embedding是不是不一样的?该如何区分
blue planet in eye
和
blue eye in planet
的区别?感觉这应该是NLP的领域了。
总结下tex2img的过程
stable diffusion生成一个随机的latent space matrix。这个由Random决定,如果Random不变,则这个latent space matrix不变。
通过noise-predictor,将noisy image与text prompt作为入参,预测predicted noise in latent space
latent noise减去predicted noise。将其作为新的latent noise
不断重复2~3执行step次。比如说step=20
最终,通过VAE的decoder将latent representation生成最终的图片
这个时候就可以贴Stable diffusion论文中的一张图了
手撕一下公式:
左上角的定义为一张RGB像素空间的图。经过的变化,生成这个latent space representation。再经过一系列的noise encoder,得到,T表示step。
而这个过程则是img2img的input。如果是img2img,那么初始的noise latent representation就是这个不断加noise之后的。
如果是tex2img,初始的noise latent representation则是直接random出来的。
具体的细节说实话没看懂,而这一部分在controlnet中也有解释,打算从controlnet的部分进行理解。
SD Encoder Block_1(64x64) -> SD Encoder Block_2(32x32) -> SD Encoder Block_3(16x16) -> SD Encoder(Block_4 8x8) -> SD Middle(Block 8x8) -> SD Decoder(Block_4 8x8) -> SD Decoder Block_3(16x16) -> SD Decoder Block_2(32x32) -> SD Decoder Blocker_1(64x64)
64x64
->
8x8
->
64x64
的过程,具体为啥,得等我撕完controlnet的论文。回到过程图中,我们可以看到denoising step则是在Latent Space的左下角进行了一个循环,这里与上面的流程一直。
最终通过VAE的decoder D,输出图片
结合上面的图看,基本还是比较清晰的,不过这个 和代表了啥就不是很清楚了。结合python代码看流程更清晰~删掉了部分代码,只留下了关键的调用。
pipe = StableDiffusionPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16
)
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae")
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
unet = UNet2DConditionModel.from_pretrained(
"CompVis/stable-diffusion-v1-4", subfolder="unet"
)
scheduler = LMSDiscreteScheduler.from_pretrained(
"CompVis/stable-diffusion-v1-4", subfolder="scheduler"
)
prompt = ["a photograph of an astronaut riding a horse"]
generator = torch.manual_seed(32)
text_input = tokenizer(
prompt,
padding="max_length",
max_length=tokenizer.model_max_length,
truncation=True,
return_tensors="pt",
)
with torch.no_grad():
text_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]
max_length = text_input.input_ids.shape[-1]
uncond_input = tokenizer(
[""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt"
)
with torch.no_grad():
uncond_embeddings = text_encoder(uncond_input.input_ids.to(torch_device))[0]
text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
latents = torch.randn(
(batch_size, unet.in_channels, height // 8, width // 8), generator=generator
)
scheduler.set_timesteps(num_inference_steps)
latents = latents * scheduler.init_noise_sigma
for t in tqdm(scheduler.timesteps):
latent_model_input = torch.cat([latents] * 2)
latent_model_input = scheduler.scale_model_input(latent_model_input, t)
with torch.no_grad():
noise_pred = unet(
latent_model_input, t, encoder_hidden_states=text_embeddings
).sample
noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
noise_pred = noise_pred_uncond + guidance_scale * (
noise_pred_text - noise_pred_uncond
)
latents = scheduler.step(noise_pred, t, latents).prev_sample
latents = 1 / 0.18215 * latents
with torch.no_grad():
image = vae.decode(latents).sample
在代码中有一个Scheduler,其实就是noising的执行器,它主要控制每一步noising的强度。
由Scheduler不断加噪,然后noise predictor进行预测减噪。
具体可以看Stable Diffusion Samplers: A Com prehensive Guide(地址:https://stable-diffusion-art.com/samplers/)
▐ Img2Img
这个其实在上面的流程图中已经解释了。这里把步骤列一下:
输入的image,通过VAE的encoder变成latent space representation
往里面加noise,总共加T个noise,noise的强度由Denoising strength控制。noise其实没有循环加的过程,就是不断叠同一个noise T次,所以可以一次计算完成。
noisy image和text prompt作为输入,由noise predictor U-Net预测一个新的noise
noisy image减去预测的noise
重复3~4 step次
通过VAE的decoder将latent representation转变成image
▐ Inpainting
基于上面的原理,Inpainting就很简单了,noise只加到inpaint的部分。其他和Img2Img一样。相当于只生成inpaint的部分。所以我们也经常发现inpaint的边缘经常无法非常平滑~如果能接受图片的细微变化,可以调低Denoising strength,将inpaint的结果,再进行一次img2img
-
text encoder组合了OpenClip和ViT-G/14。毕竟OpenClip是可训练的。 -
训练用的图片可以小于256x256,增加了39%的训练集 -
U-Net的部分比v1.5大了3倍 -
默认输出就是1024x1024
Stable Diffusion的一些常见问题
▐ 脸部细节不足,比如说眼部模糊
▐ 多指、少指
beautiful hands
和
detailed fingers
,期望其中有部分图片满足要求。
或者用inpaint。反复重新生成手部。(这个时候可以用相同的prompt。)
我们是淘天集团-场景智能技术团队,作为一支专注于通过AI和3D技术驱动商业创新的技术团队, 依托大淘宝丰富的业务形态和海量的用户、数据, 致力于为消费者提供创新的场景化导购体验, 为商家提供高效的场景化内容创作工具, 为淘宝打造围绕家的场景的第一消费入口。我们不断探索并实践新的技术, 通过持续的技术创新和突破,创新用户导购体验, 提升商家内容生产力, 让用户享受更好的消费体验, 让商家更高效、低成本地经营。
本文分享自微信公众号 - 大淘宝技术(AlibabaMTT)。
如有侵权,请联系 [email protected] 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。