stable diffusion from scratch




Stable diffusion really came out of the blue, ushering in the first year of AIGC. I wonder if you have the same confusion as me. This AI tool doesn’t seem to be so obedient?


Preface


How can we make good use of stable diffusion ? What role does AI play in stable diffusion? How can we get the results we want as quickly and cheaply as possible?


Based on this series of questions, I began a long interpretation of the paper. High-Resolution Image Synthesis with Latent Diffusion Models (Address: https://arxiv.org/abs/2112.10752?spm=ata.21736010.0.0.7d0b28addsl7xQ&file=2112.10752)

Of course, this paper was confusing, so I read the additional article How does Stable Diffusion work? (Address: https://stable-diffusion-art.com/how-stable-diffusion-work/?spm=ata.21736010.0 .0.7d0b28addsl7xQ)


To give a brief summary first, the efforts of stable diffusion are basically for two purposes:

  1. Low cost and efficient verification. Designed Latent Space

  2. Conditioning Mechanisms. Conditional control, if the picture we want cannot be output, then this is like Monkey Coding. Spend unlimited time and resources.


These are the two most important and core parts of the entire content.


Several ways to generate pictures


With the development of deep neural networks, generative models have made tremendous progress, and the mainstream ones include the following:

  1. AutoRegressive model: generates images based on pixels, resulting in high computational costs. The experimental results are pretty good

  2. Variational Autoencoder: Image to Latent, Latent to Image, VAE has problems with blurry or detailed generated images.

  3. Flow-based method (Glow)

  4. Generative adversarial network: Use the generator (G) and the discriminator (D) to play games, constantly making the distribution of generated images and real images closer and closer.


Both AR and GAN are generated for model training and inference in pixel space.


▐How does the model generate images? 


Take a cat as an example. When we want to draw a cat, we always start from a whiteboard, and the framework and details are constantly improved.


For AI, a pure noise image is an ideal whiteboard, similar to what is shown in the figure below.


From the flow in the figure, we can see that the reasoning process is as follows:

  1. Generate a random noise image. This noise depends on the Random parameter. The noise image generated by the same Random is the same.

  2. Use the noise predictor to predict how much noise is added to the image and generate a predicted noise.

  3. Use the original noise minus the predicted noise.

  4. Continuously loop 2 and 3 until our execution steps.

Eventually we will get a cat.


During this process, we will ask the following questions:

  1. How to get a noise predictor?

  2. How can we control whether we will eventually get a cat? Instead of a dog or something?


Before answering these questions, I will post some formulas:

We define a noise predictor: , which is the noise image in the t-th step process, and t represents the t-th stop.


▐How to get a noise predictor?   


This is a training process. The process is shown in the figure below:


  1. Choose a training picture, such as a cat

  2. Generate a random noise image

  3. Superimpose the noise map onto the training image to get a picture with some noise. (Here you can superimpose 1~T steps of noise

  4. Train the noise predictor and tell us how much noise is added. Adjust model weights with correct noise answers.


Finally we can get a relatively accurate noise-predictor. This is a U-Net model. In stable-diffusion-model.


Through this step, we can finally get a noise encoder and noise decoder.

PS: noise encoder will be applied in image2image.


The above processes of noise and noise-predictor are all in pixel space, so there will be huge performance problems. For example, a 1024x1024x3 RBG image corresponds to 3,145,728 numbers, which requires huge computing resources. Here stable diffusion defines a Latent Space to solve this problem.


  Latent Space


The proposal of Latent Space is based on a theory: Manifold_hypothesis


It assumes that in the real world, many high-dimensional data sets actually lie in low-dimensional latent manifolds within this high-dimensional space. Like Pixel Space, there are a lot of high-frequency details that are difficult to perceive, and these are information that need to be compressed in Latent Space.


So based on this assumption, we first define a picture in the RGB domain


Then there is a method z=varepsilon(x), , z is an expression of x in latent space.


There is a factor f=H/h=W/w . Usually we define , for example, stable-diffusion v1.5 training and inference pictures are 512x512x3, and then the intermediate expression of Latent Space is 4x64x64, then we will have a decoder D Can decode pictures from Latent Space.

In this process, we hope that the two pictures will be infinitely close.

The whole process is shown in the figure below:


What performs this process is our Variational Autoencoder, which is VAE.

So how to train VAE? We need a distance metric between the generated image and the training image.

That is .

I don’t care about the details, but this indicator can be used to measure the degree of restoration of the VAE model. The training process is very close to noise encoder and noise-predictor.

Paste a stable diffusion on the FID indicator and compare it with other methods. The table below comes from unconditional image generation. Basically, it is to compare Latent Space to see if any important information is lost.


  • Why is Latent Space feasible?


You may be wondering why VAE can compress an image into a smaller latent space without losing information.


In fact, it is the same as people's understanding of pictures. Natural and excellent pictures are not random. They have high rules. For example, there are eyes and noses on the face. A dog will have 4 legs and a regular shape.
The high dimensionality of images is artificial, whereas natural images can be easily compressed into smaller spaces without losing any information.

可能说我们修改了一张图片的很多难以感知的细节,比如说隐藏水印,微小的亮度、对比度的修改,但修改后还是同样的图像吗?我们只能说它表达的东西还是一样的。并没有丢失任何信息。


  结合Latent Space与noise predictor的图像生成过程


  1. 生成一个随机的latent space matrix,也可以叫做latent representation。一种中间表达

  2. noise-predictor预测这个latent representation的noise.并生成一个latent space noise

  3. latent representation减去latent space noise

  4. 重复2~3,直到step结束

  5. 通过VAE的decoder将latent representation生成最终的图片


直到目前为止,都还没有条件控制的部分。按这个过程,我们最终只会得到一个随机的图片。


条件控制


非常关键,没有条件控制,我们最终只能不断地进行Monkey Coding,得到源源不断的随机图片。


相信你在上面的图片生成的过程中,已经感知到一个问题了,如果只是从一堆noise中去掉noise,那最后得到的为什么是有信息的图片,而不是一堆noise呢?


noise-predictor在训练的时候,其实就是基于已经成像的图片去预测noise,那么它预测的noise基本都来自于有图像信息的训练数据。


在这个denoise的过程中,noise会被附加上各种各样的图像信息。


怎么控制noise-predictor去选择哪些训练数据去预测noise,就是条件控制的核心要素。


这里我们以tex2img为案例讨论。

  Text Conditioning


下面的流程图,展示了一个prompt如何处理,并提供给noise predictor。


  • Tokenizer


从图中可以看到,我们的每一个word,都会被tokenized。stable diffusion v1.5使用的openai ViT-L/14 Clip模型来进行这个过程。

tokenized将自然语言转成计算机可理解的数字(NLP),它只能将words转成token。比如说dreambeach会被CLIP模型拆分成dreambeach。一个word,并不意味着一个token。同时dreambeach也不等同于dream<space>beach,stable diffusion model目前被限制只能使用75个tokens来进行prompt,并不等同于75个word。


  • Embedding


同样,这也是使用的openai ViT-L/14 Clip model. Embedding是一个768长度的向量。每一个token都会被转成一个768长度的向量,如上案例,我们最后会得到一4x768的矩阵。


为什么我们需要embedding呢?


比如说我们输入了 man ,但这是不是同时可以意味着 gentleman guy sportsman boy 。他们可能说在向量空间中,与 man 的距离由近而远。而你不一定非要一个完全准确无误的 man 。通过embedding的向量,我们可以决定究竟取多近的信息来生成图片。对应stable diffusion的参数就是(Classifier-Free Guidance scale)CFG。相当于用一个scale去放大距离,因此scale越大,对应的能获取的信息越少,就会越遵循prompt。而scale越小,则越容易获取到关联小,甚至无关的信息。


如何去控制embedding?


我们经常会遇到stable diffusion无法准确绘制出我们想要的内容。那么这里我们发现了第一种条件控制的方式:textual inversion


将我们想要的token用一个全新的别名定义,这个别名对应一个准确的token。那么就能准确无误地使用对应的embedding生成图片。


这里的embedding可以是新的对象,也可以是其他已存在的对象。


比如说我们用一个玩具猫训练到CLIP模型中,并定义其Tokenizer对应的word,同时微调stable diffusion的模型。而 对应 toy cat 就能产生如下的效果。


感觉有点像Lora的思路,具体还得调研下lora。

text transformer


在得到embedding之后,通过text transformer输入给noise-predictor
transformer可以控制多种条件,如class labels、image、depth map等。

Cross-attention


具体cross-attention是什么我也不是很清楚。但这里有一个案例可以说明:
比如说我们使用prompt "A man with blue eyes"。虽然这里是两个token,但stable diffusion会把这两个单词一起成对。
这样就能保证生成一个蓝色眼睛的男人。而不是一个蓝色袜子或者其他蓝色信息的男人。
(cross-attention between the prompt and the image)

LoRA models modify the cross-attention module to change styles。后面在研究Lora,这里把原话摘到这。

感觉更像是存在 blue eyes ,然后有一个集合同时满足 blue eye 。去取这个交叉的集合。问题:对应的embedding是不是不一样的?该如何区分 blue planet in eye blue eye in planet 的区别?感觉这应该是NLP的领域了。

  • 总结下tex2img的过程


  1. stable diffusion生成一个随机的latent space matrix。这个由Random决定,如果Random不变,则这个latent space matrix不变。

  2. 通过noise-predictor,将noisy image与text prompt作为入参,预测predicted noise in latent space

  3. latent noise减去predicted noise。将其作为新的latent noise

  4. 不断重复2~3执行step次。比如说step=20

  5. 最终,通过VAE的decoder将latent representation生成最终的图片

这个时候就可以贴Stable diffusion论文中的一张图了



手撕一下公式:

左上角的定义为一张RGB像素空间的图。经过的变化,生成这个latent space representation。再经过一系列的noise encoder,得到,T表示step。


而这个过程则是img2img的input。如果是img2img,那么初始的noise latent representation就是这个不断加noise之后的


如果是tex2img,初始的noise latent representation则是直接random出来的。


再从右下角的 开始,y 表示多样的控制条件的入参,如text prompts。通过 (domain specific encoder)将 y 转为intermediate representation(一种中间表达)。而 将经过cross-attention layer的实现:

具体的细节说实话没看懂,而这一部分在controlnet中也有解释,打算从controlnet的部分进行理解。


图中cross-attention的部分可以很清晰的看到是一个由大到小,又由小到大的过程,在controlnet的图中有解释:
SD Encoder Block_1(64x64) -> SD Encoder Block_2(32x32) -> SD Encoder Block_3(16x16) -> SD Encoder(Block_4 8x8) -> SD Middle(Block 8x8) -> SD Decoder(Block_4 8x8) -> SD Decoder Block_3(16x16) -> SD Decoder Block_2(32x32) -> SD Decoder Blocker_1(64x64)
是一个从 64x64  ->  8x8  ->  64x64 的过程,具体为啥,得等我撕完controlnet的论文。回到过程图中,我们可以看到denoising step则是在Latent Space的左下角进行了一个循环,这里与上面的流程一直。


最终通过VAE的decoder D,输出图片


最终的公式如下:

结合上面的图看,基本还是比较清晰的,不过这个 : = 代表了啥就不是很清楚了。结合python代码看流程更清晰~删掉了部分代码,只留下了关键的调用。


pipe = StableDiffusionPipeline.from_pretrained(    "CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16)vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae")tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")unet = UNet2DConditionModel.from_pretrained(    "CompVis/stable-diffusion-v1-4", subfolder="unet")scheduler = LMSDiscreteScheduler.from_pretrained(    "CompVis/stable-diffusion-v1-4", subfolder="scheduler")prompt = ["a photograph of an astronaut riding a horse"]generator = torch.manual_seed(32)text_input = tokenizer(    prompt,    padding="max_length",    max_length=tokenizer.model_max_length,    truncation=True,    return_tensors="pt",)with torch.no_grad():    text_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]max_length = text_input.input_ids.shape[-1]uncond_input = tokenizer(    [""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt")with torch.no_grad():    uncond_embeddings = text_encoder(uncond_input.input_ids.to(torch_device))[0]text_embeddings = torch.cat([uncond_embeddings, text_embeddings])latents = torch.randn(    (batch_size, unet.in_channels, height // 8, width // 8), generator=generator)scheduler.set_timesteps(num_inference_steps)latents = latents * scheduler.init_noise_sigma
for t in tqdm(scheduler.timesteps): latent_model_input = torch.cat([latents] * 2) latent_model_input = scheduler.scale_model_input(latent_model_input, t) with torch.no_grad(): noise_pred = unet( latent_model_input, t, encoder_hidden_states=text_embeddings ).sample noise_pred_uncond, noise_pred_text = noise_pred.chunk(2) noise_pred = noise_pred_uncond + guidance_scale * ( noise_pred_text - noise_pred_uncond )
latents = scheduler.step(noise_pred, t, latents).prev_sample
latents = 1 / 0.18215 * latents
with torch.no_grad(): image = vae.decode(latents).sample

还是很贴合图中流程的。
在代码中有一个Scheduler,其实就是noising的执行器,它主要控制每一步noising的强度。
由Scheduler不断加噪,然后noise predictor进行预测减噪。
具体可以看Stable Diffusion Samplers: A Com
prehensive Guide(地址:https://stable-diffusion-art.com/samplers/)


  Img2Img


这个其实在上面的流程图中已经解释了。这里把步骤列一下:

  1. 输入的image,通过VAE的encoder变成latent space representation

  2. 往里面加noise,总共加T个noise,noise的强度由Denoising strength控制。noise其实没有循环加的过程,就是不断叠同一个noise T次,所以可以一次计算完成。

  3. noisy image和text prompt作为输入,由noise predictor U-Net预测一个新的noise

  4. noisy image减去预测的noise

  5. 重复3~4 step次

  6. 通过VAE的decoder将latent representation转变成image


  Inpainting


基于上面的原理,Inpainting就很简单了,noise只加到inpaint的部分。其他和Img2Img一样。相当于只生成inpaint的部分。所以我们也经常发现inpaint的边缘经常无法非常平滑~如果能接受图片的细微变化,可以调低Denoising strength,将inpaint的结果,再进行一次img2img


Stable Diffusion v1 vs v2

v2开始CLIP的部分用了OpenClip。导致生成的控制变得非常的难。OpenAI的CLIP虽然训练集更小,参数也更少。(OpenClip是ViT-L/14 CLIP的5倍大小)。 但似乎ViT-L/14的训练集更好一些,有更多针对艺术和名人照片的部分,所以输出的结果通常会更好。导致v2基本没用起来。不过现在没事了,SDXL横空出世。

SDXL model

SDXL模型的参数达到了66亿,而v1.5只有9.8亿
由一个Base model和Refiner model组成。Base model负责生成,而Refiner则负责加细节完善。 可以只运行Base model。但类似人脸眼睛模糊之类的问题还是需要Refiner解决。

SDXL的主要变动:
  1. text encoder组合了OpenClip和ViT-G/14。毕竟OpenClip是可训练的。
  2. 训练用的图片可以小于256x256,增加了39%的训练集
  3. U-Net的部分比v1.5大了3倍
  4. 默认输出就是1024x1024

展示下对比效果:
从目前来看,有朝一日SDXL迟早替代v1.5。从效果来说v2.1确实被时代淘汰了。

Stable Diffusion的一些常见问题


  脸部细节不足,比如说眼部模糊


可以通过VAE files进行修复~有点像SDXL的Refiner

  多指、少指


这个看起来是一个无解的问题。Andrew给出的建议是加prompt比如说 beautiful hands detailed fingers ,期望其中有部分图片满足要求。 或者用inpaint。反复重新生成手部。(这个时候可以用相同的prompt。)

团队介绍


我们是淘天集团-场景智能技术团队,作为一支专注于通过AI和3D技术驱动商业创新的技术团队, 依托大淘宝丰富的业务形态和海量的用户、数据, 致力于为消费者提供创新的场景化导购体验, 为商家提供高效的场景化内容创作工具, 为淘宝打造围绕家的场景的第一消费入口。我们不断探索并实践新的技术, 通过持续的技术创新和突破,创新用户导购体验, 提升商家内容生产力, 让用户享受更好的消费体验, 让商家更高效、低成本地经营。


本文分享自微信公众号 - 大淘宝技术(AlibabaMTT)。
如有侵权,请联系 [email protected] 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。

Qt 6.6 正式发布 国美 App 抽奖页面弹窗辱骂其创始人 Ubuntu 23.10 正式发布,不妨趁周五升级一波! RISC-V:不受任何单一公司或国家的控制 Ubuntu 23.10 发版插曲:因包含仇恨言论,ISO 镜像被紧急“召回” 俄罗斯企业基于龙芯处理器生产电脑和服务器 ChromeOS 是使用 Google 桌面环境的 Linux 发行版 23 岁博士生修复 Firefox 中的 22 年“幽灵老 Bug” TiDB 7.4 发版:正式兼容 MySQL 8.0 微软推出 Windows Terminal Canary 版本
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4662964/blog/10117979