DiffusionLearn in detail

Solve the error

https://zhuanlan.zhihu.com/p/622238031

Principle: https://zhuanlan.zhihu.com/p/612854566

Solve the memory overflow problem

overflow problem

  • To sum up: batch_size and n_samples adjustment

Principle learning

Diffusion models are generative models that are used to generate data similar to the training data. In a nutshell, Diffusion models work by iteratively adding Gaussian noise to "corrupt" the training data, and then learn how to remove the noise to restore the data.

A standard diffusion model has two main processes: forward diffusion and backward diffusion

  • In the forward diffusion stage, the image is corrupted by gradually introducing noise until the image becomes completely random noise.

  • In the backdiffusion stage, the prediction noise is gradually removed using a series of Markov chains to recover the data from the Gaussian noise

  • The characteristic of U-Net is that it can take an image as an entry and find a low-dimensional representation of the image by reducing sampling, which makes it more suitable for processing and finding important attributes, and then restore the image by increasing sampling.
    insert image description here

insert image description here
Mathematically, doing this above method T times makes more sense than trying to remove the whole noise. By repeating this process, the noise is gradually removed and we get a "cleaner" image. For example, for images with noise, we generate noise-free images by adding complete noise to the initial image and then iteratively removing it, which is better than removing noise directly on the original image.


question

Existing problems:

However, there are some difficulties in implementing a diffusion model. Because all Markov states need to be predicted in memory all the time, this means that multiple instances of large deep networks have to be kept in memory, which makes the diffusion model very memory intensive. Furthermore, diffusion models can get bogged down in imperceptible fine-grained complexity in image data, causing training times to become prohibitively long (days to months). Paradoxically, fine-grained image generation is one of the main strengths of diffusion models, and we cannot avoid this "sweet annoyance". Because the diffusion model has very high computational requirements, training requires a lot of memory and power, which made it impossible for most researchers to implement the model in reality.

Solve the problem:

The biggest problem with the diffusion model is that it is extremely "expensive" both in terms of time cost and economic cost. The emergence of Stable Diffusion is to solve the above problems. If we want to generate an image of size 1024x1024, U-Net will use noise of size 1024x1024 and generate an image from it. The amount of calculation to do one-step diffusion here is very large, not to mention iterating multiple times until 100%. One solution is to split the large image into several small-resolution images for training, and then use an additional neural network to generate larger-resolution images (super-resolution diffusion)

The Latent Diffusion model released in 2021 gives a different approach. The Latent Diffusion model does not directly operate on the image, but operates in the latent space. By encoding the original data into a smaller space, U-Net can add and remove noise on the low-dimensional representation.


Latent Diffusion

The "Latent Diffusion Model" (Latent Diffusion Model) combines the perception ability of GAN, the detail preservation ability of diffusion model and the semantic ability of Transformer to create a more robust and efficient generation model than all the above models. Compared with other methods, Latent Diffusion not only saves memory, but also the generated images maintain diversity and high detail, while the images also preserve the semantic structure of the data.

Any generative learning method has two main phases: perceptual compression and semantic compression.

perceptual compression

In the perceptual compression learning stage, learning methods must remove high-frequency details to encapsulate data into abstract representations. This step is necessary to build a stable, robust representation of the environment. GANs are good at perceptual compression, which they do by projecting high-dimensional redundant data from pixel space to the hyperspace of the latent space. A latent vector in latent space is a compressed form of the original pixel image, which can effectively replace the original image.

More specifically, perceptual compression is captured with an Auto Encoder structure. The encoder in an autoencoder projects high-dimensional data into a latent space, and the decoder recovers images from the latent space.
insert image description here

semantic compression

In the second stage of learning, image generation methods must be able to capture the semantic structure present in the data. This conceptual and semantic structure provides the preservation of the context and interrelationships of various objects in an image. Transformer is good at capturing semantic structure in text and images. The combination of Transformer's generalization ability and diffusion model's detail preservation ability provides the best of both worlds and provides a way to generate fine-grained highly detailed images while preserving the semantic structure in the image.

Perceptual loss

Autoencoders in latent diffusion models capture the perceptual structure of data by projecting the data into the latent space. The authors of the paper use a special loss function to train this autoencoder called "perceptual loss". This loss function ensures that the reconstruction is confined within the image manifold and reduces blurring that occurs when using pixel-space losses such as L1/L2 losses.

insert image description here

conditional diffusion

Diffusion models are conditional models that depend on priors. In image generation tasks, priors are usually text, images, or semantic maps. To obtain a priori latent representation, a transformer (e.g. CLIP) needs to be used to embed text/images into latent vectors
. Therefore, the final loss function depends not only on the original image latent space, but also on the conditional latent embedding.

Overall structure:
insert image description here

CLIP

Open AI released DALL-E and CLIP in January 2021, both of which are multimodal models that combine images and text. DALL-E is a model based on text generation, while CLIP uses text as supervision. Signals to train transferable visual models, these two works also led to a wave of new research climax like ViT.

The full English name of CLIP is Contrastive Language-Image Pre-training , which is a pre-training method or model based on contrastive text-image pairs.

The training data of CLIP is a text-image pair: an image and its corresponding text description. It is hoped that through comparative learning, the model can learn the matching relationship between text-image pairs. As shown in the figure below, CLIP includes two models: Text Encoder and Image Encoder. Text Encoder is used to extract text features, and the text transformer model commonly used in NLP can be used; Image Encoder is used to extract image features. Commonly used CNN model or vision transformer.

insert image description here

  • It can be seen that we use the multimodal characteristics of CLIP to build a dynamic classifier for specific tasks . The text features extracted by the Text Encoder can be regarded as the weights of the classifier, and the image features extracted by the Image Encoder are the weights of the classifier. enter.

insert image description here

# image_encoder - ResNet or Vision Transformer
# text_encoder - CBOW or Text Transformer
# I[n, h, w, c] - minibatch of aligned images
# T[n, l] - minibatch of aligned texts
# W_i[d_i, d_e] - learned proj of image to embed
# W_t[d_t, d_e] - learned proj of text to embed
# t - learned temperature parameter

# 分别提取图像特征和文本特征
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]

# 对两个特征进行线性投射,得到相同维度的特征,并进行l2归一化
I_e = l2_normalize(np.dot(I_f, W_i), axis=1)
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)

# 计算缩放的余弦相似度:[n, n]
logits = np.dot(I_e, T_e.T) * np.exp(t)

# 对称的对比学习损失:等价于N个类别的cross_entropy_loss
labels = np.arange(n) # 对角线元素的labels
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2

The principle and application of CLIP were introduced earlier, and here we look back at another question: why CLIP, that is, the motivation of the work of CLIP.

In the field of computer vision, the most commonly used migration learning method is to pre-train on a larger-scale data set such as ImageNet, and then fine-tune on specific downstream tasks. The pre-training here is based on supervised training and requires a large amount of data annotation, so the cost is high.

In recent years, some methods based on self-supervision have emerged, including methods based on contrastive learning such as MoCo and SimCLR, and methods based on image masks such as MAE and BeiT. The benefit of self-supervised methods is that labels are no longer required.

However, whether it is a supervised or self-supervised method, when they migrate to downstream tasks, they still need supervised fine-tuning and cannot achieve zero-shot.

  • For supervised models, since they use classifiers with a fixed number of categories on pre-trained datasets, new classifiers need to be defined for retraining on new datasets.

  • For the self-supervised model, the proxy task is often used as an auxiliary for representation learning, and a new classifier is also required for supervised training when migrating to other data sets.

  • However, in the field of NLP, pre-training methods based on autoregressive or language masks have been relatively mature, and pre-trained models can be easily transferred to downstream tasks directly with zero-shot, such as OpenAI's GPT-3. This difference is due to the fact that text and images belong to two completely different modalities, and another reason is that NLP models can use a large amount of text collected from the Internet.

Guess you like

Origin blog.csdn.net/RandyHan/article/details/131440664