[Generation model] DDPM probability diffusion model (principle + code)

foreword

AI painting woke up from the 18-year DeepDream nightmare, and in 2022 OpenAI's DALL·E 2 achieved amazing results, as shown in the picture:
insert image description here
AI + art involves a series of mathematics-related knowledge such as Transformer, VAE, ELBO, and Diffusion Model. Diffusion Models are as complex as VAE .

Diffusion model (Paper: DDPM stands for Denoising Diffusion Probabilistic Model) has received less attention since it was published in 2020, because it is not as simple and crude as GAN, but it has become so popular recently that more than half of the submissions related to ICRL conferences, and its two most advanced Text generation image - OpenAI's DALL·E 2 and Google's Imagen are both based on the diffusion model.


1. Common generative models

First horizontally compare several important generative models GAN, VAE, Flow-based Models, Diffusion Models.

GAN consists of a generator (generator) and a discriminator (discriminator), the generator is responsible for generating realistic data to "cheat" the discriminator, and the discriminator is responsible for judging whether a sample is real or "fabricated". The training of GAN is actually two models learning from each other. Can it not be called "confrontation" and be more harmonious.

VAE also hopes to train a generative model x=g(z) , which can map the sampled probability distribution to the probability distribution of the training set , generate a hidden variable z , and z contains both data information and noise, except for restoring the input In addition to the sample data, it can also be used to generate new data.
insert image description here

Diffusion Models are inspired by non-equilibrium thermodynamics (non-equilibrium thermodynamics) . The theory first defines a Markov chain of diffusion steps to slowly add random noise to the data, and then learns the reverse diffusion process to construct the desired data samples from the noise . Unlike VAE or flow models, diffusion models are learned through a fixed process, and the latent space z has relatively high dimensionality.

2. Intuitive understanding of Diffusion model

A generative model is essentially a set of probability distributions. As shown in the figure, the left side is a training data set, and all the data in it are random samples taken independently and identically distributed from a certain data p data . On the right is its generative model (probability distribution). In this probability distribution, find a distribution p θ that makes it the closest to the p data . Then take new samples on p θ to obtain a steady stream of new data.
insert image description here
However, the form of p data is often very complex, and the dimension of the image is very high, it is difficult for us to traverse the entire space, and the data samples we can observe are also limited.

Diffusion function :

We can add noise to any distribution, including the p data we are interested in, so that it eventually becomes a pure noise distribution N(0,I) . How do you understand it?

From the perspective of probability distribution , consider the two-dimensional joint probability distribution p(x,y) of the Swiss roll shape in the figure below . The diffusion process q is very intuitive. The originally concentrated and orderly sample points are disturbed by noise and diffuse outward. Eventually becomes a completely disordered noise distribution.
insert image description here
Looking at this process from a single image sample , the diffusion process q is to continuously add noise to the image until the image becomes a pure noise, and the inverse diffusion process p is the process of generating an image from pure noise. Sample changes:
insert image description here

3. Formal analysis of Diffusion model

Since it is called a generative model, it means that Diffusion Models are used to generate data similar to the training data . Fundamentally, Diffusion Models work by corrupting the training data by continuously adding Gaussian noise, and then learning to restore the data by reversing this noise process.

When testing, you can use Diffusion Models to feed randomly sampled noise into the model and learn the denoising process to generate data . That is the basic principle corresponding to the figure below.

insert image description here
More specifically, the diffusion model is a latent variable model that maps to latent space using a Markov Chain (MC). Through the Markov chain, noise is gradually added to the data xi at each time step t to obtain the posterior probability q(x 1:T | x 0 ) , where x 1 …x T represents the input data and is also latent space. That is to say, the latent space of Diffusion Models has the same dimension as the input data.

Posterior probability : In Bayesian statistics, the posterior probability of a random event or an uncertain event is the conditional probability obtained after considering and giving relevant evidence or data. wiki

A Markov chain is a random process in state space that goes through transitions from one state to another. This process requires the property of "no memory": the probability distribution of the next state can only be determined by the current state, and the events before it in the time series have nothing to do with it .

Diffusion Models are divided into forward diffusion process and reverse reverse diffusion process. The figure below shows the diffusion process. From the end to the end is a Markov chain, which represents the random process of transition from one state to another in the state space. The subscript is the image diffusion process corresponding to Diffusion Models.

insert image description here
Finally, the real image input from x 0 is asymptotically transformed into a pure Gaussian noise image x T after Diffusion Models .

Model training mainly focuses on the inverse diffusion process. The goal of training a diffusion model is to learn the inverse of the forward process: the training probability distribution p θ (x t-1 | x t ) . By traversing backwards along the Markov chain, new data x 0 can be regenerated .

The biggest difference between Diffusion Models and GAN or VAE is that it is not generated by a model, but based on the Markov chain, which generates data by learning noise.
insert image description here
In addition to generating high-quality images, another advantage of Diffusion Models is that there is no confrontation during the training process . For the GAN network model, confrontational training is actually very difficult to debug, because the two models that compete with each other during the confrontation training process, It's a black box to us. In addition, in terms of training efficiency, the diffusion model is also scalable and parallelizable , so how to speed up the training process, how to add more mathematical rules and constraints, and expand to voice, text, and 3D fields is very interesting. Lots of new articles.

*4. Detailed explanation of Diffusion Model (mathematical derivation)

It has been clearly stated above that Diffusion Models consist of a forward process (or diffusion process) and a reverse process (or reverse diffusion process), in which the input data is gradually noised, and then the noise is converted back to a sample of the source target distribution. The principle is Markov chain + conditional probability distribution . The core is how to use the neural network model to solve the probability distribution of the Markov process.

1. Forward process (diffusion process)

insert image description here
Two important properties to use during implementation and derivation:

Feature 1: Reparameterization trick
The reparameter technique has been cited in many works (gumbel softmax, VAE). If we want to randomly sample a sample (Gaussian distribution) from a certain distribution, this process cannot reverse the gradient. And this process of obtaining x t through Gaussian noise sampling is everywhere in diffusion, so we need to use heavy parameter techniques to make it differentiable:
insert image description here

Property 2: x t at any time can be represented by x 0 and β t
insert image description here

2. Inverse Diffusion Process

If the forward process (forward) is the process of adding noise, then the reverse process (reverse) is the denoising process of diffusion.

If we can reverse the above process and sample from q(x t-1 |x t ) , we can restore the original image distribution x 0 ~q(x ) from Gaussian noise x T ~N( 0, I ). It is proved in Document 7 that if q(x t |x t-1 ) satisfies a Gaussian distribution and β t is small enough, q(x t-1 |x t ) is still a Gaussian distribution. However, we cannot simply infer q(x t-1 |x t ) , so we use a deep learning model (parameter is θ, the current mainstream is U-Net+attention structure) to predict such a reverse distribution p θ (similar to VAE):

insert image description here
However, in the paper, the author directly takes the variance of the conditional probability p θ (x t-1 |x t ) as β t , instead of the Σ θ (x t , t) that needs to be estimated by the network as mentioned above, so the actual Only the mean needs to be estimated by the network.

The forward diffusion and reverse diffusion processes are both Markov, then normal distribution, and then step by step conditional probability. The only difference is that the mean and variance of the Gaussian distribution of each conditional probability in forward diffusion have been determined ( Depends on β t and x 0 ), and the mean and variance in the inverse diffusion process are what our network needs to learn.

3. Inverse diffusion conditional probability derivation

Although we can't get the probability distribution q(x t-1 |x t ) of the reversal process , if we know x 0 , q(x t-1 |x t , x 0 ) can be written directly. This thing is probably such a form

insert image description here

Bayesian formula:
insert image description here
into the formula to get:
insert image description here

insert image description here
7-1 is brought into the Bayesian formula 2; 7-2 is brought into the multiplication formula 1, and then we can get 7-3

The probability density function of univariate normal distribution is defined as:
insert image description here, which can be substituted into Equation 7.4

Formula 7.5 can be organized as 1 2 \frac{1}{2}21(ax 2 +bx+c), that is, 1 2 \frac{1}{2}21a (x+ b 2 a \frac{b}{2a} 2a _b) 2 +C, its mean is - b 2 a \frac{b}{2a}2a _b, the variance is 1 a \frac{1}{a}a1, so we can get the variance and mean in (6) after a little tidying up:
insert image description here
According to the formula (2) of characteristic 2, we know that insert image description here, into the above formula:insert image description here

It can be seen that under the condition of given x 0 , the mean value of the posterior conditional Gaussian distribution is only related to the hyperparameters, x t and ε t , and the variance is only related to the hyperparameters.

Through the above variance and mean, we get the analytical form of q(x t-1 |x t , x 0 ) .

4. Training Loss

How to train Diffusion Models to obtain the mean value μ θ (x t , t) and variance Σ θ (x t , t) in formula (3) ? In VAE, we have learned the role of maximum likelihood estimation: for the real training sample data known, the parameters of the model are required, and maximum likelihood estimation can be used.

In statistics, the likelihood function is a function of the parameters of a statistical model. Given the output x, the likelihood function L(θ|x) with respect to the parameter θ is (numerically) equal to the probability of the variable X given the parameter θ: L(θ|x)=P(X=x|θ) .

Diffusion Models uses maximum likelihood estimation to find the probability distribution of Markov chain conversion in the process of inverse diffusion, which is the training purpose of Diffusion Models. That is, to maximize the log likelihood of the model prediction distribution, from the perspective of Loss decline is to minimize the negative log likelihood:
insert image description here

The process is much like a VAE, i.e. the negative log-likelihood can be optimized using a variational lower bound (VLB) .

KL divergence is an asymmetric statistical distance measure used to measure the degree of difference between one probability distribution P and another probability distribution Q. The mathematical form of the KL divergence of a continuous distribution is:
insert image description here
Properties of the KL divergence:
insert image description here

From the KL divergence we know:
insert image description here

Further, the upper bound of the cross-entropy of the above formula can be written, and the upper bound can be further simplified:

insert image description here
Next, we classify and discuss these three situations:

First, since the forward process q has no learnable parameters, and x T is pure Gaussian noise, L T can be ignored as a constant.

Then, L t-1 is the KL divergence, which can be regarded as shortening the distance between the two distributions:

  1. The first distribution, q(x t-1 |x T ,x 0 ,) whose analytical form we have derived in the previous section, is a Gaussian distribution with mean and variance
    insert image description here
  2. The second distribution p θ (x t-1 , x t ) is the target distribution that our network expects to fit. It is also a Gaussian distribution. The mean is estimated by the network, and the variance is set to a constant related to β t .
    insert image description here
    If there are two distributions p, q are Gaussian distribution, then their KL divergence is Then because the variance of these two distributions is all constant, it has nothing to do
    insert image description here
    with optimization, so in fact, the optimization goal is the two norms of the mean of the two distributions
    insert image description here
    This formula is brought into the previous formula to get:
    insert image description here
    after such a derivation, it is an L2 loss. The input to the network is a picture that is linearly combined with the noise, and then the noise is estimated:
    insert image description here

5. Training and testing pseudo code

insert image description here

1. Training

insert image description here

2. Test

insert image description here

Six, code analysis

Recommend a simple ddpm project, use the cifar10 dataset for training:
github.com/abarankab/DDPM
See the code for usage:insert image description here

1.train_cifar.py

from torchvision import datasets

# 1.定义模型(Unet,后续会展开)
diffusion = script_utils.get_diffusion_from_args(args).to(device)
diffusion.load_state_dict(torch.load(args.model_checkpoint))  

# 2.迭代器
optimizer = torch.optim.Adam(diffusion.parameters(), lr=args.learning_rate)

# 3.从 torchvision 读入数据集
train_dataset = datasets.CIFAR10( root='./cifar_train', train=True,
            download=True, transform=script_utils.get_transform())
train_loader = script_utils.cycle(DataLoader( train_dataset,  batch_size=batch_size, shuffle=True, drop_last=True,num_workers=-1,))


for iteration in range(1, 80000):
        diffusion.train()
        x, y = next(train_loader)

        if args.use_labels:
            loss = diffusion(x, y)
        else:
            loss = diffusion(x)

Expand 1: Define diffusion

model = UNet(img_channels=3, base_channels=128)
    
# 生成 t=1000 对应的 β(0.001~0.02if args.schedule == "cosine":
    betas = generate_cosine_schedule(args.num_timesteps=1000)
else:
    betas = generate_linear_schedule(num_timesteps=1000,
        1e-4 * 1000 / args.num_timesteps,
        0.02 * 1000 / args.num_timesteps)

diffusion = GaussianDiffusion( model, (32, 32), 3, 10, betas,
    ema_decay=0.9999,  ema_update_rate=1, ema_start=2000, loss_type='l2')
    
return diffusion

Expansion 2: UNet
consists of time_mlp , init_conv(3,128), down (12-layer ResidualBlock), mid, up (12-layer Res). time_mlp is the learnable tensor of time step t, and the specific definition code is as follows;
GaussianDiffusion is a series of preset hyperparameters, such as β, cumulative multiplication α, etc .:
insert image description here

class PositionalEmbedding(nn.Module):
    __doc__ = r"""Computes a positional embedding of timesteps.

    Input:
        x: tensor of shape (N)
    Output:
        tensor of shape (N, dim)
    Args:
        dim (int): embedding dimension
        scale (float): linear scale to be applied to timesteps. Default: 1.0
    """

    def __init__(self, dim, scale=1.0):
        super().__init__()
        assert dim % 2 == 0
        self.dim = dim
        self.scale = scale

    def forward(self, x):
        device = x.device
        half_dim = self.dim // 2
        emb = math.log(10000) / half_dim
        emb = torch.exp(torch.arange(half_dim, device=device) * -emb)
        emb = torch.outer(x * self.scale, emb)
        emb = torch.cat((emb.sin(), emb.cos()), dim=-1)
        return emb

self.time_mlp = nn.Sequential(
            PositionalEmbedding(base_channels=128, time_emb_scale=1.0),
            nn.Linear(128, 512),
            nn.SiLU(),
            nn.Linear(512, 512),
        )

Expand 3: loss = diffusion(x)

b, c, h, w = x.shape                    # x:12833232  y是对应的128个标签

t = torch.randint(0, self.num_timesteps, (b,), device=device) 
# (0,1000)中随机选128个t

return self.get_losses(x, t, y)
def get_losses(self, x, t, y):
    noise = torch.randn_like(x)         # 随机噪声

    1.perturbed_x = self.perturb_x(x, t, noise)    
      # 用x0表示出xt, 下一行是具体操作:
      perturbed_x = extract(self.sqrt_alphas_cumprod, t, x.shape) * x + 
                    extract(self.sqrt_one_minus_alphas_cumprod, t, x.shape) * noise

    2.estimated_noise = self.model(perturbed_x, t, y)
      # 下一行是具体操作:
      2.1. time_emb = self.time_mlp(t)         # (128) -> (128,512)
           emb = math.log(10000) / half_dim    # 10000/64= 0.143
           emb = torch.exp(torch.arange(half_dim, device=device) * -emb)    # (64):[1.0, 0.86, 0.75, ...0.0001]
           emb = torch.outer(t * self.scale, emb)        # (128,64) 矩阵乘法
           emb = torch.cat((emb.sin(), emb.cos()), dim=-1)   # (128128
           time_emb = conv2d(emb)              # (128,512)
      
      2.2. for layer in self.downs:
               x = layer(x, time_emb, y)    # 将 time_emb 添加到特征中。即:
               out += self.time_bias(self.activation(time_emb))[:, :, None, None]
               # self.time_bias 是linear(512128),activation 是silu函数。直接跟特征相加
           for layer in self.mid:
               x = layer(x, time_emb, y)
           for layer in self.ups:
               x = layer(x, time_emb, y)

           x = self.activation(self.out_norm(x))
           x = self.out_conv(x)              # 返回值为噪音(跟输入维度相同)

           
    
    if self.loss_type == "l1":
        loss = (estimated_noise - noise).abs().mean()
    elif self.loss_type == "l2":
        loss = (estimated_noise - noise).square().mean()

    return loss

2.sample_images.py (prediction process)

x = torch.randn(batch_size, self.img_channels, *self.img_size, device=device)
# 随机采样高斯噪声,作为xt

for t in range(self.num_timesteps - 1, -1, -1):        # T=1000
    t_batch = torch.tensor([t], device=device).repeat(batch_size)  
    x = self.remove_noise(x, t_batch, y, use_ema)      # 得到x(t-1),即:
    x = (    (x - extract(self.remove_noise_coeff, t, x.shape) * self.model(x, t, y)) 
         * extract(self.reciprocal_sqrt_alphas, t, x.shape)       )

The last line of code, namelyinsert image description here

Summarize

  1. The Diffusion Model is expressed as a Markov chain in a parameterized manner, which means that the hidden variables x 1 ,...x T all satisfy the current time step t and only depend on the previous time step t-1, which is very helpful for subsequent calculations.
  2. The transition probability distribution p θ (x t-1 |x t ) in the Markov chain obeys the Gaussian distribution, and the parameters of the Gaussian distribution in the forward diffusion process are directly set, while the parameters of the Gaussian distribution in the reverse process are set by learned.
  3. The Diffusion Model network model has strong scalability and robustness. You can choose a network model with the same input and output dimensions, such as an architecture similar to UNet, and keep the input and output Tensor dims of the network model equal.
  4. The purpose of the Diffusion Model is to find the maximum likelihood function for the input data, and the actual performance is to adjust the model parameters through training to minimize the variational upper limit of the negative logarithmic likelihood of the data
  5. In the process of probability distribution conversion, because of the Markov assumption, the variational upper limit in the fourth point of the objective function can be converted to use KL divergence to calculate, so the method of Monte Carlo sampling is avoided.

Guess you like

Origin blog.csdn.net/qq_45752541/article/details/127956235