Diffusion model—diffusion model

Summary

The diffusion model is a generative model of the Encoder-Decoder architecture, which is divided into a diffusion stage and an inverse diffusion stage. In the diffusion stage, by continuously adding noise to the original data, the data is changed from the original distribution to the distribution we expect, for example, the original data distribution is changed to a normal distribution by continuously adding Gaussian noise. During the inverse diffusion stage, a neural network is used to restore the data from a normal distribution to the original data distribution. Its advantage is that each point on the normal distribution is a mapping of the real data, and the model has better interpretability. The disadvantage is that iterative sampling is slow, resulting in low model training and prediction efficiency.

The diffusion model is a generative model of the Encoder-Decoder architecture, which is divided into a diffusion stage and an inverse diffusion stage. In the diffusion stage, by continuously adding noise to the original data, the data changes from the original distribution to the distribution we expect, for example, by continuously adding Gaussian noise to change the original data distribution into a normal distribution. In the inverse diffusion stage, a neural network is used to restore the data from a normal distribution to the original data distribution. Its advantage is that each point on the normal distribution is a mapping of real data, and the model has better interpretability. The disadvantage is that the iterative sampling speed is slow, resulting in low efficiency of model training and prediction.


References

1. Introduction

The Diffusion model model is divided into a diffusion process and an inverse diffusion process. The diffusion process continuously adds Gaussian noise to the original data, so that the original data becomes Gaussian distributed data, that is, from X 0 X_0X0 − > -> > X T X_T XT. The inverse diffusion process restores the picture through Gaussian noise, that is, from XT X_TXT − > ->> X 0 X_0 X0
insert image description here

2. Diffusion process

2.1 Defining the Diffusion Process

Under the condition that the diffusion process is a Markov chain, Gaussian noise is continuously added to the original information, and the process of adding Gaussian noise at each step is from X t − 1 − > X t X_{t-1} -> X_tXt1>Xt, so define the formula:
q ( xt ∣ xt − 1 ) = N ( xt ; 1 − β txt − 1 , β t I ) q(x_t|x_{t-1}) = N(x_t;\sqrt{1- \beta_t}x_{t-1} ,\beta_tI)q(xtxt1)=N(xt;1bt xt1,btI)

This formula means that from xt − 1 − > xt x_{t-1}->x_txt1>xtIs a 1 − β txt − 1 \sqrt{1-\beta_t}x_{t-1}1bt xt1is the mean β t \beta_tbtGaussian distribution transformation for the variance.

2.2 Re-parameter technique to get iterative formula

The formula for adding Gaussian noise each time using the heavy parameter technique
is as follows: X t = 1 − β t X t − 1 + β t Z t X_t = \sqrt{1-\beta_t}X_{t-1} + \sqrt{ \beta}_tZ_tXt=1bt Xt1+b tZt

  • X t X_t XtIndicates the data distribution at time t
  • Z t Z_t ZtIndicates the Gaussian noise added at time t, which is generally fixed as a Gaussian distribution with a mean of 0 and a variance of 1
  • 1 − β t X t − 1 \sqrt{1-\beta_t}X_{t-1} 1bt Xt1Represents the mean value of the distribution at the current moment
  • β t \sqrt{\beta}_t b tIndicates the standard deviation of the distribution at the current moment (standard deviation = variance\sqrt{variance}Variance )

Note : where β t \beta_tbtis a preset constant between 0 and 1, so the diffusion process does not contain parameters.

2.3 Get the global diffusion formula

In 2.2the iterative formula of , it can be seen that there is only one parameter β \beta in the diffusion processβ,而β \betaβ is a preset constant, so there are no unknown parameters that need to be learned during the diffusion process, so it is only necessary to know the initial data distributionX 0 X_0X0and β t \beta_tbtThe distribution X t X_t at any time can be obtainedXt, the specific formula is as follows:
insert image description here

  • X 0 X_0 X0is the distribution of the original data
  • α t = 1 − β t \alpha_t = 1 - \beta_t at=1bt
  • α t ˉ = ∏ i = 1 t α i \bar{\alpha_t} = \prod_{i=1}^{t}\alpha_i atˉ=i=1tai
  • Z is a Gaussian distribution with mean 0 and variance 1

2.4 Diffusion process realization code

2.4.1 Summary Diffusion Formula

It 2.3can be seen that the formula of the diffusion process is:
X t = α t ˉ X 0 + 1 − α ˉ Z X_t = \sqrt{\bar{\alpha_t}}X_0 + \sqrt{1 - \bar{\alpha}}ZXt=atˉ X0+1aˉ Z where:

  • X 0 X_0X0is the distribution of the original data
  • α t = 1 − β t \alpha_t = 1 - \beta_tat=1bt
  • α t ˉ = ∏ i = 1 t α i \bar{\alpha_t} = \prod_{i=1}^{t}\alpha_iatˉ=i=1tai
  • Z is a Gaussian distribution with mean 0 and variance 1

2.4.2 Code

  1. Use make_s_curve to generate data as an example to get X 0 X_0X0

    # 得到数据X0
    s_curve, _ = make_s_curve(10**4, noise=0.1)
    x_0 = s_curve[:, [0, 2]]/10.0
    # 查看形状
    print(np.shape(x_0))
    # 绘图
    data = x_0.T
    fig, ax = plt.subplots()
    ax.scatter(*data, color='red', edgecolor='white')
    ax.axis('off')
    dataset = torch.Tensor(data)
    

    insert image description here

  2. Suppose there are 100 time settings, the β \beta of all timeb

    num_steps = 100
    betas = torch.linspace(-6, 6, num_steps)
    betas = torch.sigmoid(betas)*(0.5e-2 - 1e-5)+1e-5
    

    β \betaβ is a very small number before 0-1, the maximum value is 0.5e-2, and the minimum value is 1e-5

  3. get α \alphaαα = 1 − β \alpha = 1 - \betaa=1b )

    alphas = 1 - betas
    
  4. Get α t ˉ \bar{\alpha_t} at each momentatˉ α t ˉ = ∏ i = 1 t α i \bar{\alpha_t} = \prod_{i=1}^{t}\alpha_i atˉ=i=1tai

    alphas_prod = torch.cumprod(alphas, 0)
    
  5. get α t \sqrt{\alpha_t}at

    alphas_bar_sqrt = torch.sqrt(alphas_bar)
    
  6. Get 1 − α t ˉ \sqrt{1-\bar{\alpha_t}}1atˉ

    one_minus_alphas_bar_sqrt = torch.sqrt(1-alphas_bar)
    
  7. Enter X 0 X_0X0and time t, get X t X_tXt,即 X t = α t ˉ X 0 + 1 − α t ˉ Z X_t = \sqrt{\bar{\alpha_t}}X_0 + \sqrt{1 - \bar{\alpha_t}}Z Xt=atˉ X0+1atˉ Z

    def x_t(x_0, t):
        noise = torch.randn_like(x_0)
        return (alphas_bar_sqrt[t]*x_0 + one_minus_alphas_bar_sqrt[t]*noise)
    
  8. Diffusion Process Demonstration

    num_shows = 20
    fig, axs = plt.subplots(2, 10, figsize=(28, 3))
    plt.rc('text', color='blue')
    
    for i in range(num_shows):
        j = i//10
        k = i%10
        num_x_t = x_t(dataset, torch.tensor([i*num_steps//num_shows]))
        axs[j, k].scatter(*num_x_t, color='red', edgecolor='white')
        axs[j, k].set_axis_off()
        axs[j, k].set_title('$q(\mathbf{x}_{'+str(i*num_steps//num_shows)+'})$')
    

    insert image description here

3. Inverse Diffusion Process

3.1 Target formula

The diffusion process is to continuously add noise to the original data to obtain Gaussian noise. The inverse diffusion process is to restore the original data from the Gaussian noise. We assume that the inverse diffusion process is still a Markov chain process. What to do is XT − > X 0 X_T ->X_0XT>X0,用公式表达如下:
p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; u θ ( x t , t ) , Σ θ ( x t , t ) ) p_\theta(x_{t-1}|x_t) = N(x_{t-1}; u_\theta(x_t, t),\Sigma_\theta(x_t, t) ) pi(xt1xt)=N(xt1;ui(xt,t),Si(xt,t))

3.2 Posterior Conditional Probability

Deduce the posterior conditional probability q ( xt − 1 ∣ xt , x 0 ) q(x_{t-1}|x_t, x_0)q(xt1xt,x0)
insert image description here
whose varianceβ t ˉ \bar{\beta_t}btˉ为:
β t ˉ = 1 − α t − 1 ˉ 1 − α t ˉ β t \bar{\beta_t} = \frac{1-\bar{\alpha_{t-1}}}{1-\bar{\alpha_t}}\beta_t btˉ=1atˉ1at1ˉbt
均值 u ˉ ( x t − 1 , x 0 ) \bar{u}(x_{t-1}, x_0) uˉ(xt1,x0)为:
u ˉ ( x t − 1 , x 0 ) = α t ( 1 − α ˉ t − 1 ) 1 − α t ˉ x t + α ˉ t − 1 β t 1 − α t ˉ x 0 \bar{u}(x_{t-1}, x_0)=\frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha_t}}x_t+\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha_t}}x_0 uˉ(xt1,x0)=1atˉat (1aˉt1)xt+1atˉaˉt1 btx0
The inverse diffusion process model should not know x 0 x_0 in advancex0, so it is necessary to set x 0 x_0x0use xt x_txtInstead, 2.4it is obtained according to:
insert image description here
substituting into the mean value formula and simplifying to obtain the mean value of the posterior condition:
u ˉ t = 1 α t ( xt − β t 1 − α t ˉ zt ) \bar{u}_t=\frac{1} {\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha_t}}}z_t)uˉt=at 1(xt1atˉ btzt)

4. Optimization goals

4.1 Derivation of loss function formula

The loss function is obtained as follows:
insert image description here
insert image description here
insert image description here

4.2 Loss function code implementation

def diffusion_loss_fn(model, x_0, alphas_bar_sqrt, one_minus_alphas_bar_sqrt, n_steps):
    batch_size = x_0.shape[0]
    # 生成时间随机值,大小是(batch_size//2)
    t = torch.randint(0, n_steps, size=(batch_size//2,))
    
    t = torch.cat([t, num_steps-1-t], dim=0)
    t = t.unsqueeze(-1) # t.shape为(batch_size, 1)
    
    a = alphas_bar_sqrt[t].to(device)
    
    aml = one_minus_alphas_bar_sqrt[t].to(device)
    
    e = torch.randn_like(x_0).to(device)
    
    x = x_0 * a + e * aml
    
    output = model(x, t.squeeze(-1).to(device))
    
    return (e - output).square().mean()

5. Algorithm process

insert image description here

5.1 Model training code

print('训练模型...')

batch_size = 128
dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)
num_epoch = 4000
plt.rc('text', color='blue')

model = MLPDiffusion(num_steps)
model = model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for t in range(num_epoch):
    for idx, batch_x in enumerate(dataloader):
        batch_x = batch_x.to(device)
        loss = diffusion_loss_fn(model,batch_x,alphas_bar_sqrt,one_minus_alphas_bar_sqrt,num_steps)
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.)
        optimizer.step()
        
    if(t%100==0):
        print(loss)
torch.save(model, "model.h5")

5.2 Model sampling code

def p_sample_loop(model, shape, n_steps, betas, one_minus_alphas_bar_sqrt):
    cur_x = torch.randn(shape).to(device)
    x_seq = [cur_x]
    for i in reversed(range(n_steps)):
        cur_x = p_sample(model, cur_x, i, betas.to(device), one_minus_alphas_bar_sqrt.to(device))
        x_seq.append(cur_x)
    return x_seq
        
def p_sample(model, x, t, betas, one_minus_alphas_bar_sqrt):
    t = torch.tensor([t]).to(device)
    coeff = betas[t]/one_minus_alphas_bar_sqrt[t]
    eps_theta = model(x, t)
    # 计算均值
    mean = (1 / (1-betas[t]).sqrt())*(x - (coeff*eps_theta))
    z = torch.randn_like(x).to(device)
    # 计算标准差
    sigma_t = betas[t].sqrt().to(device)
    sample = mean + sigma_t * z
    return (sample)


model = torch.load("model.h5")
x_seq = p_sample_loop(model, dataset.shape, num_steps, betas, one_minus_alphas_bar_sqrt)   
fig, axs = plt.subplots(1, 10, figsize=(28, 3))
for i in range(1, 11):
    cur_x = x_seq[i*10].detach()
    axs[i-1].scatter(cur_x[:, 0].cpu(), cur_x[:, 1].cpu(), color='red', edgecolor='white');
    axs[i-1].set_axis_off();
    axs[i-1].set_title('$q(\mathbf{x}_{'+str(i*10)+'})$')

5.3 Effect of trained model

insert image description here

Guess you like

Origin blog.csdn.net/sunningzhzh/article/details/125118688