Article directory
Summary
The diffusion model is a generative model of the Encoder-Decoder architecture, which is divided into a diffusion stage and an inverse diffusion stage. In the diffusion stage, by continuously adding noise to the original data, the data is changed from the original distribution to the distribution we expect, for example, the original data distribution is changed to a normal distribution by continuously adding Gaussian noise. During the inverse diffusion stage, a neural network is used to restore the data from a normal distribution to the original data distribution. Its advantage is that each point on the normal distribution is a mapping of the real data, and the model has better interpretability. The disadvantage is that iterative sampling is slow, resulting in low model training and prediction efficiency.
The diffusion model is a generative model of the Encoder-Decoder architecture, which is divided into a diffusion stage and an inverse diffusion stage. In the diffusion stage, by continuously adding noise to the original data, the data changes from the original distribution to the distribution we expect, for example, by continuously adding Gaussian noise to change the original data distribution into a normal distribution. In the inverse diffusion stage, a neural network is used to restore the data from a normal distribution to the original data distribution. Its advantage is that each point on the normal distribution is a mapping of real data, and the model has better interpretability. The disadvantage is that the iterative sampling speed is slow, resulting in low efficiency of model training and prediction.
1. Introduction
The Diffusion model model is divided into a diffusion process and an inverse diffusion process. The diffusion process continuously adds Gaussian noise to the original data, so that the original data becomes Gaussian distributed data, that is, from X 0 X_0X0 − > -> −> X T X_T XT. The inverse diffusion process restores the picture through Gaussian noise, that is, from XT X_TXT − > ->−> X 0 X_0 X0。
2. Diffusion process
2.1 Defining the Diffusion Process
Under the condition that the diffusion process is a Markov chain, Gaussian noise is continuously added to the original information, and the process of adding Gaussian noise at each step is from X t − 1 − > X t X_{t-1} -> X_tXt−1−>Xt, so define the formula:
q ( xt ∣ xt − 1 ) = N ( xt ; 1 − β txt − 1 , β t I ) q(x_t|x_{t-1}) = N(x_t;\sqrt{1- \beta_t}x_{t-1} ,\beta_tI)q(xt∣xt−1)=N(xt;1−btxt−1,btI)
This formula means that from xt − 1 − > xt x_{t-1}->x_txt−1−>xtIs a 1 − β txt − 1 \sqrt{1-\beta_t}x_{t-1}1−btxt−1is the mean β t \beta_tbtGaussian distribution transformation for the variance.
2.2 Re-parameter technique to get iterative formula
The formula for adding Gaussian noise each time using the heavy parameter technique
is as follows: X t = 1 − β t X t − 1 + β t Z t X_t = \sqrt{1-\beta_t}X_{t-1} + \sqrt{ \beta}_tZ_tXt=1−btXt−1+btZt
- X t X_t XtIndicates the data distribution at time t
- Z t Z_t ZtIndicates the Gaussian noise added at time t, which is generally fixed as a Gaussian distribution with a mean of 0 and a variance of 1
- 1 − β t X t − 1 \sqrt{1-\beta_t}X_{t-1} 1−btXt−1Represents the mean value of the distribution at the current moment
- β t \sqrt{\beta}_t btIndicates the standard deviation of the distribution at the current moment (standard deviation = variance\sqrt{variance}Variance)
Note : where β t \beta_tbtis a preset constant between 0 and 1, so the diffusion process does not contain parameters.
2.3 Get the global diffusion formula
In 2.2
the iterative formula of , it can be seen that there is only one parameter β \beta in the diffusion processβ,而β \betaβ is a preset constant, so there are no unknown parameters that need to be learned during the diffusion process, so it is only necessary to know the initial data distributionX 0 X_0X0and β t \beta_tbtThe distribution X t X_t at any time can be obtainedXt, the specific formula is as follows:
- X 0 X_0 X0is the distribution of the original data
- α t = 1 − β t \alpha_t = 1 - \beta_t at=1−bt
- α t ˉ = ∏ i = 1 t α i \bar{\alpha_t} = \prod_{i=1}^{t}\alpha_i atˉ=∏i=1tai
- Z is a Gaussian distribution with mean 0 and variance 1
2.4 Diffusion process realization code
2.4.1 Summary Diffusion Formula
It 2.3
can be seen that the formula of the diffusion process is:
X t = α t ˉ X 0 + 1 − α ˉ Z X_t = \sqrt{\bar{\alpha_t}}X_0 + \sqrt{1 - \bar{\alpha}}ZXt=atˉX0+1−aˉZ where:
- X 0 X_0X0is the distribution of the original data
- α t = 1 − β t \alpha_t = 1 - \beta_tat=1−bt
- α t ˉ = ∏ i = 1 t α i \bar{\alpha_t} = \prod_{i=1}^{t}\alpha_iatˉ=∏i=1tai
- Z is a Gaussian distribution with mean 0 and variance 1
2.4.2 Code
-
Use make_s_curve to generate data as an example to get X 0 X_0X0
# 得到数据X0 s_curve, _ = make_s_curve(10**4, noise=0.1) x_0 = s_curve[:, [0, 2]]/10.0 # 查看形状 print(np.shape(x_0)) # 绘图 data = x_0.T fig, ax = plt.subplots() ax.scatter(*data, color='red', edgecolor='white') ax.axis('off') dataset = torch.Tensor(data)
-
Suppose there are 100 time settings, the β \beta of all timeb
num_steps = 100 betas = torch.linspace(-6, 6, num_steps) betas = torch.sigmoid(betas)*(0.5e-2 - 1e-5)+1e-5
β \betaβ is a very small number before 0-1, the maximum value is 0.5e-2, and the minimum value is 1e-5
-
get α \alphaα(α = 1 − β \alpha = 1 - \betaa=1−b )
alphas = 1 - betas
-
Get α t ˉ \bar{\alpha_t} at each momentatˉ( α t ˉ = ∏ i = 1 t α i \bar{\alpha_t} = \prod_{i=1}^{t}\alpha_i atˉ=∏i=1tai)
alphas_prod = torch.cumprod(alphas, 0)
-
get α t \sqrt{\alpha_t}at
alphas_bar_sqrt = torch.sqrt(alphas_bar)
-
Get 1 − α t ˉ \sqrt{1-\bar{\alpha_t}}1−atˉ
one_minus_alphas_bar_sqrt = torch.sqrt(1-alphas_bar)
-
Enter X 0 X_0X0and time t, get X t X_tXt,即 X t = α t ˉ X 0 + 1 − α t ˉ Z X_t = \sqrt{\bar{\alpha_t}}X_0 + \sqrt{1 - \bar{\alpha_t}}Z Xt=atˉX0+1−atˉZ
def x_t(x_0, t): noise = torch.randn_like(x_0) return (alphas_bar_sqrt[t]*x_0 + one_minus_alphas_bar_sqrt[t]*noise)
-
Diffusion Process Demonstration
num_shows = 20 fig, axs = plt.subplots(2, 10, figsize=(28, 3)) plt.rc('text', color='blue') for i in range(num_shows): j = i//10 k = i%10 num_x_t = x_t(dataset, torch.tensor([i*num_steps//num_shows])) axs[j, k].scatter(*num_x_t, color='red', edgecolor='white') axs[j, k].set_axis_off() axs[j, k].set_title('$q(\mathbf{x}_{'+str(i*num_steps//num_shows)+'})$')
3. Inverse Diffusion Process
3.1 Target formula
The diffusion process is to continuously add noise to the original data to obtain Gaussian noise. The inverse diffusion process is to restore the original data from the Gaussian noise. We assume that the inverse diffusion process is still a Markov chain process. What to do is XT − > X 0 X_T ->X_0XT−>X0,用公式表达如下:
p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; u θ ( x t , t ) , Σ θ ( x t , t ) ) p_\theta(x_{t-1}|x_t) = N(x_{t-1}; u_\theta(x_t, t),\Sigma_\theta(x_t, t) ) pi(xt−1∣xt)=N(xt−1;ui(xt,t),Si(xt,t))
3.2 Posterior Conditional Probability
Deduce the posterior conditional probability q ( xt − 1 ∣ xt , x 0 ) q(x_{t-1}|x_t, x_0)q(xt−1∣xt,x0)
whose varianceβ t ˉ \bar{\beta_t}btˉ为:
β t ˉ = 1 − α t − 1 ˉ 1 − α t ˉ β t \bar{\beta_t} = \frac{1-\bar{\alpha_{t-1}}}{1-\bar{\alpha_t}}\beta_t btˉ=1−atˉ1−at−1ˉbt
均值 u ˉ ( x t − 1 , x 0 ) \bar{u}(x_{t-1}, x_0) uˉ(xt−1,x0)为:
u ˉ ( x t − 1 , x 0 ) = α t ( 1 − α ˉ t − 1 ) 1 − α t ˉ x t + α ˉ t − 1 β t 1 − α t ˉ x 0 \bar{u}(x_{t-1}, x_0)=\frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha_t}}x_t+\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha_t}}x_0 uˉ(xt−1,x0)=1−atˉat(1−aˉt−1)xt+1−atˉaˉt−1btx0
The inverse diffusion process model should not know x 0 x_0 in advancex0, so it is necessary to set x 0 x_0x0use xt x_txtInstead, 2.4
it is obtained according to:
substituting into the mean value formula and simplifying to obtain the mean value of the posterior condition:
u ˉ t = 1 α t ( xt − β t 1 − α t ˉ zt ) \bar{u}_t=\frac{1} {\sqrt{\alpha_t}}(x_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha_t}}}z_t)uˉt=at1(xt−1−atˉbtzt)
4. Optimization goals
4.1 Derivation of loss function formula
The loss function is obtained as follows:
4.2 Loss function code implementation
def diffusion_loss_fn(model, x_0, alphas_bar_sqrt, one_minus_alphas_bar_sqrt, n_steps):
batch_size = x_0.shape[0]
# 生成时间随机值,大小是(batch_size//2)
t = torch.randint(0, n_steps, size=(batch_size//2,))
t = torch.cat([t, num_steps-1-t], dim=0)
t = t.unsqueeze(-1) # t.shape为(batch_size, 1)
a = alphas_bar_sqrt[t].to(device)
aml = one_minus_alphas_bar_sqrt[t].to(device)
e = torch.randn_like(x_0).to(device)
x = x_0 * a + e * aml
output = model(x, t.squeeze(-1).to(device))
return (e - output).square().mean()
5. Algorithm process
5.1 Model training code
print('训练模型...')
batch_size = 128
dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)
num_epoch = 4000
plt.rc('text', color='blue')
model = MLPDiffusion(num_steps)
model = model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
for t in range(num_epoch):
for idx, batch_x in enumerate(dataloader):
batch_x = batch_x.to(device)
loss = diffusion_loss_fn(model,batch_x,alphas_bar_sqrt,one_minus_alphas_bar_sqrt,num_steps)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.)
optimizer.step()
if(t%100==0):
print(loss)
torch.save(model, "model.h5")
5.2 Model sampling code
def p_sample_loop(model, shape, n_steps, betas, one_minus_alphas_bar_sqrt):
cur_x = torch.randn(shape).to(device)
x_seq = [cur_x]
for i in reversed(range(n_steps)):
cur_x = p_sample(model, cur_x, i, betas.to(device), one_minus_alphas_bar_sqrt.to(device))
x_seq.append(cur_x)
return x_seq
def p_sample(model, x, t, betas, one_minus_alphas_bar_sqrt):
t = torch.tensor([t]).to(device)
coeff = betas[t]/one_minus_alphas_bar_sqrt[t]
eps_theta = model(x, t)
# 计算均值
mean = (1 / (1-betas[t]).sqrt())*(x - (coeff*eps_theta))
z = torch.randn_like(x).to(device)
# 计算标准差
sigma_t = betas[t].sqrt().to(device)
sample = mean + sigma_t * z
return (sample)
model = torch.load("model.h5")
x_seq = p_sample_loop(model, dataset.shape, num_steps, betas, one_minus_alphas_bar_sqrt)
fig, axs = plt.subplots(1, 10, figsize=(28, 3))
for i in range(1, 11):
cur_x = x_seq[i*10].detach()
axs[i-1].scatter(cur_x[:, 0].cpu(), cur_x[:, 1].cpu(), color='red', edgecolor='white');
axs[i-1].set_axis_off();
axs[i-1].set_title('$q(\mathbf{x}_{'+str(i*10)+'})$')