foreword
Improved Denoising Diffusion Probabilistic Models (IDDPM) is an improvement of the previous Denoising Diffusion Probabilistic Models (DDPM).
Some important formulas have been mentioned in the previous blog DDPM principle and code analysis , and will not be repeated here. This article mainly explains some improvements and codes.
This article refers to video 58, the PyTorch code of Improved Diffusion for an in-depth explanation line by line , the up explanation is very clear, it is recommended to watch.
This article is constantly being updated...
DDIM is optimized for sampling, using the respace technique to reduce the sampling steps DDIM principle and code (Denoising diffusion implicit models)
Reproduce the code of this paper and toss a bit to install and configure MPI under Ubuntu 20.04 , thanks for this article.
The mpi4py library has not been downloaded well. It turned out that mpicc was not downloaded.
the code
The case is mainly based on this OpenAI official code openai/improved-diffusion .
This part mainly focuses on forward diffusion, backward diffusion, sampling and loss calculation. As for the model using unet with attention, it will not be expanded here.
It mainly focuses on the GaussianDiffusion class in improved_diffusion/gaussian_diffusion.py. In addition, this part only extracts the core part of the code. As for robustness, such as assert or type conversion code, it will not be included. If you need to run it, please check the code in the original warehouse.
GaussianDiffusion
init
Model parameters are threaded into betas, where betas come from this way. The original DDPM adopts the Linear method, while the IDDPM adopts the cosine method.
# gaussian_diffusion.py
def get_named_beta_schedule(schedule_name, num_diffusion_timesteps):
"""
Get a pre-defined beta schedule for the given name.
"""
if schedule_name == "linear":
# Linear schedule from Ho et al, extended to work for any number of
# diffusion steps.
scale = 1000 / num_diffusion_timesteps
beta_start = scale * 0.0001
beta_end = scale * 0.02
return np.linspace(
beta_start, beta_end, num_diffusion_timesteps, dtype=np.float64
)
elif schedule_name == "cosine":
return betas_for_alpha_bar(
num_diffusion_timesteps,
lambda t: math.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2,
)
alphas_cumprod 是 α ‾ t \overline{\alpha}_t at, alphas_cumprod_prev 是 α ‾ t − 1 \overline{\alpha}_{t-1} at−1, alphas_cumprod_next 是 α ‾ t + 1 \overline{\alpha}_{t+1} at+1
alphas = 1.0 - betas
alphas_cumprod = np.cumprod(alphas, axis=0)
alphas_cumprod_prev = np.append(1.0, self.alphas_cumprod[:-1])
alphas_cumprod_next = np.append(self.alphas_cumprod[1:], 0.0)
α ‾ t \sqrt{\overline{\alpha}_t} atfor sqrt_alphas_cumprod
sqrt_alphas_cumprod = np.sqrt(self.alphas_cumprod)
1 − α ‾ t \sqrt{1-\overline{\alpha}_t} 1−atfor sqrt_one_minus_alphas_cumprod
sqrt_one_minus_alphas_cumprod = np.sqrt(1.0 - self.alphas_cumprod)
l o g ( 1 − α ‾ t ) log(1-\overline{\alpha}_t) log(1−at) 为 log_one_minus_alphas_cumprod。
log_one_minus_alphas_cumprod = np.log(1.0 - self.alphas_cumprod)
1 α ‾ t \frac{1}{\sqrt{\overline{\alpha}_t}} at1for sqrt_recip_alphas_cumprod
sqrt_recip_alphas_cumprod = np.sqrt(1.0 / self.alphas_cumprod)
1 α ‾ t − 1 \sqrt{\frac{1}{\overline{\alpha}_t}-1} at1−1for sqrt_recipm1_alphas_cumprod
sqrt_recipm1_alphas_cumprod = np.sqrt(1.0 / self.alphas_cumprod - 1)
β ~ t = 1 − α ‾ t − 1 1 − α ‾ t β t \widetilde{\beta}_t = \frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_t}\beta_t b t=1−at1−at−1bt
# calculations for posterior q(x_{t-1} | x_t, x_0)
posterior_variance = (
betas * (1.0 - self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod)
)
take a log
# log calculation clipped because the posterior variance is 0 at the
# beginning of the diffusion chain.
posterior_log_variance_clipped = np.log(
np.append(self.posterior_variance[1], self.posterior_variance[1:])
)
μ ~ ( X t , X 0 ) = α ‾ t − 1 1 − α ‾ t X 0 + α t ( 1 − α ‾ t − 1 ) 1 − α ‾ t X t \widetilde{\mu}(X_t, X_0) = \frac{\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_t} X_0 + \frac{\sqrt{\alpha_t}(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_{t}}X_t m (Xt,X0)=1−atat−1X0+1−atat(1−at−1)Xt, where X 0 X_0X0The previous coefficient corresponds to posterior_mean_coef1, X t X_tXtThe previous coefficient corresponds to posterior_mean_coef2.
posterior_mean_coef1 = (
betas * np.sqrt(self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod)
)
posterior_mean_coef2 = (
(1.0 - self.alphas_cumprod_prev)
* np.sqrt(alphas)
/ (1.0 - self.alphas_cumprod)
)
q_mean_variance
Pass in (x_start, t), get mean and variance
q ( X t ∣ X 0 ) = N ( X t ; α ‾ t X 0 , ( 1 − α ‾ t ) I ) q(X_t|X_0) = N( X_t; \sqrt{\overline{\alpha}_t}X_0, (1-\overline{\alpha}_t)I)q(Xt∣X0)=N(Xt;atX0,(1−at)I)
mean = (
_extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start
)
variance = _extract_into_tensor(1.0 - self.alphas_cumprod, t, x_start.shape)
log_variance = _extract_into_tensor(
self.log_one_minus_alphas_cumprod, t, x_start.shape
)
q_sample
Reparameterization to obtain the image after adding noise
X t = α ‾ t X 0 + 1 − α ‾ t ϵ X_t = \sqrt{\overline{\alpha}_t}X_0+\sqrt{1-\overline{\alpha}_t }~\epsilonXt=atX0+1−at ϵ
_extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start
+ _extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape)
* noise
q_posterior_mean_variance
The mean and distribution of the posterior
μ ~ ( X t , X 0 ) = α ‾ t − 1 1 − α ‾ t X 0 + α t ( 1 − α ‾ t − 1 ) 1 − α ‾ t X t \widetilde{ \mu}(X_t, X_0) = \frac{\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_t} X_0 + \frac{\sqrt{\alpha_t }(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_{t}}X_tm
(Xt,X0)=1−atat−1X0+1−atat(1−at−1)Xt
posterior_mean = (
_extract_into_tensor(self.posterior_mean_coef1, t, x_t.shape) * x_start
+ _extract_into_tensor(self.posterior_mean_coef2, t, x_t.shape) * x_t
)
β ~ t = 1 − α ‾ t − 1 1 − α ‾ t β t \widetilde{\beta}_t = \frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_t} \beta_t b t=1−at1−at−1bt, this has been calculated in the init function before, so there is no need to calculate it again
posterior_variance = _extract_into_tensor(self.posterior_variance, t, x_t.shape)
p_mean_variance
The incoming here is ttx at time t , to predict t − 1 t-1t−The mean and variance variance at time 1 .
The variance can be either learned or fixed.
(1) The variance can be learned, and the following conditions
if self.model_var_type in [ModelVarType.LEARNED, ModelVarType.LEARNED_RANGE]:
It needs to be split in the channel dimension
model_output, model_var_values = th.split(model_output, C, dim=1)
There are also two cases here, the original DDPM is the direct prediction variance
if self.model_var_type == ModelVarType.LEARNED:
model_log_variance = model_var_values
model_variance = th.exp(model_log_variance)
In improve-DDPM, it is the prediction range, and predicts v of the following formula.
Σ θ ( X t , t ) = exp ( vlog β t + ( 1 − v ) log β ~ t ) \Sigma_{\theta}(X_t, t)=exp(vlog\beta_t + (1-v)log \ widetilde{\beta}_t)Si(Xt,t)=exp(vlogβt+(1−v)logb
t)
因为 β ~ t = 1 − α ‾ t − 1 1 − α ‾ t β t \widetilde{\beta}_t = \frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_t} \beta_t b t=1−at1−at−1bt, 而 1 − α ‾ t − 1 < 1 − α ‾ t 1-\overline{\alpha}_{t-1} < 1-\overline{\alpha}_t 1−at−1<1−at, 所以 β ~ t < β t \widetilde{\beta}_t < \beta_t b t<bt
So max_log is log β t log \beta_tlogβt
max_log = _extract_into_tensor(np.log(self.betas), t, x.shape)
And min_log is log β ~ t log \widetilde{\beta}_tlogb t
min_log = _extract_into_tensor(
self.posterior_log_variance_clipped, t, x.shape)
Convert the predicted value [-1, 1] to [0, 1]
# The model_var_values is [-1, 1] for [min_var, max_var].
frac = (model_var_values + 1) / 2
Then according to the formula Σ θ ( X t , t ) = exp ( vlog β t + ( 1 − v ) log β ~ t ) \Sigma_{\theta}(X_t, t)=exp(vlog\beta_t + (1-v )log \widetilde{\beta}_t)Si(Xt,t)=exp(vlogβt+(1−v)logb t)
model_log_variance = frac * max_log + (1 - frac) * min_log
model_variance = th.exp(model_log_variance)
(2) The variance cannot be learned.
In DDPM, β t \beta_t is usedbt, and there are two ways in IDDPM β t \beta_tbt or β ~ t \widetilde{\beta}_t b
t
The large variance is β t \beta_tbt,
ModelVarType.FIXED_LARGE: (
# for fixedlarge, we set the initial (log-)variance like so
# to get a better decoder log likelihood.
np.append(self.posterior_variance[1], self.betas[1:]),
np.log(np.append(self.posterior_variance[1], self.betas[1:])),
),
A small variance is β ~ t \widetilde{\beta}_tb t
ModelVarType.FIXED_SMALL: (
self.posterior_variance,
self.posterior_log_variance_clipped,
),
Note that the output of the above calculation is a list, and then we only need to take out the
model_variance = _extract_into_tensor(model_variance, t, x.shape)
model_log_variance = _extract_into_tensor(model_log_variance, t, x.shape)
Then it is the prediction of the mean
(1) prediction X t − 1 X_{t-1}Xt−1mean of time
if self.model_mean_type == ModelMeanType.PREVIOUS_X:
so direct
model_mean = model_output
By the way, it is also predicted that in addition to X 0 X_0X0, will not be used in training, but will be used in evaluation
pred_xstart = process_xstart(
self._predict_xstart_from_xprev(x_t=x, t=t, xprev=model_output)
)
(2) Predict X 0 X_0X0
if self.model_mean_type == ModelMeanType.START_X:
After a post-processing function
pred_xstart = process_xstart(model_output)
(3) Prediction noise
ModelMeanType.EPSILON
pred_xstart = process_xstart(
self._predict_xstart_from_eps(x_t=x, t=t, eps=model_output)
)
μ ~ ( X t , X 0 ) = α ‾ t − 1 1 − α ‾ t X 0 + α t ( 1 − α ‾ t − 1 ) 1 − α ‾ t X t \widetilde{\mu}(X_t, X_0) = \frac{\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_t} X_0 + \frac{\sqrt{\alpha_t}(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_{t}}X_t m (Xt,X0)=1−atat−1X0+1−atat(1−at−1)Xt
model_mean, _, _ = self.q_posterior_mean_variance(
x_start=pred_xstart, x_t=x, t=t
)
_predict_xstart_from_xprev
Use this formula to calculate X 0 X_0X0: μ ~ ( X t , X 0 ) = α ‾ t − 1 1 − α ‾ t X 0 + α t ( 1 − α ‾ t − 1 ) 1 − α ‾ t X t \widetilde{\mu}(X_t, X_0) = \frac{\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_t} X_0 + \frac{\sqrt{\alpha_t}(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_{t}}X_t m (Xt,X0)=1−atat−1X0+1−atat(1−at−1)Xt
return ( # (xprev - coef2*x_t) / coef1
_extract_into_tensor(1.0 / self.posterior_mean_coef1, t, x_t.shape) * xprev
- _extract_into_tensor(
self.posterior_mean_coef2 / self.posterior_mean_coef1, t, x_t.shape
)
* x_t
)
_predict_xstart_from_eps
X 0 = 1 α t ( X t − β t 1 − α ‾ t ϵ ) X_0 = \frac{1}{\sqrt{\alpha_t}}(X_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon) X0=at1(Xt−1−atbtϵ )
Simplify the above formula to
X 0 = 1 α ‾ t X t − 1 α ‾ t − 1 ϵ X_0 = \frac{1}{\sqrt{\overline{\alpha}_t}}X_t - \sqrt{\frac {1}{\overline{\alpha}_t}-1}~\epsilonX0=at1Xt−at1−1 ϵ
_extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t
- _extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape) * eps
p_sample
Based on X t X_tXtSample X t − 1 X_{t-1}Xt−1
out是 { “mean”: model_mean, “variance”: model_variance, “log_variance”: model_log_variance, “pred_xstart”: pred_xstart}
# 得到 X[t-1]的均值、方差、对数方差、X[0]的预测值
out = self.p_mean_variance(
model,
x,
t,
clip_denoised=clip_denoised,
denoised_fn=denoised_fn,
model_kwargs=model_kwargs,
)
reparameterized sampling
noise = th.randn_like(x)
sample = out["mean"] + nonzero_mask * th.exp(0.5 * out["log_variance"]) * noise
The p_sample_loop and p_sample_loop_progressive functions call this function iteratively.
p_sample_loop_progressive
X 0 X_0 X0
if noise is not None:
img = noise
else:
img = th.randn(*shape, device=device)
T, T-1, …, 0
indices = list(range(self.num_timesteps))[::-1]
Continuous sampling
for i in indices:
t = th.tensor([i] * shape[0], device=device)
with th.no_grad():
out = self.p_sample(
model,
img,
t,
clip_denoised=clip_denoised,
denoised_fn=denoised_fn,
model_kwargs=model_kwargs,
)
yield out
img = out["sample"]
_vb_terms_bpd
vb variational lower bound, bpd is bit per dimension
Find the true mean and variance of the q distribution
true_mean, _, true_log_variance_clipped = self.q_posterior_mean_variance(
x_start=x_start, x_t=x_t, t=t
)
Find the mean and variance of the model predictions for the p-distribution
out = self.p_mean_variance(
model, x_t, t, clip_denoised=clip_denoised, model_kwargs=model_kwargs
)
Computes the KL divergence of two Gaussian distributions.
L t − 1 = DKL ( q ( X t − 1 ∣ X t , X 0 ) ∣ ∣ p θ ( X t − 1 ∣ X t ) ) L_{t-1} = D_{KL}(q(X_{ t-1}|X_t, X_0)~||~ p_\theta (X_{t-1}|X_t))Lt−1=DKL(q(Xt−1∣Xt,X0) ∣∣ pi(Xt−1∣Xt))
kl = normal_kl( true_mean, true_log_variance_clipped,
out["mean"], out["log_variance"])
kl = mean_flat(kl) / np.log(2.0)
L 0 = − logp θ ( X 0 ∣ X 1 ) L_0=-log p_{\theta}(X_0|X_1)L0=−logpi(X0∣X1)
model a discrete distribution with the difference of a cumulative function
decoder_nll = -discretized_gaussian_log_likelihood(
x_start, means=out["mean"], log_scales=0.5 * out["log_variance"]
)
decoder_nll = mean_flat(decoder_nll) / np.log(2.0)
Merge it, including the KL divergence of all moments
output = th.where((t == 0), decoder_nll, kl)
discretized_gaussian_log_likelihood
improved_diffusion/losses.py
def discretized_gaussian_log_likelihood(x, *, means, log_scales):
"""
Compute the log-likelihood of a Gaussian distribution discretizing to a
given image.
:param x: the target images. It is assumed that this was uint8 values,
rescaled to the range [-1, 1].
:param means: the Gaussian mean Tensor.
:param log_scales: the Gaussian log stddev Tensor.
:return: a tensor like x of log probabilities (in nats).
"""
assert x.shape == means.shape == log_scales.shape
centered_x = x - means
inv_stdv = th.exp(-log_scales)
plus_in = inv_stdv * (centered_x + 1.0 / 255.0)
cdf_plus = approx_standard_normal_cdf(plus_in)
min_in = inv_stdv * (centered_x - 1.0 / 255.0)
cdf_min = approx_standard_normal_cdf(min_in)
log_cdf_plus = th.log(cdf_plus.clamp(min=1e-12))
log_one_minus_cdf_min = th.log((1.0 - cdf_min).clamp(min=1e-12))
cdf_delta = cdf_plus - cdf_min
log_probs = th.where(
x < -0.999,
log_cdf_plus,
th.where(x > 0.999, log_one_minus_cdf_min, th.log(cdf_delta.clamp(min=1e-12))),
)
assert log_probs.shape == x.shape
return log_probs
training_losses
If losstype is KL
if self.loss_type == LossType.KL or self.loss_type == LossType.RESCALED_KL:
Call the previous vb_terms function
terms["loss"] = self._vb_terms_bpd(
model=model,
x_start=x_start,
x_t=x_t,
t=t,
clip_denoised=False,
model_kwargs=model_kwargs,
)["output"]
As for the MSE loss, it is judged differently according to the type of model prediction.
elif self.loss_type == LossType.MSE or self.loss_type == LossType.RESCALED_MSE:
(1) If the model predicts variance
if self.model_var_type in [
ModelVarType.LEARNED,
ModelVarType.LEARNED_RANGE,
]:
Split, model_output and model_var_values
B, C = x_t.shape[:2]
assert model_output.shape == (B, C * 2, *x_t.shape[2:])
model_output, model_var_values = th.split(model_output, C, dim=1)
frozen_out = th.cat([model_output.detach(), model_var_values], dim=1)
terms["vb"] = self._vb_terms_bpd(
model=lambda *args, r=frozen_out: r,
x_start=x_start,
x_t=x_t,
t=t,
clip_denoised=False,
)["output"]
Here, since the model has already been predicted once, there is no need to let the model predict it again, so the incoming model is directly an identity return. Here frozen_out is to let the learning of the variance not affect the optimization of the mean.
model=lambda *args, r=frozen_out: r
Here lambda is equivalent to an anonymous function
def fun(*args, r=frozen_out):
return r
If rescale,
if self.loss_type == LossType.RESCALED_MSE:
# Divide by 1000 for equivalence with initial implementation.
# Without a factor of 1/1000, the VB term hurts the MSE term.
terms["vb"] *= self.num_timesteps / 1000.0
Next, it depends on what kind of target prediction is.
It can be X t − 1 X_{t-1}Xt−1Mean and variance at time, X 0 X_0X0, can also be noise
target = {
ModelMeanType.PREVIOUS_X: self.q_posterior_mean_variance(
x_start=x_start, x_t=x_t, t=t
)[0],
ModelMeanType.START_X: x_start,
ModelMeanType.EPSILON: noise,
}[self.model_mean_type]
Then just count as MSEloss
terms["mse"] = mean_flat((target - model_output) ** 2)
put the loss together again
if "vb" in terms:
terms["loss"] = terms["mse"] + terms["vb"]
else:
terms["loss"] = terms["mse"]
In short, when the variance can be learned, there will be a loss of vb.