IDDPM principle and code analysis

foreword

Improved Denoising Diffusion Probabilistic Models (IDDPM) is an improvement of the previous Denoising Diffusion Probabilistic Models (DDPM).
Some important formulas have been mentioned in the previous blog DDPM principle and code analysis , and will not be repeated here. This article mainly explains some improvements and codes.

This article refers to video 58, the PyTorch code of Improved Diffusion for an in-depth explanation line by line , the up explanation is very clear, it is recommended to watch.

This article is constantly being updated...

DDIM is optimized for sampling, using the respace technique to reduce the sampling steps DDIM principle and code (Denoising diffusion implicit models)

Reproduce the code of this paper and toss a bit to install and configure MPI under Ubuntu 20.04 , thanks for this article.
The mpi4py library has not been downloaded well. It turned out that mpicc was not downloaded.



the code

The case is mainly based on this OpenAI official code openai/improved-diffusion .
This part mainly focuses on forward diffusion, backward diffusion, sampling and loss calculation. As for the model using unet with attention, it will not be expanded here.

It mainly focuses on the GaussianDiffusion class in improved_diffusion/gaussian_diffusion.py. In addition, this part only extracts the core part of the code. As for robustness, such as assert or type conversion code, it will not be included. If you need to run it, please check the code in the original warehouse.

GaussianDiffusion

init

Model parameters are threaded into betas, where betas come from this way. The original DDPM adopts the Linear method, while the IDDPM adopts the cosine method.

# gaussian_diffusion.py
def get_named_beta_schedule(schedule_name, num_diffusion_timesteps):
    """
    Get a pre-defined beta schedule for the given name.
    """
    if schedule_name == "linear":
        # Linear schedule from Ho et al, extended to work for any number of
        # diffusion steps.
        scale = 1000 / num_diffusion_timesteps
        beta_start = scale * 0.0001
        beta_end = scale * 0.02
        return np.linspace(
            beta_start, beta_end, num_diffusion_timesteps, dtype=np.float64
        )
    elif schedule_name == "cosine":
        return betas_for_alpha_bar(
            num_diffusion_timesteps,
            lambda t: math.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2,
        )

alphas_cumprod 是 α ‾ t \overline{\alpha}_t at, alphas_cumprod_prev 是 α ‾ t − 1 \overline{\alpha}_{t-1} at1, alphas_cumprod_next 是 α ‾ t + 1 \overline{\alpha}_{t+1} at+1

alphas = 1.0 - betas
alphas_cumprod = np.cumprod(alphas, axis=0)
alphas_cumprod_prev = np.append(1.0, self.alphas_cumprod[:-1])
alphas_cumprod_next = np.append(self.alphas_cumprod[1:], 0.0)

α ‾ t \sqrt{\overline{\alpha}_t} at for sqrt_alphas_cumprod

sqrt_alphas_cumprod = np.sqrt(self.alphas_cumprod)

1 − α ‾ t \sqrt{1-\overline{\alpha}_t} 1at for sqrt_one_minus_alphas_cumprod

sqrt_one_minus_alphas_cumprod = np.sqrt(1.0 - self.alphas_cumprod)

l o g ( 1 − α ‾ t ) log(1-\overline{\alpha}_t) log(1at) 为 log_one_minus_alphas_cumprod。

log_one_minus_alphas_cumprod = np.log(1.0 - self.alphas_cumprod)

1 α ‾ t \frac{1}{\sqrt{\overline{\alpha}_t}} at 1for sqrt_recip_alphas_cumprod

sqrt_recip_alphas_cumprod = np.sqrt(1.0 / self.alphas_cumprod)

1 α ‾ t − 1 \sqrt{\frac{1}{\overline{\alpha}_t}-1} at11 for sqrt_recipm1_alphas_cumprod

sqrt_recipm1_alphas_cumprod = np.sqrt(1.0 / self.alphas_cumprod - 1)

β ~ t = 1 − α ‾ t − 1 1 − α ‾ t β t \widetilde{\beta}_t = \frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_t}\beta_t b t=1at1at1bt

# calculations for posterior q(x_{t-1} | x_t, x_0)
posterior_variance = (
   betas * (1.0 - self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod)
)

take a log

# log calculation clipped because the posterior variance is 0 at the
# beginning of the diffusion chain.
posterior_log_variance_clipped = np.log(
   np.append(self.posterior_variance[1], self.posterior_variance[1:])
)

μ ~ ( X t , X 0 ) = α ‾ t − 1 1 − α ‾ t X 0 + α t ( 1 − α ‾ t − 1 ) 1 − α ‾ t X t \widetilde{\mu}(X_t, X_0) = \frac{\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_t} X_0 + \frac{\sqrt{\alpha_t}(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_{t}}X_t m (Xt,X0)=1atat1 X0+1atat (1at1)Xt, where X 0 X_0X0The previous coefficient corresponds to posterior_mean_coef1, X t X_tXtThe previous coefficient corresponds to posterior_mean_coef2.

posterior_mean_coef1 = (
    betas * np.sqrt(self.alphas_cumprod_prev) / (1.0 - self.alphas_cumprod)
)
posterior_mean_coef2 = (
   (1.0 - self.alphas_cumprod_prev)
   * np.sqrt(alphas)
   / (1.0 - self.alphas_cumprod)
)



q_mean_variance

Pass in (x_start, t), get mean and variance
q ( X t ∣ X 0 ) = N ( X t ; α ‾ t X 0 , ( 1 − α ‾ t ) I ) q(X_t|X_0) = N( X_t; \sqrt{\overline{\alpha}_t}X_0, (1-\overline{\alpha}_t)I)q(XtX0)=N(Xt;at X0,(1at)I)

mean = (
     _extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start
)
variance = _extract_into_tensor(1.0 - self.alphas_cumprod, t, x_start.shape)
log_variance = _extract_into_tensor(
   self.log_one_minus_alphas_cumprod, t, x_start.shape
)

q_sample

Reparameterization to obtain the image after adding noise
X t = α ‾ t X 0 + 1 − α ‾ t ϵ X_t = \sqrt{\overline{\alpha}_t}X_0+\sqrt{1-\overline{\alpha}_t }~\epsilonXt=at X0+1at  ϵ

_extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start
+ _extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape)
* noise



q_posterior_mean_variance

The mean and distribution of the posterior
μ ~ ( X t , X 0 ) = α ‾ t − 1 1 − α ‾ t X 0 + α t ( 1 − α ‾ t − 1 ) 1 − α ‾ t X t \widetilde{ \mu}(X_t, X_0) = \frac{\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_t} X_0 + \frac{\sqrt{\alpha_t }(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_{t}}X_tm (Xt,X0)=1atat1 X0+1atat (1at1)Xt

posterior_mean = (
            _extract_into_tensor(self.posterior_mean_coef1, t, x_t.shape) * x_start
            + _extract_into_tensor(self.posterior_mean_coef2, t, x_t.shape) * x_t
        )

β ~ t = 1 − α ‾ t − 1 1 − α ‾ t β t \widetilde{\beta}_t = \frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_t} \beta_t b t=1at1at1bt, this has been calculated in the init function before, so there is no need to calculate it again

posterior_variance = _extract_into_tensor(self.posterior_variance, t, x_t.shape)



p_mean_variance

The incoming here is ttx at time t , to predict t − 1 t-1tThe mean and variance variance at time 1 .
The variance can be either learned or fixed.
(1) The variance can be learned, and the following conditions

if self.model_var_type in [ModelVarType.LEARNED, ModelVarType.LEARNED_RANGE]:

It needs to be split in the channel dimension

model_output, model_var_values = th.split(model_output, C, dim=1)

There are also two cases here, the original DDPM is the direct prediction variance

if self.model_var_type == ModelVarType.LEARNED:
	  model_log_variance = model_var_values
	  model_variance = th.exp(model_log_variance)

In improve-DDPM, it is the prediction range, and predicts v of the following formula.
Σ θ ( X t , t ) = exp ( vlog β t + ( 1 − v ) log β ~ t ) \Sigma_{\theta}(X_t, t)=exp(vlog\beta_t + (1-v)log \ widetilde{\beta}_t)Si(Xt,t)=exp(vlogβt+(1v)logb t)

因为 β ~ t = 1 − α ‾ t − 1 1 − α ‾ t β t \widetilde{\beta}_t = \frac{1-\overline{\alpha}_{t-1}}{1-\overline{\alpha}_t} \beta_t b t=1at1at1bt, 而 1 − α ‾ t − 1 < 1 − α ‾ t 1-\overline{\alpha}_{t-1} < 1-\overline{\alpha}_t 1at1<1at, 所以 β ~ t < β t \widetilde{\beta}_t < \beta_t b t<bt

So max_log is log β t log \beta_tlogβt

max_log = _extract_into_tensor(np.log(self.betas), t, x.shape)

And min_log is log β ~ t log \widetilde{\beta}_tlogb t

min_log = _extract_into_tensor(
                    self.posterior_log_variance_clipped, t, x.shape)

Convert the predicted value [-1, 1] to [0, 1]

# The model_var_values is [-1, 1] for [min_var, max_var].
frac = (model_var_values + 1) / 2

Then according to the formula Σ θ ( X t , t ) = exp ( vlog β t + ( 1 − v ) log β ~ t ) \Sigma_{\theta}(X_t, t)=exp(vlog\beta_t + (1-v )log \widetilde{\beta}_t)Si(Xt,t)=exp(vlogβt+(1v)logb t)

model_log_variance = frac * max_log + (1 - frac) * min_log
model_variance = th.exp(model_log_variance)



(2) The variance cannot be learned.
In DDPM, β t \beta_t is usedbt, and there are two ways in IDDPM β t \beta_tbt or β ~ t \widetilde{\beta}_t b t
The large variance is β t \beta_tbt,

ModelVarType.FIXED_LARGE: (
    # for fixedlarge, we set the initial (log-)variance like so
    # to get a better decoder log likelihood.
    np.append(self.posterior_variance[1], self.betas[1:]),
    np.log(np.append(self.posterior_variance[1], self.betas[1:])),
),

A small variance is β ~ t \widetilde{\beta}_tb t

ModelVarType.FIXED_SMALL: (
    self.posterior_variance,
    self.posterior_log_variance_clipped,
),

Note that the output of the above calculation is a list, and then we only need to take out the

model_variance = _extract_into_tensor(model_variance, t, x.shape)
model_log_variance = _extract_into_tensor(model_log_variance, t, x.shape)

Then it is the prediction of the mean
(1) prediction X t − 1 X_{t-1}Xt1mean of time

if self.model_mean_type == ModelMeanType.PREVIOUS_X:

so direct

model_mean = model_output

By the way, it is also predicted that in addition to X 0 X_0X0, will not be used in training, but will be used in evaluation

pred_xstart = process_xstart(
    self._predict_xstart_from_xprev(x_t=x, t=t, xprev=model_output)
)



(2) Predict X 0 X_0X0

if self.model_mean_type == ModelMeanType.START_X:

After a post-processing function

pred_xstart = process_xstart(model_output)



(3) Prediction noise

ModelMeanType.EPSILON
pred_xstart = process_xstart(
   self._predict_xstart_from_eps(x_t=x, t=t, eps=model_output)
)

μ ~ ( X t , X 0 ) = α ‾ t − 1 1 − α ‾ t X 0 + α t ( 1 − α ‾ t − 1 ) 1 − α ‾ t X t \widetilde{\mu}(X_t, X_0) = \frac{\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_t} X_0 + \frac{\sqrt{\alpha_t}(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_{t}}X_t m (Xt,X0)=1atat1 X0+1atat (1at1)Xt

model_mean, _, _ = self.q_posterior_mean_variance(
   x_start=pred_xstart, x_t=x, t=t
)



_predict_xstart_from_xprev

Use this formula to calculate X 0 X_0X0: μ ~ ( X t , X 0 ) = α ‾ t − 1 1 − α ‾ t X 0 + α t ( 1 − α ‾ t − 1 ) 1 − α ‾ t X t \widetilde{\mu}(X_t, X_0) = \frac{\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_t} X_0 + \frac{\sqrt{\alpha_t}(1-\overline{\alpha}_{t-1})}{1-\overline{\alpha}_{t}}X_t m (Xt,X0)=1atat1 X0+1atat (1at1)Xt

return (  # (xprev - coef2*x_t) / coef1
    _extract_into_tensor(1.0 / self.posterior_mean_coef1, t, x_t.shape) * xprev 
    - _extract_into_tensor(
       self.posterior_mean_coef2 / self.posterior_mean_coef1, t, x_t.shape
    )
    * x_t
)



_predict_xstart_from_eps

X 0 = 1 α t ( X t − β t 1 − α ‾ t ϵ ) X_0 = \frac{1}{\sqrt{\alpha_t}}(X_t-\frac{\beta_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon) X0=at 1(Xt1at btϵ )
Simplify the above formula to
X 0 = 1 α ‾ t X t − 1 α ‾ t − 1 ϵ X_0 = \frac{1}{\sqrt{\overline{\alpha}_t}}X_t - \sqrt{\frac {1}{\overline{\alpha}_t}-1}~\epsilonX0=at 1Xtat11  ϵ

_extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t
- _extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape) * eps



p_sample

Based on X t X_tXtSample X t − 1 X_{t-1}Xt1
out是 { “mean”: model_mean, “variance”: model_variance, “log_variance”: model_log_variance, “pred_xstart”: pred_xstart}

# 得到 X[t-1]的均值、方差、对数方差、X[0]的预测值
out = self.p_mean_variance(
          model,
          x,
          t,
          clip_denoised=clip_denoised,
          denoised_fn=denoised_fn,
          model_kwargs=model_kwargs,
      )

reparameterized sampling

noise = th.randn_like(x)
sample = out["mean"] + nonzero_mask * th.exp(0.5 * out["log_variance"]) * noise

The p_sample_loop and p_sample_loop_progressive functions call this function iteratively.



p_sample_loop_progressive

X 0 X_0 X0

if noise is not None:
   img = noise
else:
   img = th.randn(*shape, device=device)

T, T-1, …, 0

indices = list(range(self.num_timesteps))[::-1]

Continuous sampling

for i in indices:
    t = th.tensor([i] * shape[0], device=device)
    with th.no_grad():
        out = self.p_sample(
                model,
                img,
                t,
                clip_denoised=clip_denoised,
                denoised_fn=denoised_fn,
                model_kwargs=model_kwargs,
             )
        yield out
        img = out["sample"]



_vb_terms_bpd

vb variational lower bound, bpd is bit per dimension

Find the true mean and variance of the q distribution

true_mean, _, true_log_variance_clipped = self.q_posterior_mean_variance(
     x_start=x_start, x_t=x_t, t=t
)

Find the mean and variance of the model predictions for the p-distribution

out = self.p_mean_variance(
           model, x_t, t, clip_denoised=clip_denoised, model_kwargs=model_kwargs
)

Computes the KL divergence of two Gaussian distributions.
L t − 1 = DKL ( q ( X t − 1 ∣ X t , X 0 ) ∣ ∣ p θ ( X t − 1 ∣ X t ) ) L_{t-1} = D_{KL}(q(X_{ t-1}|X_t, X_0)~||~ p_\theta (X_{t-1}|X_t))Lt1=DKL(q(Xt1Xt,X0) ∣∣ pi(Xt1Xt))

kl = normal_kl( true_mean, true_log_variance_clipped, 
                out["mean"], out["log_variance"])
kl = mean_flat(kl) / np.log(2.0)



L 0 = − logp θ ( X 0 ∣ X 1 ) L_0=-log p_{\theta}(X_0|X_1)L0=logpi(X0X1)
model a discrete distribution with the difference of a cumulative function

decoder_nll = -discretized_gaussian_log_likelihood(
    x_start, means=out["mean"], log_scales=0.5 * out["log_variance"]
)
decoder_nll = mean_flat(decoder_nll) / np.log(2.0)

Merge it, including the KL divergence of all moments

output = th.where((t == 0), decoder_nll, kl)



discretized_gaussian_log_likelihood

improved_diffusion/losses.py

def discretized_gaussian_log_likelihood(x, *, means, log_scales):
    """
    Compute the log-likelihood of a Gaussian distribution discretizing to a
    given image.

    :param x: the target images. It is assumed that this was uint8 values,
              rescaled to the range [-1, 1].
    :param means: the Gaussian mean Tensor.
    :param log_scales: the Gaussian log stddev Tensor.
    :return: a tensor like x of log probabilities (in nats).
    """
    assert x.shape == means.shape == log_scales.shape
    centered_x = x - means
    inv_stdv = th.exp(-log_scales)
    plus_in = inv_stdv * (centered_x + 1.0 / 255.0)
    cdf_plus = approx_standard_normal_cdf(plus_in)
    min_in = inv_stdv * (centered_x - 1.0 / 255.0)
    cdf_min = approx_standard_normal_cdf(min_in)
    log_cdf_plus = th.log(cdf_plus.clamp(min=1e-12))
    log_one_minus_cdf_min = th.log((1.0 - cdf_min).clamp(min=1e-12))
    cdf_delta = cdf_plus - cdf_min
    log_probs = th.where(
        x < -0.999,
        log_cdf_plus,
        th.where(x > 0.999, log_one_minus_cdf_min, th.log(cdf_delta.clamp(min=1e-12))),
    )
    assert log_probs.shape == x.shape
    return log_probs



training_losses

If losstype is KL

if self.loss_type == LossType.KL or self.loss_type == LossType.RESCALED_KL:

Call the previous vb_terms function

terms["loss"] = self._vb_terms_bpd(
                model=model,
                x_start=x_start,
                x_t=x_t,
                t=t,
                clip_denoised=False,
                model_kwargs=model_kwargs,
            )["output"]



As for the MSE loss, it is judged differently according to the type of model prediction.

elif self.loss_type == LossType.MSE or self.loss_type == LossType.RESCALED_MSE:

(1) If the model predicts variance

if self.model_var_type in [
    ModelVarType.LEARNED,
    ModelVarType.LEARNED_RANGE,
]:

Split, model_output and model_var_values

B, C = x_t.shape[:2]
assert model_output.shape == (B, C * 2, *x_t.shape[2:])
model_output, model_var_values = th.split(model_output, C, dim=1)
frozen_out = th.cat([model_output.detach(), model_var_values], dim=1)
terms["vb"] = self._vb_terms_bpd(
	model=lambda *args, r=frozen_out: r,
	x_start=x_start,
	x_t=x_t,
	t=t,
	clip_denoised=False,
)["output"]

Here, since the model has already been predicted once, there is no need to let the model predict it again, so the incoming model is directly an identity return. Here frozen_out is to let the learning of the variance not affect the optimization of the mean.

model=lambda *args, r=frozen_out: r

Here lambda is equivalent to an anonymous function

def fun(*args, r=frozen_out):
	return r

If rescale,

if self.loss_type == LossType.RESCALED_MSE:
    # Divide by 1000 for equivalence with initial implementation.
    # Without a factor of 1/1000, the VB term hurts the MSE term.
    terms["vb"] *= self.num_timesteps / 1000.0

Next, it depends on what kind of target prediction is.
It can be X t − 1 X_{t-1}Xt1Mean and variance at time, X 0 X_0X0, can also be noise

target = {
    
    
            ModelMeanType.PREVIOUS_X: self.q_posterior_mean_variance(
                x_start=x_start, x_t=x_t, t=t
            )[0],
            ModelMeanType.START_X: x_start,
            ModelMeanType.EPSILON: noise,
          }[self.model_mean_type]

Then just count as MSEloss

terms["mse"] = mean_flat((target - model_output) ** 2)

put the loss together again

if "vb" in terms:
    terms["loss"] = terms["mse"] + terms["vb"]
else:
    terms["loss"] = terms["mse"]

In short, when the variance can be learned, there will be a loss of vb.

Guess you like

Origin blog.csdn.net/weixin_43850253/article/details/128275723