[New direction of generative models]: score-based generative models

0.Preface

Recently (2021.6) a new trending paradigm of generative models was discovered: score-based generative model . To introduce this structure in one sentence, it is:

score functions (gradients of log probability density functions)By learning a ( scoring function , a log-likelihood estimate of the gradient) on large-scale noise-perturbed data distributions , Langevin is used to sample to obtain samples consistent with the training set. This new generative model is called score-based generative models (or diffusion probabilistic models)

This score-based generative model has the following advantages:

  • ① Can obtain GAN-level sampling effect without adversarial training (adversarial training)
  • ② Flexible model structure
  • ③Exact log-likelihood computation
  • ④Uniquely identifiable representation learning (uniquely identifiable representation learning)
  • ⑤ The process is reversible. My understanding is that there is no need to train a feature network like the StyleGAN model, and it may not require as much calculation as FLOW.

The purpose of this blog is to introduce the motivation, basic concepts and potential applications of the score-based generative model[1] . This article is mainly translated from the blog of Dr. Song Yang (Ph.D., Stanford University), a pioneer in this field .

The picture below is shared by Twitter user Simone

insert image description here

1 Introduction

Currently, generative models can be mainly divided into two broad categories based on the way they represent probability distributions:

  • Likelihood-based models : directly learn the PDF (probability density (D for density) function) or PMF (probability mass (M for mass) function) of the distribution by approximating maximum likelihood estimation ( via (approximate) maximum likelihood ). Typical The likelihood-based methods include: autoregressive models[2] , normalizing flow models (such as NICE, FLOW, etc.)[3] , EBM (energy-based methods)[4] and VAE[5] .

insert image description here

  • implicit generative models : Methods in GANs in which the probability distribution is implicitly performed through the sampling process of the generative model. New samples in GAN are obtained by feeding random Gaussian vectors into the GAN's generative model.

insert image description here
Both types of generative models have some problems: likelihood-based models need to ensure tractable regularization constants (this will be mentioned later) in order to conveniently calculate likelihood, and this usually means that the network structure has greater limitations, that is, The network structure cannot be organized and designed arbitrarily like NAS. approximate maximum likelihood trainingOr must rely on alternative objectives to approximate maximum likelihood ( ) during training. The biggest problem with implicit generative models is that they require adversarial training, and this training method is often unstable [6].

This blog introduces the score-based generative model proposed by Dr. Song , using this new generative model to solve/avoid the problems just mentioned. The core idea of ​​the score-based generative model is:

Modeling the gradient of log PDF results in a [7]quantity called (Stein) score function.

Such score-based generative models do not need to deal with regularized constants similar to likelihood-based models . Moreover, score-based generative models train very well under noisy data. This type of method can restore the image itself that is disturbed by noise, and has good sample quality.
insert image description here
It has good performance in image generation[8, 9] , audio synthesis (WaveGrad, DiffWave) , shape generation[10] , and music generation . It even performs better than GAN in the field of audio synthesis!

When the noise perturbation process is given by a stochastic differential equation (SDE) , score-based generative models and models such as FLOW are mathematically linked, so that accurate likelihood estimation calculations can be performed and Representation learning.

In addition, the modeling and estimation of score facilitates the solution of its inverse problem (I think this is also where flow models such as FLOW and NICE are good at). These inverse problems include:

  • image inpainting[8,9]
  • image colorization[9]
  • Medical image reconstruction and compressed sensing, etc.
    insert image description here

2. The score function, score-based models, and score matching

Suppose we have a data set x 1 , x 2 , . . . , x N {x_1, x_2, ... , x_N}x1,x2,...,xN, each part xi , i ∈ 1 , . . . , N x_i, i \in {1, ..., N}xi,i1,...,N are all drawn from a potential data distributionp θ ( x ) p_{\theta}(x)pi( iid ) obtained independently in ( x ). The purpose of generating the model is to perfectly model this data distribution p θ (x) p_{\theta}(x)pi( x ) , so that arbitrary sampling generates new data that conforms to this distribution.

In order to construct this generative model, we first need to find a way to represent this probability distribution. One way, as mentioned above, is likelihood-based models , that is, directly modeling PDF and PMF.

probability density function (p.d.f.) or probability mass function (p.m.f.)

We set, f θ ( x ) ∈ R f_{\theta}(\bf{x}) \in \mathbb{R}fi(x)R is a vector withθ \thetaθ is a function of parameters. Then, **(pdf)** can be defined by the following formula:
insert image description here

Here, Z θ > 0 Z_{\theta} > 0Zi>0 is a value that depends onθ \thetaThe normalizing constant of θ (regularizing constant) aims to make∫ p θ ( x ) dx = 1 \int p_{\theta}(x)dx = 1pi(x)dx=1. The functionf θ ( x ) f_{\theta}(\bf{x})fi( x ) is an unnormalized probability model, or EBM energy model.

We can train p θ ( x ) p_{\theta}(x)pi( x ) to maximize the log-likelihood of the data[11].
insert image description here
However, the above formula requiresp θ ( x ) p_{\theta}(x)pi( x ) is a regularized PDF, which is useful for calculatingp θ ( x ) p_{\theta}(x)pi( x ) poses a challenge:

We must calculate the normalization constant Z θ Z_{\theta}Zi, for any general case f θ ( x ) f_{\theta}(\bf{x})fi( x ) , which is a typically difficult quantity to handle

Therefore, in order to make the training of maximum likelihood training feasible, likelihood-based models adopt the following two methods, and these two methods, especially the FLOW-based model, will greatly increase the amount of calculation:

  • Restrict the model structure (causal convolutions in autoregressive models, invertible networks in normalizing flow models) to make Z θ = 1 Z_{\theta}=1Zi=1

  • Approximate regularization constant (variational inference in VAEs, or MCMC sampling used in contrastive divergence)

The score-based model avoids the problem of dealing with this regularized constant by constructing a score function instead of a density function. For a distribution P ( x ) P(x)P ( x ) , its score function is defined as:

insert image description here

Models that use this score function are collectively called score-based models, using s θ ( x ) s_{\theta}(\bf{x})si( x ) , the goal of this model is to makes θ ( x ) ≈ ∇ xlogp ( x ) s_{\theta}(\bf{x}) \approx \ nabla_{x} log p(x)si(x)xl o g p ( x ) wherep θ ( x ) = e − f θ ( x ) Z θ p_{\theta}(x)=\frac{e^{-f_{\theta}(\bf{x })}}{Z_{\theta}}pi(x)=Ziefi(x)Take the expansion as an example to get the following results:

insert image description here
It can be seen that s θ ( x ) s_{\theta}(\bf{x})si(x)和normalizing constant Z θ Z_{\theta} ZiIndependent. This property ensures that we can expand the category of generative models without designing complex structures to make Z θ Z_{\theta} like the previous likelihood class method.Zitractable.

conv_ops

Parameterizing probability density functions(pdfs). No matter how you change the model family and parameters, it has to be normalized (area under the curve (AUC) must integrate to one).

1

Parameterizing score functions. No need to worry about normalization.

Similar to the likelihood method, we can train a score-based model by minimizing the Fisher divergence between the model and the data distributions : Intuitively, Fisher divergence (Fisher divergence) is to calculate ground-truth data and score- l 2 l_2 of based model
insert image description here
l2distance squared. But since the data score is not known ∇ xlogp ( x ) \nabla_{x} log p(x)xl o g p ( x ) , we cannot directly optimize Fisher divergence. Fortunately, there is a series ofmethodsscore matching[12,13,14], which can minimize Fisher without knowing the ground-truth data score. divergence.

The objectives of score matching can be estimated by SGD (stochastic gradient descent) on the given data. Analogous to the log-likelihood objective in training likelihood-based models.

Score matching objectives can directly be estimated on a dataset and optimized with stochastic gradient descent, analogous to the log-likelihood objective for training likelihood-based models (with known normalizing constants

We can train a score-based model to optimize the score-matching objective without adversarial learning !

In addition, using score matching objective gives us flexibility in model structure design. Fisher Divergence does not require s θ ( x ) s_{\theta}(\bf{x})si( x ) is the actual score function of any regularized distribution. That is:there is no need to compare s θ (si( x ) has a strong hypothesis! In use, the only requirement of score-based model is that
score-based model should be avector-valued functionwith the same input and output dimensionality, which is easy to satisfy in practice.

In this section, we can simulate/represent a distribution by modeling the score function. The construction of this model is obtained by using score matching technology.

3. Langevin dynamics

Once we train to get a s θ ( x ) ≈ ∇ xlogp ( x ) s_{\theta}(\bf{x}) \approx \nabla_{x} log p(x)si(x)xl o g p ( x ) , we can useLangevin dynamics[15,16]method to iteratively sample data.

Langevin dynamics is achieved only by using score function ∇ xlogp ( x ) \nabla_{x} log p(x)xl o g p ( x ) to the real data distributionP ( x ) P(x)P ( x ) performs sampling of MCMC (MCMC, 马尔科夫链蒙特卡洛(Markov Chain Monte Carlo)方法,是用于从复杂分布中获取随机样本的统计学算法). Specifically, it starts with any prior distributionx 0 ∼ π ( x ) \bf{x}_{0} \sim \bf{\pi(x)}x0π ( x ) , initialize and construct a chain, and then iterate as described in the following formula:
insert image description here
Here,zi ∼ N ( 0 , I ) \bf{z}_{i} \sim N(0, I)ziN(0,I ) , 当ϵ \epsilonϵ approaches 0 andK \bf{K}When K approaches infinity, under normal conditions,x K \bf{x}_{K}xKApproximate to the actual data distribution P ( x ) P(x)For the data of P ( x ) , the error between the two isϵ \epsilonϵ is small enough andK \bf{K}When K is large enough, it can be ignored. This shows thatLangevin dynamicsbe used to sample the distribution we want to get exactly the same as the original data distribution!

[External link image transfer failed. The source site may have an anti-leeching mechanism. It is recommended to save the image and upload it directly (img-oWZtpic9-1624072358558)(http://yang-song.github.io/assets/img/score/langevin .gif)]

4. The most basic score-based model and its problems

So far, we have discussed how to use score matching to optimally train a score-based model and use Langevin dynamics for data sampling. However, this naive approach often doesn't work well in practice, and this section focuses on these secret pitfalls.
insert image description here
The current version has obvious failures due to some problems in score matching, and these problems have not been carefully explored in previous articles.

A core challenge (key challenge) is that the estimated score function is very inaccurate in low-dimensional space. We mentioned before that the score-based model is performed by minimizing the Fisher divergence of F real data and model output.
insert image description here

However, due to the l 2 l_2 of true data score function and score-based modell2The error is determined by the data distribution P ( x ) P(x)P ( x ) is determined, and the real data distribution is greatly distorted and disturbed in the low-dimensional space, so it cannot represent the real data distribution. This situation resulted inbelow-average(subpar) results, as shown below:

insert image description here
When using Langevin dynamics for data sampling, our initial samples are very likely to appear in low density areas rather than high-dimensional space. Therefore, sampling based on an inaccurate score-based model will derail the sampling process of Langevin dynamics and fail to generate high-quality data that can represent the real data distribution!

5. Score-based model after multiple noise perturbation

As mentioned in Part 4, how do we get around the accuracy problem of score estimation in low-dimensional space/low-density areas? Our idea is to perturb the data points and let our model be trained on this noisy data.

When the magnitude of the noise is large enough, it can fill in low data density areas to improve the accuracy of the estimated scores. Specifically, the figure below is the result of using additional Gaussian noise to perturb the mixed Gaussian model:

insert image description here

But this raises another question: How do we choose a suitable noise amplitude for perturbation? Greater noise can significantly cover more low-density areas and improve score estimation results. But it will greatly damage the data itself and deviate from the original data distribution.

However, small noise perturbations cannot cover the low density regions we hope to cover, even if they do not significantly change the distribution of the original data.

In order to achieve the best effect, Dr. Song proposed using multi-scale noise interference at the same time [8, 9]. Suppose we always use isotropic Gaussian noise with mean zero to interfere with the data. Suppose there is LLL disturbance signals, with standard deviations arranged from small to large:σ 1 < σ 2 < . . . < σ L \sigma_1 < \sigma_2 < ... < \sigma_Lp1<p2<...<pL, first, use each disturbance signal to disturb the data P ( x ) P(x)P(x):

insert image description here

Note that we can calculate x ∼ P ( x ) x \sim P(x)xP ( x ) samples and calculatesx + σ iz \bf{x} + \sigma_i \bf{z}x+piz to obtain the data disturbed by the i-th noise, wherez ∼ N (0, I) \bf{z} \sim N(0, I)zN(0,I)

In the second step, we train Noise Conditional Score-Based Model s θ ( x , i ) s_{\theta}(\bf{x, i})si(x,i ) , estimate the score function of each distribution perturbed by noise:∇ xlogp σ i ( x ) \nabla_{\bf{x}} log p_{\sigma_i}(\bf{x})xlogppi( x ) so that:
insert image description here

insert image description here
Then the next step is very intuitive. The training target s θ ( x , i ) s_{\theta}(\bf{x, i})si(x,i ) is the different noise scaleLLThe weighted result of L. Specifically, we use the following objective function:
insert image description here

The only thing to note here is the weight λ ( i ) \lambda(i)The value of λ ( i ) , in Dr. Song’s paper, letλ (i) = σ i 2 \lambda(i)=\sigma_i^{2}λ(i)=pi2. This objective function can be optimized using score matching technology, just like optimizing the simplest score based model s θ ( x ) s_{\theta}(\bf{x})si( x ) Same.

After obtaining the noise-conditional score-based model s θ ( x , i ) s_{\theta}(\bf{x, i})si(x,i ) , we can use Langevin Dynamics for sampling.i = L , L − 1 , . . . , 1 i=L, L-1, ..., 1i=L,L1,...,1. This method is calledannealing Langevin Dynamics algorithm(Annealed Langevin Dynamics, 在[8]的算法1中定义). The reason why it is called annealing can be understood that the amplitude of the noise is gradually reduced.
insert image description here

Here are some practical suggestions for training a score-based generative model with multiple noise scales:

  • It is best to have hundreds or even thousands of levels of noise from low to high.
  • U-Net structure to design the model.
  • During the testing phase, use the EMA.
    insert image description here

insert image description here

Annealed Langevin dynamics for the Noise Conditional Score Network (NCSN) model (from ref.
[17]) trained on CelebA . We can start from complete noise, modify images according to the scores, and generate nice samples. The method achieved state-of-the-art Inception score on CIFAR-10 at its time.

insert image description here
Using the instructions above, we can generate high-quality image samples similar to GAN, as shown below:

insert image description here

6. Score-based generative modeling with stochastic differential equations (SDEs)

According to the previous discussion, we know that adding multi-level and scale noise is a key factor for success in score-based generative model training. Now, when we want to expand the amount of noise to infinity, we can construct the most powerful framework to date based on the score-based generative model. This not only generates higher quality samples, but also optimizes the model with precise log-likelihood and speeds up sampling, so that the learned features have better, more independent representations and can be used for inverse problem solving. ).

Dr. Song provided a version of Google Colab to complete the training of a step-by-step MNIST model. Likewise, there are more complex models for more complex tasks.

insert image description here

6.1 Use SDE( 随机微分方程) to perturb data

As the size and scale of noise approaches infinity, we are essentially interfering with the data with progressively increasing amounts of noise. In this case, the noise interference process is a continuous-time stochastic process ( continuous-time stochastic process )
insert image description here

insert image description here
insert image description here

Looking specifically at the GIF image [1], you can see here that as the random process deepens, a large amount of information in the original image is hidden.

So, how can this stochastic process be represented in a more precise way? Random stochastic processes (take the diffusion process as an example) are solutions to SDEs (stochastic differential equations). Generally, SDE has the following form:

insert image description here

f ( x , t ) : R d → R d \mathbf{f}(\mathbf{x}, t) : \mathbb{R}^d \rightarrow \mathbb{R}^d f(x,t):RdRd representsthe drift coefficient,g ( t ) ∈ R g(t) \in \mathbb{R}g(t)R represents the diffusion coefficient,w \mathbf{w}w is expressed as standard Brownian motion,dw \mathrm{d}\mathbf{w}d w can be regarded as infinitesimal white noise. The solution of this stochastic differential equation is a set of continuous random variables{ x ( t ) } t ∈ [ 0 , T ] \{\mathbf{x}(t)\}_{t \in [0, T]}{ x(t)}t[0,T], these random variables describe the trajectory at time t.

pt ( x ) p_t(\mathbf{x})pt( x ) to representx ( t ) \mathbf{x}(t)Marginal probability density function of x ( t ) . Heret ∈ [ 0 , T ] t \in [0, T]t[0,T ] can be analogized to noise i = 1 , 2 , . . . , L i = 1, 2, ..., Lat different scalesi=1,2,...,L p t ( x ) p_t(\mathbf{x}) pt(x)可以类比为 p σ i ( x ) p_{\sigma_i}(\mathbf{x}) pσi(x)。这里, p 0 ( x ) = p ( x ) p_0(\mathbf{x}) = p(\mathbf{x}) p0(x)=p(x)代表了本来的数据分布(没有噪声干扰的情况)。

在用随机过程的方法对 p ( x ) p(\mathbf{x}) p(x)干扰了足够长的时间 T T T后, p T ( x ) p_T(\mathbf{x}) pT(x)已经变成了一个简单的随机噪声分布,我们将其表示为一个prior distribution(先验分布), 相似地,这可以类比为有限扰动尺度下的 p σ L ( x ) p_{\sigma_L}(\mathbf{x}) pσL(x)
insert image description here
我们知道,对数据进行扰动的方式非常多,选择SDEs的方式进行扰动也没啥特别的。如下式这种SDE,是使用均值为0,方差指数增长的高斯噪声对数据进行干扰,这同之前的 N ( 0 , σ 1 2 I ) , N ( 0 , σ 2 2 I ) , . . . , N ( 0 , σ L 2 I ) N(0, \sigma_1^2I), N(0, \sigma_2^2I), ..., N(0, \sigma_L^2I) N(0,σ12I),N(0,σ22I),...,N(0,σL2I)类似。

insert image description here

因此,SDE的过程应该被视为模型的超参数,如 { σ 1 , σ 2 , . . . , σ L } \{\sigma_1, \sigma_2, ... , \sigma_L\} { σ1,σ2,...,σL} . For image generation tasks, we provide three SDEs that are more suitable for this field.

6.2 Reverse SDE is used to generate samples

The annealed Langevin dynamics (annealed Langevin dynamics algorithm) we mentioned before is to sequentially sample from the distribution of each noise interference using Langevin dynamics. For our SDE method (infinite noise), a similar method can also be used.
insert image description here
insert image description here

insert image description here
It should be noted that SDE is reversible and has its corresponding Inverse SDE, which has a clear close-form solution:
insert image description here
here, dt \mathrm{d}td t representsa negative infinitesimal time step, since SDE needs to be solved inversely (from timet = T t=Tt=T to timet = 0 t=0t=0 ), then we need to∇ xlogpt ( x ) \nabla_{\mathbf{x}}log p_t(\mathbf{x})xlogpt( x ) is estimated, which is the same aspt ( x ) p_t(\mathbf{x})ptThe score function of ( x ) is the same.

insert image description here

6.3 Estimating the reverse SDE with score-based models and score matching

As mentioned in 6.2, we need to estimate ∇ xlogpt ( x ) \nabla_{\mathbf{x}}log p_t(\mathbf{x})xlogpt( x ) to solve inversely to obtain the image, voice and other information before being interfered by noise. Then, in order to estimate∇ xlogpt ( x ) \nabla_{\mathbf{x}}log p_t(\mathbf{x})xlogpt( x ) , we propose aTime-Dependent Score-Based Model s θ ( x , t ) \mathbf{s}_{\theta}(\mathbf{x}, t)si(x,t), 从而使得 s θ ( x , t ) ≈ ∇ x l o g p t ( x ) \mathbf{s}_{\theta}(\mathbf{x}, t) \approx \nabla_{\mathbf{x}}log p_t(\mathbf{x}) sθ(x,t)xlogpt(x)。同样,这可以和noise-conditional score-based model s θ ( x , i ) \mathbf{s}_{\theta}(\mathbf{x}, i) sθ(x,i)进行类比。

我们对于 s θ ( x , t ) \mathbf{s}_{\theta}(\mathbf{x}, t) sθ(x,t)的训练目标很直接,就是一个连续的Fisher散度的Mixture:
insert image description here
这里 u ( 0 , T ) u(0, T) u(0,T)表示在 [ 0 , T ] [0, T] [0,Uniform distribution of T ] , λ > 0 : R → R \lambda > 0: \mathbb{R} \rightarrow \mathbb{R}l>0:RR represents the noise weight at different times and is positive.

We use the following equation (when λ ( t ) = g 2 ( t ) \lambda(t) = g^2(t)λ ( t )=g2 (t)) to representλ ( t ) \lambda(t)λ ( t ) :
insert image description here
Here, Fisher divergence and KL divergence have some wonderful connections:
insert image description here
Here,pt \mathtt{p}_tptqt \mathtt{q}_tqtrepresent xt \mathbf{x}_{t} respectivelyxtThe distribution of ( x ( 0 ) ∼ p 0 \mathbf{x}(0) \sim \mathtt{p}_0x(0)p0 x ( 0 ) ∼ q 0 \mathbf{x}(0) \sim \mathtt{q}_0 x(0)q0)。

由于KL散度和Fisher散度的特殊联系以及KL散度和最大似然估计的等价性,
我们将 λ ( t ) = g 2 ( t ) \lambda(t) = g^2(t) λ(t)=g2(t)称为似然权重函数likelihood weighting function

同之前讲的那样, 我们的目标函数: 混合Fisher散度(mixture of Fisher divergence)能够通过score matching方法进行高效的优化,如denoising score matching[17]以及sliced score matching[14]

当我们的score-based模型训练完毕后,我们可以将其插入reverse SDE过程中,用于数据的采样过程。

insert image description here

6.4 How to solve the reverse SDE

With numerical SDE solvers, we can estimate the reverse SDE, and we can simulate the reverse stochastic process used to generate samples. Perhaps the simplest numerical SDE solver is the Euler-Maruyama method. When applied to our SDE, the Euler-Maruyama method uses finite time steps and small Gaussian noise to discretize the SDE. Specifically, it selects a small, negative time step, initializes it, and then performs iterative optimization in the following manner until t ≈ 0 t \approx 0t0:

insert image description here
这里zt ∼ N ( 0 , I ) \mathbf{z}_t \sim N(0, I)ztN(0,I ) , the properties of Euler-Maruyama method and Langevin dynamics method are very similar: “They both update x \mathbf{x}x by following score functions perturbed with Gaussian noise”.

In addition to the Euler-Maruyama method, there are some methods that can be directly used to solve the SDE inverse process: the Milstein method [18]and the stochastic Runge-Kutta method [19]. In Dr. Song's latest ICLR2021 paper, a new reverse diffusion solver is proposed. Approximate the Euler-Maruyama method, which is more suitable for solving reverse-time SDE.

For our reverse SDE, there are two special properties that allow us to perform more flexible sampling:

- ①

Based on the above two properties, we can use the Markov chain Monte Carlo method to fine-tune the trajectories obtained by the numerical SDE solver. Dr. Song proposed Predictor-Corrector samplers .
insert image description here
insert image description here

Actually comparing the MNIST code, I found that the sampling process as shown below \sigma^{2t}is actually g (t) 2 g(t)^2g(t)2 , taking 500 iterations as an example for analysis, we can match it one-to-one with the formula to get the results before disturbance.

num_steps =  500#@param {'type':'integer'}
def Euler_Maruyama_sampler(score_model, 
                           marginal_prob_std,
                           diffusion_coeff, 
                           batch_size=64, 
                           num_steps=num_steps, 
                           device='cuda', 
                           eps=1e-3):
  """Generate samples from score-based models with the Euler-Maruyama solver.

  Args:
    score_model: A PyTorch model that represents the time-dependent score-based model.
    marginal_prob_std: A function that gives the standard deviation of
      the perturbation kernel.
    diffusion_coeff: A function that gives the diffusion coefficient of the SDE.
    batch_size: The number of samplers to generate by calling this function once.
    num_steps: The number of sampling steps. 
      Equivalent to the number of discretized time steps.
    device: 'cuda' for running on GPUs, and 'cpu' for running on CPUs.
    eps: The smallest time step for numerical stability.
  
  Returns:
    Samples.    
  """
  t = torch.ones(batch_size, device=device)
  init_x = torch.randn(batch_size, 1, 28, 28, device=device) \
    * marginal_prob_std(t)[:, None, None, None]
  time_steps = torch.linspace(1., eps, num_steps, device=device)
  step_size = time_steps[0] - time_steps[1]
  x = init_x
  with torch.no_grad():
    for time_step in tqdm.notebook.tqdm(time_steps):      
      batch_time_step = torch.ones(batch_size, device=device) * time_step
      g = diffusion_coeff(batch_time_step) # g(t) 扩散系数.
      mean_x = x + (g**2)[:, None, None, None] * score_model(x, batch_time_step) * step_size
      x = mean_x + torch.sqrt(step_size) * g[:, None, None, None] * torch.randn_like(x)      
  # Do not include any noise in the last sampling step.
  return mean_x

insert image description here

With the Predictor-Corrector method to optimize the sampling process and the introduction of a better score-based model architecture, Dr. Song's algorithm has achieved SOTA results on CIFAR10, and the results are even more amazing than those achieved by StyleGAN2!
insert image description here
insert image description here
insert image description here
insert image description here

References

[1]: Generative Modeling by Estimating Gradients of the Data Distribution: 宋飏-20210505
[2]: The neural autoregressive distribution estimator
[3]: NICE: Non-linear independent components estimation
[4]: A Tutorial on Energy-Based Learning
[5]: Auto-encoding variational bayes
[6]: Unrolled Generative Adversarial Networks
[7] A kernelized Stein discrepancy for goodness-of-fit tests
[8] Generative Modeling by Estimating Gradients of the Data Distribution
[9] Improved Techniques for Training Score-Based Generative Models
[10] Learning Gradient Fields for Shape Generation
[11] 最大似然估计
[12] Estimation of non-normalized statistical models by score matching
[13] A connection between score matching and denoising autoencoders
[14] Sliced score matching: A scalable approach to density and score estimation
[15] Correlation functions and computer simulations, 1981, G. Parisi.
[16] Representations of knowledge in complex systems, 1994, U. Grenander, M.I. Miller.
[17] A connection between score matching and denoising autoencoders
[18] Milstein method
[19] Runge–Kutta method (SDE)

Guess you like

Origin blog.csdn.net/g11d111/article/details/118026427