Generative Modeling by Estimating Gradients of the Data Distribution reading notes

Overview

The paper proposes a generative model and uses it for image generation tasks.
The paper first introduces the traditional score-based generative modeling method, then analyzes the existing problems of traditional score-based generative modeling, and finally proposes an algorithm to solve the problem, noise conditional score network.

Introduction to traditional score-based generative modeling

Assume that the data in the data set obeys pdata ( x ) p_{data}(\mathbf{x})pdata( x ) distribution.
The goal of generative modeling is to learn a generative model to generate obeypdata ( x ) p_{data}(\mathbf{x})pdataA new sample from the ( x ) distribution.
Define score function as the probability density functionp ( x ) p(\mathbf{x})p ( x ) value∇x log ⁡ p ( x ) \nabla_\mathbf{x}\log p(\mathbf{x})xlogp ( x ) .
Define score network as a parameterθ \thetaNeural network s of θ s_\thetasi, which attempts to approximate the score function.
Score-based generative modeling generates new samples that conform to the distribution by learning the score function and adding Langevin dynamics sampling. The steps are shown in the figure below:
Insert image description here

score matching

Using the score matching algorithm, we can directly train a score network s θ ( x ) s_\theta(\mathbf{x})si( x ) to estimate∇ x log ⁡ pdata ( x ) \nabla_\mathbf{x}\log p_{data}(\mathbf{x})xlogpdata( x ) without training the model to estimatepdata ( x ) p_{data}(\mathbf{x})pdata( x ) . The advantage is that the normalization constant in the probability density function can be avoided. For details,see the introduction to the score matching algorithm.
The optimization objectives of the score matching algorithm are as follows:
1 2 E pdata [ ∥ s θ ( x ) − ∇ x log ⁡ pdata ( x ) ∥ 2 2 ] \frac{1}{2}\mathbb{E}_{p_{data }}[\|\mathbf{s}_\theta(\mathbf{x})-\nabla_\mathbf{x}\log p_{data}(\mathbf{x})\|^2_2]21Epdata[si(x)xlogpdata(x)22] The above formula needs to calculate∇ x log ⁡ pdata ( x ) \nabla_\mathbf{x}\log p_{data}(\mathbf{x})xlogpdata( x ) , this is a non-parametric estimation problem and is not easy to calculate. Fortunately, the above formula is equivalent to E pdata [ tr ( ∇ xs θ ( x ) ) + 1 2 ∥ s θ ( x ) ∥ 2 2 ] \mathbb{E}_{p_{data }}[\text{tr}(\nabla_\mathbf{x}\mathbf{s}_\theta(\mathbf{x}))+\frac{1}{2}\|\mathbf{s}_\ theta(\mathbf{x})\|^2_2]Epdata[tr(xsi(x))+21si(x)22] Minimize the above formula to finds θ ( x ) \mathbf{s}_\theta(\mathbf{x})si( x ) . In reality, the expectation can be replaced by the average of the sample.
However, high-dimensional data calculationtr ( ∇ xs θ ( x ) ) \text{tr}(\nabla_\mathbf{x}\mathbf{s}_\theta(\mathbf{x}))tr(xsi( x )) is very complex. Denoising score matching and Sliced ​​score matching are two commonly used improvement methods for high-dimensional big data.

Langevin dynamics

Langevin dynamics is a method that only requires score function ∇ x log ⁡ p ( x ) \nabla_\mathbf{x}\log p(\mathbf{x})xlogp ( x ) can be obtained from the probability density functionpdata (x) p_{data}(\mathbf{x})pdataThe sampling method in ( x ) is a Markov chain Monte Carlo (MCMC) method.
Given an initial distributionx ~ 0 ∼ π ( x ) \tilde{\mathbf{x}}_0\sim \pi(\mathbf{x})x~0π ( x ) , and fixed step sizeϵ > 0 \epsilon>0ϵ>0 , the Langevin method repeats the following steps cyclically:
x ~ t = x ~ t − 1 + ϵ 2 ∇ x log ⁡ p ( x ~ t − 1 ) + ϵ zt \tilde{\mathbf{x}}_t=\ tilde{\mathbf{x}}_{t-1}+\frac{\epsilon}{2}\nabla_\mathbf{x}\log p(\tilde{\mathbf{x}}_{t-1} )+\sqrt{\epsilon}\mathbf{z}_tx~t=x~t1+2ϵxlogp(x~t1)+ϵ ztExpand zt ∼ N ( 0 , I ) \mathbf{z}_t\sim\mathcal{N}(0,\mathbf{I})ztN(0,I )。ϵ→ 0 \epsilon\rightarrow0ϵ0 T → ∞ T\rightarrow\infin T时, x ~ T \tilde{\mathbf{x}}_T x~TThe distribution is pdata ( x ) p_{data}(\mathbf{x})pdata(x)

Problems with traditional score-based generative modeling

Problems with the manifold hypothesis

The manifold hypothesis states that real-world data tend to be concentrated on low-dimensional manifolds embedded in high-dimensional space (also called environment space).

The manifold hypothesis states that data in the real world tend to concentrate on low dimensional manifolds embedded in a high dimensional space (a.k.a., the ambient space).

Under the manifold assumption, score-based generative models have two problems:

  1. ∇ x log ⁡ p d a t a ( x ) \nabla_\mathbf{x}\log p_{data}(\mathbf{x}) xlogpdata( x ) is not defined on low-dimensional manifolds.
  2. The score estimator is consistent only when the data distribution is the entire space.

Problems with low-density areas

Insert image description here

  1. In low-density areas of data, there are not enough data samples to accurately learn the score function.
  2. When two peaks (modes) of a data distribution are separated by a low-density region, Langevin dynamics will not be able to correctly recover the relative weights of the two peaks in a reasonable time and may not converge to the true distribution. For example, suppose pdata ( x ) = π p 1 ( x ) + ( 1 − π ) p 2 ( x ) p_{data}(\mathbf{x})=\pi p_{1}(\mathbf{x}) +(1-\pi)p_{2}(\mathbf{x})pdata(x)=πp1(x)+(1π ) p2( x ) , andp 1 p_{1}p1and p 2 p_{2}p2There are no intersecting support sets. After derivation, the weight π \piπ will not affect the score function.

Noise Conditional Score Network

In order to solve the above problems, the author improved the traditional score-based generative modeling.
The author proposes to 1) use various noise levels to perturb the data; 2) use a conditional score network to simultaneously estimate the scores corresponding to all noise levels.
After the conditional score network training is completed, when using Langevin dynamics to generate samples, the score corresponding to high noise is initially used, and then the noise is gradually reduced. This helps smoothly transfer the benefits of high noise to low noise. The data with low noise interference is almost indistinguishable from the original data.

The principle is as shown below. By adding noise, the data can fill the low data density area to improve the accuracy of the estimated score.
Larger noise can obviously cover more low-density areas to get better score estimates, but it will overly corrupt the data and change the original distribution significantly. On the other hand, smaller noise will result in less corruption of the original data distribution, but will not cover low-density areas as much as we would like. Therefore, the author proposed to use multi-scale noise interference.
Insert image description here

Noise Conditional Score Networks

{ σ i } i = 1 L \{\sigma_i\}_{i=1}^L { pi}i=1Lis a series of noise levels that satisfy the condition σ 1 σ 2 = ⋯ = σ L − 1 σ L > 1 \frac{\sigma_{1}}{\sigma_{2}}=\cdots=\frac{\sigma_{L -1}}{\sigma_{L}}>1p2p1==pLpL1>1 ,q σ ( x ) ≜ ∫ pdata ( t ) N ( x ∣ t , σ 2 I ) dt q_\sigma(\mathbf{x})\triangleq\int p_{data}(\mathbf{t})\ mathcal{N}(\mathbf{x} | \mathbf{t}, \sigma^2\mathbf{I})d\mathbf{t}qp(x)pdata(t)N(xt,p2 I)dtis the data distribution after noise disturbance. we have to learnoneNoise conditional score network s θ ( x , σ ) s_\theta(\mathbf{x},\sigma)si(x,σ ) to estimate the fraction of noisy data, that is,s θ ( x , σ ) ≈ ∇ x log ⁡ q σ ( x ) s_\theta(\mathbf{x},\sigma)\approx\nabla_\mathbf{x} \log q_\sigma(\mathbf{x})si(x,s )xlogqp( x ) . Note that the score network here is a conditional score network, and the input is compared to the traditionals θ ( x ) s_\theta(\mathbf{x})si( x ) has one moreσ \sigmaσ .
The author considers the problem of image generation, sos θ ( x , σ ) s_\theta(\mathbf{x},\sigma)si(x,The author chose U-Net for the structure of σ ) .

For the training of the noise conditional score network, the noise distribution chosen by the author is
q σ ( x ~ ∣ x ) = N ( x ~ ∣ x , σ 2 I ) q_\sigma( \tilde{\mathbf{x}} |\mathbf {x})=\mathcal{N}( \tilde{\mathbf{x}} |\mathbf{x}, \sigma^{2}\mathbf{I})qp(x~x)=N(x~x,p2I )._
For a given noiseσ \sigmaσ ,one of the equations:
l ( θ , σ ) = 1 2 E pdata E x ~ ∼ N ( x , σ 2 I ) [ ∥ s θ ( x ~ , σ ) + x ~ − x σ 2 ∥ ] \mathcal{l}(\theta,\sigma)=\frac{1}{2}\mathbb{E}_{p_{data}}\mathbb{E}_{\tilde{\mathbf{x}} \sim \mathcal{N}(x, \sigma^{2}\mathbf{I})}[\| s_\theta(\tilde{\mathbf{x}},\sigma) + \frac{\tilde{\mathbf{x}}-\mathbf{x}}{\sigma^2} \|_2^2l ( i ,s )=21EpdataEx~N(x,p2I ).[si(x~,s )+p2x~x22] Let us solve the problem of a specific inequality:
L ( θ ; { σ i } i = 1 L ) ≜ 1 L ∑ i = 1 L λ ( σ i ) l ( θ , σ i ) \mathcal{L }(\theta;\{\sigma_i\}_{i=1}^L)\triangleq\frac{1}{L}\sum_{i=1}^L\lambda(\sigma_i)\mathcal{l} (\theta,\sigma_i)L ( i ;{ pi}i=1L)L1i=1Ll ( pi) l ( i ,pi) amongλ( σ i ) \lambda(\sigma_il ( pi) is the weight.

annealed Langevin dynamics

In the noise conditional fractional network s θ ( x ; σ ) s_\theta(\mathbf{x};\sigma)si(x;After the training of σ ) is completed, the author proposes the annealed Langevin dynamics algorithm to generate samples, as shown in Algorithm 1. Use the score corresponding to high noise first, and then gradually reduce the noise.
Insert image description here

reference

Yang Song’s blog《Generative Modeling by Estimating Gradients of the Data Distribution》
NIPS 2019《Generative Modeling by Estimating Gradients of the Data Distribution》

Guess you like

Origin blog.csdn.net/icylling/article/details/128320524