Generative Adversarial Network (GAN) Principle Derivation and Network Construction Ideas

0 Preface

Imagine this scenario: you are the owner of a studio, and your studio is mainly used to produce fakes of famous paintings; while the real famous paintings are created by predecessors and stored in the collection room. Your counterfeit paintings will be appraised by connoisseurs together with the real ones, and your ultimate goal is to become a master craftsman who can reconcile the real ones. The road to the goal is naturally very bumpy, the first thing you have to do is to confuse the real with the fake. In fact, it is easier to confuse the fake with the real. After all, you can say that the fake is the real one if you deceive a young boy, but it is very difficult for authoritative appraisal experts to be blind. And since your goal is to become a master craftsman, then naturally you will not just satisfy and deceive Xiaobai. So you find a person who is determined to become an appraisal expert, and you ask him to identify your fake paintings and real paintings, and he will tell you the results of his appraisal, so that you can know whether your paintings have been authenticated, so that you can be more accurate. Improve your skills well; on the contrary, you will also tell him whether you have identified your fake paintings, so that his identification skills will continue to improve.

Please add a picture description

In the above example, the master craftsman imitates famous paintings to create fakes, that is, the generator captures the potential distribution of real data samples and generates new data samples; the appraisal expert judges whether the painting sent to him is genuine or fake, that is, the discriminator judges whether the input is real data Or generated samples.

1 Introduction

Generative Adversarial Nets (GAN) was proposed by Ian J. Goodfellow in 2014. The basic idea of ​​GAN comes from the two-person zero-sum game of game theory, which consists of a generator and a discriminator, and is trained by confrontational learning. The goal is to estimate the underlying distribution of data samples and generate new data samples. The optimization process of GAN is a minimax game problem. The optimization goal is to achieve Nash equilibrium, so that the generator can estimate the distribution of data samples. GAN is being widely studied in the fields of image and visual computing, speech and language processing, information security, chess games, etc., and has great application prospects.

2 Basic principles

The core idea of ​​GAN comes from the Nash equilibrium of game theory. It sets the two parties participating in the game as a generator (Generator) and a discriminator (Discriminator). The purpose of the generator is to learn the real data distribution as much as possible, and the purpose of the discriminator is to correctly determine whether the input data comes from real data. It still comes from the generator; in order to win the game, the two game participants need to continuously optimize and improve their own generation ability and discrimination ability respectively. This learning optimization process is to find a Nash equilibrium between the two.
Please add a picture description

The calculation process and structure of GAN are shown in the above figure. Any differentiable function can be used to represent the generator and discriminator of GAN. Therefore, we can use the differentiable function DDD andGGG to represent the discriminator and the generator respectively, and their inputs are the real data distributionxxx and the random variablezzz G G G will be the random variablezzz (eg Gaussian distribution, denoted aspz p_zpz) maps to G ( z ) G(z)G(z) G ( z ) G(z) G ( z ) obeys a distribution pdata p_{data}as close as possible to the real datapdataThe probability distribution, this probability distribution is usually recorded as pg p_gpg

For discriminator DDD , returns 1 if the input is from real data; if the input isG ( z ) G(z)G ( z ) , then mark it as 0. HereDDThe goal of D is to realize the two-category discrimination of data sources:

  • True - derived from real data xxdistribution of x ;
  • Pseudo - Fake data from the generator G ( z ) G(z)G(z)

And GGThe goal of G is to make the fake data G ( z ) G(z)generated by itselfG ( z ) inDDExpress D on D ( G ( z ) ) D(G(z))D ( G ( z )) and real dataxxx inDDExpress D on D ( x ) D(x)D ( x ) agrees.

D D D andGGThe process of G fighting against each other and iterative optimization makes the performance of the two continuously improve. When the finalDDWhen D' s discriminative ability is improved to a certain extent (becoming an appraisal expert), and the data source cannot be correctly identified, it can be considered that the generatorGGG has learned the distribution of real data (becoming a master of the craft).

3 Objective function

According to the content of Section 2, the goal of the discriminator is, if the input xxx frompdata p_{data}pdata, then D ( x ) D(x)D ( x ) should be as large as possible; if inputxxx frompg p_gpg ,则 1 − D ( G ( z ) ) 1-D(G(z)) 1D ( G ( z )) should be as large as possible. In order to make the objective function easier to express, take the logarithm of the two, that is,log ⁡ D ( x ) \log D(x)logD(x) log ⁡ ( 1 − D ( G ( z ) ) ) ​ \log (1-D(G(z)))​ log(1D ( G ( z ))) .Independent :
max ⁡ DE x ∼ pdata [ log ⁡ D ( x ) ] + E z ∼ pz [ log ⁡ ( 1 − D ( G ( z ) ) ) ] \max_D E_{x\sim p_{data}}\big[\log D(x)\big] + E_{z\sim p_z}\big[\log(1-D(G(z)))\big]DmaxExpdata[logD(x)]+Ezpz[log(1D ( G ( z ))) ]
Here the expectation is1 N ∑ i = 1 N log ⁡ D ( xi ) ​ \frac1N\sum_{i=1}^N\log D(x_i)N1i=1NlogD(xi)

The goal of the generator is to generate G ( z ) G(z)G ( z ) is as far as possible identified by the discriminatorDDD is recognized as real data, that is,D ( G ( z ) ) D(G(z))D ( G ( z )) is as large as possible, or1 − D ( G ( z ) ) ​ 1-D(G(z))1D ( G ( z )) ​is as small as possible Similarly, taking the logarithm, the mathematical formula is:
min ⁡ GE z ∼ pz [ log ⁡ ( 1 − D ( G ( z ) ) ) ] \min_G E_{z\sim p_z}\big[\log ( 1-D(G(z)))\big]GminEzpz[log(1D ( G ( z ))) ]
According to the objective functions of the two,DDD andG ​ GG ​Before there was a relationship that what he gained must be what he lost, and the two constituted a zero-sum game.

Remember the example mentioned in Section 0, we have to fool not only the little boy, but also the appraisal experts. corresponds to DDD andGGG is on,GGThe fake samples generated by G can fool the top-notchDDD. _ Therefore, in the idea of ​​GAN,DDD is just a tool, our ultimate goal is stillGGG , so we get the overall goal of GAN optimization, which is a minimax problem, described as follows:
min ⁡ G max ⁡ DE x ∼ pdata [ log ⁡ D ( x ) ] + E z ∼ pz [ log ⁡ ( 1 − D ( G ( z ) ) ) ] \min_G\max_D E_{x\sim p_{data}}\big[\log D(x)\big] + E_{z\sim p_z}\big[ \log (1-D(G(z)))\big]GminDmaxExpdata[logD(x)]+Ezpz[log(1D(G(z)))]

4 Existence of Global Optimal Solution

In the previous section, we logically constructed the overall objective function of GAN, and pdata = pg ​ p_{data} = p_gpdata=pg​Should reach the global optimal solution (ieG ​ GG ​generated fake samples are indistinguishable from real samples).

For the learning process of GAN, we need to train the model D ​ DD ​to maximize the discriminant data from real datax ​ xx ​or pseudo-data distributionG ( z ) G(z)G ( z ) ​Accuracy rate, at the same time, we need to train the modelG ​ GG ​to minimizelog ⁡ ( 1 − D ( G ( z ) ) ) ​ \log (1-D(G(z)))log(1D ( G ( z ))) ​. Alternate optimization can be used here: first fix the generatorG ​ GG ​, optimize the discriminatorD ​ DD ​, such thatD​ DD ​The discriminant accuracy is maximized; then the fixed discriminatorD​ DD ​, optimized generatorG GG ​, such thatD DThe discriminative accuracy of D ​is minimized. If and only ifpdata = pg ​ p_{data} = p_gpdata=pgto reach the global optimal solution.

In actual training, when the parameters are updated in the same round, generally for D ​ DD ​parameter updatek kk ​against G GThe parameters of G ​are updated once.

Next we prove that pdata = pg p_{data} = p_g in the case of the global optimal solutionpdata=pgWhether it is established or not. Since z ∼ pzz\sim p_zzpz G ( z ) G(z) The result of G ( z ) is related to x ∼ pgx\sim p_gxpgwhen xxx is equivalent, so we can write
V ( D , G ) = E x ∼ pdata [ log ⁡ D ( x ) ] + E x ∼ pg [ log ⁡ ( 1 − D ( x ) ) ] V(D,G) = E_{x\sim p_{data}}\big[\log D(x)\big] + E_{x\sim p_g}\big[\log (1-D(x))\big]V(D,G)=Expdata[logD(x)]+Expg[log(1D ( x )) ]
First, in the given generatorGGIn the case of G , we only consider optimizing the discriminatorDDD ,即 max ⁡ D V ( D , G ) \max_D V(D,G) maxDV(D,G),而
V ( D , G ) = E x ∼ p d a t a [ log ⁡ D ( x ) ] + E x ∼ p g [ log ⁡ ( 1 − D ( x ) ) ] = ∫ p d a t a ⋅ log ⁡ D d x + ∫ p g log ⁡ ( 1 − D ) d x = ∫ [ p d a t a ⋅ log ⁡ D + p g log ⁡ ( 1 − D ) ] d x \begin{aligned} V(D,G) &= E_{x\sim p_{data}}\big[\log D(x)\big] + E_{x\sim p_g}\big[\log (1-D(x))\big]\\ &= \int p_{data}\cdot \log D dx + \int p_g\log(1-D)dx\\ &= \int \Big[ p_{data}\cdot \log D + p_g\log(1-D)\Big]dx \end{aligned} V(D,G)=Expdata[logD(x)]+Expg[log(1D(x))]=pdatalogD d x+pglog(1D)dx=[pdatalogD+pglog(1D)]dx

The above equation uses the basic knowledge of probability theory, a brief review:

Let continuous random variable XXThe probability density of X is f ( x ) f(x)f ( x ) , if the integral∫ − ∞ + ∞ xf ( x ) dx \int_{-\infty}^{+\infty}xf(x)dx+x f ( x ) d x is absolutely convergent, then this integral is called a random variableXXThe mathematical expectationormeanof X , denoted asE ( X ) E(X)E(X) ,即
E ( x ) = ∫ − ∞ + ∞ x f ( x ) d x E(x) = \int_{-\infty}^{+\infty}xf(x)dx E ( x )=+x f ( x ) d x
assume continuous random variableXXThe probability density of X is f ( x ) f(x)f ( x ) , if the integral∫ − ∞ + ∞ g ( x ) f ( x ) dx \int_{-\infty}^{+\infty}g(x)f(x)dx+g ( x ) f ( x ) d x is absolutely convergent, then this integral is called a random variableY = g ( X ) Y=g(X)Y=The mathematical expectationormeanof g ( X ) , that is,
E ( Y ) = E [ g ( X ) ] = ∫ − ∞ + ∞ g ( x ) f ( x ) dx E(Y) = E\big[g(X )\big] = \int_{-\infty}^{+\infty}g(x)f(x)dxE ( AND )=E[g(X)]=+g(x)f(x)dx

Since only D D is consideredD ​, so consider makingV ( D , G ) ​ V(D,G)V(D,G ) to D DD 求偏导
∂ ∂ D V ( D , G ) = ∂ ∂ D ∫ [ p d a t a ⋅ log ⁡ D + p g log ⁡ ( 1 − D ) ] d x = ∫ ∂ ∂ D [ p d a t a ⋅ log ⁡ D + p g log ⁡ ( 1 − D ) ] d x = ∫ [ p d a t a ⋅ 1 D + p g − 1 1 − D ] d x \begin{aligned} \frac{\partial}{\partial D}V(D,G) &= \frac{\partial}{\partial D}\int\Big[ p_{data}\cdot \log D + p_g\log(1-D)\Big]dx\\ &= \int\frac{\partial}{\partial D}\Big[ p_{data}\cdot \log D + p_g\log(1-D)\Big]dx\\ &= \int \Big[p_{data}\cdot\frac1D + p_g\frac{-1}{1-D} \Big]dx \end{aligned} DV(D,G)=D[pdatalogD+pglog(1D)]dx=D[pdatalogD+pglog(1D)]dx=[pdataD1+pg1D1]dx
What we want is max ⁡ DV ( D , G ) \max_DV(D,G)maxDV(D,G ) , so let the derivative result of the above formula be equal to 0, namely:
∫ [ pdata ⋅ 1 D + pg − 1 1 − D ] dx = 0 \int \Big[p_{data}\cdot\frac1D + p_g\frac{- 1}{1-D} \Big]dx = 0[pdataD1+pg1D1]dx=0
can be obtained at this time:
DG ∗ = pdatapdata + pg D_G^* = \frac{p_{data}}{p_{data}+p_g}DG=pdata+pgpdata
This is the optimal solution of the discriminator.

Again, D ( x ) D(x)D ( x ) means inputxxx comes from real samples (x ∼ pdatax\sim p_{data}xpdata)The probability. G ( z ) G(z)G ( z ) means input az ∼ pzz\sim p_zzpz, output a x ∼ pgx\sim p_gxpg

D G ∗ = p d a t a p d a t a + p g D_G^*=\frac{p_{data}}{p_{data}+p_g} DG=pdata+pgpdataSubstituted into the total objective function, there are:
min ⁡ G max ⁡ D V ( D , G ) = min ⁡ G V ( D G ∗ , G ) = min ⁡ G E x ∼ p d a t a [ log ⁡ D G ∗ ] + E x ∼ p g [ log ⁡ ( 1 − D G ∗ ) ] = min ⁡ G E x ∼ p d a t a [ log ⁡ p d a t a p d a t a + p g ] + E x ∼ p g [ log ⁡ ( 1 − p d a t a p d a t a + p g ) ] = min ⁡ G E x ∼ p d a t a [ log ⁡ p d a t a p d a t a + p g ] + E x ∼ p g [ log ⁡ p g p d a t a + p g ] = min ⁡ G E x ∼ p d a t a [ log ⁡ ( p d a t a ( p d a t a + p g ) / 2 ⋅ 1 2 ) ] + E x ∼ p g [ log ⁡ ( p g ( p d a t a + p g ) / 2 ⋅ 1 2 ) ] = min ⁡ G E x ∼ p d a t a [ log ⁡ ( p d a t a ( p d a t a + p g ) / 2 ) ] + E x ∼ p d a t a [ log ⁡ ( p g ( p d a t a + p g ) / 2 ) ] − log ⁡ 2 − log ⁡ 2 = min ⁡ G K L ( p d a t a ∣ ∣ p d a t a + p g 2 ) + K L ( p g ∣ ∣ p d a t a + p g 2 ) − log ⁡ 4 ≥ − log ⁡ 4 \begin{aligned} \min_G\max_DV(D,G) &= \min_GV(D_G^*,G)\\ &= \min_G E_{x\sim p_{data}}\big[\log D_G^*\big] + E_{x\sim p_g}\big[\log (1-D_G^*)\big]\\ &= \min_G E_{x\sim p_{data}}\big[\log \frac{p_{data}}{p_{data}+p_g}\big] + E_{x\sim p_g}\big[\log (1-\frac{p_{data}}{p_{data}+p_g})\big]\\ &= \min_G E_{x\sim p_{data}}\big[\log \frac{p_{data}}{p_{data}+p_g}\big] + E_{x\sim p_g}\big[\log \frac{p_g}{p_{data}+p_g}\big]\\ &= \min_G E_{x\sim p_{data}}\big[\log \big(\frac{p_{data}}{(p_{data}+p_g)/2}\cdot\frac12\big)\big] + E_{x\sim p_g}\big[\log \big(\frac{p_g}{(p_{data}+p_g)/2}\cdot\frac12\big)\big]\\ &= \min_G E_{x\sim p_{data}}\big[\log \big(\frac{p_{data}}{(p_{data}+p_g)/2}\big)\big] + E_{x\sim p_{data}}\big[\log \big(\frac{p_g}{(p_{data}+p_g)/2}\big)\big] - \log2-\log2\\ &= \min_G KL(p_{data}||\frac{p_{data}+p_g}{2}) + KL(p_g||\frac{p_{data}+p_g}{2}) - \log4\\ &\ge -\log4 \end{aligned} GminDmaxV(D,G)=GminV(DG,G)=GminExpdata[logDG]+Expg[log(1DG)]=GminExpdata[logpdata+pgpdata]+Expg[log(1pdata+pgpdata)]=GminExpdata[logpdata+pgpdata]+Expg[logpdata+pgpg]=GminExpdata[log((pdata+pg)/2pdata21)]+Expg[log((pdata+pg)/2pg21)]=GminExpdata[log((pdata+pg)/2pdata)]+Expdata[log((pdata+pg)/2pg)]log2log2=GminW L ( pdata∣∣2pdata+pg)+W L ( pg∣∣2pdata+pg)log4log4

The above formula uses KL divergence, that is, relative entropy (value range is [ 0 , + ∞ ) ​ [0,+\infty)[0,+)):

Let P ( x ) , Q ( x ) P(x),Q(x)P(x),Q ( x ) is the random variableXXTwo probability distributions on X , in the case of discrete and continuous random variables, the relative entropy is defined as:
KL ( P ∣ ∣ Q ) = ∑ P ( x ) log ⁡ P ( x ) Q ( x ) KL (P||Q) = \sum P(x)\log\frac{P(x)}{Q(x)}KL(P∣∣Q)=P(x)logQ(x)P(x)

K L ( P ∣ ∣ Q ) = ∫ P ( x ) log ⁡ P ( x ) Q ( x ) d x KL(P||Q) = \int P(x)\log\frac{P(x)}{Q(x)}dx KL(P∣∣Q)=P(x)logQ(x)P(x)dx

The concept, principle and derivation of the specific KL divergence are being written and written (still create a folder).

Because in the above formula as the denominator pdata + pg ​ p_{data}+p_gpdata+pg​is the addition of two probability distributions, the value range is[ 0 , 2 ] ​ [0,2][0,2 ] ​, the probability distribution as a That islog ⁡ pgpdata + pg ​ \log \frac{p_g}{p_{data}+p_g}logpdata+pgpg 重构成 log ⁡ ( p d a t a ( p d a t a + p g ) / 2 ⋅ 1 2 ) ​ \log \big(\frac{p_{data}}{(p_{data}+p_g)/2}\cdot\frac12\big)​ log((pdata+pg)/2pdata21)

显然,当 p d a t a = p d a t a + p g 2 = p g p_{data} = \frac{p_{data}+p_g}{2} = p_g pdata=2pdata+pg=pgWhen , the equality sign holds. So pg ∗ = pdata p_g^* = p_{data}pg=pdata, at this time DG ∗ = 1 2 D_G^*=\frac12DG=21. This shows that when the total objective function of GAN reaches the global optimal solution, the generator GGG can obey the inputpz p_zpzThe random variable zz distributedz maps to a subjectpg p_gpgThe random variable G ( z ) G(z)G ( z ) , and at this timepg = pdata p_g=p_{data}pg=pdata. And the discriminator DDD can no longer tell whether the input input is a real sample or a fake sample (regardless of the inputxxx comes from the real sample orGGThe pseudo samples generated by G can only output the probability of1 2 \frac1221)。

5 Network construction and training

Generally speaking, the discriminator DDD and generatorG ​ GG ​are two independent neural networks. Depending on the actual use, you can choose to use multi-layer perceptron (MLP), convolutional neural network (CNN), Seq2Seq or some other neural networks.

The discriminator is generally a simple classification network, and the output value is in [ 0 , 1 ] ​ [0,1][0,1 ] ​, indicating the probability that the input comes from a real sample, greater than 0.5 means that the discriminator thinks it comes from a real sample, otherwise it thinks it is a fake sample generated by the generator.

The generator is more complicated. Its input is a noise, and this noise usually makes it obey the Gaussian distribution (others are also fine, there is no hard requirement, and the effect is good in actual application), and then the generator produces a sample similar to the real one. fake samples. For example, if we want to use GAN to generate an image, the generator inputs one or more random noises, which are mapped by the neural network and converted into a two-dimensional image.

Please add a picture description

For the training process of the network, it has been described at the beginning of Section 4, and some more will be added here.

During each iterative training process of GAN, the stochastic gradient descent method (SGD) is used to update the parameters (parameters are generally expressed as θ D \theta_DiDθ G \theta_GiG). The overall idea of ​​training is: after randomly initializing the global parameters, for each round of iterative training, first train the discriminator so that it outputs ≥ 0.5 \ge0.5 for real samples0.5 , for pseudo samples, output≤ 0.5 \le0.50.5 , and then fix the discriminator; then train the generator so that the samples generated by it can be ≥ 0.5 \ge 0.5as much as possible after being output by the discriminator0.5

Here for SGD and DDThe loss function of D (generally the cross-entropy loss function) belongs to the basic content of the neural network and will not be repeated here.

Guess you like

Origin blog.csdn.net/meng_xin_true/article/details/128488476