GAN algorithm notes

GAN algorithm for this article is Goodfellow made for the mountains "Generative Adversarial Nets" the study notes, if an error message or a private letter welcome correction.

1. Introduction

GAN model problem

The authors pointed out the meaning of the first paragraph of this project - to avoid the depth generation model two limitations are:
(1) maximum likelihood estimation probability and other relevant policies difficult to handle calculations;
(2) difficult to use in the generation environment advantages piecewise linear unit.

PS: depth generation model to simulate the original sample data from the data distribution, thereby generating new samples that meet the distribution.

GAN constitute model

GAN model is divided into two parts: generating a model \ (G \) (Generative Model) and discriminant model \ (D \) (discriminative Model). Analogy of the relationship between these two parts is counterfeit manufacturers and the police. Discriminant model (the police) to determine whether a sample from the data distribution or the distribution model, while the production model (counterfeit manufacturers) is to generate false samples to fool discriminant model (police). This results in competition driven by two parties to constantly update their methods, sample until false and true sample completely indistinguishable. Frame generated by the random noise model input MLP Further samples give false, the discriminant model sucked sample input MLP determines whether the sample is authentic sample. The authors note that while the training of these two models is to use the back-propagation algorithm (backpropagation algorithm) andDropping algorithm(Should be "random inactivation algorithm", the expression has been wrong before) (Dropout algorithm) .

ps: MLP (Multi Layer Perceptron) comprises at least one hidden layer (layer is removed except for one input and one output layer) multi-layer neural network as shown in FIG.

In this part author introduces the work on the depth generation network, because the system has not yet grasp the depth of learning, for the time being skipped. . .

3. Adversarial nets

Author defines two functions:

  • \ (G \ left (\ boldsymbol {z}; \ theta_ {g} \ right) \) represents the noise \ (\ boldsymbol {z} \ ) data sample space mapping function, wherein \ (\ theta_ {g} \) represents the parameters of MLP.
  • \ (D (\ boldsymbol {x }; \ theta_d) \) represents the input sample \ (\ boldsymbol {x} \ ) output is determined that the probability of real data (as a scalar)

GAN target

On the target defined GAN for the following optimization formula:
\ [\ min_G \ max_D V (D, G) = \ mathbb {E} _ {\ mathbf {X} \ SIM P_ {Data} (X)} [\ log D (x)] + \ mathbb
{E} _ {\ mathbf {z} \ sim p_z} [\ log (1-D (G (z)))] \] this equation consists of two parts, the first part \ ( \ mathbb {E} _ {\ mathbf {x} \ sim p_ {data} (x)} [\ log D (x)] \) can be understood as a discriminant model correct determination of the desired number of real data, the latter part \ ( \ mathbb {E} _ {\ mathbf {z} \ sim p_z} [\ log (1-D (G (z)))] \) can be understood as the number of correct recognition model determines a desired false samples.

Relative entropy?

Re-look at this formula behind when suddenly thought of a question: Why when the formula to find the desired function and then set a log probability value out of it?
I have not fully want to understand, feeling a bit like? Cross-entropy and relative entropy concept, but they can not be defined in the above formula and formula-one correspondence relative entropy together.
Can first refer to this link: machine learning in a variety of "entropy" (The blog page is really the card)

Image understanding

Below this picture I saw did not understand a night, this morning came to understand what it meant.

The blue dashed line shows the probability distribution model is determined, the black line represents the original dot data (generated?) Probability distribution, generating a green solid line indicates the probability distribution model generated.

  • Figure (a) said training has not yet started or just the beginning of a period of time, discriminant model and generate models has not been a lot of training, and therefore determine the probability distribution of some fluctuations, generated from real distribution data generating distribution there is no small distance.
  • FIG. (B) shows the elapsed time after the training period, it is determined discriminant model can generate samples and the original samples, the dashed line indicates the height of the current position of blue corresponding to the abscissa of the probability of real samples of data samples, expressed as a higher the greater the probability of real samples.
  • Figure (c) shows continue training after a period of time, the original sample, and the sample is generated closer, discriminant model is still relatively good discrimination effect.
  • FIG. (D) shows After sufficient training, the probability of the original sample and samples generated substantially uniform distribution, determines the effect of the loss of the discriminant model.

Algorithm 1

Algorithm 1 original description

Algorithm 1 Chinese python pseudo-code description (this is what the hell)

for i in range(训练的迭代次数)):
    for j in range(k步):
        从噪音分布中取出m个噪音样本
        从数据生成分布中取出m个样本
        利用随机梯度上升法更新判别器
    从噪音分布中取出m个噪音样本
    利用随机梯度下降法更新生成器

4. Theoretical Results

Global optimal \ (p_g = p_ {data} \)

On the first proposition is given: When generating the model \ (G \) is fixed, the optimal discriminator is: \ [^ * _ D G (X) = \ {P_ FRAC Data} {(X)} {{P_ data} (x) + p_g (
x)} \] and also given of proof that the idea is very simple, shaped like a \ (y \ rightarrow a \ log (y) + b \ log (1-y) \ ) equation in \ ([0,1] \) has a fixed maximum value for the interval \ (\ FRAC {a + B} {a} \) .
Thus, the original formula can be replaced with the following formula:
\[\begin{aligned} C(G) &=\max _{D} V(G, D) \\ &=\mathbb{E}_{\boldsymbol{x} \sim p_{\text {data }}}\left[\log D_{G}^{*}(\boldsymbol{x})\right]+\mathbb{E}_{\boldsymbol{z} \sim p_{z}}\left[\log \left(1-D_{G}^{*}(G(\boldsymbol{z}))\right)\right] \\ &=\mathbb{E}_{\boldsymbol{x} \sim p_{\text {data }}}\left[\log D_{G}^{*}(\boldsymbol{x})\right]+\mathbb{E}_{\boldsymbol{x} \sim p_{g}}\left[\log \left(1-D_{G}^{*}(\boldsymbol{x})\right)\right] \\ &=\mathbb{E}_{\boldsymbol{x} \sim p_{\text {data }}}\left[\log \frac{p_{\text {data }}(\boldsymbol{x})}{P_{\text {data }}(\boldsymbol{x})+p_{g}(\boldsymbol{x})}\right]+\mathbb{E}_{\boldsymbol{x} \sim p_{g}}\left[\log \frac{p_{g}(\boldsymbol{x})}{p_{\text {data }}(\boldsymbol{x})+p_{g}(\boldsymbol{x})}\right] \end{aligned}\]
再利用KL散度(相对熵)公式:
\ [D_ {KL} (P
|| Q) = \ sum_ {i = 1} ^ np_i \ log (\ frac {p_i} {q_i}) \] can be replaced with just the expression
\ [C (G) = V (G ^ *, D ) = - \ log (4) + KL \ left (p _ {\ text {data}} \ | \ frac {p _ {\ text {data}} + p_ {g}} {2 } \ right) + KL \ left
(p_ {g} \ | \ frac {p _ {\ text {data}} + p_ {g}} {2} \ right) \] further may utilize JS divergence formula replacement:
\ [C (G) = -
\ log (4) +2 \ cdot JSD \ left (p _ {\ mathrm {data}} \ | p_ {g} \ right) \] by the nature of the divergence formula JS - i.e. in the range of \ ([0,1] \) thus seen \ (C (G) \) the minimum value of \ (- \ log (. 4) \) , JS divergence formula takes this time 0, and at this time The only solution is \ (p_g = p_ {data} \)

5. Experiments

6. Advantages and disadvantages

The authors note that the current drawback is that for \ (p_g (x) \) have not yet accurate representation, and in the process of training \ (D \) must be \ (G \) in perfect sync.
Advantage is no Markov chain.

Guess you like

Origin www.cnblogs.com/xieldy/p/11651447.html