Article Directory
A variety of divergence
entropy
The amount of information carried by the distribution P
/
using the minimum number of bytes required to encode P-based distribution of the sample P
Cross entropy
P information distribution from the perspective of view the distribution of Q
/
use sample based on the distribution P Q "average code length" required to encode the desired
why cross entropy loss can be used to measure? Reference
training sample entropy distribution P is constant, equal to a minimum cross entropy minimization of KL divergence, i.e. the amount of information with the current distribution to fit the training data loss distribution.
KL tide
Non-negative asymmetry
Q distribution using approximate amount of loss of information when the distribution of the P
/
based encoding Q "extra length required to code" P sample distribution.
JS divergence
The more similar the smaller the symmetry between 0-1
GAN principle
The loss of the original GAN discriminator defined, we can obtain the optimal form of the discriminator; in the optimum discriminator, can define the original GAN generator into an equivalent loss minimize the real distribution
And distributed generation
JS divergence between.
Fixed G, D optimum is determined, and then substituting max DV (G, D), to give the JS divergence, minimum -2log2
minimize the above formula, i.e., JS divergence optimized, then there must
Training problems
- G, D Training on each other
after the update G, JS divergence does correspond to a smaller, but also affects the V (G, D) curve, and that the next MAXV (G, D) may become large, and that is D the ability to fit both the distribution worse
solution updated multiple times D, G updated - JS divergence problem solving plus-noise
picture is made of low-dimensional vector to generate high-dimensional, since versus Almost impossible to have a non-negligible overlap, so that no matter how far apart they are constants JS divergence , eventually leading to the gradient generator (approximately) is 0, the gradient disappears. - Improved generator loss leads to instability & collapse mode diversity shortage
equal to minimize
but also minimize KL, but also to maximize JS gradient instability
KL earlier problems: Asymmetric
first generation is no real sample data set exists, the second is the error generated no real data in the sample, then I would prefer not to generate diversity sample, not trial and error.
Wgan
Earth-Mover (EM) distance
In all possible joint distribution, seeking real samples and generate the desired sample distance, taking the desired lower bound.
That is, the optimal joint distribution, Pr moved to the minimum consumption of Pg.
Wasserstein compared KL divergence distance, the superiority of JS divergence is that, even if the two distributions do not overlap, the distance still to reflect Wasserstein distance thereof.
Wgan
Real samples taken for f (x), to generate a sample is taken -f (x) of the sector, there are restrictions on the gradient parameter w.
Laplace continuous
The difference between the original GAN:
1. loss function
-
Laplace parameter truncated to meet conditions
-
Removing the sigmoid discriminator
because the original D (x) is 0, the value fit, and where the fitting is Wassertain discriminator distance.
Relativistic GANs