Pytorch Neural Network Practical Study Notes_36 Application of Divergence in Neural Networks: F Divergence + F-GAN Implementation + Mutual Information Neural Estimation + GAN Model Training Skills

1 Application of Divergence in Unsupervised Learning

In the loss calculation of neural networks, the method of maximizing and minimizing the divergence between two data distributions has become one of the effective training methods in unsupervised models.

In unsupervised model training, not only K-divergence JS-divergence can be used, but other methods of measuring distribution can be used. f-GAN summarizes the practice of metric distribution and finds out the rules, and uses a unified f-divergence to implement a method based on metric distribution to achieve a general framework for training GAN models based on metric distribution.

1.1 Brief introduction of f-GAN

f-GAN is a set of frameworks for training GAN. It is not a specific GAN method. It can easily implement various divergence applications in GAN training, that is, f-GAN is a "factory" that produces GAN models. .

The GAN models it produces have a common feature: they do not make any prior assumptions, use a metric that minimizes the difference in the sample distribution to be generated, and try to solve the general problem of data sample generation (often used for unsupervised training).

1.2 Variational Divergence Minimization (VDM) based on f-divergence

The variational dispersion minimization method refers to training the parameters in the model by minimizing the variational distance between two data distributions, which is a general method used by f-GAN. In f-GAN, the distance between data distributions is measured using f-divergence.

1.2.1 Scope of application of variable dispersion minimization method

The training method of the WGAN model and the training method of the self-encoding also belong to the VDM method. All GAN ​​models conforming to the f-GAN framework can be trained using the VDM method. The VDM method is suitable for the training of GAN models.

1.2.1 f divergence

Given two distributions P, Q, p(x) and q(x) are the probability functions corresponding to x, respectively, then the f divergence can be expressed as;

The f-divergence is equivalent to a divergence "factory", which must be specified for the generating function f(x) in the formula before it can be used. The f-divergence will generate the specified measurement algorithm according to the specific content corresponding to the generating function f(x).

 2 Implement f-GAN with Fenchel conjugate function

2.1. Definition of Fenchel conjugate function (Fenchel conjugate)

Fenchel conjugate/convex conjugate function means that for every convex function f(x) that satisfies the lower semi-continuity, there is a conjugate function f* defined as:

In the formula, f*(t) is a function of t, where t is a variable; dom(f) is the domain of f(x); max is to find that when the abscissa takes t, the ordinate is in multiple expressions: From the straight line of {xt-f(x)}, take the point corresponding to the largest straight line, as shown in the figure.

2.2 Properties of Fenchel Conjugate Functions

In Figure 8-23, there is a thick line and several thin lines. These thin lines are f(x) generated by randomly sampled several x values, and the thick line is the conjugate function f* of the generating function. The generating function in Figure 8-23 is f(x)=|x-1|÷2, and the algorithm corresponding to this function is the Total Variation (TV) algorithm. TV algorithm is often used for image denoising and restoration.

2.3 Applying the Fenchel conjugate function to the f-divergence

 2.4 Generating various GANs with f-GAN 

The corresponding GAN can be obtained by substituting the specific algorithm in Figure 8-22 into Equation (8-40). Interestingly, for the GAN calculated by f-GAN, many known GAN models can be found. This model of looking back at individuals through the lens of regularity will make our understanding of GANs more thorough. An example is as follows:

 2.5 Activation function of the discriminator in f-GAN

 3 Mutual Information Neural Estimation

Mutual Information Neural Estimation (MlNE) is a method for estimating mutual information based on neural network. It is trained by BP algorithm and estimates the mutual information between high-dimensional continuous random variables, which can maximize or minimize mutual information, improve the confrontational training of generative models, and break through the bottleneck of supervised learning classification tasks. (The referenced paper number is arX:1801.04062, 2018)

3.1 Convert Mutual Information to KL Divergence

The formula for mutual information was introduced earlier. It can be expressed as the relative entropy of the product of the marginal distribution of two random variables XY relative to the joint probability distribution of Y and Y, namely

This shows that the E information can be calculated by finding the KL divergence.

3.2 Two dual representations of KL divergence

KL divergence has asymmetry, which can be converted into a duality representation for calculation. There are two duality representation formulas based on divergence.

The dual f-divergence representation has a lower lower bound than the Donsker-Varadhan representation, which will lead to more loose and inaccurate estimation results. Therefore, the Donsker-Varadhan representation is generally used.

3.3 Applications of KL Divergence in Neural Networks

4 Experience and skills for stable training of GAN

4.1 Classification of GAN training failures

Training of GAN models is a well-recognized challenge in neural networks. For many cases of training failure, there are mainly two cases: mode dropping and mode collapsing

  • Mode discarding refers to the lack of diversity in the simulated samples generated by the model, that is, the generated simulated data is a subset of the original pendulum set. Just like, the MNST data distribution has a total of 10 categories, and the simulated data generated by the generator has only one of them.
  • Mode Collapse: The simulation samples generated by the generator are very modal and of low quality.

4.2 GAN training techniques

4.2.1 Reduce the learning rate

In general, when training horizontal shapes with larger batches, a higher learning rate can be set. However, when the model has mode abandonment, you can try to reduce the learning rate of the model and start training from scratch.

4.2.2 Label Smoothing

Label smoothing can effectively improve mode collapse during training. This method is also very easy to understand and implement. If the label of the real image is set to 1, change it to a lower value (such as 0.9). This solution prevents the discriminator from believing too much in categorical labels, i.e. not relying on a very limited set of features to tell whether an image is real or fake.

4.2.3 Multiscale Gradients

This technique is often used to generate larger (1024 pixels by 1024 pixels) analog images. This method handles in a similar way to the traditional U-Net for semantic segmentation. The model pays more attention to the multi-scale gradient. The multi-scale images obtained by downsampling the real pictures and the multi-scale vectors output by the multi-hop connection part of the generator are sent to the discriminator together to form the MSG-GAN architecture. (The referenced paper number is arXv:1903.06048, 2019)

4.2.4 Replacing the loss function

In the training methods of the f-GAN series, due to the different measures of divergence, the problem of training instability exists. In this case, a different metric can be used as the loss function in the model to find a more suitable solution.

4.2.5 Estimation method by means of mutual information

When training the model, the MNE method can also be used to assist model training.

The MINE method is a general training method that can be used for various models (self-encoding neural networks, adversarial neural networks). In the training process of GAN, using the MINE method to assist the training model will have better performance, as shown in Figure 8-27.

The left side of Figure 8-27 is the result generated by the GAN model; the right side is the generated result after using MINE-assisted training. As you can see, the space covered by the simulated data (yellow dots) on the right side of the graph is more consistent with the original data (blue dots).

 4.3 Overview of the MINE method

There are two main techniques used in the MINE method: the mutual information transfer to the neural network model technique and the use of dual KL divergence to calculate the loss technique. The most valuable is the idea of ​​these two technologies, which can be applied to more prompt structures by using mutual information into neural network model technology. At the same time, the loss function can also use different distribution measurement algorithms according to specific tasks. [See the actual combat in the next section]

Guess you like

Origin blog.csdn.net/qq_39237205/article/details/123775566