Preliminary study of Gan (1) Image-related applications SGan, Wgan, Wgan-gp, Cgan, LapGan, PGGan, StyleGan

background

During the intern period, I was asked to do dry-related research. Finally, I had to solve a sequence generation problem and record the process.

introduce

The original basic idea of ​​​​Gan is that there are two networks, a generator network generator and a discriminator network discriminator.
The generator network inputs our random vectors and outputs anything meaningful we want, like images.
The input of the discriminant network is the thing we need to distinguish, and a value is output to represent the quality of the thing, that is, true or false.
Prepare some real data to start training the network. In each iterative process, first fix the generation network, mix the output of the generation network and real data to train the discriminant network, and mark the labels with 0 and 1 respectively; then fix the discriminant network, and train the generation network back, so that the true and false data cannot be distinguished. Stealing a picture of a senior:

insert image description here
Why do we need two networks instead of just one generative network? Take the generated image as an example:
if only the generated network is used, the common method is to restore the image by encoding and decoding, and the decoding part is what the generated network does. In practice, the effect of the network will not be very good. The reason is that different pixels of the picture are actually generated independently, and the connection between them is difficult to be considered by the network. This requires a more complex network structure to achieve the same effect. The discriminative network can be used to discriminate the entire generated picture and score it as a whole. The effect is better in practice.
Mathematical principles
The purpose of the discriminator D is to distinguish them as much as possible when there are already real sample data and generated samples G(x):
D = max ( E x − datalog D ( x ) + E x − G ( x ) log ( 1 − D ( x ) ) ) D=max(E_{x-data}logD(x)+E_{xG(x)}log(1-D(x)))D=m a x ( ExdatalogD(x)+ExG(x)log(1D ( x ) ) )
The purpose of the generator G is not to let the discriminator distinguish:G = min ( D ) G=min(D)G=m i n ( D )
In the training process, G is first fixed to train D, and (the paper) is simplified to getD ( x ) = P r ( x ) P r ( x ) + P g ( x ) D(x)=\frac{ Pr(x)}{Pr(x)+Pg(x)}D(x)=P r ( x )+P g ( x )P r ( x )Among them, pr(x) and pg(x) respectively represent the probability density of the real sample and the generated sample at x, which means that the function that the discriminant network D is finally fitting is this thing.
Substituting the trained D into G, you will find that G is minimizing the js divergence , and finally achieves the goal that the two distributions are getting closer.
Theoretically speaking, at the end of the training, the samples generated by the generator are no different from the real samples, that is, Pr(x)=Pg(x), and the D discriminator will eventually stabilize at 1/2. But the actual situation is difficult to achieve this situation, because Gan's training is very unstable, it is difficult to find a good indicator of the end of training in practice.

Problems :
1. In the early stage of training, the difference between the real sample and the generated sample is too large, and the js divergence may be 0. If D is trained too strongly, G will have gradient disappearance. Gan training is very difficult.
2. The original author's improved version uses KL divergence, which has a large penalty for failing to generate positive samples, lack of diversity in samples, and poor generalization ability, and the effect is not outstanding in practical applications.

By

wgan: Change the role of the discriminator. The discriminator was originally a simple binary classifier for judging the authenticity of the data, but now it is changed to calculate the Wasserstein distance (EM distance), changing from a classification problem to a regression problem, avoiding the original Gan gradient disappearance problem.

insert image description here

EM distance explanation: Now there are two distributions Pr and Pg. If there is a moving scheme that can move all the data on Pr to Pg, then this moving scheme is called r(Pr, Pg), and this scheme is calculated The average moving distance of each point in r is used as a measure of the difference between the two distributions. In all schemes, the one with the least average moving distance is chosen as the EM distance of the two distributions.
wgan uses EM distance instead of js divergence. Since EM distance cannot be directly calculated, wgan proposes a function to approximate EM distance. The mathematical method used is Kantorovich-Rubinstein duality (unintelligible).
insert image description here

Question : The latest article (2021.3) "Wasserstein GANs Work Because They Fail (to Approximate the Wasserstein Distance)" shows that the discriminator in wgan is actually very poor in fitting the EM distance, but if you use other methods to fit the EM distance When it is better, the effect of the generator will decrease instead, so it is currently a mystery that wgan can succeed.

WGan-GP

wgan-gp: In practice, it is found that the parameters are limited to a small range [-0.01,0.01] during the wgan operation, but because the loss of wgan hopes that the difference between the positive and negative phases is as large as possible, so the weights are all concentrated in [-0.01] and [0.01], making the neural network fitting effect worse. Wgan-gp added a gradient penalty on the original basis, no longer restricts the weight, but directly adds the gradient to the loss, so that the closer the gradient is to the k of a certain characteristic, the better. accomplish:
insert image description here

Compared with wgan, there is no need for weight clipping, and only a penalty term is added to the loss of the discriminator: find the gradient at x, and limit it around k (k=1). According to the continuous description of lipschitz, we need to sample in the entire sample space, which is obviously impossible. The proposer chooses to sample the space between the real data and the generated data, using the method of linear interpolation: where random is [0,1] random number.

LSGan

LSGan: The least squares method loss function is used to replace the cross entropy in the original Gan, also in order to optimize the Gan training difficulty and the gradient disappearance problem. Wgan-gp is not practical in terms of usage.

DCGan

DCgan (Deep Convolutional): Use a convolutional layer to replace the fully connected layer, use deconvolution instead of upsampling, use batchnorm for each layer, and generally use leakyrelu for the activation function. The simplest is indeed the most effective modification to facilitate image manipulation. At present, most simple projects can be realized through Dcgan, which is more practical.
The picture generation I tried using Dcgan myself:
insert image description here

Resolution 64*64.

CGan

Cgan (Conditional Gan): Add labels for training. Real sample data no longer only has one 1 to represent true or false, but can have its own label, and the input of the generator will also add corresponding label information to generate data corresponding to the label. The advantage is that the output of the generator can be controlled. The difference from the traditional Gan is that the input channels of the generator and the discriminator are increased at the same time, the number of channels of the category is increased, and the channel of the corresponding category is set to 1, and the rest are set to 0. Equivalent to training under certain conditions.

LapGan

LapGan: Generate high-resolution images, using the Laplacian pyramid. Continuously generate higher resolution images by repeatedly stacking Cgan. The generator is doing the work of generating the Laplacian pyramid of the image.
insert image description here

Training: Downsample the training picture I0 to get I1, upsample I1 to get L0, subtract L0 from I0 to get the real image, put L0 into the generator to get the generated image, and hand it over to the discriminator for training.
Starting from low-resolution images, the generator generates upsampled residuals step by step, and synthesizes high-resolution images by summing.

PG-Gan

PG-Gan (Progressive Growing): In order to generate high-resolution images, Pg-Gan adjusts the entire training process, starting from 4x4 generated images, until 1024x1024 faces, gradually allowing the generator and discriminator to grow . The effect is better than LapGan.
insert image description here

Improvements:
1. Smooth operation, the high-resolution image is combined by the generator and the low-resolution image upsampling.
insert image description here

2. Abandon the use of deconvolution and use upsampling + convolution instead.
3. In order to increase sample diversity, add a measure representing diversity to the last layer of the discriminator. The specific method is to calculate the standard deviation between samples on the feature map.

StyleGan

StyleGan: On the basis of Pg-Gan, a noise mapping network is added to achieve the purpose of controlling style.
1. The initial random noise is no longer directly input into the generator. Because the ability to control visual features using input vectors is very limited since it must follow the probability density of the training data. Therefore the model cannot map part of the input (elements in the vector) to the features, a phenomenon known as feature entanglement. However, by using another neural network, the model can generate a vector that does not have to follow the distribution of the training data and can reduce the correlation between features.
insert image description here

2. Randomly add noise to each module to control changes in some small details and increase the randomness of the samples.
3. Cancel batchnorm and replace it with adain.
insert image description here

Effects (pictures from other blogs):
insert image description here

Gan parameter tuning skills:
1. In the field of image generation, the real image pixel value is compressed to [-1,1], and the last layer of the generator uses tanh.
2. When training the generator, replace the label of the fake sample with a real one for training.
3. Replace relu with leakyrelu, and replace maxpooling with avgpooling.
4. Do not directly use 0 and 1 for the label of the true and false samples, but use random numbers of 0-0.3 and 0.7-1.2.
5. Do not mix positive and negative samples to the discriminator for discrimination. It is a better choice for a batch to be full of positive or negative samples.

Most of the formulas and pictures in the article come from the papers that proposed Gan. If you really want to understand the implementation methods of different types of Gan, you can read the papers. The next chapter deals with applications to sequence generation.

Guess you like

Origin blog.csdn.net/yzsjwd/article/details/119330091