Advanced Deep Learning [8]: Introduction to Basic Concepts of GAN Against Neural Networks, Nash Equilibrium, Generator Discriminator, Decoder Encoder Detailed Explanation and GAN Application Scenarios

insert image description here
[Introduction to advanced deep learning] must-see series, including activation function, optimization strategy, loss function, model tuning, normalization algorithm, convolution model, sequence model, pre-training model, adversarial neural network, etc.

insert image description here
The column introduces in detail: [Introduction to advanced deep learning] must-see series, including activation function, optimization strategy, loss function, model tuning, normalization algorithm, convolution model, sequence model, pre-training model, adversarial neural network, etc.

This column is mainly to facilitate beginners to quickly grasp relevant knowledge. In the follow-up, we will continue to analyze the knowledge principles involved in deep learning to everyone, so that everyone can reserve knowledge while practicing the project, knowing what it is, why it is, and why to know why it is.

Disclaimer: Some projects are online classic projects for everyone to learn quickly, and practical links will be added in the future (competitions, papers, practical applications, etc.)

Column Subscription: Deep Learning Introduction to Advanced Columns

Introduction to the basic concepts of GAN against neural networks: generative adversarial network

1. Game theory

Game theory can be thought of as a model of interactions between two or more rational agents or players.

Rational is the keyword because it is the basis of game theory. We can simply refer to rationality as an understanding that each agent knows that all other agents are as rational as he/she is, possessing the same level of understanding and knowledge . Meanwhile, rationality refers to the fact that an agent always prefers higher rewards/rewards given the actions of other agents.

Now that we know what it means to be rational, let's look at some other keywords related to game theory:

  • Games : In general, games are made up of a set of players, actions/strategies and ultimate payoffs. For example: auctions, chess, politics, etc.

  • Player : A player is a rational entity that participates in any game. For example: bidders at auctions, rock-paper-scissors players, politicians participating in elections, etc.

  • Payoffs : Payoffs are the rewards all players receive when they achieve a certain outcome. It can be positive or negative. As we discussed before, each agent is selfish and wants to maximize their payoff.

2. Nash Equilibrium

Nash equilibrium (or Nash equilibrium), Nash equilibrium, also known as non-cooperative game equilibrium, is the "cornerstone" of artificial intelligence game theory methods.

The so-called Nash equilibrium refers to a strategy combination of participants. In this strategy, any participant will not benefit from changing the strategy alone, that is, everyone 's strategy is the optimal response to other people's strategies . In other words, a strategy portfolio is a Nash equilibrium if no one changes his strategy when no one else changes his strategy.

The classic example is the Prisoner's Dilemma :

​ **Background:** Two suspects A and B in a case were interrogated separately by police officers, so A and B had no chance to collude;

​ **Rewards and punishments:** The police officer told A and B respectively that if neither confessed, they would each be sentenced to 3 years; if both confessed, they would both be sentenced to 5 years; The other party for 10 years.

​ **Result: ** Both A and B choose to confess, and each will be sentenced to 5 years. This is the Nash equilibrium at this time.

Judging from the description of rewards and punishments, no confession is the best solution, and the sentence is the least. In fact, this is not the case. A and B cannot communicate, so from the perspective of their respective interests:

Suspect A thought:

  • If B confesses, if I confess, I will only be sentenced to 5 years, and if I do not confess, I will be sentenced to 10 years;

  • If B does not confess, if I confess, I will only be sentenced to 1 year, and if I do not confess, I will be sentenced to 3 years;

So no matter whether B confesses or not, as long as A confesses, it is the optimal strategy for A.

As above, suspect B also has the same idea, and they all choose to confess according to their own rationality. This situation is called the Nash equilibrium point.

3. Why is the input of the GAN generator noise

The input of the GAN generator Generator is random noise, and the purpose is to generate different pictures each time. But if it is completely random, you don't know what features the generated images will have, and the results will be uncontrollable, so noise is usually generated from a prior random distribution. Commonly used random distributions:

  • Gaussian distribution : the most widely used probability distribution among continuous variables;

  • Uniform distribution : A simple distribution of a continuous variable x.

The introduction of random noise makes the generated pictures diverse. For example, different noise z in the following figure can produce different numbers:

4. Generator

The generator G is a network for generating pictures, which can use multi-layer perceptrons, convolutional networks, autoencoders, etc. It receives a random noise z, and generates a picture through this noise, denoted as G(z). Explain how the generator generates a picture from noise step by step through the model structure in the figure below:

1) Input: 100-dimensional vector;

2) After two fully connected layers Fc1 and Fc2 and one Resize, the noise vector is amplified to obtain 128 feature maps of 7*7 size;

3) Perform upsampling to expand the feature map to obtain 128 feature maps of 14*14 size;

4) After the first convolution Conv1, 64 feature maps of 14*14 are obtained;

5) Perform upsampling to expand the feature map to obtain 64 feature maps of 28*28 size;

6) After the second convolution Conv2, the input noise Z is gradually converted into a single-channel image output of 1 28 28, and the generated handwritten numbers are obtained.

Tips: The function of the fully connected layer: dimension transformation, becoming high-dimensional, which is convenient for amplifying the noise vector. Because the calculation of the fully connected layer is slightly larger, the subsequent improved GAN removes the fully connected layer.

Tips: The activation function of the last layer usually uses tanh(): it acts as both activation and normalization, and normalizes the output of the generator to [-1,1] as the input of the discriminator. It also makes the training of GAN more stable, the convergence speed is faster, and the generation quality is indeed higher.

5. Discriminator Discriminator

The input of the discriminator D is the real image and the image generated by the generator, and its purpose is to distinguish the generated image from the real image as much as possible. It belongs to the binary classification problem. Through the model structure in the following figure, explain how the discriminator distinguishes between true and false pictures:

  • Input: a single-channel image with a size of 28*28 pixels (not a fixed value, it can be modified according to the actual situation).

  • Output: Binary classification, sample is true or false.

1) Input: 28 28 1-pixel images;

2) After the first convolution conv1, 64 feature maps of 26 26 are obtained, and then the maximum pooling pool1 is performed to obtain 64 feature maps of 13 13;

3) After the second convolution conv2, 128 feature maps of 11 11 are obtained, and then the maximum pooling pool2 is performed to obtain 128 feature maps of 5 5;

4) Dimensionalize the multi-dimensional input through Resize;

5) After two fully connected layers fc1 and fc2, the vector representation of the original image is obtained;

6) Finally, through the Sigmoid activation function, the discrimination probability is output, that is, the binary classification result of whether the picture is real or fake.

6. GAN loss function

During the training process, the goal of the generator G (Generator) is to generate real pictures as much as possible to deceive the discriminator D (Discriminator). The goal of D is to distinguish the pictures generated by G from the real pictures as much as possible. In this way, G and D constitute a dynamic "game process".

What is the result of the final game? In the most ideal state, G can generate a picture G(z) that is enough to "disguise the real one". For D, it is difficult to determine whether the picture generated by G is real, so D(G(z)) = 0.5.

The formula is as follows:

m i n G m a x D V ( D , G ) = E x ∼ p d a t a ( x ) [ log ⁡ D ( x ) ] + E z ∼ p z ( z ) [ log ⁡ ( 1 − D ( G ( z ) ) ) ] (1) \begin{equation} \mathop{min}\limits_{G}\mathop{max}\limits_{D}V(D,G) = Ε_{x\sim p_{data}(x)} \left[\log D\left(x\right)\right]+Ε_{z\sim p_{z}(z)}\left[\log \left(1 - D\left(G\left(z\right)\right)\right)\right]\end{equation} \tag{1} GminDmaxV(D,G)=Expdata(x)[logD(x)]+Ezpz(z)[log(1D(G(z)))](1)

V(D,G) on the left side of the formula represents the difference between the generated image and the real image, and the cross-entropy loss function of the binary classification (true and false categories) is used. Contains minG and maxD two parts:

m a x D V ( D , G ) \mathop{max}\limits_{D}V(D,G) DmaxV(D,G ) means that a fixed generator G trains a discriminator D, and the parameters of the discriminator D are updated by maximizing the cross-entropy loss V(D,G). The training goal of D is to correctly distinguish between the real picture x and the generated picture G(z). The stronger the discrimination ability of D, the larger D(x) should be, the larger the first item on the right, and the smaller D(G(x)) should be. , the second term on the right is larger. At this time, V(D,G) will become larger, so the formula is to find the maximum (maxD) for D.

m i n G m a x D V ( D , G ) \mathop{min}\limits_{G}\mathop{max}\limits_{D}V(D,G) GminDmaxV(D,G ) means that the fixed discriminator D trains the generator G, and the generator should minimize the cross-entropy loss when the discriminator maximizes the cross-entropy loss V(D,G) of the true and false images. At this time, only the second item on the right is useful. G hopes that the pictures he generates are "closer to the real, the better" and can deceive the discriminator, that is, D(G(z)) is as large as possible. At this time, V(D, G) will get smaller. Therefore, the formula is the minimum (min_G) for G.

  • x ∼ p d a t a ( x ) x\sim p_{data}(x) xpdata( x ) : represents a real image;

  • z ∼ p z ( z ) z\sim p_{z}(z) zpz( z ) : Represents samples of Gaussian distribution, that is, noise;

  • D(x) represents the probability that x is a real picture. If it is 1, it means that 100% is a real picture, and if the output is 0, it means that it cannot be a real picture.

The right side of the equation is actually to expand the cross quotient loss formula on the left side of the equation and write it into the expected form of the probability distribution. For detailed derivation, please refer to the original paper Generative Adversarial Nets .

7. Model training

GAN contains two networks of generator G and discriminator D, so how do we train two networks?

During training, the discriminator D is first trained to label the real picture with a real label 1 and the fake picture generated by the generator G with a fake label 0, and together form a batch and send it to the discriminator D to train the discriminator. When calculating the loss, the discriminator's judgment on the real image input tends to be true, and the judgment of the generated fake image tends to be false. In this process, only the parameters of the discriminator are updated, and the parameters of the generator are not updated.

Then train the generator G to send the noise z of Gaussian distribution to the generator G, and label the generated fake picture with a real label 1 and send it to the discriminator D. When calculating the loss, the discriminator's discrimination against the generated fake pictures tends to be true. In this process, only the parameters of the generator are updated, and the parameters of the discriminator are not updated.

Note: Early in training, when G is poorly generated, D rejects generated samples with high confidence because they are significantly different from the training data. Therefore, log(1−D(G(z))) saturates (ie, is constant and the gradient is 0). Therefore, we choose to maximize logD(G(z)) instead of minimizing log(1−D(G(z))) to train G, and compare it with the second item on the right of the publicity (1).

8 Model training is unstable

The reasons for GAN training instability are as follows:

  • No convergence: It is difficult to make the two models G and D converge at the same time;

  • Pattern collapse: Generator G generates single or finite patterns;

  • Slow training: The gradient of the generator G vanishes.

When training GAN, the following training techniques can be adopted:

1) The activation function of the last layer of the generator uses tanh(), and the output is normalized to [-1, 1];

2) The real image is also normalized to [-1,1];

3) Do not set the learning rate too large, the initial 1e-4 can be used as a reference, and the learning rate can be continuously reduced as the training progresses;

4) The optimizer tries to choose Adam, because SGD solves a problem of finding the minimum value, and GAN is a game problem, and it is easy to oscillate when using SGD;

5) Avoid using ReLU and MaxPool to reduce the possibility of sparse gradients. You can use the Leak Re LU activation function. Downsampling can be replaced by Average Pooling or Convolution + stride. Upsampling can use PixelShuffle, ConvTranspose2d + stride;

6) Adding noise: Adding noise to real images and generated images increases the difficulty of discriminator training, which is beneficial to improve stability;

7) If there is labeled data, try to use label information for training;

8) Label smoothing: If the label of a real image is set to 1, we change it to a lower value, such as 0.9, to avoid the discriminator being too confident in its classification.

9. Encoder Encoder

The goal of Encoder is to encode the input sequence into a low-dimensional vector representation or embedding. The mapping function is as follows:

V → R d (1) \begin{equation}V\to R^{d}\end{equation} \tag{1} VRd(1)

The input V is mapped to embedding zi ∈ R d z_i\in R^{d}ziRd , as shown in the figure below:

Encoder is generally a convolutional neural network, mainly composed of convolutional layer, pooling layer and BatchNormalization layer. The convolutional layer is responsible for obtaining the local features of the image, the pooling layer down-samples the image and transfers the scale-invariant features to the next layer, and the BN mainly normalizes the distribution of the training image to accelerate learning. (Encoder network structure is not limited to convolutional neural network)

Taking face encoding as an example, the Encoder compresses the face image into a short vector, so that the short vector contains the main information of the face image. For example, the elements of the vector may represent the skin color of the face, the position of the eyebrows, the size of the eyes, and so on. The encoder learns different faces, then it can learn the commonality of faces:

10. Decoder Decoder

The goal of the Decoder is to use the embedding output by the Encoder to decode the structural information about the graph.

The input is the embeddings of the Node Pair, and the output is a real number, which measures the similarity of the two Nodes in the node. The mapping relationship is as follows:

R d ∗ R d → R + . (1) \begin{equation}R^{d} * R^{d}\to R^{+}\end{equation}. \tag{1} RdRdR+.(1)

The Decoder upsamples the reduced feature image vector, and then performs convolution processing on the upsampled image. The purpose is to improve the geometry of the object and compensate for the loss of detail caused by the reduction of the object by the pooling layer in the Encoder.

Taking face encoding and decoding as an example, after the Encoder encodes the face, the Decoder is used to learn the characteristics of the face, that is, the short vector is restored to the face image, as shown in the following figure:

11. GAN application

Let's take a look at some interesting applications of GAN:

  • image generation

    Image generation is a basic problem of generative models, and GAN can generate images with higher image quality than previous generative models. such as generating realistic face images

  • super resolution

    When zooming in on an image, the picture becomes blurry. Use GAN to expand the 32*32 image into a 64*64 real image, and increase the resolution of the image while enlarging the image.

  • image restoration

    Complete the incomplete image, and can also be used to remove tattoos, TV logos, watermarks, etc.

  • Image to Image Conversion

    Generate another image with a different style based on one image, for example, a horse becomes a zebra map, and an aerial map becomes a map

  • animation of scenery

    Convert landscape images into animation effects

  • comic face

    Generate face images in cartoon style

  • image coloring

    Black and White Image Colorization

  • text to image

    Generate corresponding images based on text descriptions

The application of GAN is widely used, far more than the above-mentioned ones.

Guess you like

Origin blog.csdn.net/sinat_39620217/article/details/130982812