Advanced Deep Learning [9]: Overview of GANs against Generative Networks, representative variant models, training strategies, introduction of GAN in computer vision applications and common data sets, and cutting-edge problem solving

insert image description here
[Introduction to advanced deep learning] must-see series, including activation function, optimization strategy, loss function, model tuning, normalization algorithm, convolution model, sequence model, pre-training model, adversarial neural network, etc.

insert image description here
The column introduces in detail: [Introduction to advanced deep learning] must-see series, including activation function, optimization strategy, loss function, model tuning, normalization algorithm, convolution model, sequence model, pre-training model, adversarial neural network, etc.

This column is mainly to facilitate beginners to quickly grasp relevant knowledge. In the follow-up, we will continue to analyze the knowledge principles involved in deep learning to everyone, so that everyone can reserve knowledge while practicing the project, knowing what it is, why it is, and why to know why it is.

Disclaimer: Some projects are online classic projects for everyone to learn quickly, and practical links will be added in the future (competitions, papers, practical applications, etc.)

Column Subscription: Deep Learning Introduction to Advanced Columns

A Survey of Generative Adversarial Networks (GANs)

1. Generation and discrimination

1.1 Generating the model

​ The so-called generative model refers to a model that can be described as a generated data, which belongs to a probabilistic model. The definition on Wikipedia is: In the theory of probability and statistics, a generative model refers to a model that can randomly generate observation data, especially under the condition of certain hidden parameters. It specifies a joint probability distribution for observations and labeled data sequences . In machine learning, generative models can be used to directly model data (such as sampling data according to the probability density function of a variable), or they can be used to establish conditional probability distributions between variables . Conditional probability distributions can be formed by generative models according to Bayes' theorem . In layman's terms, through this model we can generate new data that is not included in the training data set. As shown in Figure 1 , for example, we have a lot of pictures of horses through the generative model to learn the images of these horses, and learn the appearance of the horse from them. The generative model can generate images of horses that look very real, but this image does not belong to training. image.

Figure 1 Flowchart of generating model processing

​ And our common models are generally discriminative models. As shown in Figure 2 , the discriminant model can be simply understood as classification. For example, to divide an image into cats or dogs or others, as in Figure 2 , we train a discriminant model to distinguish whether it is a painting of Van Gogh. This discriminant model will mention and classify the features of the paintings in the data set, so as to distinguish which one It was made by Master Van Gogh.

Therefore, the difference between a generative model and a discriminative model is:

  1. The data set of the generative model does not have labels similar to those of the discriminative model (that is, the label information, the generative model can also have labels, and the generative model can generate images of the corresponding category according to the labels). The generative model is like a kind of unsupervised learning. The discriminative model is a kind of supervised learning.

  2. Mathematical representation:

    Discriminant model: p(y|x) is the probability of obtaining y for a given observation x.

    Generative model: p(x) is the probability of observing x. If there is a label, it is expressed as: p(x|y) specifies the probability that label y generates x.

Figure 2 Discriminant model processing flow chart

​ The birth of the GAN model is to combine the characteristics of the generative model and the characteristics of the discriminant model, train through dynamic confrontation, and find the optimal solution in the homomorphic balance.

2. What is GAN?

2.1 Fighting Thoughts

The main idea of ​​GAN is the adversarial idea: the adversarial idea has been successfully applied in many fields, such as machine learning, artificial intelligence, computer vision, and natural language processing. The recent defeat of the world's top human players by AlphaGo has sparked public interest in artificial intelligence. An intermediate version of AlphaGo uses two competing networks. Adversarial examples are examples that are very different from real examples, but are very confidently classified into the real category, or slightly different from real examples, but are classified into the wrong category. This is a very hot research topic recently.

Adversarial machine learning is a minimax problem. The defender builds the classifier we want to work correctly, it searches in the parameter space to find the parameters that reduce the cost of the classifier as much as possible. At the same time, the attacker is searching the input of the model to maximize the cost. Adversarial ideas exist in adversarial networks, adversarial learning, and adversarial examples.

The theoretical background of confrontational thinking is game theory. Game theory, also known as game theory, game theory, etc., is not only a new branch of modern mathematics, but also an important subject of operations research. Game theory mainly studies the interaction between formulaic incentive structures, and is a mathematical theory and method for studying phenomena with the nature of struggle or competition. Game theory considers the predicted and actual behavior of individuals in games and studies their optimization strategies. Biologists use game theory to understand and predict certain outcomes of evolution. (Game theory and related concepts)

2.2 Generative Adversarial Network(GAN)

GAN, as its name suggests, is a neural network that generates and confronts. Generally, a GAN network includes a generator (Generator) and a discriminator (Discriminator). The generator is used to continuously generate data that is getting closer and closer to the actual label according to the requirements, and the discriminator is used to continuously distinguish the difference between the generated result of the generator and the actual label. For example, for the image super-resolution problem, the general neural network uses the loss function to supervise the difference between the generated image and the real label from different angles (such as pixels, feature maps, etc.), and finds the model corresponding to the minimum value of the loss function through optimization. parameter. A GAN network model will generate an image through the generator, and then dynamically distinguish the difference between the generated image and the real image through the discriminator. As shown in the figure below, for contrast, the left eye shows the original image, and the right eye shows the image after passing through the GAN network. Obviously, the GAN network makes the original blurred image clearer, and the detailed texture is more prominent.

Figure 4 Example of GAN model effect for image super-resolution

​ Of course, the GAN network is not only used for image super-resolution tasks, image conversion, image understanding, image filling and other tasks can use GAN.

Compared with other generative algorithms, GANs are proposed to overcome the shortcomings of other generative algorithms. The basic idea behind adversarial learning is that the generator tries to create as realistic examples as possible to fool the discriminator. The discriminator tries to distinguish fake examples from real ones. Both the generator and the discriminator are improved through adversarial learning. This adversarial process gives GANs a significant advantage over other generative algorithms. More specifically, GANs have the following advantages over other generative algorithms:

  • GANs can be generated in parallel, which is not possible with other generative algorithms
  • There are no restrictions on the design of generators.
  • People subjectively believe that GANs produce better examples than other methods.

​ The figure below is a classic GAN network model. Let's first understand what the two models of GAN do. The first is the discriminant model, which is the network in the right half of the figure. The Discriminator part in the figure is the discriminant model mentioned above. Generally, common neural network structures such as VGG and ResNet are used as the main body of the structure. Input an image (such as X real , X fake X_{real}, X_{fake}Xreal,Xfake), output a probability value for use in judging true or false (probability value greater than 0.5 is true, less than 0.5 is false), but true and false is just a probability defined by people. The second is the generation model (Generator part). The generation model is also built on the basis of the classic network model, and the convolution layer, pooling layer, etc. are added, deleted, and modified for different problems. The input of the Generator is a set of random numbers Z, and an image is output. It can be seen from the figure that there are two data sets, one is a real data set, and the other is a fake data set. This data set is a data set created by a generative network. According to this picture, let's understand the goal of GAN again:

  • The purpose of the discriminant network: to be able to distinguish whether a picture it belongs to is from a real sample set or a fake sample set. If the input is a real sample, the network output will be close to 1, if the input is a fake sample, the network output will be close to 0, which achieves the purpose of good discrimination.

  • The purpose of the generated network: the generated network is to make samples, and its purpose is to make the ability to make samples as strong as possible, so that the discriminant network cannot judge whether the sample is a real sample or a fake sample.

​ The GAN network is mainly composed of two parts: the generation network and the identification network. The hidden variable $ z $ (usually random noise subject to Gaussian distribution) generates $ X_{fake} $ through the Generator, and the discriminator is responsible for discriminating whether the input data is a generated sample $X_{fake}$ is still a real sample $X_{real}$.

Figure 5 Schematic diagram of the GAN model structure

The loss is as follows:

min ⁡ G max ⁡ D V ( D , G ) = min ⁡ G max ⁡ D E x ∼ p d a t a ( x ) [ log ⁡ D ( x ) ] + E z ∼ p z ( z ) [ log ⁡ ( 1 − D ( G ( z ) ) ) ] {\min _G}{\max _D}V(D,G) = {\min _G}{\max _D}{E_{x \sim {p_{data}}(x)}}[\log D(x)] + {E_{z \sim {p_z}(z)}}[\log (1 - D(G(z)))] GminDmaxV(D,G)=GminDmaxExpdata(x)[logD(x)]+Ezpz(z)[log(1D(G(z)))]

​ For the discriminator D, this is a two-category problem, and V(D,G) is the common cross-entropy loss in two-category problems. For the generator G, in order to deceive D as much as possible, it is necessary to maximize the discriminant probability D(G(z)) of the generated samples, that is, to minimize $ \log (1 - D(G(z))) $ (note : $ \log D(x) $ has nothing to do with the generator G, so it can be ignored.)

​ In actual training, the generator and the discriminator adopt alternate training, that is, training D first, then training G, and reciprocating continuously. It is worth noting that for the generator, it minimizes ${\max _D}V(D,G)$, that is, minimizes the maximum value of $V(D,G)$. In order to ensure that V(D,G) achieves the maximum value, we usually train the discriminator for k iterations, and then iterate the generator once (however, it is found in practice that k is usually 1). When the generator G is fixed, we can derive the V(D,G) to find the optimal discriminator $ {D^ * }(x) $:

D ∗ ( x ) = p d a t a ( x ) p g ( x ) + p d a t a ( x ) {D^ * }(x) = \frac{ { {p_{data}}(x)}}{ { {p_g}(x) + {p_{data}}(x)}} D(x)=pg(x)+pdata(x)pdata(x)

​ Substituting the optimal discriminator into the above objective function, it can be further obtained that under the optimal discriminator, the objective function of the generator is equivalent to optimizing ${p_{data}}(x), {p_g}(x) $ JS Divergence (JSD, Jenson Shannon Divergence). It can be proved that when the capacities of G and D are sufficient, the model will converge, and the two will reach Nash equilibrium. At this time, ${p_{data}}(x) = {p_g}(x) $, whether the discriminator is for ${p_{data}}(x)$ or ${p_g}(x)$ samples, the predicted probabilities are $ \frac{1}{2} $ , that is, the generated samples are indistinguishable from the real samples.

3. The development of GAN

​ With the innovation of information technology and the continuous replacement of computing power of hardware equipment, artificial intelligence is flourishing in the information society. The field of machine learning represented by generative models continues to attract the attention of researchers. It is widely used in the field of computer vision, such as image generation, video generation and other tasks; the direction of natural language processing represented by tasks such as information steganography and text generation; the direction of speech synthesis in the audio field, and in these tasks, generate The models showed amazing results. At present, the research of GAN in the fields of computer vision, medicine, natural language processing and so on has been kept active. In addition, the research work of the GAN model mainly focuses on the following two aspects: one is to focus on theoretical clues to try to improve the stability of the GAN and solve its training problems, or to consider different perspectives (such as information theory, model efficiency, etc. aspect) to enrich its structure; the second is to focus on the variant structure and application scenarios of generative adversarial networks in different application fields. In addition to image synthesis, GAN has also been successfully applied in other directions, such as image super-resolution, image description, image restoration, text-to-image translation, semantic segmentation, object detection, generative adversarial attacks, machine translation, image fusion and denoising.

​ In 2014, Ian GoodFellow proposed the GAN model. Since GAN was proposed, the generative confrontation network has quickly become the most popular generative model. In the fast-growing adolescence, GAN has produced many popular architectures, such as DCGAN, StyleGAN, BigGAN, StackGAN, Pix2pix, Age-cGAN, CycleGAN, etc. This is a graph of the generative adversarial network family. The left part is mainly to improve the model to solve actual problems such as image conversion, text to image, image generation, video conversion, etc.; the right part is mainly to solve some problems existing in the GAN framework itself. The traditional generation model can be traced back to the RBM in the 1980s, and the AutoEncoder that was gradually packaged with a deep neural network. Then there is the most popular generative model GAN.

Figure 6 Schematic diagram of the development of the classic GAN model

4. Introduction to the classic GAN model

Table 1 Classification of different types of GANs

Table 1 is a collation of algorithm-based GANs methods, and classifies existing GAN methods from the aspects of GANs training strategy, structural changes, training techniques, and supervision types. This paper selects the classic model and method to illustrate.

4.1 Representative variants of GAN

4.1.1 InfoGAN

​ Its principle is very simple. In info GAN, the input vector z is divided into two parts, c and z'. c can be understood as an interpretable hidden variable, and z can be understood as incompressible noise. It is hoped that by constraining the relationship between c and output, the dimension of c corresponds to the semantic features of output. Taking handwritten numbers as an example, such as stroke thickness, inclination, etc. In order to introduce c, the author constrains c through mutual information, which can also be understood as a process of self-encoding. The specific operation is that the output of the generator is passed through a classifier to see if c can be obtained. In fact, it can be regarded as a reverse process of anto-encoder. The rest of the discriminator is the same as the regular GAN.

Figure 7 Schematic diagram of InfoGAN structure

​ In the actual process, the classifier and the discriminator will share parameters, only the last layer is different, the output of the classifier is a vector, and the output of the discriminator is a scalar.

From the perspective of loss function, the loss function of infoGAN becomes:
min ⁡ G max ⁡ DVI ( D , G ) = V ( D , G ) − λ I ( c ; G ( z , c ) ) {\min _G }{\max _D}{V_I}(D,G) = V(D,G) - \lambda I(c;G(z,c))GminDmaxVI(D,G)=V(D,G)λI(c;G(z,c ))
​ Compared with the original GAN, there is an additional $ \lambda I(c;G(z,c)) $, which represents the mutual information between c and the output of the generator. The larger this item is, the more relevant c is to output.

Why info GAN is effective? The intuitive understanding is that if each dimension of c has a clear impact on Output, then the classifier can return the original c according to x. If c has no appreciable effect on output, the classifier cannot return the original c. Below is the result of info GAN. Changing the categorical variable can generate different numbers, and changing the continuous variable can change the slope and stroke thickness.

Figure 8 InfoGAN results

4.1.2 Conditional GANs (cGANs)

GANs can be extended to a conditional model if both the discriminator and the generator rely on some additional information. The objective function of conditional GANs is:

min ⁡ G max ⁡ D V ( D , G ) = E x ∼ p d a t a ( x ) [ log ⁡ D ( x ∣ y ) ] + E z ∼ p z ( z ) [ log ⁡ ( 1 − D ( G ( z ∣ y ) ) ) ] {\min _G}{\max _D}V(D,G) = {E_{x \sim {p_{data}}(x)}}[\log D(x|y)] + {E_{z \sim {p_z}(z)}}[\log (1 - D(G(z|y)))] GminDmaxV(D,G)=Expdata(x)[logD(xy)]+Ezpz(z)[log(1D(G(zy)))]

We can see that the generator of InfoGAN is similar to that of CGAN. However, the latent encoding of InfoGAN is unknown, which is discovered through training. In addition, InfoGAN has an additional network Qto output conditional variable $Q(c|x)$.

Based on CGAN, we can generate sample conditions on class labels, text, bounding boxes and keypoints. Text-to-photorealistic image synthesis using stacked generative adversarial networks (SGANs). CGANs have been used for convolutional face generation, face aging, image translation, synthesis of outdoor images with specific scene attributes, natural image description, and 3D-aware scene manipulation. Chrysos et al. proposed robust CGAN. Kumparampil et al. discuss the robustness of conditional GANs to noisy labels. Conditional cycle roots use CGAN with cycle consistency. Mode Searching GANs (MSGANs) proposes a simple yet effective regularization term to address the mode collapse problem of CGANs.

Train the discriminator on the original source [3] to maximize the log-likelihood of its assignment to the correct source:

L = E [ log ⁡ P ( S = r e a l ∣ X r e a l ) ] + E [ log ⁡ ( P ( S = f a k e ∣ X f a k e ) ) ] L = E[\log P(S = real|{X_{real}})] + E[\log (P(S = fake|{X_{fake}}))] L=E[logP(S=realXreal)]+E[log(P(S=fakeXfake))]

The objective function of the auxiliary classifier GAN (AC-GAN) has two parts: the log-likelihood LS of the correct source and the log-likelihood LC of the correct class label

L c = E [ log ⁡ P ( C = c X r e a l ) ] + E [ log ⁡ ( P ( C = c ∣ X f a k e ) ) ] {L_c} = E[\log P(C = c{X_{real}})] + E[\log (P(C = c|{X_{fake}}))] Lc=E[logP(C=cXreal)]+E[log(P(C=cXfake))]

Figure 9 Schematic diagram of cGAN results

Illustration of pix2pix: Training conditional GANs to map grayscale → color discriminator learns to classify between true grayscale, color tuples, and fake (synthesized by the generator). Unlike original GANs, where both the generator and the discriminator observe input grayscale images, the pix2pix generator has no noisy input.

Figure 10 Schematic diagram of generator and discriminator

​ The entire network structure is shown in the figure above, where z is the random input of the generating network, y is the condition, and x is the real sample. The training process is still like GANs, first train the discriminator, then train the generator, and cross. Alternately proceed until the discriminator cannot distinguish between real samples and generated samples. The difference in the training process is that the discriminator D needs to distinguish three types:

  1. The condition and the real picture that matches the condition, the expected output is 1;

  2. The condition and the real picture that does not match the condition, the expected output is 0;

  3. The output generated by the conditional and generative network, the expected output is 0

    In the cGANs paper, the MNIST dataset was tested. In this test, the added condition is the label of each image. That is, the input of the generator G is a random vector and the corresponding label of the picture to be generated. The input of the discriminator D is the real picture, the label corresponding to the real picture, and the generated picture. The following pictures are some generated pictures

Figure 11 Schematic diagram of cGAN generation results

​ When training a GAN, only the picture of the number 0 is put into the GAN training as a real sample. GAN can generate a picture of a number (such as a picture of the number 0), and to generate all corresponding pictures of 0-9, You need to train 10 different GANs, but when adding conditions, that is, the labels corresponding to each image sample, we can put 10 digital samples and corresponding labels into this network at the same time, and you can use a The GAN network generates pictures of ten numbers from 0-9

4.1.3 CycleGAN

CycleGAN is essentially two mirror-symmetrical GANs that form a ring network. The two GANs share two generators and each has a discriminator, that is, there are two discriminators and two generators. A one-way GAN has two losses, two or four losses in total.

Figure 12 Cycle Consistency Loss

In the paper, the mean square error loss is finally used to express:

L L S G A N ( G , D Y , X , Y ) = E y ∼ p d a t a ( y ) [ ( D Y ( y ) − 1 ) 2 ] + E x ∼ p d a t a ( x ) [ ( 1 − D Y ( G ( x ) ) ) 2 ] {L_{LSGAN}}(G,{D_Y},X,Y) = { {\rm E}_{y \sim {p_{data}}(y)}}[{({D_Y}(y) - 1)^2}] + { {\rm E}_{x \sim {p_{data}}(x)}}[{(1 - {D_Y}(G(x)))^2}] LL SG A N(G,DY,X,Y)=Eypdata(y)[(DY(y)1)2]+Expdata(x)[(1DY(G(x)))2]

The network architecture of CycleGAN is shown in the figure:

Figure 13 Schematic diagram of CycleGAN structure

It is a typical advantage of CycleGAN compared with Pixel2Pixel that the training of two image sets without pairing can be achieved. But we still need to create this map through training to ensure that there is a meaningful relationship between the input image and the generated image, that is, the input and output share some features.

Briefly, the model works by taking an input image from domain DA, which is passed to the first generator, GeneratorA→B, whose task is to transform a given image from domain DA to an image in the target domain DB . Then this newly generated image is passed to another generator, GeneratorB→A, whose task is to convert back to image CyclicA in the original domain DA, which can be compared with the autoencoder here. This output image must be similar to the original input image to define meaningful mappings that did not originally exist in the unpaired dataset.

4.2 Training strategy of GANs

Although a theoretically unique solution exists, GANs are difficult to train and often unstable for a number of reasons. One difficulty is that the optimal weights of GANs correspond to saddle points of the loss function, rather than minima. Specific model training can refer to here .

There are many papers on GANs training. Yadav et al. stabilized GANs with prediction methods. By using independent learning rates, two time-scale update rules (TTUR) are proposed for the discriminator and generator to ensure that the model can converge to a stable local Nash equilibrium. Arjovsky has done a lot of theoretical research to fully understand the training of GANs, analyzed the reasons why GANs are difficult to train, rigorously studied and demonstrated the problems of saturation and instability in training, and studied the practical and theoretical directions to alleviate these problems. and introduces new research tools. Liang et al. consider GANs training to be a continuous learning problem. One way to improve the training of GANs is to assess empirical "symptoms" that may occur during training. These symptoms include: generative model crashes, generating very similar samples for different inputs; discriminator loss quickly converges to zero and does not provide gradient updates to the generator; model has difficulty converging.

4.2.1 Improved GAN model based on input and output

The improvement based on input and output mainly refers to the improvement from the input of G and the output of D. In the basic model of GAN, the input of G is a random variable in the latent space, so its improvement is mainly carried out from the two points of latent space and hidden variable. The purpose of improving the latent variable is to make it better control the details of the generated samples, while the purpose of improving the latent space is to better distinguish different generation modes. The discriminative result output by D is the true and false classification, which can be adjusted to multi-classification with the objective function or remove the Softmax layer of the neural network to directly output the feature vector, thereby optimizing the training process and realizing semi-supervised learning and other effects.

​ The proponent of the BiCoGAN model believes that the input z and c of the model proposed by MIRZA are entangled with each other, so an encoder (denoted as E) is added to learn the inverse mapping from the output of the discriminator to the two inputs of the generator, which is more accurate Encode c to improve model performance. As shown in Figure 14 , the concatenation of z and c (denoted as ˆz) is input into the generator to obtain the output G(ˆz), the real sample x is input into the encoder to obtain the output E(x), and the discriminator receives G [(ˆz ) , ˆz] or [x, E(x)], as input, it is judged that the input comes from the generator or a class of real data. Since the label of the real sample x can be regarded as c, and E(x) can be split into z' and c', so making c and c' as close as possible is also the goal of model training, so that the encoder Learn about inverse mapping. This paper proposes to use EFL (extrinsic factor loss) to measure the distance between two distributions pc and pc', and proposes an objective function as shown in formula (6).

Figure 14 BiCoGAN model

​ IcGAN (invertible conditional GAN) is based on the model of MIRZA, adding two pre-trained encoders E z and E y , E z is used to generate the random variable z in the latent space, and E y is used to generate the original condition y , by modifying y to y' as the input condition of cGAN, so as to control the details of the synthesized image ( as shown in Figure 15). The article proposes three methods of sampling from the distribution to obtain y': when y is a binary vector, KDE (kernel density estimation) can be used to fit the distribution and sample; when y is a real vector, the training set can be selected Label vectors are directly interpolated; when a condition is not unique in all training sets, p data can be sampled directly.

Figure 15 IcGAN model

​ DeLiGAN is suitable for scenarios with small training data and many types. The DeliGAN model is shown in Figure 16. Gurumurthy et al. proposed to use GMM (Gaussian mixture model) to parameterize the latent space, and then randomly select a Gaussian component for reparameterization, and obtain samples from the specified Gaussian distribution, but the model uses GMM as a simplified assumption, which limits its The ability to approximate more complex distributions.

Figure 16 DeLiGAN model

​ The proponent of NEMGAN (noise engineered mode matchingGAN) proposed a pattern matching strategy that can perform better in the case of data imbalance in the training set. According to the generated samples, its corresponding representation in the latent space is trained to obtain the potential pattern. The prior distribution separates the multiple modes of the generated samples and matches them with the modes of the real samples, ensuring that the generated samples contain the modes of multiple real samples to alleviate the problem of mode collapse.

​ FCGAN (fully conditional GAN) is based on the model of MIRZA, and connects additional information c to each layer of the neural network, which improves the quality of generated samples when conditionally generating samples to a certain extent, but the model is more complicated in c or The calculation efficiency is low in scenes with large vectors.

​SGAN (semi-supervised learning GAN) is a semi-supervised model that can reconstruct label information for a dataset, and its model is shown in Figure 17. It improves D into a combination of classifier and discriminator. The output of D includes N types of real samples and one type of generated samples, with a total of N+1 types. When an unlabeled sample is input to the model and the discriminator classifies it as a real sample, the output of the discriminator can be used as the label of the sample.

Figure 17 SGAN model

​ AC-GAN (auxiliary classifier GAN) has the characteristics of both the MIRZA model and the ODENA model. G inputs random variables and classification information c, and D outputs samples that are false and classification probabilities. This method can output and generate samples when conditions are generated. The category the sample belongs to.

4.2.2 Improved GAN model based on generator

The improved work based on the generator aims to improve the quality of generated samples and avoid the problem of model collapse, so that the model can generate multiple types of samples, and the samples within the same type are diverse. Ideas for improvement include: using the idea of ​​ensemble learning to synthesize the patterns learned by multiple weak generators, designing each generator to focus on learning a specific pattern multi-generator architecture, so that the model as a whole contains multiple patterns, Using the idea of ​​a multi-agent system makes competition and cooperation among multiple generators.

The proponent of the AdaGAN model proposed an iterative training algorithm that incorporates the idea of ​​ensemble learning. In the single-step iteration process, a weak generator is obtained according to the training sample and the mixed weight, and the weak generator is weighted and mixed with the weak generator obtained in the previous round of iteration to obtain the result of this iteration. After several rounds of iterations, the generator synthesizes the modes learned by multiple weak generators, which alleviates the mode collapse problem caused by the lack of modes, and can generate better quality samples. However, mixing multiple generator networks leads to a discontinuous input latent space, and new hidden variables cannot be obtained by interpolation like the basic GAN model.

​MADGAN (multi-agent diverse GAN) consists of multiple generators and a discriminator, and its model is shown in Figure 16. Among them, the discriminator is responsible for judging whether the input sample is a real sample or a generated sample, and if it is a generated sample, it is judged which generator it was generated by. Each generator focuses on learning a specific pattern, the model enables multiple generators to learn independently, and the final generated samples of the model come from multiple generators that have learned different patterns, which explicitly guarantees the diversity of generated samples and eases the pattern Crash issue.

Figure 18 MADGAN model

MGAN's idea of ​​mitigating the mode collapse problem is similar to that of HOANG et al., and its model is shown in Figure 9. This model designs a classifier that shares the weights of the discriminator but removes the Softmax layer, which is used to judge the generator that the generated samples belong to. The discriminator is only responsible for distinguishing whether the samples are real samples or generated samples.

Figure 19 MGAN model

The MPMGAN (message passing multi-agent GAN) model is a multi-generator that introduces a message passing mechanism, and the output of the generator is used as a message passed to other generators. Under the action of the message sharing mechanism, all generators have two kinds of goals: cooperation goal and competition goal. The cooperative objective encourages the generated samples of other generators to be better than its own; the competitive objective encourages its own generated samples to be better than those of other generators. Both objectives work together to optimize the quality of the generated samples.

Figure 20 MPMGAN model

4.2.3 Improved GAN model based on discriminator

During the training process of the GAN model, the quality of the initial generated samples is poor, and the discriminator can simply distinguish the samples, which leads to the slow initial training speed of the generator. Improving the discriminator to match the current capabilities of the generator can help speed up training, and making it recognize multiple patterns can alleviate the mode collapse problem. Ideas for improvement include enabling a single discriminator to recognize more patterns, and making each of multiple discriminators focus on recognizing specific patterns.

The PacGAN model is shown in Figure 21. PacGAN "packages" multiple samples of the same class and inputs them into the discriminator together, so as to ensure that the samples input by the discriminator are diverse each time. Since the discriminator can perceive the diversity of samples every time it accepts an input, when the generator tries to deceive the discriminator, it needs to ensure the diversity of generated samples, which helps to alleviate the mode collapse problem.

Figure 21 PacGAN model

​ The proposer of the GMAN (generative multi-adversarial network-works) model believes that excessive improvement of the discriminator will make the objective function too harsh, which will inhibit the learning of the generator. Therefore, a method of combining ensemble learning is proposed. By setting multiple discriminators, The generator learns from the aggregated results of multiple discriminators, which accelerates the convergence of the network. The GMAN model is shown in Figure 22.

Figure 22 GMAN model

​ DropoutGAN sets up a set of discriminators. At the end of each batch of sample training, the result is deleted with a certain probability, and the remaining results are aggregated and fed back to the generator, so that the generator is not limited to cheating a specific discriminator. The proponent of the DropoutGAN model believes that the mode collapse problem is the overfitting of the generator to a specific discriminator or a static integrated discriminator, that is, the generator learns the special conditions that make the discriminator output the true value instead of learning the sample mode, and the model In the structure of , the set of discriminators is dynamically changing, and the generator cannot learn special conditions to deceive the discriminator, so that the generator can learn a variety of sample patterns, which helps to alleviate the problem of mode collapse. The DropoutGAN model is shown in Figure 23.

Figure 23 DropoutGAN model

​ D2GAN (dual discriminator GAN) sets up two discriminators D 1 and D 2 , using forward KL divergence and reverse KL divergence respectively, to make full use of the complementary statistical characteristics of the two. Among them, D 1 is rewarded by correctly judging that the sample comes from the real sample distribution, and D 2 is rewarded by correctly judging that the sample is from the generated sample distribution. The generator simultaneously fools both discriminators to improve the quality of generated samples. The D2GAN model is shown in Figure 24.

Figure 24 D2GAN model

The proponent of the StabilizingGAN model believes that the real samples are concentrated in the space, while the generated samples are scattered in the space at the beginning, so that the discriminator can accurately judge almost all the generated samples in the early stage of training, resulting in invalid gradients and making the generator training slow. Therefore, they propose to train a set of discriminators with limited viewing angles at the same time. Each discriminator focuses on a part of the projection in the space, and the generator gradually meets the constraints of all discriminators to stabilize training and improve the quality of generated samples.

​In the EBGAN (energy-based GAN) model ( as shown in Figure 25), the energy function method is introduced. The greater the difference between things, the higher the energy, so the samples near the real distribution have lower energy. Its researchers designed a discriminator composed of an encoder and a decoder, using MSE (mean square error) to measure the difference between the generated sample and the real sample as an energy function, and the goal of the generator is to generate a generated sample that minimizes energy. BEGAN (boundary equilibrium GAN) uses an autoencoder to replace the discriminator in the model proposed by ZHAO et al.

Figure 25 EBGAN model

4.2.4 Improved GAN model based on multi-module combination

In addition to better fitting the real sample distribution, improving the speed of network convergence, improving the clarity of generated images, and applying it to semi-supervised learning are also the directions for improving the GAN model. This type of research work adjusts the module structure and optimizes different influencing factors so that the model can achieve a specific purpose.

​GRAN (generative recurrent adversarial networks) is a recursive generative model that repeatedly generates output conditioned on the previous state, and finally obtains generated samples that are more in line with human intuition.

​ StackGAN builds a two-stage model based on the MIRZA model ( as shown in Figure 26). It uses textual description as additional information, stage one generates a lower resolution image and outputs it to stage two, and stage two outputs a higher resolution image, thereby increasing the resolution of the generated image.

Figure 26 StackGAN model

The proponents of the ProgressGAN model believe that small-scale images can ensure diversity without losing details. They use multiple and gradually increasing WGAN-GP networks to gradually train and finally generate high-definition images.

​ TripleGAN generates labels for real samples by adding a classifier network, the generator generates samples for real labels, and the discriminator judges whether the received sample label pairs are real samples with real labels, so as to simultaneously train better classifiers and Generator, which extends the capabilities of GANs to label unlabeled samples. The TripleGAN model is shown in Figure 27.

Figure 27 TripleGAN model

​ The proponent of the ControlGAN model believes that the discriminator in the MIRZA model is responsible for the two tasks of classifying real samples and discriminating true and false samples at the same time, so it is split into independent classifiers and discriminators, so that when generating samples conditionally Fine-grained control over the characteristics of generated samples. The ControlGAN model is shown in Figure 28.

Figure 28 ControlGAN model

SGAN (several local pairs GAN) uses several sets of local network pairs and a set of global network pairs, and each set of network pairs has a generator and a discriminator. The local network pair is trained using a fixed paired network, there is no information interaction between different local network pairs, and the global network is trained using the local network. Since each local network pair can learn a pattern, after using the local network pair to update the global network pair, it can ensure that the global network pair synthesizes multiple patterns, thereby alleviating the mode collapse problem. The SGAN model is shown in Figure 29.

Figure 29 SGAN model

​ The proposer of the MemoryGAN model believes that the hidden space has a continuous distribution, but different types of structures have discontinuity. Therefore, a storage network is added to the network for access by the generator and the discriminator, so that the generator and the discriminator learn the aggregation of data. class distribution to optimize this problem.

4.2.5 Improved GAN model based on the idea of ​​model crossover

​ Combining other generative model ideas and other domain ideas to improve the GAN model can also optimize model performance or expand model application scenarios.

​ DCGAN replaces the multi-layer perceptron in the basic GAN model with CNN (convolutional neural network) that removes the pooling layer ( as shown in Figure 30), and uses the global pooling layer instead of the fully connected layer to reduce the amount of calculation and improve the generation The quality of the sample optimizes the problem of unstable training.

Figure 30 CNN in DCGAN model

CapsuleGAN uses the capsule network as the framework of the discriminator ( as shown in Figure 31). Capsule network can be used to replace neurons to convert node output from a value to a vector. Neurons are used to detect a specific pattern, and capsule network can detect a certain type of pattern to improve the generalization ability of the discriminator. , so as to improve the quality of generated samples.

Figure 31 Basic principle of CapsuleGAN

VAEGAN uses GAN to improve the quality of samples generated by VAE. The idea is that in a VAE, the encoder encodes the real distribution into the latent space, and the decoder restores the latent space to the real distribution. The decoder alone can be used as a generative model, but the generated samples are of poor quality, so they are fed into the discriminator.

The proposer of the DEGAN (decoder-encoder GAN) model believes that the input random variable obeys a Gaussian distribution, so the generator needs to map the entire Gaussian distribution to the image, which cannot reflect the real sample distribution. Therefore, drawing on the idea of ​​VAE, adding a pre-trained encoder and decoder to GAN, mapping random variables into variables containing real sample distribution information, and then passing them to GAN, thereby accelerating convergence and improving generation quality.

​ AAE (adversarial auto-encoder) combines AE and GAN by adding the idea of ​​confrontation in the hidden layer of AE (auto-encoder). The discriminator makes the distribution of the encoder closer to the real sample distribution by judging whether the data comes from the hidden layer or the real sample.

​ BiGAN uses the encoder to extract real sample features, uses the decoder to imitate the generator, and uses the discriminator to distinguish the feature sample pair from the encoder or the decoder, and finally makes the encoding method and decoding method approach each other, so that Random variables are mapped to real data. ALi and BiGAN are essentially the same, with only minor differences. The BiGAN model is shown in Figure 32.

Figure 32 BiGAN model

MatAN (matching adversarial network) replaces the discriminator with a Siamese network to take the correct label into account in the generator objective function. Siamese networks are used to measure the similarity between real and generated data. This method is effective for speeding up generator training.

​ The proponent of the SAGAN (self-attention GAN) model believes that GAN performs better on types with less synthetic structural constraints, but it is difficult to capture complex patterns. This problem is solved by introducing a self-attention mechanism into the network.

​ KDGAN uses the idea of ​​KD (knowledge distillation). The model includes a lightweight classifier as a student network, a large and complex teacher network, and a discriminator. Among them, both the classifier and the teacher network generate labels, and the two learn each other through mutual distillation output. Knowledge, and finally a lightweight classifier with better performance can be trained.

​ IRGAN uses GAN to combine the generative retrieval model in the field of IR (information re-trieval) with the discriminative retrieval model, and trains the generator with reinforcement learning based on policy gradients, so as to achieve better results in typical information retrieval tasks. good performance. The IRGAN model is shown in Figure 33.

Figure 33 IRGAN model

​ LapGAN uses the ideas in the field of image processing, and uses three groups of cGANs to train the network by downsampling images step by step according to the Gaussian pyramid mode, and upsamples images step by step according to the Laplacian pyramid mode, so as to achieve blurred images. The purpose of reconstructing high-resolution images in medium.

​ QuGAN combines the idea of ​​GAN with the idea of ​​quantum computing. The generator is analogous to generating circuits, and the discriminator is analogous to discriminating circuits. The generated circuits imitate the wave function of real circuits as much as possible, and the discriminating circuits are only measured by auxiliary bits. Determines whether the input wavefunction is from a generated line or a real line.

The creator of the BayesianGAN model believes that it is difficult to explicitly model the implicit learning distribution of GAN, so he proposes to use the stochastic gradient Hamiltonian Monte Carlo method to marginalize the weights of the two neural networks, so that the data representation is interpretable .

5. Application of GAN

GANs are a powerful generative model that can generate realistic samples using random vectors. We neither need to know the exact real data distribution nor make any mathematical assumptions. These advantages make GANs widely used in image processing, computer vision, sequence data and other fields. The picture above classifies different GANs based on the actual application scenarios of GANs, including image super-resolution, image synthesis and processing, texture synthesis, target detection, video synthesis, audio synthesis, multimodal transformation, etc.

5.1 Computer Vision and Image Processing

The most successful applications of GANs are image processing and computer vision, such as image super-resolution, image synthesis and processing, and video processing.

5.1.1 Super-resolution (SR)

​ Image super-resolution technology mainly solves the problem of transforming low-resolution images into high-resolution images without distortion, and needs to maintain superior performance in terms of accuracy and speed. In addition, super-resolution technology can solve problems such as medical For the pain points of some industries in scenarios such as diagnosis, video surveillance, and satellite remote sensing, the actual social value generated by the application of this technology is immeasurable. Image super-resolution technology based on deep learning can be divided into three types: supervised, unsupervised, and specific application fields. The SR-GAN model replaces the generator with a parameterized residual network, and the discriminator uses the VGG network. The loss function is a weighted combination of content loss and confrontation loss. Compared with other models such as deep convolutional networks, the super-resolution accuracy The speed and speed have been improved, and the learning and representation of image texture details is better, so it has achieved good results in the field of super-resolution.

5.1.2 Image synthesis and processing

human face

  • Pose Correlation: A Disentangled Representation Learning GAN (DR-GAN) is proposed for pose-invariant face recognition. Huang et al. propose a two-pass GAN (TP-GAN) for photorealistic frontal view synthesis by simultaneously perceiving local details and global structure. Ma et al. propose a novel pose-guided person generation network (PG2), which synthesizes images of people in arbitrary poses based on novel poses and images of people. Cao et al. proposed a high-fidelity pose-invariant model for GANs-based high-resolution face frontization. Siarohin et al. propose deformable human organs for pose-based human image generation. A pose-robust SpatialWare GAN (PSGAN) for custom makeup transfer is proposed.
  • Portrait related: APDrawingGAN proposes to generate artistic portraits from face photos with hierarchical Gan. APDrawingGAN has a WeChat-based software. GANs are also used in other face-related applications, such as face attribute alteration and portrait editing.
  • Face generation: The quality of faces generated by GANs has improved year by year, which can be found in Sebastian Nowozin's GAN lecture material 1. Faces generated based on raw GANs have low visual quality and can only be used as a proof of concept. Radford et al. used a better neural network structure: a deep convolutional neural network for generating faces. Roth et al. address the instability of GAN training, allowing the use of larger architectures such as ResNet. Karras et al. utilize multi-scale training to generate megapixel face images with high fidelity. Each object generated by face is a face, and most face datasets tend to consist of people looking directly at the camera, so it is relatively simple.

general goals

Getting GANs to work on classification datasets like ImageNet is a bit difficult because ImageNet has a thousand different object classes, and the quality of those images improves every year.

While most papers use GANs to synthesize two-dimensional images, Wu et al. synthesize three-dimensional (3-D) samples using GANs and volumetric convolutions. Wu et al. synthesized novel objects such as cars, chairs, couches, and tables. Im et al. utilize recurrent adversarial networks to generate images. Yang et al. propose a layered recurrent GAN (LR-GAN) for image generation.

image restoration

​ Image completion is a traditional image restoration processing task, whose purpose is to fill in the missing or covered parts of the image. Such tasks are widely used in the current production and living environment. Most completion methods are based on low-level cues, finding small patches in neighboring regions of the image and creating synthetic content similar to the patch. With the help of this principle, Wang Haiyong and others realized facial expression recognition under partial occlusion, with high recognition efficiency. Different from the existing models that search for completion blocks for synthesis, the model proposed in related research literature is based on CNN to generate the content of missing regions. The algorithm is trained with a reconstruction loss function, two adversarial loss functions and a semantic parsing loss function to guarantee pixel quality and local-global content stability.

Interaction between humans and the image generation process

There are many applications that involve the interaction between humans and image generation processes. Realistic image manipulation is difficult because it requires modifying the image in a user-controlled manner while making it look real. If the user does not have effective artistic skills, it is easy to deviate from the variety of natural images when editing. Interactive GANs (IGANs) define a class of image editing operations and constrain their outputs to always lie on the learned manifold.

5.1.3 Texture Synthesis

Texture synthesis is a classic problem in the image field. Markovian GANs (MGAN) is a texture synthesis method based on GANs. By capturing texture data of Markovian patches, MGAN can quickly generate stylized videos and images, enabling real-time texture synthesis. Spatial GAN ​​(SGAN) is the first to apply GAN with fully unsupervised learning to texture synthesis. Periodic Spatial GAN ​​(PSGAN) is a variant of SGAN that can learn periodic textures from single images or complex large datasets.

5.1.4 Object Detection

How do we learn object detectors that are invariant to deformation and occlusion? One approach is to use a data-driven strategy—collecting large-scale datasets with object examples under different conditions. We want the final classifier to be able to use these examples to learn invariance. Is it possible to view all deformations and occlusions in the dataset? Some deformations and occlusions are so rare that they almost never happen in practice; however, we want to learn a method that is invariant to this case. Wang et al. use GANs to generate instances with deformation and occlusion. The goal of the adversary is to generate instances that are difficult for object detectors to classify. By using cutters and GANs, Segan detects objects in an image that are occluded by other objects. To solve the small object detection problem, Li et al. proposed perceptual GAN, and Bai et al. proposed end-to-end multi-task GAN (MTGAN).

5.1.5 Video

Villegas et al. propose a deep neural network for predicting future frames in natural video sequences using GANs. Denton and Birodkar proposed a new model named Disentangled Representation Network (DRNET), which learns disentangled image representations from videos based on GANs. Related research literature proposes a new video-to-video synthesis method (video2video) under the framework of generative adversarial learning. MoCoGan proposes to decompose motion and content to generate videos. GANs have also been used in other video applications such as video prediction and video retargeting.

​ Video can be understood as a combination of multiple pictures through frame-by-frame decomposition, so video generation and prediction are realized on the basis of GAN-generated images. Generally speaking, video is composed of relatively static background color and dynamic object movement. VGAN takes this into consideration, using a two-stream generator to predict the next frame with a 3D CNN's moving foreground generator, and using a 2D CNN's static background generation to keep the background still. Pose-GAN adopts a hybrid VAE and GAN method, which uses the VAE method to estimate the future object motion in the current object pose and past pose hidden representation.

Video-based GANs need to consider not only spatial modeling, but also temporal modeling, i.e., the motion between each adjacent frame in a video sequence. MoCoGAN is proposed to learn motion and content in an unsupervised manner, which divides the latent space of images into content space and motion space. DVD-GAN is able to generate longer, higher-resolution videos based on the BigGAN architecture, while introducing a scalable, video-specific generator and discriminator architecture.

5.1.6 Other image and vision applications

GANs have been used in other image processing and computer vision tasks such as object deformation, semantic segmentation, visual saliency prediction, object tracking, image decluttering, natural image matting, image inpainting, image fusion, image completion, image classification.

5.2 Time series data

GANs have also achieved success on sequential data such as natural language, music, speech, speech, and time series.

5.2.1 Natural Language Processing

The performance of GAN on images has led many researchers to propose some models based on GAN in the field of text generation. The combination of SeqGAN and reinforcement learning avoids that the general GAN ​​model cannot generate discrete sequences, and can return the gradient value of the model when generating discrete data. Such methods can be used to generate speech data, machine translation and other scenarios. The MaskGAN model introduces the Actor-Critic architecture to fill in missing textual information based on contextual content.

​ In addition to the application of image generation text, the literature StackGAN can realize the image described by the corresponding text by inputting text information and the image has high resolution. This model realizes the interactive generation of text and image. In addition, CookGAN realizes the method of generating image menu based on text from the perspective of image causal chain. TiVGAN realizes the idea of ​​generating continuous video sequences through text.

5.2.2 Music

GANs are used to generate music, such as Continuous RNN-GAN (C-RNN-GAN), Continuous RNN-GAN (Organ), and SeqGAN.

5.2.3 Speech and audio

GANs have been used in speech and audio analysis such as synthesis, enhancement, and recognition.

5.3 Other applications

5.3.1 Medical field

In general, there are two approaches to using GANs in medical imaging: the first focuses on the generation phase, which helps achieve the basic structure of the training data to create realistic images, enabling GANs to better handle data scarcity Sexuality and patient privacy issues. The second focuses on the discriminative stage, where the discriminator can be thought of as a prior learned on unprocessed images, and thus can act as a detector for pseudo-generated images.

​Generation phase: Sandfort et al. propose a data augmentation model based on CycleGAN to improve generalization in CT segmentation. Han et al. proposed a GAN-based two-stage method for unsupervised anomaly detection in MRI scans.

​ Discrimination stage: Tang et al. proposed a CT image segmentation method based on superimposed generative confrontation network. The first layer of the network reduces the noise in the CT image, and the second layer creates a higher resolution image with enhanced boundaries. Dou et al. proposed GANs for MRI and CT to handle efficient domain transfer by supporting the feature spaces of source and target domains in an unsupervised manner.

5.3.2 3D reconstruction

GAN complements or reconstructs the three-dimensional shape of objects in three-dimensional space, which is the improvement and expansion of three-dimensional reconstruction technology. Wang et al. proposed a hybrid structure, using the 3D-ED-GAN model of the Recurrent Convolutional Network (LRCN). Wu et al. proposed the 3D-VAE-GAN model, which utilizes the latest research theory of volumetric convolutional networks and generative adversarial networks to generate 3D objects in probabilistic space. Related research literature introduces a new GAN training model to achieve detailed 3D shapes of objects. The model is trained with Wasserstein normalization with gradient penalty, which improves the realism of the image. This architecture can even reconstruct 3D shapes from 2D images and complete shape completion.

Figure 34 3D of real-world item scans

3D-RecGAN is a random depth view reconstruction of the full 3D structure of a given object. The model is an encoder-decoder 3D deep neural network on the GAN structure, combining two objective losses: a loss for 3D object reconstruction and a modified Wasserstein GAN loss. Algebraic operations and deep autoencoder GANs (AE-EMD) for semantic part editing, shape analogy and shape pinching, and shape completion for 3D objects have also been made.

5.3.3 Data Science

GANs have been used in data generation, neural network generation, data augmentation, spatial representation learning, network embedding, heterogeneous information networks, and mobile user evaluation.

GANs have been widely used in many other domains, such as malware detection, chess games, steganography, privacy protection, social robotics, and network pruning.

6. Common data sets

Generally speaking, the data set used by the image-based GANs method is based on the existing data image for up (down) sampling, adding interference processing. The processed image and the original image are used as a pair of images for the training of the GANs network. Other aspects, such as video, text, etc., are also preprocessed on existing open source (or closed source) data sets, and the original data is used as labels for network training. However, the datasets produced in this way are never fully representative of reality. The five datasets used to train GANs are introduced below.

6.1 Abstract Art Dataset

​ This dataset contains 2782 abstract art images scraped from wikiart.org. This data can be used to build GANs to generate synthetic images of abstract art. The dataset contains images of real abstract art like Van Gogh, Dali, Picasso, etc.

6.2 High content screening with C. Elegens

​ These data contain images corresponding to screens to find new antibiotics using the roundworm C. elegans. The data has images of roundworms infected with a pathogen called enterococcus . Some images are of round worms that have not been treated with antibiotics, ampirin, while others are of infected round worms, that have been treated with ampirin. For those interested in applying GANs to an interesting drug discovery problem, this is a great place to start!

6.3 Abnormal lung chest X-ray

This dataset contains clinically labeled chest X-ray images by radiologists. There were 336 chest X-ray images corresponding to tuberculosis and 326 images corresponding to healthy individuals. This is a great data source for those interested in using GANs for medical image data synthesis.

6.4 Fake face

These data actually contain synthetic images of human faces generated by GAN. These images were obtained from the website This person does not exist. The website generates a new fake face image, made with a GAN, every time you refresh the page. This is a great set of data to start with for generating synthetic images with GANs.

6.5 glasses or no glasses

This dataset contains face images with glasses and face images without glasses. Although these images were generated using a GAN, they can also serve as training data for generating other synthetic images.

7. Frontier issues

Since GANs are popular throughout the deep learning field, their limitations have been improved recently. For GANs, there are still some open research problems.

7.1 Mode crash problem

Although existing research has made many attempts and made some progress in solving the problem of mode collapse, how to solve the problem of mode collapse is still the main challenge faced by GAN.

​ For the reasons for the mode collapse of GAN, some research work has tried to explain: the generator is regarded as a parameterized description of an N-dimensional manifold, when the tangent space dimension of a certain point on the manifold is less than N, resulting in When the point changes along some directions, the change of the data is invalid, so the generator will generate a single data; Based on the optimal transmission theory, it is considered that the generator maps the distribution of the latent space to the distribution on the manifold is a transmission mapping, which has The discontinuity point is a discontinuous mapping, but the neural network can only approximate the continuous mapping at present, resulting in the generation of meaningless results and mode collapse; when the mode collapse occurs, the singular value of the weight matrix of the discriminator network decreases sharply, which can be obtained from The problem started with solving the modal crash issue.

Compared with the general neural network training process, there is a game mechanism between the generator G and the discriminator D in the GAN model, which makes the problem of GAN mode collapse complicated. All in all, the research work on GAN model collapse is still in its infancy, and the research starts from various angles, and a unified framework has not been formed to explain the problem. If the future work can start from the game mechanism of GAN and integrate the relevant factors of the generator and the discriminator, it will help to solve this problem.

7.2 Effect of Training Set Samples

The performance of the neural network mainly depends on the characteristics of the model itself and the real sample set used for training. Similarly, the quality of training and learning of the GAN model is also subject to the influence of the training sample set. On the one hand, the inherent data distribution of the sample set may affect the training efficiency and generation quality of GAN. For example, define the intra-class distance set and the inter-class distance set on the sample set, and propose a distance-based separability index to quantify the sample separability, and point out that when different types of samples are mixed according to the same distribution, the most difficult It is difficult to make the model perform better when using this sample set for supervised learning. This has reference significance for the design of the sample generation quality evaluation index of GAN. On the other hand, a major feature of the GAN model is to learn the real sample distribution, so enough real samples are needed for training to perform well. It is challenging and meaningful to study how to use a small-scale training set to obtain a better GAN model. The GAN model also has high requirements on the quality of the training set, and high-quality data sets are often difficult to obtain, so research on which data will affect the performance of the model, how to avoid the negative impact of low-quality samples, and reduce the high requirements for the quality of the training set , to be a future research direction.

In addition, there have been some studies on reducing the training set sample size requirements. Through transfer learning, fine-tuning is performed on the pre-trained generator network and discriminator network with appropriate samples, but the effect is not good when the samples are seriously insufficient or the samples are greatly different from the pre-training data. Some researchers believe that the singular value of the network weight is related to the semantics of the generated samples, so by performing singular value decomposition on the network weight and fine-tuning the singular value of the pre-trained model to achieve the purpose of training with fewer samples. Using meta-learning on GANs, some results have been achieved on few-shot training problems. Using reconstruction loss and triplet loss to transform the loss function of GAN, the idea of ​​self-supervised learning is introduced into GAN, and some results have been achieved on small sample training problems.

There have been some studies on reducing the demand for sample quality in the training set. NRGAN sets an image generator and a noise generator in the model, which are used to learn the data distribution and noise distribution in real samples, respectively, so as to generate noise-free samples from the noisy training set without predicting the noise distribution.

At present, research on the impact of training set samples on GAN is still in its infancy. Reducing the size of the training set often leads to poor support for complex patterns, while reducing the quality requirements of the training set samples is accompanied by too many assumptions. Follow-up work should further study the reasons for these limitations, and use this as a guide to make its application scenarios more realistic.

7.3 Intersection with research on model robustness issues

​ The robustness of the neural network reflects that when a small disturbance occurs on the input data set, the model can still show the ability to resist interference at the output end. The research of GAN and the research of the robustness of artificial neural network complement each other and are closely related. On the one hand, GAN uses adversarial examples to train the network model, which helps to improve the robustness of the model. On the other hand, the research on the robustness of neural networks is intrinsically related to the improvement of GAN. For example, the loss of deep neural networks is smoother near the peak value after adversarial training, and the use of Lipschitz conditions in CNN can make the model have better performance at the same time. Robustness and accuracy, related research in this field has a certain reference value for the improvement of GAN, especially in the evaluation of the quality of generated adversarial samples and the research on the target of the generator.

Some researchers describe the robustness of neural networks on data sets from two aspects: confrontation frequency and confrontation severity. Among them, the frequency of confrontation reflects the possibility of adversarial disturbance on the data set, and the severity of confrontation reflects the degree to which the output deviates when the disturbance occurs. This method has reference value in the evaluation of the quality of GAN generated adversarial sample datasets, and has guiding significance for the training of generators. Another researcher proposed a neural network security analysis method based on symbolic linear relaxation, which treats the adversarial perturbation as a special case of security attribute violations. The framework can define five different security attribute constraints. The results of sexual disturbance are refined. These works contribute to the taxonomy of GAN generator design goals.

8. Summary

In this paper, we survey the existing model approaches for Generative Adversarial Networks (GANs) from different aspects. Firstly, the existing GAN methods are classified according to training strategies, structural changes, training techniques, supervision types, etc., and the improvement points of different GAN networks are introduced by taking the classic network as an example. Then, the basic structure of the GAN network is introduced in detail, and the development of the newer generative confrontation network is given. Finally, the classic and commonly used GAN models are introduced based on practical application scenarios. We selected five commonly used GAN datasets from Kaggle and introduced them separately. Dataset links are placed at the dataset name respectively. Finally, the current frontier issues of generative adversarial networks are introduced.

references

[1] GAN network and variant arrangement

[2] A review on generative adversarial networks: Algorithms, theory, and applications

[3] Generative adversarial networks: An overview

[4] Generative Adversarial Networks (GANs) in networking: A comprehensive survey & evaluation

[5] A review of generative adversarial networks and their applications

[6] A review of generative confrontation network research

Guess you like

Origin blog.csdn.net/sinat_39620217/article/details/130982913