[Computer Vision | Generative Confrontation] Conditional Generative Confrontation Network (CGAN)

This series of blog posts are notes for deep learning/computer vision papers, please indicate the source for reprinting

标题:Conditional Generative Adversarial Nets

链接:[1411.1784] Conditional Generative Adversarial Nets (arxiv.org)

Summary

Generative Adversarial Nets [8] have recently been introduced as a novel approach to training generative models.

In this work, we introduce a conditional version of Generative Adversarial Networks by simply taking the data yy we wish to rely ony is fed to both the generator and the discriminator, and it can be constructed. We show that this model can generate MNIST digits conditioned on class labels.

We also show how this model can be used to learn a multi-modal model, and provide a preliminary image labeling application example, in which we show how this method can be used to generate descriptions that are not part of the training labels sex label.

1 Introduction

Generative adversarial networks have recently been introduced as an alternative framework for training generative models in order to bypass many intractable difficulties of probability computation.

Adversarial networks have the following advantages:

  • Markov chains are never needed, only backpropagation is used to obtain gradients

  • no reasoning is required during the learning process, and

  • Various factors and interactions can be easily incorporated into the model

Furthermore, as demonstrated in [8], it can produce state-of-the-art log-likelihood estimates and realistic samples.

In an unconditional generative model, there is no control over the mode of generating the data.

However, by conditioning the model with additional information, it is possible to guide the data generation process. This conditioning could be based on class labels, patching on partial data like in [5], or even on different modality data.

In this work, we show how to build conditional generative adversarial networks. As for the empirical results, we present two sets of experiments. One set is the MNIST digit dataset based on class labels, and the other is the MIR Flickr 25,000 dataset [10] for multimodal learning.

2 related work

2.1 Multimodal Learning for Image Labeling

Despite the many recent successes of supervised neural networks (especially convolutional networks) [13, 17], it remains challenging to scale these models to accommodate an extremely large number of predicted output categories . The second problem is that most of the work to date has focused on learning a one-to-one mapping from input to output. However, many interesting problems are more naturally thought of as probabilistic one-to-many mappings . For example, in the case of image labeling, there may be many different labels that can be appropriately applied to a given image, and different (human) annotators may use different (but often synonymous or related) terms to describe the same image.

  • One way to solve the first problem

    • is to exploit additional information from other modalities: e.g. learning vector representations of labels using natural language corpora where geometric relationships are semantically meaningful.
    • When making predictions in such spaces, we benefit from the fact that when our predictions are wrong, we are still usually close to the truth (e.g., predicting "table" instead of "chair"), and also from the fact that we can naturally benefit from the fact that unseen labels perform predictive generalization.
    • Works such as [3] have shown that even a simple linear mapping from image feature space to word representation space can improve classification performance.
  • One way to solve the second problem

    • is a generative model using conditional probabilities, the inputs are treated as conditional variables, and the one-to-many mapping is instantiated as a conditional predictive distribution.
    • [16] took a similar approach to this problem and trained a multimodal deep Boltzmann machine on the MIR Flickr 25,000 dataset, as we did in this work.

Furthermore, in [12] the authors showed how to train a supervised multimodal neural language model and they were able to generate descriptive sentences for images.

3 Conditional Generative Adversarial Networks

3.1 Generating Adversarial Networks

Generative adversarial networks have recently been introduced as a novel approach to training generative models.

They consist of two "adversarial" models: a generative model G, which captures the data distribution; and a discriminative model D, which estimates the probability of a sample coming from training data or G. Both G and D can be nonlinear mapping functions, such as multi-layer perceptrons.

To learn the generator distribution pg p_gpgin data xxdistribution over x , the generator starts from the prior noise distributionpz ( z ) p_z(z)pz( z ) Construct the mapping function G to the data space( z ; θ g ) G(z; \theta_g)G(z;ig) . And the discriminatorD ( x ; θ d ) D(x; \theta_d)D(x;id) outputs a scalar, representingxxx comes from training data instead ofpg p_gpgThe probability.

Both G and D are trained simultaneously: we tune the parameters of G so that log ⁡ ( 1 − D ( G ( z ) ) \log(1 - D(G(z))log(1D ( G ( z )) is minimized, and the parameters of D are adjusted so thatlog ⁡ D ( X ) \log D(X)logD ( X ) are minimized as if they were following a value functionV ( G , D ) V(G, D)V ( G ,D ) two-player minimax game:
min ⁡ G max ⁡ DV ( D , G ) = E x ∼ p data ( x ) [ log ⁡ D ( x ) ] + E z ∼ pz ( z ) [ log ⁡ ( 1 − D ( G ( z ) ) ) ]. (1) \min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}(x)} [\log D(x)] + \mathbb{E} _{z \sim p_z(z)} [\log(1 - D(G(z)))]. \tag{1}GminDmaxV(D,G)=Expdata(x)[logD(x)]+Ezpz(z)[log(1D(G(z)))](1)

3.2 Conditional Generative Adversarial Networks

Generative adversarial networks can be extended to conditional models if both the generator and the discriminator are based on some additional information yyy is conditionalized. yyy can be any kind of auxiliary information, such as class labels or data from other modals. We can passyyy is fed into the discriminator and generator as an additional input layer to perform conditioning.

In the generator, the prior input noise pz ( z ) p_z(z)pz( z )yyy is incorporated in a joint hidden representation, and the adversarial training framework allows considerable flexibility in composing this hidden representation. 1

In the discriminator, xxxyyy is presented as an input, and fed to the discriminant function (in this case again embodied by the MLP).
The objective function for a two-player minimax game will be the same as Equation 2
min ⁡ G max ⁡ DV ( D , G ) = E x ∼ p data ( x ) [ log ⁡ D ( x ∣ y ) ] + E z ∼ pz ( z ) [ log ⁡ ( 1 − D ( G ( z ∣ y ) ) ) ]. (2) \min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}(x)} [\log D(x|y)] + \mathbb{ E}_{z \sim p_z(z)} [\log(1 - D(G(z|y)))]. \tag{2}GminDmaxV(D,G)=Expdata(x)[logD(xy)]+Ezpz(z)[log(1D(G(zy)))]( 2 )
Figure 1 illustrates the structure of a simple conditional adversarial network.

Figure 1: Conditional Generative Adversarial Networks

4 Experimental results

4.1 Single mode

We train a conditional generative adversarial network on MNIST images and condition on their class labels, encoded as one-hot vectors. In the generator network, a noise prior zz
with 100 dimensions is sampled uniformly from the unit hypercubez z z zyyBoth y are mapped to hidden layers with rectified linear unit (ReLu) activations [4, 11] of layer size 200 and 1000, respectively, and then mapped to a second combined hidden ReLu layer of dimension 1200. We then have a final layer of sigmoid units as output to generate 784-dimensional MNIST samples.

The discriminator will xxx is mapped to a maxout[6] layer with 240 units and 5 parts, andyyy is mapped to a maxout layer with 50 units and 5 parts. Both hidden layers are mapped to a joint maxout layer with 240 units and 4 parts before being fed into the sigmoid layer. (The exact architecture of the discriminator is not critical, as long as it is sufficiently capable; we find that maxout units are often suitable for this task.)

The model is trained using stochastic gradient descent with mini-batches of size 100 and an initial learning rate of 0.1 that decreases exponentially to 0.000001 with a decay factor of 1.00004. Initial momentum is 0.5, increased to 0.7. Dropout [9] with probability 0.5 is applied to both generator and discriminator. and take the best estimate of the log-likelihood on the validation set as the stopping point.

Table 1 shows the Gaussian Parzen window log-likelihood estimates for the test data of the MNIST dataset. Draw 1000 samples from each of the 10 categories and fit a Gaussian Parzen window to these samples. We then estimate the log-likelihood of the test set using the Parzen window distribution. (See [8] for more details on how to construct this estimate.)

Table 1: MNIST log-likelihood estimates based on Parzen windows. We followed the same procedure as [8] to compute these values.

The conditional GAN ​​results we show are comparable to some other network-based results, but surpassed by several other methods, including unconditional GANs. We present these results more as a proof of concept than validity, and believe that with further exploration of the hyperparameter space and architecture, the conditional model should match or exceed the unconditional results.

Figure 2 shows some generated samples. Each row is conditioned on a label, and each column is a different generated sample.

Figure 2: Generated MNIST digits, each row based on a label

4.2 Multimodal

Photo sites like Flickr are rich sources of tagged data in the form of images and their associated user-generated metadata (UGM), especially user tags. User-generated metadata differ from more "canonical" image labeling schemes in that they are often more descriptive and semantically closer to how people describe images in natural language, rather than just identifying objects present in images. Another aspect of UGM is that synonyms are ubiquitous, and different users may use different vocabularies to describe the same concept, thus, it becomes important to effectively standardize these labels. Concept word embeddings [14] can be very useful here, as related concepts end up being represented as similar vectors.

In this section, we demonstrate multi-label prediction using conditional adversarial networks to generate automatic label (possibly multimodal) label vector distributions for images.

For image features, we pre-trained the full ImageNet dataset with 21,000 labels [15] using a convolutional model similar to [13]. We use the output of the last fully connected layer with 4096 units as the image representation.

For the world representation, we first collect text from the concatenation of user tags, titles, and descriptions from the metadata of the YFCC100M2 dataset . After text preprocessing and cleaning, we train a skipped gram model [14] with a word vector size of 200. We omit any word that occurs less than 200 times in the vocabulary, thus ending up with a dictionary of size 247465.

During the training of the adversarial network, we keep the convolutional model and language model fixed. And backpropagation through these models is left as future work.

For our experiments, we use the MIR Flickr 25,000 dataset [10] and extract image and label features using the convolutional model and language model we described above. Images without any labels are omitted and annotations are considered as extra labels. The first 150,000 examples are used as the training set. Images with multiple labels are repeated within the training set, once for each associated label.

For evaluation, we generate 100 samples for each image and find the top 20 closest words using the cosine similarity of the vector representation of the words in the vocabulary to each sample. We then select the top 10 most frequent words across all 100 samples. Table 4.2 shows some samples of user-assigned labels and annotations and generated labels.

The generator of the best working model receives Gaussian noise of size 100 as a noise prior and maps it to a 500-dimensional ReLu layer. And map the 4096-dimensional image feature vector to the 2000-dimensional ReLu hidden layer. Both layers map to a 200-dimensional linear layer that will output the resulting word vectors.

The discriminator consists of 500- and 1200-dimensional ReLu hidden layers for word vectors and image features, respectively, and a max layer with 1000 units and 3 parts as the connection layer, finally inputting to a single sigmoidal unit.

The model is trained using stochastic gradient descent with a batch size of 100 and an initial learning rate of 0.1, which decreases exponentially to .000001 with a decay factor of 1.00004. A momentum with an initial value of .5 was also used, which was increased to 0.7. Dropout with probability 0.5 is applied on both the generator and the discriminator.

Hyperparameter and architecture choices were obtained through cross-validation and a mixture of random grid search and manual selection (albeit within a limited search space).

5 Future of work

The results shown in this paper are very preliminary, but they demonstrate the potential of conditional adversarial networks and show promise for interesting and useful applications.

In future explorations between now and the workshop, we expect to present more complex models, as well as a more detailed and thorough analysis of their performance and properties.

Table 2: Generate label samples

Also, in the current experiments, we only use each label individually. However, by using multiple labels simultaneously (effectively posing the generation problem as an "ensemble generation" problem), we hope to achieve better results.

Another obvious direction for future work is to build joint training schemes to learn language models. Work such as [12] shows that we can learn appropriate language models for specific tasks.

thank you

This project is developed in the framework of Pylearn2 [7] and we would like to thank the developers of Pylearn2. We would also like to thank Ian Goodfellow for helpful discussions during his tenure at Université de Montréal. The authors would like to thank Flickr's vision and machine learning team and production engineering team for their support (in alphabetical order: Andrew Stadlen, Arel Cordero, Clayton Mellina, Cyprien Noel, Frank Liu, Gerry Pesavento, Huy Nguyen, Jack Culpepper, John Ko, Pierre Garrigues , Rob Hess, Stacey Svetlichnaya, Tobi Baumgartner, and Ye Lu).

references

  1. Bengio, Y., Mesnil, G., Dauphin, Y. and Rifai, S. (2013). Better blending through depth representation. At ICML'2013.
  2. Bengio, Y., Thibodeau-Laufer, E., Alain, G. and Yosinski, J. (2014). Deep generative random networks that can be trained via backpropagation. In Proceedings of the 30th International Conference on Machine Learning (ICML'14).
  3. Frome, A., Corrado, GS, Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al. (2013). Devise: A Deep Visual Semantic Embedding Model. In Advances in Neural Information Processing Systems, pp. 2121–2129.
  4. Glorot, X., Bordes, A. and Bengio, Y. (2011). Deep Sparse Rectifier Neural Networks. In International Conference on Artificial Intelligence and Statistics, pp. 315–323.
  5. Goodfellow, I., Mirza, M., Courville, A., and Bengio, Y. (2013a). Multi-Prediction Depth Boltzmann Machine. In Advances in Neural Information Processing Systems, pp. 548–556.
  6. Goodfellow, IJ, Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013b). Maximum output network. At ICML'2013.
  7. Goodfellow, IJ, Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R., Bergstra, J., Bastien, F., and Bengio, Y. (2013c). Pylearn2: A machine learning research library. arXiv preprint arXiv:1308.4214.
  8. Goodfellow, IJ, Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative Adversarial Networks. at NIPS'2014.
  9. Hinton, GE, Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. Technical Report, number: arXiv:1207.0580.
  10. Huiskes, MJ and Lew, MS (2008). mir flickr search evaluation. In MIR'08: 2008 ACM International Conference on Multimedia Information Retrieval, New York, USA. ACM.
  11. Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best multi-level architecture for object recognition? at ICCV'09.
  12. Kiros, R., Zemel, R. and Salakhutdinov, R. (2013). Multimodal Neural Language Models. In Proceedings of the NIPS Workshop on Deep Learning.
  13. Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). ImageNet Classification Using Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25 (NIPS'2012).
  14. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector spaces. In International Conference on Learning Representations: Workshop Track.
  15. Russakovsky, O. and Fei-Fei, L. (2010). Attribute Learning in Large-Scale Datasets. In the International Symposium on Parts and Attributes of the European Conference on Computer Vision (ECCV), Crete, Greece.
  16. Srivastava, N. and Salakhutdinov, R. (2012). Multimodal Learning with Deep Boltzmann Machines. In NIPS'2012.
  17. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabiovich, A. (2014). Dig deeper with convolution. arXiv preprint arXiv:1409.4842.

References

  1. Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2013). Better mixing via deep representations. In ICML’2013.
  2. Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014). Deep generative stochastic networks trainable by backprop. In Proceedings of the 30th International Conference on Machine Learning (ICML’14).
  3. Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al. (2013). Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems, pages 2121–2129.
  4. Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics, pages 315–323.
  5. Goodfellow, I., Mirza, M., Courville, A., and Bengio, Y. (2013a). Multi-prediction deep Boltzmann machines. In Advances in Neural Information Processing Systems, pages 548–556.
  6. Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013b). Maxout networks. In ICML’2013.
  7. Goodfellow, I. J., Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R., Bergstra, J., Bastien, F., and Bengio, Y. (2013c). Pylearn2: a machine learning research library. arXiv preprint arXiv:1308.4214.
  8. Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In NIPS’2014.
  9. Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580.
  10. Huiskes, M. J. and Lew, M. S. (2008). The mir flickr retrieval evaluation. In MIR ’08: Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval, New York, NY, USA. ACM.
  11. Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best multi-stage architecture for object recognition? In ICCV’09.
  12. Kiros, R., Zemel, R., and Salakhutdinov, R. (2013). Multimodal neural language models. In Proc. NIPS Deep Learning Workshop.
  13. Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS’2012).
  14. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. In International Conference on Learning Representations: Workshops Track.
  15. Russakovsky, O. and Fei-Fei, L. (2010). Attribute learning in large-scale datasets. In European Conference of Computer Vision (ECCV), International Workshop on Parts and Attributes, Crete, Greece.
  16. Srivastava, N. and Salakhutdinov, R. (2012). Multimodal learning with deep Boltzmann machines. In NIPS’2012.
    Vision (ECCV), International Workshop on Parts and Attributes, Crete, Greece.
  17. Srivastava, N. and Salakhutdinov, R. (2012). Multimodal learning with deep Boltzmann machines. In NIPS’2012.
  18. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabiovich, A. (2014). Going deeper with convolutions. arXiv preprint arXiv:1409.4842.

  1. Currently, we simply feed the conditional input and noise prior as input to a single hidden layer of the MLP, but one could imagine using higher order interactions, allowing complex generative mechanisms that would be very intractable in traditional generative frameworks . ↩︎

  2. Yahoo Flickr Creative Common 100M dataset: http://webscope.sandbox.yahoo.com/catalog.php?datatype=i&did=67. ↩︎

Guess you like

Origin blog.csdn.net/I_am_Tony_Stark/article/details/132231270