It can also generate clear images without convolution. For the first time, a Chinese doctoral student tried to build a GAN with two Transformers.

「attention is really becoming『all you need』.」

Selected from arXiv, Author: Yifan Jiang et al., Almost Human compiler, Almost Human newsroom

 

 

Recently, CV researchers have become very interested in transformers and have made many breakthroughs. This shows that transformer has the potential to become a powerful general model for computer vision tasks such as classification, detection, and segmentation.

We are all curious: how far can transformer go in the field of computer vision? For more difficult visual tasks, such as Generative Adversarial Networks (GAN), how does the transformer perform?

Driven by this curiosity, researchers such as Yifan Jiang and Zhangyang Wang of the University of Texas at Austin, and Shiyu Chang of IBM Research conducted the first experimental study and constructed a pure transformer architecture without convolution at all. GAN, and named it TransGAN . Compared with other transformer-based vision models, it seems more challenging to build GAN using only transformer. This is because compared with tasks such as classification, the threshold for real image generation is higher, and GAN training itself has high instability. .

 

 

From the structural point of view, TransGAN includes two parts: one is a memory-friendly transformer-based generator that can gradually increase the feature resolution while reducing the embedding dimension; the other is a patch-level discriminator based on the transformer.

The researchers also found that TransGAN significantly benefits from data augmentation (exceeding the standard GAN), the generator's multi-task collaborative training strategy, and the local initialization self-attention that emphasizes the smoothness of natural image neighborhoods. These findings indicate that TransGAN can be effectively extended to larger models and image data sets with higher resolution.

The experimental results show that, compared with the current SOTA GAN based on the convolution backbone, the best performing TransGAN achieves extremely competitive performance. Specifically, TransGAN has an IS score of 10.10 on STL-10 and an FID of 25.32, realizing a new SOTA.

The research shows that the dependence on the convolution backbone and many specialized modules may not be necessary for GAN, and the pure transformer has enough ability to generate images.

In the related discussion of the paper, some readers joked, "attention is really becoming "all you need"."

 

 

However, some researchers have expressed their concerns: in the context of transformers sweeping the entire community, how can small laboratories survive?

 

 

If transformers really become "just needed" in the community, how to improve the computational efficiency of this type of architecture will become a difficult research question.

GAN based on pure Transformer

Transformer encoder as a basic block

The researcher chose to use the Transformer encoder (Vaswani et al., 2017) as the basic block and try to make minimal changes. The encoder consists of two parts, the first part is constructed by a multi-head self-attention module, and the second part is a feedforward MLP (multiple-layer perceptron) with GELU nonlinearity. In addition, the researchers applied layer normalization before both components (Ba et al., 2016). Both components also use residual connections.

Memory-friendly generator

The Transformer in NLP takes each word as input (Devlin et al., 2018). However, if the image is generated pixel by pixel by stacking Transformer encoders in a similar way, low-resolution images (such as 32×32) may also cause long sequences (1024) and higher self-attention overhead.

Therefore, in order to avoid excessive overhead, researchers are inspired by common design concepts in CNN-based GANs to iteratively increase the resolution in multiple stages (Denton et al., 2015; Karras et al., 2017). Their strategy is to gradually increase the input sequence and reduce the embedding dimension .

As shown on the left of Figure 1 below, the researcher proposed a memory-friendly, Transformer-based generator with multiple stages:

 

 

Several encoder blocks are stacked in each stage (the default is 5, 2, and 2). Through the segmented design, the researcher gradually increases the resolution of the feature map until it reaches the target resolution H_T×W_T. Specifically, the generator takes random noise as its input, and passes the random noise to a vector of length H×W×C through an MLP. The vector is transformed into a feature map with a resolution of H×W (default H=W=8), and each point is a C-dimensional embedding. Then, the feature map is treated as a C-dimensional token sequence with a length of 64 and combined with the learnable position code.

Similar to BERT (Devlin et al., 2018), the Transformer encoder proposed in this study takes embedded tokens as input and recursively calculates the matching between each token. In order to synthesize images with higher resolution, the researchers inserted an upsampling module composed of reshaping and pixelshuffle modules after each stage.

In terms of specific operations, the upsampling module first embeds the token of the 1D sequence into a 2D feature map

 

 

, And then use the pixelshuffle module to upsample the resolution of the 2D feature map, and downsample the embedding dimension, and finally get the output

 

 

. Then, the 2D feature map X'_0 is transformed into a 1D sequence of embedded tokens again, where the number of tokens is 4HW and the embedding dimension is C/4. Therefore, at each stage, the resolution (H, W) is increased to twice, and the embedding dimension C is reduced to a quarter of the input. This trade-off strategy moderated the surge in memory and computing requirements.

The researcher repeats the above process in multiple stages until the resolution reaches (H_T, W_T ). Then, they projected the embedding dimension to 3 and got the RGB image.

 

 

Tokenized input for the discriminator

Unlike those generators that need to accurately synthesize each pixel, the discriminator proposed in this study only needs to distinguish true and false images. This allows researchers to semantically tokenize the input image to a rougher patch level (Dosovitskiy et al., 2020).

As shown on the right of Figure 1 above, the discriminator takes the patch of the image as input. The researcher will enter the image

 

 

Decomposed into 8 × 8 patches, where each patch can be regarded as a "word". Then, 8 × 8 patches are transformed into a 1D sequence of token embedding through a linear flatten layer, where the number of tokens is N = 8 × 8 = 64, and the embedding dimension is C. After that, the researcher added a learnable position code and a [cls] token at the beginning of the 1D sequence. After passing through the Transformer encoder, the classification head only uses the [cls] token to output true and false predictions.

experiment

Results on CIFAR-10

The researchers compared TransGAN and recent convolution-based GAN research on the CIFAR-10 data set, and the results are shown in Table 5 below:

 

 

As shown in Table 5 above, TransGAN is better than AutoGAN (Gong et al., 2019), and is also better than many competitors in IS scoring, such as SN-GAN (Miyato et al., 2018), improving MMDGAN (Wang et al., 2018a) , MGAN (Hoang et al., 2018). TransGAN is second only to Progressive GAN and StyleGAN v2.

Comparing the FID results, the study found that TransGAN is even better than Progressive GAN and slightly lower than StyleGANv2 (Karras et al., 2020b). The visualization example generated on CIFAR-10 is shown in Figure 4 below:

 

 

Results on STL-10

The researchers applied TransGAN to another popular 48×48 resolution benchmark STL-10. In order to adapt to the target resolution, the research increased the input feature map of the first stage from (8×8)=64 to (12×12)=144, and then combined the proposed TransGAN-XL with automatically searched ConvNets and hand-made ConvNets were compared, and the results are shown in Table 6 below:

 

 

Unlike the results on CIFAR-10, the study found that TransGAN outperforms all current models and achieves new SOTA performance in terms of IS and FID scores.

High resolution generation

Because TransGAN has achieved good performance on the standard benchmarks CIFAR-10 and STL-10, the researchers used TransGAN for the more challenging data set CelebA 64 × 64. The results are shown in Table 10 below:

 

 

The FID score of TransGAN-XL is 12.23, which shows that TransGAN-XL is suitable for high-resolution tasks. The visualization result is shown in Figure 4.

limitation

Although TransGAN has achieved good results, it still has a lot of room for improvement compared with the best hand-designed GAN. At the end of the paper, the author pointed out the following specific directions for improvement:

  • Perform more complex tokenize operations on G and D, such as using some semantic grouping (Wu et al., 2020).
  • Pre-training Transformer using pretext tasks may improve the existing MT-CT in this study.
  • More powerful forms of attention, such as (Zhu et al., 2020).
  • A more effective form of self-attention (Wang et al., 2020; Choromanski et al., 2020), which not only helps to improve model efficiency, but also saves memory overhead, thereby helping to generate higher resolution images.

About the Author

 

 

Yifan Jiang is a first-year doctoral student in the Department of Electronic and Computer Engineering at the University of Texas at Austin (previously studied at Texas A&M University for one year). He graduated from Huazhong University of Science and Technology with a bachelor’s degree and his research interests focus on computer vision and deep learning. And other directions. Currently, Yifan Jiang is mainly engaged in research in the areas of neural architecture search, video understanding and advanced representation learning, under the tutelage of Zhangyang Wang, assistant professor in the Department of Electronic and Computer Engineering at the University of Texas at Austin.

During his undergraduate studies, Yifan Jiang worked as an intern at ByteDance AI Lab. This summer, he will enter Google Research for an internship.

One homepage: https://yifanjiang.net/

Reference link: https://www.reddit.com/r/MachineLearning/comments/ll30kf/r_transgan_two_transformers_can_make_one_strong/

 

Guess you like

Origin blog.csdn.net/weixin_42137700/article/details/113836476