[Paper] 2102.DALL-E: Zero-Shot Text-to-Image Generation (the beginning of text generation of various imaginative images)

insert image description here

Main reference:

openai official website: https://openai.com/blog/dall-e/
2102.DALLE: Zero-Shot Text-to-Image Generation
2204.DALLE- 2: Hierarchical Text-Conditional Image Generation with CLIP Latents
Thesis resource online disk download: https://pan.baidu.com/s/1KLvYrTTXlDCBv1HfU1q5Kg?pwd=0828
Zhihu master, analyze DALL-E: https://zhuanlan.zhihu.com/p/480947973
csdn-kunli interpretation, with code: https: //blog.csdn.net/u012193416/article/details/126108145
Know how DAlle is implemented: https://www.zhihu.com/question/447757686

Prior knowledge

  1. Resnet (residual network structure, Deep residual learning for image recognition)
  2. Transformer (Attention is all you need)
  3. dVAE (Discrete Variational Autoencoder)
  4. CLIP (Contrastive Language–Image Pre-training)

Method overview

DALLE includes three independently trained models: dVAE, Transformer, and CLIP. The training of dVAE is basically the same as that of VAE, and Transformer uses a generative pre-training method similar to GPT-3.

In the first stage, 256×256the picture is divided into 32×32patches, and then the encoder of the trained discrete VAE model is used to map each patch to a size of 8192的词表, and finally a picture is represented by 1024 tokens.

In the second stage, use BPE-encoder to encode the text to get a maximum of 256 tokens. If the number of tokens is less than 256, the padding will be 256; then splicing with 1024 image tokens to obtain data with a length of 1280; 256个文本tokenfinally
splicing The data input to the trained Transformer model with 12 billion parameters. In the third stage, the images generated by the model are sampled, and the sampled results are sorted using the CLIP model released at the same time, so as to obtain the generated image that best matches the text.
Author: Jin Xuefeng
Link https://www.zhihu.com/question/447757686/answer/1764970196
insert image description here

original

Summary

Text-to-image generation has traditionally focused on finding better modeling hypotheses for training on a fixed dataset.
These assumptions may involve complex architectures, auxiliary losses, or auxiliary information: such as object part labels or segmentation masks provided during training.
Propose a transformersimple method based on , which automatically regressively ( autoregressively) 文本和图像的tokensmodels tokens as a single data stream.
With sufficient data and scale, our method is competitive with previous domain-specific models (domian-specific) when evaluated in a zero-training, zero-shot fashion.

introduction

Text-to-image synthesis work started in 2015

figure 1

Comparison of original image (top) and 离散VAE(discrete) reconstruction (bottom). This encoder downsamples the spatial resolution by a factor of 8. Although details are sometimes lost or distorted (for example, the texture of the car's fur, writing on the storefront, and thin lines in illustrations), the main features of the image remain is recognizable. We use a large vocabulary of 8192 to reduce information loss.
insert image description here

Figure 2 Generated various style maps

结合不同的概念With varying degrees of reliability, our model appears to be able to create anthropomorphized versions of animals, render text, and perform certain types of image-to-image translation in plausible ways ( distinct concepts). translate:

  • a: Tapir with accordion texture (transliteration mo, English tapir). Tapir made of accordion
  • b: An illustration of a hedgehog walking the dog in a Christmas sweater
  • c: A neon sign that says "Props on the Back". A neon sign that reads "BACK PROP". Back brace neon sign
  • d: the cat at the top is exactly the same as the sketch at the bottom

insert image description here

Guess you like

Origin blog.csdn.net/imwaters/article/details/125644146