vector quantized diffusion model for text-to-image synthesis

CVPR 2022 Paper Sharing Session - Text to Image Synthesis Based on VQ-Diffusion_bilibili_bilibili CVPR 2022 Paper Sharing Session - Text to Image Synthesis Based on VQ-Diffusion, video views 1438, comments 2, likes The number of coins tossed is 12, the number of collectors is 40, and the number of retweets is 13. The author of the video is Microsoft Technology. About the author. Hello everyone, I am Teacher Tian. Tian Zijian holds classes from time to time and draws prizes regularly. Remember to pay attention, related videos: 2023 CVPR paper sharing session | Session1: Visual Generation - Zhang Bo, Microsoft 2023 Microsoft 365 Copilot conference full review Chinese subtitle version, 2023 CVPR paper sharing session | Session1: Visual Generation - Dong Jing, 2023 CVPR Paper sharing session | Poster sharing, [AAAI 2023 paper sharing] Adaptive policy learning from offline to online reinforcement learning, ICSE 2021 paper sharing - Resource-oriented configuration space reduction for deep learning models, [Douban slashed 9.5 points] SCI paper writing textbook, tailor-made for researchers from non-native English-speaking countries, specializes in guiding English academic paper writing! -SCI/Academic Papers/Journals, [AAAI 2023 Paper Sharing] Combination online learning based on cause and effect, are you ready to face the challenge of "digital transformation"? , just install the CodeGeeX plug-in to enjoy AI coding | VSCode plug-in recommendation | CodeGeeX usage tutorial https://www.bilibili.com/video/BV13Y4y1r7CH/?spm_id_from=333.1007.top_right_bar_window_dynamic.content.click&vd_source=4aed82e35f26bb600bc5b46e65e25c22

Methods before 2021 are all based on GAN. Generally, text and noise are put into a generation network, and then after generating an image, the discriminator determines whether it matches the text, and then determines real and fake at the same time. This method has Two disadvantages: 1. It can only model a single scene. For example, it can only generate face-related ones, so the gan model can only be trained on the face scene; 2. It cannot build multiple objects that exist in the scene. mold. The right side is a method based on GPT. If dalle, for a given text, starts from the upper left corner of the image, sequentially from the upper left to the lower right, and generates the image block by block, but for some complex and diverse pictures, the previous one If the token is wrong, subsequent generation will have problems and be very slow.

1. Introducing denoise diffusion into the field of Vincentian diagrams; 2. Proposing the VQ diffusion algorithm; 3. 15 times faster than autoregression.

The diffusion model has two steps, forward step, looking from right to left, adding noise, and Markov process. When an image is constantly adding noise, it will eventually become a pure noise image. Reverse step, denoising, uses the network to deal with the noise. The image is denoised and the final picture is obtained.

VQ diffusion is not done in pure pixel space, but in a quantified pixel space. The image resolution in pixel space is very high. If you use a transformer to model each pixel, the sequence length will be very long, which is not conducive to modeling. . Therefore, to compress the resolution of the image space, VQVAE is used to turn the image into a discrete code with a lower resolution. For example, the resolution of the picture above is 256x256, which becomes 32x32 after compression.

In the second step, the mask and replace strategies are introduced. All noise addition is performed in a discrete space. There are two ways to add noise. The first is to randomly remove a certain code and mask it out. The second is to replace, randomly replace the code with other codes, so that when adding noise, I will get a vector composed of a random code and a mask code, and the original image can be restored through a string of codes with noise and text information.

Guess you like

Origin blog.csdn.net/u012193416/article/details/132523097