Methods before 2021 are all based on GAN. Generally, text and noise are put into a generation network, and then after generating an image, the discriminator determines whether it matches the text, and then determines real and fake at the same time. This method has Two disadvantages: 1. It can only model a single scene. For example, it can only generate face-related ones, so the gan model can only be trained on the face scene; 2. It cannot build multiple objects that exist in the scene. mold. The right side is a method based on GPT. If dalle, for a given text, starts from the upper left corner of the image, sequentially from the upper left to the lower right, and generates the image block by block, but for some complex and diverse pictures, the previous one If the token is wrong, subsequent generation will have problems and be very slow.
1. Introducing denoise diffusion into the field of Vincentian diagrams; 2. Proposing the VQ diffusion algorithm; 3. 15 times faster than autoregression.
The diffusion model has two steps, forward step, looking from right to left, adding noise, and Markov process. When an image is constantly adding noise, it will eventually become a pure noise image. Reverse step, denoising, uses the network to deal with the noise. The image is denoised and the final picture is obtained.
VQ diffusion is not done in pure pixel space, but in a quantified pixel space. The image resolution in pixel space is very high. If you use a transformer to model each pixel, the sequence length will be very long, which is not conducive to modeling. . Therefore, to compress the resolution of the image space, VQVAE is used to turn the image into a discrete code with a lower resolution. For example, the resolution of the picture above is 256x256, which becomes 32x32 after compression.
In the second step, the mask and replace strategies are introduced. All noise addition is performed in a discrete space. There are two ways to add noise. The first is to randomly remove a certain code and mask it out. The second is to replace, randomly replace the code with other codes, so that when adding noise, I will get a vector composed of a random code and a mask code, and the original image can be restored through a string of codes with noise and text information.