AI painting is very popular. Take Kunlun AIGC as an example to reveal the model algorithm behind AI painting.

I. Introduction

Recently, AI painting has brought artificial intelligence into the public eye again. In the early days of the development of artificial intelligence, it was always believed that the functions that artificial intelligence could achieve were very limited. Usually they are rigid things, such as playing chess, quizzes, etc., and are not creative. People at that time would have never imagined that AI can now draw, compose music, and compose poetry. These things that were once considered unique to humans are now also involved in AI.
Today we are going to discuss the hot AI painting nowadays. Let’s see if AI is really creative, or if it just keeps moving.
There are many models that can implement AI painting. Today we mainly discuss two models: Conditional GAN and Stable Diffusion. There are now corresponding commercial versions. For example, Kunlun Wanwei's AI drawing uses the Stable Diffusion branch model and has achieved considerable results.

2.GAN

Here we discuss the principle of Conditional GAN (Generative Adversarial Network) to implement AI. Before talking about Conditional GAN, let's take a look at what GAN is.

2.1 Generate

Generative networks have always been considered a breakthrough in empowering AI creativity. Generation includes text generation, image generation, audio generation, etc.
GAN is a relatively mature generation network, usually used to generate images. There are many variants of GAN, including DCGAN, CycleGAN, etc.

2.2 Experts and fakes

The Chinese name of GAN is to generate confrontation network. When referring to GAN, two opposing roles are often used as examples. One is a counterfeit expert, who is responsible for making fakes; the other is an identification expert, who is responsible for identifying fakes. They were not experts at first, but learned through confrontation, and eventually counterfeiters were able to create fakes that were difficult for anyone to identify. In the end, we will abandon the identification experts and let the counterfeiting experts serve us.
The counterfeiting expert mentioned above is the G network, which is the Generator; and the identification expert is the D network, which is the Discriminator. They learn from each other and eventually become experts in their respective fields. This is the idea of GAN.

2.3 Generator

Below we discuss the Generator and Discriminator of the GAN network with the example of generating anime avatars.
First discuss the Generator, which acts as a fake in GAN and also uses it to generate images. The Generator accepts a random variable that satisfies a specific simple distribution, such as a Gaussian distribution. After receiving the input random variable, the network generates a very long vector through operation. We can reshape this vector into w×h×3, which is a color image.
Insert image description here
The specific structure of the Generator can be varied, usually based on a convolutional network. For example, in DCGAN, Generator consists of 5 layers of deconvolution, and its network structure is as follows:

Insert image description here
Input a vector with a dimension of 100 and output a 64×64×3 image. The PyTorch implementation is as follows:

class Generator(nn.Module):
    def __init__(self, ngpu):
        super(Generator, self).__init__()
        self.ngpu = ngpu
        self.main = nn.Sequential(
            # input is Z, going into a convolution
            nn.ConvTranspose2d( nz, ngf * 8, 4, 1, 0, bias=False),
            nn.BatchNorm2d(ngf * 8),
            nn.ReLU(True),
            # state size. (ngf*8) x 4 x 4
            nn.ConvTranspose2d(ngf * 8, ngf * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf * 4),
            nn.ReLU(True),
            # state size. (ngf*4) x 8 x 8
            nn.ConvTranspose2d( ngf * 4, ngf * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf * 2),
            nn.ReLU(True),
            # state size. (ngf*2) x 16 x 16
            nn.ConvTranspose2d( ngf * 2, ngf, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf),
            nn.ReLU(True),
            # state size. (ngf) x 32 x 32
            nn.ConvTranspose2d( ngf, nc, 4, 2, 1, bias=False),
            nn.Tanh()
            # state size. (nc) x 64 x 64
        )

    def forward(self, input):
        return self.main(input)

2.4 Discriminator

Discriminator is a very important role in GAN. It is a network that accepts an image input. The input image will contain a part of the real image (the animation image we collected) and a part of the false image fake (the image generated by the Generator). , and then output a result. This result can be the probability that the fake is a real image, or it can be the category of the fake (0 means false, 1 means true). For the Discriminator, its purpose is to adjust the network parameters to let the network know that the fake image is fake.

Insert image description here
There are no very fixed constraints on the structure of Discriminator, which is usually a convolutional network. Here we also refer to DCGAN, here is an implementation of PyTorch:

class Discriminator(nn.Module):
    def __init__(self, ngpu):
        super(Discriminator, self).__init__()
        self.ngpu = ngpu
        self.main = nn.Sequential(
            # input is (nc) x 64 x 64
            nn.Conv2d(nc, ndf, 4, 2, 1, bias=False),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (ndf) x 32 x 32
            nn.Conv2d(ndf, ndf * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf * 2),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (ndf*2) x 16 x 16
            nn.Conv2d(ndf * 2, ndf * 4, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf * 4),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (ndf*4) x 8 x 8
            nn.Conv2d(ndf * 4, ndf * 8, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf * 8),
            nn.LeakyReLU(0.2, inplace=True),
            # state size. (ndf*8) x 4 x 4
            nn.Conv2d(ndf * 8, 1, 4, 1, 0, bias=False),
            nn.Sigmoid()
        )

    def forward(self, input):
        return self.main(input)

What is special here is the use of LeakyReLU.

2.5 BOTH

With Generator and Discriminator, a GAN network can be formed.
In the beginning, Generator and Discriminator were two ignorant children. Generator didn't know how to generate, and Discriminator didn't know how to distinguish. The training of GAN network is divided into the following steps.

The first step: train the Discriminator network. At this time, the photos provided by the Generator are all noise. Training the Discriminator first can let the Discriminator know how to distinguish real images and noise.
Step 2: Fix the Discriminator and train the Generator so that the images generated by the Generator can hide from the Discriminator
Step 3: Recycle and train Discriminator-Generator until the images generated by Generator can meet our needs
Step 4: Use Generator to generate images.
The above steps can be seen as the following figure:

Insert image description here
The above is the training process of GAN network. In fact, it is the process of alternating training between Generator and Discriminator. Its PyTorch implementation is as follows:

# Create the generator
netG = Generator(ngpu).to(device)
if (device.type == 'cuda') and (ngpu > 1):
    netG = nn.DataParallel(netG, list(range(ngpu)))
netG.apply(weights_init)

# Create the Discriminator
netD = Discriminator(ngpu).to(device)

if (device.type == 'cuda') and (ngpu > 1):
    netD = nn.DataParallel(netD, list(range(ngpu)))
netD.apply(weights_init)

criterion = nn.BCELoss()

fixed_noise = torch.randn(64, nz, 1, 1, device=device)
real_label = 1.
fake_label = 0.
optimizerD = optim.Adam(netD.parameters(), lr=lr, betas=(beta1, 0.999))
optimizerG = optim.Adam(netG.parameters(), lr=lr, betas=(beta1, 0.999))

# Training Loop

# Lists to keep track of progress
img_list = []
G_losses = []
D_losses = []
iters = 0

print("Starting Training Loop...")
# For each epoch
for epoch in range(num_epochs):
    # For each batch in the dataloader
    for i, data in enumerate(dataloader, 0):

        ############################
        # (1) Update D network: maximize log(D(x)) + log(1 - D(G(z)))
        ###########################
        ## Train with all-real batch
        netD.zero_grad()
        # Format batch
        real_cpu = data[0].to(device)
        b_size = real_cpu.size(0)
        label = torch.full((b_size,), real_label, dtype=torch.float, device=device)
        # Forward pass real batch through D
        output = netD(real_cpu).view(-1)
        # Calculate loss on all-real batch
        errD_real = criterion(output, label)
        # Calculate gradients for D in backward pass
        errD_real.backward()
        D_x = output.mean().item()

        ## Train with all-fake batch
        # Generate batch of latent vectors
        noise = torch.randn(b_size, nz, 1, 1, device=device)
        # Generate fake image batch with G
        fake = netG(noise)
        label.fill_(fake_label)
        # Classify all fake batch with D
        output = netD(fake.detach()).view(-1)
        # Calculate D's loss on the all-fake batch
        errD_fake = criterion(output, label)
        # Calculate the gradients for this batch, accumulated (summed) with previous gradients
        errD_fake.backward()
        D_G_z1 = output.mean().item()
        # Compute error of D as sum over the fake and the real batches
        errD = errD_real + errD_fake
        # Update D
        optimizerD.step()

        ############################
        # (2) Update G network: maximize log(D(G(z)))
        ###########################
        netG.zero_grad()
        label.fill_(real_label)  # fake labels are real for generator cost
        # Since we just updated D, perform another forward pass of all-fake batch through D
        output = netD(fake).view(-1)
        # Calculate G's loss based on this output
        errG = criterion(output, label)
        # Calculate gradients for G
        errG.backward()
        D_G_z2 = output.mean().item()
        # Update G
        optimizerG.step()

        # Output training stats
        if i % 50 == 0:
            print('[%d/%d][%d/%d]\tLoss_D: %.4f\tLoss_G: %.4f\tD(x): %.4f\tD(G(z)): %.4f / %.4f'
                  % (epoch, num_epochs, i, len(dataloader),
                     errD.item(), errG.item(), D_x, D_G_z1, D_G_z2))

        # Save Losses for plotting later
        G_losses.append(errG.item())
        D_losses.append(errD.item())

        # Check how the generator is doing by saving G's output on fixed_noise
        if (iters % 500 == 0) or ((epoch == num_epochs-1) and (i == len(dataloader)-1)):
            with torch.no_grad():
                fake = netG(fixed_noise).detach().cpu()
            img_list.append(vutils.make_grid(fake, padding=2, normalize=True))

        iters += 1

After a period of training, we can generate some anime images. For the code implementation of DCGAN, please refer to https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html#sphx-glr-beginner-dcgan-faces-tutorial-py.

3. Conditional GAN

Through the above GAN network, we can generate animation images. But this generation is uncontrollable. We only know that it generates anime images, and we cannot know the content of the images. We cannot generate images based on descriptions. This is a limitation of the GAN network. Therefore, a variant called Conditional GAN is proposed. This GAN network can solve the above problems.

3.1 Generator

Conditional GAN is different from GAN in that the number of parameters received by its Generator and Discriminator is different. Generator not only receives random variables, but also receives a "thought vector". This thought vector can be an encoding of a sentence. At this time, the structure of our Generator becomes a network that inputs two vectors and outputs an image.
Insert image description here
For example, in the picture above, we convert the sentence red eyes into a vector and give it to the Generator, and then let it generate an anime image of red eyes. By modifying x, we can get different images, and because of the existence of the random variable z, we can get different images even if we give the same x.

In order for the network to learn the relationship between text and description, we need to prepare a data set of this combination (text description-image).

3.2 Discriminator

The Discriminator also needs to input two vectors, which are the image generated by the Generator and the x input to the Generator, and then the output is correct.
Insert image description here
The training data submitted to the Generator needs to have (correct description - correct image) as category 1, (correct description, incorrect image), (correct description, correct image, but the image and description do not match) as category 0.

Without including (correct description, correct image, but image and description mismatch) as training data, our network does not get very good results.

After knowing the Generator and Discriminator networks, we can use a similar method to GAN for training. The final Generator is our AI painter. We give it a text description, and it returns us a corresponding picture.

四、Stable Diffusion

Stable Diffusion and Conditional GAN have many similarities, because both can be used to solve Text-to-image problems, so the models both receive a text and Gaussian noise that affects the image. It's just that the network structure used is different, and Stable Diffusion introduces Latent Diffusion to make training smoother.

Latent Diffusion consists of three parts, namely autoencoder, U-Net, and Text-Encoder.

The autoencoder consists of two parts: encoder and decoder. The output of the encoder will be handed over to U-Net for processing. The output of U-Net will be given to the decoder.

U-Net receives the encoder input and also receives a sentence vector. This sentence vector is given by Text-Encoder. The figure below is the structure of U-Net.
Insert image description here
Because U-Net works on a low-dimensional space, Latent Diffusion is fast and effective. The overall process of Stable Diffusion is as follows:

Insert image description here

5. Kunlun Wanwei-Tiangong Qiaohua Experience

There are many ready-made platforms for AI painting. Compared with GAN, Stable Diffusion is better at painting. Here you can use Kunlun Tiangong’s SkyPaint for a simple experience. SkyPaint uses the world’s first The multilingual Stable Diffusion branch model is one of the few text and image generation models in China that supports Chinese and English bilinguals.

Kunlun Wanwei AI painting model mainly adopts the following strategies during the model training process:

While increasing the ability to input Chinese prompt words, it is also compatible with the original stable_diffusion English prompt word model. The English prompt word manuals accumulated by previous users can still be used on the model;
Using 150 million levels of parallel corpus to optimize the prompt word model to achieve comparison between Chinese and English, it not only involves translation task corpus, but also includes Chinese and English materials for prompt words that are frequently used by users, Chinese and English materials for ancient poems, subtitles, encyclopedias, and picture texts. A massive corpus collection for multiple scenarios and tasks such as description corpus;
During training, a model distillation scheme and a bilingual alignment scheme are used. The teacher model is used to distill the student model while supplemented by a decoder language alignment task to assist in model training.

In both text-generating image and picture-generating text applications, Kunlun Tiangong's Tiangong Qiaohui SkyPaint model is comparable to the most advanced models in the field of AI painting. The following table compares the performance of different models on the Flickr30K-CN data set.

Insert image description here

Below are a few test examples.

Cat in Hat and Sword.
My original idea was to get an image similar to Cat in Boots. The following results have some flavor of Puss in Boots.

Insert image description here
2. Van Gogh Starry Sky

The first rendering is somewhat similar to the original scene, while the remaining paintings are different.

Insert image description here
3. A red helicopter is taking off from the snow-capped mountains in Alaska that have never melted for thousands of years.

The description this time contains a lot of details, the red helicopter, taking off, etc. From the results below, it can be seen that AI has grasped these details, and there is not much sense of inconsistency in each picture. However, there are still some dissatisfaction points when looking closely at the propeller.

Insert image description here
You can try the effect of AI drawing yourself.

6. Summary and Outlook

AI painting is not simply copied from the implementation of Conditional GAN. When training Conditional GAN, we learn the distribution of images while doing it. For a 64×64×3 8-bit image, there can be 12288^256 combinations, and only a very small part of these combinations are the images we need. The Generator network changes z from a simple distribution (such as Gaussian distribution) ), maps a complex distribution (the distribution of images). After learning this distribution, we only need to sample a point from the distribution of z to correspond to an image. This is what our Generator is doing.
Fortunately, based on its forward-looking judgment on artificial intelligence technology, Kunlun Wanwei began to deploy the AIGC field in 2020, training clusters of 200 cards, investing tens of millions of yuan, and forming a R&D team of more than 200 people. By the end of 2020, In April 2021, a Chinese GPT-3 model with tens of billions of parameters was developed, and in August 2021, it began to develop a conversation robot based on its own large text model; in January 2022, the SkyMusic music laboratory was launched and reached in April 2022. The best results in the field of artificial intelligence; AIGC products in programming, image, and text directions will be launched in September 2022. It is worth mentioning that the current AI image, AI text, and AI programming models have been open sourced on GitHub.

Official website link, experience jump: http://www.kunlun.com/Kunlun
Tiangong open source address:
Github: https://github.com/SkyWorkAIGCHuggingface https://huggingface.co/SkyWork
Huggingface: https://huggingface .co/SkyWork
Related websites:
SkyPaint:
https://sky-paint.singularity-ai.com
SkyCode:
https://sky-code.singularity-ai.com
SkyText:
https ://openapi.singularity-ai.com
We can believe that through technological innovation and development in AIGC model algorithms, the development of the open source AIGC algorithm and model community will become stronger and stronger, and the use and learning threshold of AIGC technology in various industries It will also gradually decrease, and a new era belonging to AIGC will come.