AutoEncoder and VAE

Author: Sherlock
link: https: //zhuanlan.zhihu.com/p/27549418
Source: know almost
copyrighted by the author. For commercial reprints, please contact the author for authorization, and for non-commercial reprints, please indicate the source.
 

What is an auto encoder

AutoEncoder was originally used as a data compression method, and its characteristics are:

1) The degree of correlation with the data is very high, which means that the autoencoder can only compress the data similar to the training data. This is actually more obvious, because the features extracted by the neural network are generally highly related to the original training set, using the face The trained autoencoder performs poorly when compressing pictures of natural animals, because it only learns the characteristics of human faces, but not the characteristics of natural pictures;

2) The data is lossy after compression, because the information will inevitably be lost in the process of dimensionality reduction;

In 2012, people found that using autoencoders in convolutional networks to do layer-by-layer pre-training can train deeper networks, but soon people found that a good initialization strategy is much more effective than strenuous layer-by-layer pre-training. In 2014 The emergence of Batch Normalization technology is also a deeper network that can be effectively trained. By the end of the 15th, we can basically train a neural network of any depth through the residual (ResNet).

So now there are two main applications of autoencoders, the first is data denoising, and the second is visual dimensionality reduction. However, another function of the autoencoder is to generate data.

We have talked about GAN before. Compared with GAN, it has some advantages, but also has some disadvantages. Let's first talk about its advantages compared with GAN.

The first point is that we use GAN to generate pictures. The disadvantage is that we use random Gaussian noise to generate pictures. This means that we cannot generate any pictures of our specified type, which means we can’t decide which to use. This kind of random noise can produce the picture we want, unless we can try all the initial distributions. But using an autoencoder, we can get the distribution of this type of picture after encoding through the encoding process of the output picture, which is equivalent to knowing the noise distribution corresponding to each picture, and we can generate what we want by selecting specific noises. The picture to be generated.

The second point is that this is not only the advantage of the generative network, but also has certain limitations. This is that the generative network distinguishes "real" pictures from "fake" pictures through a confrontation process, but the pictures obtained in this way are only as real as possible. , But this does not guarantee that the content of the picture is what we want. In other words, it is possible to generate the network as much as possible to generate some background patterns to make it as true as possible, but there are no actual objects in it.

Structure of auto encoder

First we give the general structure of the autoencoder

From the above figure, we can see two parts, the first part is the encoder (Encoder), the second part is the decoder (Decoder), the encoder and decoder can be any model, usually we Use neural network model as encoder and decoder. The input data is reduced to a code by a neural network, and then decoded by another neural network to obtain a generated data exactly the same as the original input data, and then by comparing the two data, minimize the difference between them Differences are used to train the parameters of the encoder and decoder in this network. After training in this process, we can take out the decoder and randomly pass in a code (code), hoping to generate a data similar to the original data through the decoder, the example of the above picture is to be able to generate a Almost the same picture.

Can this happen? In fact, it is possible. Below we will use PyTorch to simply implement an autoencoder.

First, we build a simple multilayer perceptron to realize it.

class autoencoder(nn.Module):
    def __init__(self):
        super(autoencoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(28*28, 128),
            nn.ReLU(True),
            nn.Linear(128, 64),
            nn.ReLU(True),
            nn.Linear(64, 12),
            nn.ReLU(True),
            nn.Linear(12, 3)
        )
        self.decoder = nn.Sequential(
            nn.Linear(3, 12),
            nn.ReLU(True),
            nn.Linear(12, 64),
            nn.ReLU(True),
            nn.Linear(64, 128),
            nn.ReLU(True),
            nn.Linear(128, 28*28),
            nn.Tanh()
        )

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

Here we define a simple 4-layer network as the encoder, using the ReLU activation function in the middle, the final output dimension is 3-dimensional, the defined decoder, input the three-dimensional code, output a 28x28 image data, pay special attention to the last The activation function used is Tanh. This activation function can convert the final output to between -1 and 1. This is because the picture we input has been converted to between -1 and 1, and the output here must correspond to it.

The training process is also relatively simple. We use the minimum mean square error as the loss function to compare the difference between each pixel of the generated picture and the original picture.

At the same time, we can also replace the multilayer perceptron with a convolutional neural network, which has a better effect on the feature extraction of the picture.

class autoencoder(nn.Module):
    def __init__(self):
        super(autoencoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Conv2d(1, 16, 3, stride=3, padding=1),  # b, 16, 10, 10
            nn.ReLU(True),
            nn.MaxPool2d(2, stride=2),  # b, 16, 5, 5
            nn.Conv2d(16, 8, 3, stride=2, padding=1),  # b, 8, 3, 3
            nn.ReLU(True),
            nn.MaxPool2d(2, stride=1)  # b, 8, 2, 2
        )
        self.decoder = nn.Sequential(
            nn.ConvTranspose2d(8, 16, 3, stride=2),  # b, 16, 5, 5
            nn.ReLU(True),
            nn.ConvTranspose2d(16, 8, 5, stride=3, padding=1),  # b, 8, 15, 15
            nn.ReLU(True),
            nn.ConvTranspose2d(8, 1, 2, stride=2, padding=1),  # b, 1, 28, 28
            nn.Tanh()
        )

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

Here, nn.ConvTranspose2d() is used, which can be regarded as the inverse operation of convolution, which can be regarded as deconvolution in a sense.

The final image we get by using the convolutional network will be better. I will not put the specific image effect here. You can see the display of the image on our github .

Variational Autoencoder

The variational encoder is an upgraded version of the autoencoder. Its structure is similar to that of the autoencoder, and it is also composed of an encoder and a decoder.

Recall what we did in the autoencoder, we need to input a picture, and then encode a picture to get a hidden vector, which is better than randomly picking a random noise because it contains the original picture Then we implicitly decode the vector to get the photo corresponding to the original picture.

But in this way, we can't actually generate pictures arbitrarily, because we have no way to construct the hidden vector by ourselves. We need to enter the encoding through a picture to know what the hidden vector is, then we can use the variational autoencoder To solve this problem.

In fact, the principle is very simple. You only need to add some restrictions to it in the encoding process to force the implicit vector generated by it to roughly follow a standard normal distribution. This is the biggest difference between it and the general autoencoder.

In this way, it is very simple for us to generate a new picture, we only need to give it a standard normal distribution random hidden vector, so that the decoder can generate the picture we want without giving it an original picture Code first.

In actual situations, we need to make a trade-off between the accuracy of the model and the implicit vector obeys the standard normal distribution. The accuracy of the model refers to the degree of similarity between the picture generated by the decoder and the original picture. We can let the network make this decision by itself. It is very simple. We only need to make a loss for both, and then sum them as the total loss, so that the network can choose how to make this total loss. decline. In addition, we need to measure the similarity of the two distributions. How do you read the mathematical derivation of a GAN before, you know that there will be a thing called KL divergence to measure the similarity of the two distributions, here we use KL divergence to represent the implicit The loss of the difference between the vector and the standard normal distribution, and the other loss is still represented by the mean square error of the generated picture and the original picture.

We can give the formula of KL divergence

[official]

Here the variational encoder uses a technique "re-parameterization" to solve the calculation problem of KL divergence.

At this time, instead of generating a hidden vector every time, two vectors are generated, one representing the mean value and one representing the standard deviation, and then synthesizing the implicit vector through these two statistics. This is also very simple, using a standard The normal distribution is first multiplied by the standard deviation and then added to the mean. Here we default the implicit vector after encoding to obey a normal distribution. At this time, we want to make the mean as close to 0 as possible and the standard deviation as close to 1 as possible. And the paper has a detailed derivation on how to get the calculation formula of this loss, interested students can go to see the derivation

Below is the implementation of PyTorch

reconstruction_function = nn.BCELoss(size_average=False)  # mse loss

def loss_function(recon_x, x, mu, logvar):
    """
    recon_x: generating images
    x: origin images
    mu: latent mean
    logvar: latent log variance
    """
    BCE = reconstruction_function(recon_x, x)
    # loss = 0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
    KLD_element = mu.pow(2).add_(logvar.exp()).mul_(-1).add_(1).add_(logvar)
    KLD = torch.sum(KLD_element).mul_(-0.5)
    # KL divergence
    return BCE + KLD

In addition, the variational encoder not only allows us to randomly generate hidden variables, but also improves the generalization ability of the network.

Finally, the code implementation of VAE

class VAE(nn.Module):
    def __init__(self):
        super(VAE, self).__init__()

        self.fc1 = nn.Linear(784, 400)
        self.fc21 = nn.Linear(400, 20)
        self.fc22 = nn.Linear(400, 20)
        self.fc3 = nn.Linear(20, 400)
        self.fc4 = nn.Linear(400, 784)

    def encode(self, x):
        h1 = F.relu(self.fc1(x))
        return self.fc21(h1), self.fc22(h1)

    def reparametrize(self, mu, logvar):
        std = logvar.mul(0.5).exp_()
        if torch.cuda.is_available():
            eps = torch.cuda.FloatTensor(std.size()).normal_()
        else:
            eps = torch.FloatTensor(std.size()).normal_()
        eps = Variable(eps)
        return eps.mul(std).add_(mu)

    def decode(self, z):
        h3 = F.relu(self.fc3(z))
        return F.sigmoid(self.fc4(h3))

    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparametrize(mu, logvar)
        return self.decode(z), mu, logvar

The result of VAE is much better than that of ordinary autoencoder. Below is the result

 

The shortcomings of VAE are also obvious. It directly calculates the mean square error between the generated picture and the original picture instead of fighting and learning like GAN, which makes the generated picture a bit blurry. There is already some work that combines VAE and GAN, using the structure of VAE, but using adversarial networks for training, you can refer to this paper for details .

Guess you like

Origin blog.csdn.net/a493823882/article/details/106986898