In-depth understanding of autoencoders (generate images with variational autoencoders)

The content is summarized from the flower book "Deep Learning" and "Python Deep Learning".

autoencoder

An autoencoder is a type of neural network that is trained to attempt to copy the input to the output . There is a hidden layer hh inside the autoencoderh , an encodingcan be generatedto represent the input.

We can think of an autoencoder as consisting of two parts: one consisting of a function h = f ( x ) h=f(x)h=An encoderdenoted by f ( x ) and adecoder that r = g ( h ) r=g(h)r=g(h)

But autoencoders are not particularly useful if they are simply replicating the input. We usually impose some constraints on the autoencoder so that it can only replicate approximately, and only replicate inputs similar to the training data. These constraints force the model to consider which parts of the input data need to be replicated preferentially, and thus tend to learn useful properties of the data .

Traditional autoencoders are used for dimensionality reduction or feature learning . In recent years, the connection between autoencoders and latent variable model theory has brought autoencoders to the forefront of generative modeling. The key idea of ​​image generation is to find a low-dimensional representation latent space (latent space), in which any point can be mapped into a realistic image. Once such a latent space is found, it can be randomly sampled from it and mapped to image space, resulting in never-before-seen images:

insert image description here
GANs and VAEs (Variational Autoencoders) are two different strategies for learning this latent space of image representations. VAEs are great for learning well-structured latent spaces, where specific directions represent meaningful axes of variation in the data , GANs can generate images that are very realistic, but their latent spaces may not be well-structured nor sufficiently continuous. We'll cover VAEs later.


undercomplete autoencoder

A traditional image self-encoder takes an image, maps it to a latent vector space through an encoder module, and then decodes it into an output with the same size as the original image through a decoder module. Then, train this autoencoder using the same images as the input images as target data.

We also just mentioned that copying the input to the output sounds useless, but we usually don't care about the output of the decoder, instead we want to make hh by training the autoencoder to copy the input.h for useful properties.

One way to get useful features from an autoencoder is to restrict hhThe dimension of h is larger than xxWhen x is small, theencoding dimension is smaller than the input dimensionis called an undercomplete autoencoder. Learning an under-complete representation will force the autoencoder to capture the most salient features in the training data.

But if the encoder and decoder are given too much capacity , the autoencoder will perform the duplication task without capturing useful features of the data.

regularized autoencoder

An undercomplete autoencoder whose encoding dimension is smaller than the input dimension can learn the most salient features of the data distribution, but if such an autoencoder is given too much capacity, it cannot learn any useful information. We can allow autoencoders to learn interesting latent representations of data by imposing various constraints on the encoding (i.e., the output of the encoder).

sparse autoencoder

Sparse autoencoders simply incorporate the coefficient penalty Ω ( h ) \Omega(h) of the encoding layer during trainingΩ ( h ) and reconstruction error:
L ( x , g ( f ( x ) ) ) + Ω ( h ) L(x,g(f(x)))+\Omega(h)L(x,g(f(x)))+Oh ( h )

The concept of parameters is consistent with what we mentioned above, that is, h = f ( x ) h=f(x)h=f ( x ) represents the output of the encoder. Sparse autoencoders are commonly used to learn features for tasks like classification.

Denoising Autoencoder

Traditional autoencoders minimize the reconstruction error
L ( x , g ( f ( x ) ) ) L(x,g(f(x)))L(x,g(f(x)))

The denoising autoencoder (DAE) minimizes
L ( x , g ( f ( x ~ ) ) ) L(x,g(f(\tilde{x})))L(x,g(f(x~ )))

where x ~ \tilde{x}x~ is corrupted by some kind of noisexxA copy of x . Therefore denoising autoencoders must undo these corruptions rather than simply duplicating the input. We will introduce a corruption processC ( x ~ ∣ x ) C(\tilde{x}|x)C(x~x), this conditional distribution represents the given data samplexxx produces corrupted samplesx ~ \tilde{x}x~ probability. The autoencoder starts from the training data pair( x , x ~ ) (x,\tilde{x})(x,x~ )learning toreconstruct the distribution preconstruct ( x ∣ x ~ ) p_{reconstruct}(x|\tilde{x})preconstruct( x x~ ):

  1. Take a training sample xx from the training datax
  2. From C ( x ~ ∣ x ) C(\tilde{x}|x)C(x~x)takes a damaged sample;
  3. general ( x , x ~ ) (x,\tilde{x})(x,x~ )as training samples to estimate the reconstruction distribution of the self-encoderpreconstruct ( x ∣ x ~ ) = pdecoder ( x ∣ h ) p_{reconstruct}(x|\tilde{x})=p_{decoder}(x|h )preconstruct( x x~ )=pdecoder( x h )

We can understand the denoising autoencoder as the corrupted data point x ~ \tilde{x}x~ maps back to original data pointxxx

insert image description here
In the figure above, the dotted circle represents the damage process C ( x ~ ∣ x ) C(\tilde{x}|x)C(x~x), what the self-encoder learns is the vector fieldg ( f ( x ) ) − xg(f(x))-xg(f(x))x.

shrinking autoencoder

Another strategy for regularizing autoencoders is to use a penalty term Ω \OmegaΩ :
L ( x , g ( f ( x ) ) ) + Ω ( h , x ) L(x,g(f(x)))+\Omega(h,x)L(x,g(f(x)))+Ω ( h ,x)

Ω ( h , x ) = λ ∑ i ∥
∇ xhi ∥ 2 \Omega(h,x)=\lambda\sum_i \Vert\nabla_xh_i\Vert^2Ω ( h ,x)=lixhi2

This forces the model to learn ax varies as a function of the hourly target and does not change much. Such a regularized autoencoder is calleda contractiveautoencoder (CAE).


Variational Autoencoder

Instead of compressing an input image into a fixed encoding in a latent space, a VAE transforms the image into the parameters of a statistical distribution , namely mean and variance. Essentially, this means that we assume that the input image is generated by a statistical process, the randomness of which should be taken into account during encoding and decoding. The VAE then uses the two parameters mean and variance to randomly sample an element from the distribution and decode this element to the original input.

insert image description here

From a technical point of view, VAE works as follows:

  1. An encoder module input_imgconverts to represent two parameters in the latent space z_meanand z_log_variance;
  2. We assume an underlying normal distribution capable of generating the input image, and randomly sample a point zz from this distributionz z = z= With= z_mean + exp(z_log_variance) * epsilon , where epsilonis
  3. A decoder module maps this point in the latent space back to the original input image.

Because epsilonis random, this process ensures that input_imgevery point close to a latent location encoded by can be decoded as an image input_imgsimilar to , thus forcing the latent space to be continuous and meaningful. Any two adjacent points in the latent space will be decoded as highly similar images . Continuity, and the low dimensionality of the latent space, will force each direction in the latent space to represent a meaningful axis of variation in the data, which makes the latent space very well structured.

The approximate code of VAE is as follows:

z_mean, z_log_variance = encoder(input_img) # 将输入编码为均值和方差两个参数

z = z_mean + exp(z_log_var) * epsilon

reconstructed_img = decoder(z) # 将 z 解码为一张图像

model = Model(input_img, reconstructed_img) # 实例化自编码器模型

Let's define the encoder network first:

import keras
from keras import layers
from keras import backend as K
from keras.models import Model
import numpy as np

img_shape = (28, 28, 1)
batch_size = 16
latent_dim = 2 # 潜在空间维度:一个二维平面

input_img = keras.Input(shape=img_shape)

x = layers.Conv2D(32, 3, padding='same', activation='relu')(input_img)
x = layers.Conv2D(64, 3, padding='same', activation='relu',
                  strides=(2, 2))(x)
x = layers.Conv2D(64, 3, padding='same', activation='relu')(x)
x = layers.Conv2D(64, 3, padding='same', activation='relu')(x)
shape_before_flattening = K.int_shape(x)

x = layers.Flatten()(x)
x = layers.Dense(32, activation='relu')(x)

z_mean = layers.Dense(latent_dim)(x)
z_log_var = layers.Dense(latent_dim)(x)

The following code will use z_meanand z_log_varto generate a latent space point zzz . In keras, any object should be a layer, so if the code is not part of the built-in layer, we should wrap it into a Lambdalayer

def sampling(args):
    z_mean, z_log_var = args
    epsilon = K.random_normal(shape=(K.shape(z_mean)[0], latent_dim),
                              mean=0., stddev=1.)
    return z_mean + K.exp(z_log_var) * epsilon

z = layers.Lambda(sampling)([z_mean, z_log_var])

The following code shows the implementation of the decoder:

decoder_input = layers.Input(K.int_shape(z)[1:])

x = layers.Dense(np.prod(shape_before_flattening[1:]),
                 activation='relu')(decoder_input)

x = layers.Reshape(shape_before_flattening[1:])(x)

x = layers.Conv2DTranspose(32, 3, padding='same', activation='relu',
                           strides=(2, 2))(x)

x = layers.Conv2D(1, 3, padding='same', activation='sigmoid')(x)

decoder = Model(decoder_input, x)

z_decoded = decoder(z)

VAE has a dual loss of reconstruction loss and regularization loss , we need to write a custom layer and use the built-in add_losslayer it to create the loss we want:

class CustomVariationalLayer(keras.layers.Layer):
    
    def vae_loss(self, x, z_decoded):
        x = K.flatten(x)
        z_decoded = K.flatten(z_decoded)
        xent_loss = keras.metrics.binary_crossentropy(x, z_decoded)
        kl_loss = -5e-4 * K.mean(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)
        
        return K.mean(xent_loss + kl_loss)
    
    def call(self, inputs):
        x = inputs[0]
        z_decoded = inputs[1]
        loss = self.vae_loss(x, z_decoded)
        self.add_loss(loss, inputs=inputs)
        return x
    
y = CustomVariationalLayer()([input_img, z_decoded])

Finally, instantiate the model and start training. Because the loss is included in a custom layer, there is no need to specify an external loss (i.e. loss=None) at compile time, which means no target data needs to be passed in during training.

from keras.datasets import mnist

vae = Model(input_img, y)
vae.compile(optimizer='rmsprop', loss=None)
vae.summary()

(x_train, _), (x_test, y_test) = mnist.load_data()

x_train = x_train.astype('float32') / 255.
x_train = x_train.reshape(x_train.shape + (1,))
x_test = x_test.astype('float32') / 255.
x_test = x_test.reshape(x_test.shape + (1,))

vae.fit(x=x_train, y=None,
        shuffle=True,
        epochs=10,
        batch_size=batch_size,
        validation_data=(x_test, None))

If you find an error at this point, you can add the following statement at the beginning and rerun all the above codes:

from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()

Once the model is trained, we can use the decodernetwork to convert any latent space vector to an image:

import matplotlib.pyplot as plt
from scipy.stats import norm

n = 15
digit_size = 28
figure = np.zeros((digit_size * n, digit_size * n))
grid_x = norm.ppf(np.linspace(0.05, 0.95, n))
grid_y = norm.ppf(np.linspace(0.05, 0.95, n))

for i, yi in enumerate(grid_x):
    for j, xi in enumerate(grid_y):
        z_sample = np.array([[xi, yi]])
        z_sample = np.tile(z_sample, batch_size).reshape(batch_size, 2)
        x_decoded = decoder.predict(z_sample, batch_size=batch_size)
        
        digit = x_decoded[0].reshape(digit_size, digit_size)
        figure[i * digit_size:(i + 1) * digit_size,
               j * digit_size:(j + 1) * digit_size] = digit
        
plt.figure(figsize=(10, 10))
plt.imshow(figure, cmap='Greys_r')
plt.show()

insert image description here
As we follow a path through the latent space, we observe a gradual morphing of one digit into another.


References

[1] Deep Learning with Python, François Chollet.

[2] I. J. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA, USA: MIT Press, 2016, http://www.deeplearningbook.org.

Guess you like

Origin blog.csdn.net/myDarling_/article/details/128426946