Encoder-decoder architecture

1. Communication system

Before learning 编码器-解码器 about deep learning, first introduce 编码与解码 and 调制与解调 in the communication system.

  1. Communication is for exchanging messages, and communicating messages requires ensuring that the information received by the recipient is consistent with the information sent by the sender. However, there is interference during the communication process!
  2. Original signal (analog signal): refers to the original signal formed by the transmitting end in the communication system, with various and irregular waveforms, which is easy to be interfered (distorted).
  3. Therefore, the signal needs to be processed so that it is not easily interfered and the waveform is not severely deformed.

Therefore编码与解码调制与解调

  1. At the sending end, the source code is original signal S 1 S_1 S1Perform leveling processing and channel coding to reduce the distortion rate of the signal and enhance its testability.
  2. Use modulation technology to further enhance the signal to obtainintermediate signal S e n h a n c e S_{enhance} Senhance
  3. At the receiving end, channel decoding and source decoding are performed inversely to obtain the output signal S 1 ′ S_1^{'} < /span>S1
  4. This achieves a high-fidelity, high-efficiency signal transmission.

编码方式: reverse non-return to zero coding, Manchester coding, Miller coding, modified Miller coding, error control coding.
调制方式: Amplitude keying, frequency shift keying, phase keying, subcarrier modulation.
Reference

2. Encoder-Decoder

2.1 Introduction

In machine learning, many problems can be abstracted from similar models:

Machine translation: Convert sentences in one language into sentences in another language.
Automatic summary: extract a summary of a piece of text.
Generate text explanations for images: Convert image data into text data.

Use a function directly y = f ( x ) y=f(x) and=f(x)Complete the above conversion, it is possible There will be difficulties. eg: In machine translation, the lengths of input and output are not fixed, and they may not be equal.

Inko, ShogunNumber of imports x x x is converted into a kind of intermediate data z z z、ReturnIntermediate number setting z z z ProjectionOutput number station y y y. This is the encoder-decoder architecture. (Similar to the 编码与解码 process of the communication system: original signal S 1 S_1 S1----Intermediate signal S_{enhance} Senhance----Output signal S 1 ′ S_1^{'} S1

2.2 Encoder-Decoder architecture

Encoder-Decoder is an abstract concept of deep learning model. Many models are based on this architecture, such as CNN, RNN, LSTM and Transformer.

  1. Encoder: Responsible for converting input into features
  2. Decoder: Responsible for converting features into targets
Picture not displayed

Generalized encoder-decoder architecture

  • CNN (Convolutional Neural Network): can be used as a decoder, and does not need to accept input.
Picture not displayed

CNN: encoder-decoder architecture

  • RNN (Recurrent Neural Network): can be used as a decoder Decoder to accept input.
Picture not displayed

RNN: encoder-decoder architecture

2.3 Detailed introduction

reference

2.3.1 PCA principal component analysis

PCA is an unsupervised data dimensionality reduction algorithm. Convert a high-dimensional vector x x x is mapped into a low-dimensional vector z z z,Prerequisite is correct z z z I'm on hold x x Main message for x. Execute公式1 to achieve data dimensionality reduction. The dimensionality reduction process is similar to 编码器, converting the high-dimensional vector x x z
W W W: Projection matrix, calculated from the sample set; m m m: mean vector of the sample set
y = W ( x − m ) (1) y=W(x-m)\tag1 < /span>and=W(xm)(1)

Sometimes, we need to reduce the dimension from the vector z z z Chūshō origin primitive direction x x x, execute公式2 to realize data reconstruction. Data reconstruction is the opposite of the data dimensionality reduction algorithm. The reconstruction process is similar to 解码器, starting from the low-dimensional vector z z z Decode the original high-dimensional vector x x x
W T y + m (2) W^Ty+m\tag2 INTy+m(2)

2.3.2 Auto-Encoder AE automatic encoder

AE is a special neural network used for feature extraction and data dimensionality reduction. The simplest autoencoder consists of an input layer, a hidden layer, and an output layer. The mapping of the hidden layer acts as 编码器, and the mapping of the output layer acts as 解码器.

During training编码器(隐藏层)Pair input vector x x x is mapped to obtain the encoded vector z z z;解码器(输出层)direction amount z z z is mapped to obtain the reconstructed vector y y y (inbound direction amount x x Approximation of x). The encoder and decoder are trained simultaneously, and the training goal is to minimize the reconstruction error, that is, let the reconstruction vector y y y Source import direction x x The error between x is minimized, which is very similar to PCA. Sample x x The label value of x is the sample itself.

After training is completed, only 编码器 is used in prediction and 解码器 is no longer needed. The output result of 编码器 is Further use, for classification, regression and other tasks.

Picture not displayed

Autoencoder, input layer: 6 neurons (input data is a 6-dimensional vector); hidden layer: 3 neurons (encoded vector); output layer: 6 neurons (reconstructed vector)

result h h h is the mapping function of the encoder, g g g is the mapping function of the decoder, l l l Number of strings, θ \theta θ Sum θ ′ \theta' i is the parameter to be determined by the encoder and decoder, and the sampling Euclidean distance loss is used, then the objective function optimized during training is 公式3:
m i n 1 2 l ∑ i = 1 l ∣ ∣ x i − g θ ′ ( h θ ( x i ) ) 2 2 ∣ ∣ (3) min\frac1{2l}\sum_{i= 1}^l||x_i-g_{\theta'}(h_\theta(x_i))^2_2||\tag3 min2l1i=1l∣∣xigi(hθ(xi))22∣∣(3)

2.3.3 Variational Auto-Encoder VAE variational autoencoder

A variational autoencoder is a deep generative model used to generate data for images, sounds, etc., similar to a GAN generative adversarial network. Variational autoencoders are very different from autoencoders and serve completely different purposes.

Considerdata generation problems, such as writing, writing handwritten numbers like the MNIST data set

Picture not displayed

MNIST handwritten digit recognition

If you collect training samples first and then let the algorithm output the samples as they are (write), the generated samples will have no diversity. The solution is as follows:

Picture not displayed

First generate some random numbers, then transform them to generate complex sample data

Simple random number generation: first learn features from text images, and then randomly perturb the features to generate new samples. Variational autoencoders adopt this idea. Its structure is shown in the figure below:

Picture not displayed

Autoencoders (latent variables: features learned from images)

The encoder and decoder are trained at the same time. The target during training is the right end of 公式4:
l o g p ( x ) − D K L [ q ( z ∣ x ) ∣ ∣ p ( z ∣ x ) ] = E z − q [ l o g p ( x ∣ z ) ] − D K L [ q ( z ∣ x ) ∣ ∣ p ( z ) ] (4) logp(x)-D_{ KL}[q(z|x)||p(z|x)]=E_{z-q}[logp(x|z)]-D_{KL}[q(z|x)||p(z)] \tag4 logp(x)DKL[q(zx)∣∣p(zx)]=ANDzq[logp(xz)]DKL[q(zx)∣∣p(z)](4)

After training is completed, the prediction phase can directly generate samples. First, from the normal distribution Z ∼ N ( 0 , 1 ) Z\sim N(0,1) WITHN(0,A random number is generated in 1) and then sent to the decoder to obtain the prediction result, which is the generated sample. At this time, the encoder is no longer needed, and there is no need to perform mean and variance transformations on the random numbers.

Picture not displayed

Prediction stage

2.3.4 Dense prediction-fully convolutional network

The goal of image segmentation is to determine what object each pixel in the image belongs to, that is, to classify all pixels. It is a dense prediction problem of pixel-by-pixel prediction.

Convolutional network CNN Disadvantages: (1) After multiple convolutions and pooling, the size of the image will be reduced, and the final output result cannot correspond to the original image. of every pixel. (2) The fully connected layer following the convolutional layer maps the image into a fixed-length vector, which is not consistent with the segmentation task.

Fully convolutional network FCN Advantages: (1) Using deconvolution operation, an output image equal to the size of the original input image is obtained from the previous convolution feature image. (2) The fully connected layer in the convolutional neural network is removed and replaced with convolution.

Fully Convolutional Network FCN predicts from卷积特征图像. The network can accept input images of any size and produce output images of the same size. The pixels of the input image and the output image correspond one to one. This network supports the training of. 输入图像每个像素的类别端到端、像素到像素

The first half of FCN is卷积层和池化层,充当编码器, which extracts features from the input image. The second half of the network is 反卷积层,充当解码器, which decodes the resulting image from the features. The typical network structure is shown in the figure below

Picture not displayed

Fully convolutional network FCN

The first half of SegNet semantic segmentation network is编码器,由多个卷积层和池化层组成. The second half of the network is 解码器,由多个上采样层和卷积层构成。解码器的最后一层是softmax层 and is used to classify pixels.

Picture not displayed

SegNet semantic segmentation network

Among them, 编码器网络 is used to generate feature images with semantic information; 解码器网络 is used to convert the low-resolution feature images output by the encoder network Map back to the dimensions of the input image for pixel-by-pixel classification.

2.3.5 Sequence-to-sequence learning

In some problems, the length of the input sequence and the output sequence are not necessarily equal, and the length of the output sequence is unknown.

For example, in machine translation, after a sentence in one language is translated into another language, the length of the sentence, that is, the number of words it contains, is generally not equal. English “what’s your name” is a sequence of 3 words, translated into Chinese “what’s your name” is a sequence of 4 Chinese characters.

Standard RNN cannot handle this situation where the length of the input sequence and the output sequence are not equal. One way to solve this kind of problem isSequence to Sequence Learning, Referred to as seq2seq technology.

The seq2seq framework consists of two parts, called 编码器和解码器,它们都是循环神经网络 respectively. What needs to be done here is prediction (mapping) from one sequence to another: S s r c → S d s t S_{src} \to S_{dst} SsrcSdst, the former is the source sequence, the latter is the target sequence, and the lengths of the two may not be equal.

编码器网络Accept the input sequence and convert the last moment T T Hidden layer state value generated by T h r h_r hr as the encoded value of the sequence v v v,它included 1 1 1 T T T All the information of the input sequence at any time is a fixed-length vector. 解码器网络The input value at each moment is v v v y t y_t andt , which can calculate the target sequence y 1 … y T ′ y_1…y_{T'} and1andTThe conditional probability of

For machine translation, 编码器 receives each word of the source language sentence in turn, until eos, and finally obtains the semantic vector v v v v v . 解码器First enter bos, which is the beginning of the sentence, according to bos and v v v Predict the probability of each word as the next word, and select the word or words with the highest probability. Next, compare this word with the highest probability with v v v is sent to the decoder to get the next word, and the cycle continues until eos is obtained, which is the end of the sentence, and the translation ends. (Using beam search technology) As shown in the figure below:

Picture not displayed

Machine translation using seq2seq

2.3.6 Combination of CNN and RNN

FCN is a combination of CNN and CNN, seq2seq is a combination of RNN and RNN.

In the encoder-decoder framework, CNN and RNN can be combined to form an encoder-decoder architecture, and the two can be flexibly combined for various tasks.

2.3.6.1 From image to text

Mapping from images to text refers to generating text explanations for images or videos.

(1) CNN is编码器, used to extract the semantic features of images.
(2) RNN is 解码器, whose input is the semantic feature of the image and outputs a text sequence of variable length. The structure is shown below:

Picture not displayed

From image to text

2.3.6.2 From text to image

Using recurrent neural networks and deep convolutional generative adversarial networks can generate an image from a piece of text, transforming visual concepts from text into pixel representation. As shown below:

Picture not displayed

From text to image

Similar to machine translation, an encoder-decoder framework is used.

(1) Convert the text into a vector (semantic information of the text) and output it through一个 CNN 和一个 RNN . The output corresponds to the image.
(2) Train a 生成对抗网(深度卷积网络) based on the generated text vector, responsible for generating images. The mapping implemented by the generative adversarial network is: R z × R T → R D \mathbb R^z \times\mathbb R^T\to \mathbb R^D RWith×RTRD

R z \mathbb R^z Rz: Dimension of random noise; R T \mathbb R^T RT: vector dimension after text vectorization; R D \mathbb R^D RD: Dimension of the generated image. Generative adversarial network accepts random noise vectors and text feature vectors as input and outputs images of specified sizes.

3. Others

  1. In Transformer, 编码器 works in parallel during training and testing; 解码器 works in parallel during training and serially during testing. Attention, Reference Encoder-Decoder

Guess you like

Origin blog.csdn.net/deer2019530/article/details/129675690