Deep learning-common neural networks

1. Deep Belief Network (DBN)

In 2006, the "father of neural networks" Geoffrey Hinton sacrificed the artifact deep belief network, which solved the training problem of deep neural networks in one fell swoop and promoted the rapid development of deep learning.
Deep Belief Nets (Deep Belief Nets) is a probability generation model that can establish a joint probability distribution of input data and output categories.
The deep belief network solves the optimization problem of deep neural networks by adopting layer-by-layer training. The layer-by-layer training gives the entire network a good initial weight, so that the network can achieve the optimal solution as long as it is fine-tuned.
Each hidden layer of the deep belief network plays a dual role: it serves as both the hidden layer of the previous neuron and the visible layer of the subsequent neuron.
In the layer-by-layer training, the most important role is the
structure of the "restricted Boltzmann machine" . The deep belief network can be regarded as 受限玻尔兹曼机a whole

Boltzmann machine (BM)

Boltzmann Machines (Boltzmann Machines, BM for short), proposed by Great God Hinton in 1986, is a stochastic neural network rooted in statistical mechanics. In this kind of network, neurons have only two states (inactive, activated) Represented by binary 0 and 1, the value of the state is determined according to the law of probability statistics.
Since the expression form of this statistical law of probability is similar to the Boltzmann distribution proposed by the famous statistical mechanics LEBoltzmann, this network is named "Boltzmann machine".
In physics, the Boltzmann distribution is a description of the energy distribution law of gas molecules in a thermal equilibrium state when an ideal gas is subjected to a conservative external force.
In statistical learning, if we regard the model to be learned as a high-temperature object, the learning process is regarded as a process of cooling to reach thermal equilibrium. After the energy converges to the minimum, the thermal balance tends to be stable, that is, when the energy is the least, the network is the most stable, and the network is optimal at this time.

Boltzmann machine (BM) can be used in supervised learning and unsupervised learning.
In unsupervised learning, hidden variables can be seen as internal feature representations of visible variables, and they can learn complex rules in data. The price of Boltzmann machine is that the training time is very long, very long and very long.

Restricted Boltzmann Machine (RBM)

Restricted Boltzmann Machines (RBM)
removes the intra-layer connections of the "Boltzmann Machine" (BM) and restricts the connections to become "Restricted Boltzmann Machines" (RBM)
A two-layer neural network, a visible layer and a hidden layer.
The visible layer receives the data, the hidden layer processes the data, the two layers are connected in a fully connected manner, and are not connected before the same layer.
The restricted Boltzmann machine needs to feed back the output results to the visible layer, and by reconstructing the reconstruction error between the visible layer and the hidden layer cyclically to reconstruct a set of weight coefficients that minimize the error.

The traditional back propagation method applied to deep structures is feasible in principle, but it can not solve the problem of gradient dispersion in practical operation.

Gradient dispersion (gradient vanishing), when the error backpropagates, the farther the propagation distance, the smaller the gradient value becomes, and the slower the parameter update.

This will cause the parameters of the hidden layer to have converged near the output layer; and near the input layer, the parameters of the hidden layer have hardly changed, or they are randomly selected initial values.

2. Convolutional Neural Network (CNN)

Convolutional neural network refers to a neural network that uses convolution instead of matrix multiplication in at least one derivative.

What is convolution

Convolution is a mathematical operation performed on two functions. We call \ ((f * g) (n) \) the convolution of f and g.

  • Continuous definition
    \ ((f * g) (n) = \ int _ {-\ infty} ^ {\ infty} f (τ) g (n-τ) dτ \)
  • Discrete definition
    \ ((f * g) (n) = \ sum_ {τ =-\ infty} ^ {\ infty} f (τ) g (n-τ) \)

We make \ (x = τ \) , \ (y = n-τ \) , then \ (x + y = n \) is equivalent to the straight line below

conv_ops

If you traverse these straight lines, just like a towel rolled up, as the name implies, "convolution"
conv_ops

In convolutional networks, the essence of convolution is the process of weighted summation 核函数gas a weighting factor 输入函数f.
In fact, the binary function \ (U (x, y) = f (x) g (y) \) is rolled into a unary function \ (V (t) \) , commonly known as dimensionality reduction attack
\ (V (t) = \ int_ {x + y = t} U (x, y) d_x \) , the functions f and g should have equal status, or the variables x and y should have equal status, a desirable way is to roll up along a straight line x + y = t ;

dice

Find the probability that the two dice will add up to 4, this is the application scenario of convolution.

  • The probability of the first dice is f (1), f (2), ... f (6)
  • The probability of the second eighth dice is g (1), g (2), ... g (m)

The probability that the two dice add up to 4 is: \ (f (1) g (3) + f (2) g (2) + f (3) g (1) \) The
standard form is: \ ((f * g) (4) = \ sum_ {m = 1} ^ 3f (4-m) g (m) \)

Make buns

The machine continuously produces steamed buns, assuming that the steamed bun production speed is f (t),
then the total amount of steamed buns produced in a day is
\ (\ int_ {0} ^ {24} f (t) dt \)
and will gradually become corrupt after being produced. The corruption function is g (t). For example, 10 steamed buns will be corrupted in 24 hours.
\ (10 ​​* g (t) \)
The steamed buns produced in a day is
\ (\ int_ {0} ^ {24} f (t) g ( 24-t) dt \)

Make fish

Convolution is regarded as cooking, the input function is the raw material, and the kernel function is the recipe, for the same input function carp

  • The soy sauce weight in the kernel function is larger, and the braised fish is output
  • The weight of sugar and vinegar in the kernel function is greater, and the output of Xihu vinegar fish
  • The pepper in the kernel function has a larger weight and outputs Korean spicy fish

Image Processing

Assuming a picture has noise, to smooth it, you can convert the image into a matrix

If you want to smooth the \ (a_ {1,1} \) point, you can perform a convolution operation on the composition matrix \ (f \) and \ (g \) near the \ (a_ {1,1} \) point , and then Fill back

\ (F \) and \ (G \) is calculated as follows, in fact, be calculated in the opposite direction, the same as roll towel
conv_ops
computing \ (c_ {1,1} \) written formula \ ((f * g) ( 1,1) = \ sum_ {k = 0} ^ 2 \ sum_ {h = 0} ^ 2f (h, k) g (1-h, 1-k) \)

Specific reference:

Convolutional neural network features

The characteristics of the convolution operation determine that the neural network is suitable for processing data with a network-like structure.
The typical network data is a digital image. Whether it is a grayscale or color image, it is a set of titles or vectors defined on a two-dimensional pixel network.
Convolutional neural networks are widely used in image and text recognition, and gradually expanded to other fields such as natural language processing.

  • Sparse Perception
    The size of the kernel function of the convolutional layer is usually much smaller than the size of the image.
    The image may have thousands of pixels in both dimensions, but the kernel function will not exceed tens of pixels at most.
    Choosing a smaller kernel function helps to discover the subtle local details in the image and improve the storage efficiency and operation efficiency of the algorithm.
  • Parameter sharing The
    same parameters are used in a model. In each round of training, a single kernel function is used to convolve with all the blocks of the image.
  • Translation invariance
    When the input of the convolution is balanced, its output is equal to the same amount of translation as the original output, indicating that the translation operation and the function of the kernel function can be exchanged.

Convolutional neural network layering

After the input image is sent to the convolutional neural network, it must cycle through the convolutional layer, activation layer and pooling layer, and finally output the classification results from the fully connected layer.

  • Input layer
    Input data, usually do some data processing, such as averaging, normalization, PCA / whitening, etc.
  • Convolutional layer The
    convolutional layer is the core part of the convolutional neural network. The parameters are one or more randomly initialized kernel functions. The kernel function scans the input image line by line and column by column like a lamp. All the convolution results calculated after scanning can form a matrix. This new matrix is ​​called a feature map. The features obtained by the convolution layer are generally sent to the excitation layer for processing
  • The
    main function of the excitation layer is to make a non-linear mapping of the results of the convolution layer. Common excitation layer functions are sigmoid, tanh, Relu, Leaky Relu, ELU, Maxout
  • Pooling layer
    In the middle of the continuous volume base layer and excitation layer, it is used to compress the amount of data and parameters to reduce overfitting.
    In short, if the input is an image, the main function of the pooling layer is to compress the image.
    A common maximum pooling method is to divide the feature map into several rectangular regions and select the maximum value in each region.
  • Fully connected layer
    All neurons between the two layers have the right to reconnect, usually the fully connected layer is at the tail of the convolutional neural network and outputs the classification result.

In the training of convolutional neural networks, the parameter to be trained is a convolution kernel.

Convolution kernel: that is, the kernel function used for convolution.

The function of the convolutional neural network is to extract the features of the input object layer by layer. The training also uses the method of back propagation. The continuous updating of parameters can improve the accuracy of image feature extraction

3. Generating an adversarial network (GAN)

GAN (Generative Adversarial Network) is a generative model designed by Goodfellow and others in 2014. Inspired by the zero-sum game in game theory, the generative problem is regarded as the confrontation and game between the two networks, the generator and the discriminator.

This method was proposed by Goodfellow et al. In 2014. The generative adversarial network consists of a generator and a discriminator. The
generator generates random samples from the latent space as input, and its output needs to mimic the real samples in the training set as much as possible.
The input of the discriminator is the real sample or the output of the generator, and its purpose is to distinguish the output of the generator from the real sample as much as possible.

The main advantage of GAN is that it surpasses the traditional neural network classification and feature extraction functions, and can generate new data according to the characteristics of real data.
The two networks are advancing in the confrontation. After the progress, the confrontation continues, and the data obtained by the generative network becomes more and more perfect, approaching the real data, so that the desired data (pictures, sequences, videos, etc.) can be generated.

Generator (generator)

The generator generates synthetic data from given noise (generally referred to as uniform distribution or normal distribution). Try to produce data that is closer to reality.
The generator is like a white bone, trying to simulate the potential distribution of real data samples from random noise to generate false and real data samples

Discriminator

The discriminator distinguishes the output of the generator from the real data. Try to distinguish between real data and generated data more perfectly.

The discriminator is Wu Wukong, and he uses fire eyes to judge whether it is real data that is harmless to humans and animals or a pretender disguised by the generator.

Generating an adversarial network * can be seen as a breakthrough in deep learning

Both the generator and the discriminator can be implemented with a deep neural network to establish a data generation model to make the generator as accurate as possible. You have the distribution of data samples. Adversarial learning is unsupervised learning from the learning method.

Network training can be equivalent to the maximum-minimum problem of the catalog function

  • Maximum: Maximize the accuracy of the discriminator to distinguish real data from fake data
  • Minimal: Minimize the probability that the data generated by the generator is discovered by the discriminator

The traditional generative model defines the distribution of the model and then solves the parameters. For example, under the premise that the known data satisfies the normal distribution, the generative model will solve the normal mean and variance according to the sample through methods such as maximum likelihood estimation.

Generative adversarial networks get rid of the dependence on model distribution and do not limit the dimensions of generation, which greatly expands the range of generated data samples, and can also integrate different loss functions, increasing the freedom of design.

4. Recurrent Neural Network (RNN)

Recurrent Neural Network (Recurrent Neural Network) can also represent Recursive Neural Network. Recurrent neural networks can be regarded as a special case of recurrent neural networks, and recurrent neural networks can be regarded as the promotion of recurrent neural networks.
The convolutional neural network has the characteristic of sharing parameters in space, which allows the same kernel function to be applied to different regions of the image.
Adjust the parameter sharing to the time dimension, let the neural network use the same weight coefficient to process the data with the order, and the result is the recurrent neural network.

  • Time
    recurrent neural network introduces the "time" dimension, suitable for processing time series data.
    The recurrent neural network is to divide the input of variable length into small blocks of equal length, and then use the same weight system for processing, so as to realize the calculation and processing of variable length input.
    For example, my mother suddenly called you in the kitchen: "The food is ready, hurry up ...", even if you don't hear clearly later, you can guess that ten or nine are for you to eat quickly.


  • The output of memory recurrent neural network at time t depends on the input at the current time, and also depends on the output of the network at the previous time t-1 or even earlier.
    In this sense, the recurrent neural network introduces a feedback mechanism and thus has a memory function. The memory function enables the recurrent neural network to extract the information from the sequence itself.
    The internal information of the input sequence is stored in the hidden layer of the neural network and flows through the hidden layer over time. The memory characteristics of the recurrent network can be expressed by the formula
    \ (h_t = f (Wx_t + Uh_ {t-1}) \)

Explanation: The weighted result of the input at the time \ (x_t \) and the weighted result of the hidden layer state at the time \ (t-1 \) \ (h_ {t-1} \) are used as
the input of the transfer function , and the result is The output of the hidden layer at time \ (t \) \ (h_t \) .
\ (W \) represents the weight matrix from input to state, \ (U \) represents the transition matrix from state to state.
The training of the recurrent neural network is to continuously adjust the parameters \ (W \) and \ (U \) according to the error between the output result and the real result until the preset requirements are reached. The training method is also a gradient-based back propagation algorithm .

The feedforward neural network also has memory characteristics in a certain program. As long as the neural network parameters are optimized, the optimized parameters will contain traces of past data, but the optimized memory is limited to the training data set. When the training vinegar is applied to When the new test data set is used, its parameters will not be further adjusted according to the performance of the test data.

Bidirectional RNN

For example, there is a TV series, the characters that appeared in the third episode, now let ’s predict the names of the characters that appear in the third episode, you ca n’t predict the contents of the previous two episodes, so you need to use the fourth, The content of the fifth episode to predict the content of the third episode, this is the idea of ​​bidirectional RNN

If you want the recurrent neural network to use the information from the future, you need to establish a direct connection between the current state and the state at a later time, which is a bidirectional recurrent neural network.
The bidirectional recurrent network includes two links of forward calculation and reverse calculation

  • In the forward calculation, hidden layer state at time t \ (h_t \) and past \ (h_ {t-1} \) associated
  • In the reverse calculation, the state of the hidden layer \ (h_t \) and the future \ (h_ {t + 1} \) related
    bidirectional recurrent network at time t need to calculate the forward and reverse results separately, and use both as hidden The final parameters of the layer.

Deep RNN

The deep structure can be obtained by introducing the deep structure into the recurrent neural network.
For example, when you learn English, you will remember all the words you want to test if you read the English words once, usually with the words that you have memorized a few times before, and then choose those that are memorized but not familiar or not. Memorized words

Compared with bidirectional RNN, deep bidirectional RNN has several more hidden layers, because his idea is that a lot of information cannot be remembered at one time.
Deep bidirectional RNN is based on such an idea, the state of each hidden layer \ (h_t ^ i \) Not only depends on the state of the previous hidden layer at the same time \ (h_t ^ {i-1} \) , but also depends on the state of the same hidden layer h_ {t-1} ^ {i} $
The role of the depth structure is to establish a clearer Of representation. With "Gestalt", you need to choose the right words according to the context. Some fill-in-holes can be inferred only from the sentence in which they are located, which corresponds to the dependence of a single hidden layer on the time dimension; some fill-in-holes may need to be read through the entire paragraph or the full text to determine, which corresponds to the common between the time dimension and the space dimension Dependence.

Recursive RNN

Recurrent neural networks can process data with a hierarchical structure, which can be regarded as the promotion of recurrent networks

The characteristic of the recurrent neural network is to share parameters in the time dimension to expand the processing sequence. If it is expanded into a tree state structure, the recurrent neural network is used. The recurrent neural network first converts the input data into a certain topology, and then recursively uses the same weight coefficient on the same structure to obtain a structured prediction by traversing.

For example, "teacher at two universities" has ambiguity. If it is simply split into word sequences, the ambiguity cannot be resolved.
The recurrent neural network breaks a complete sentence into a combination of several components through a tree structure, and the generated vector is not the root node of the tree structure.

5. Long- and short-term memory network (LSTM)

Long Short-Term Memory (LSTM, Long Short-Term Memory) is a kind of time recurrent neural network. It is specially designed to solve the long-term dependence problem of general RNN (recurrent neural network). The paper was first published in 1997. Due to its unique design structure, LSTM is suitable for processing and predicting important events with very long intervals and delays in the time series.

RNN introduces the memory feature by sharing parameters in time, so that the previous information can be applied to the current task, but this kind of memory is usually only limited in depth.

For example, Dragon Ball Super or Naruto updates an episode every week. Even after a week's gap, we can seamlessly connect the content of the previous episode with the plot of the new episode. However, RNN's memory does not have such a strong continuity, let alone a week, it is estimated that the food has been taken out in 5 minutes.

LSTM can selectively memorize some information with a longer time interval like human memory. It will judge whether different information is forgotten or remembered according to the characteristics of the constituent elements.
LSTM is used to achieve long-term memory, to achieve any length of memory. The model is required to have the ability to judge the value of information, combine with itself to determine which information should be preserved and which information should be discarded, and the unit must also be able to decide which part of memory needs to be used immediately.

4 types of composition

LSTM usually consists of the following 4 modules

  • Memory cells (memory cells)
    are used to store values ​​or states, and the storage period can be long-term or short-term
  • The input gate
    determines what information is stored in the memory cell
  • The forget gate
    determines what information is discarded from the memory cell
  • The output gate
    determines what information is output from the memory cell

Guess you like

Origin www.cnblogs.com/chenqionghe/p/12688780.html