On one of deep learning: basic concepts and classification neural network

First, the concept of deep learning

1. What is the depth of learning

Deep learning (Deep Learning) is a form of machine learning, artificial neural network concepts originated. Containing multiple hidden layers of the multilayer perceptron learning is a deep structure. Attribute category indicates deep learning or more abstract features formed by the combination of high-level low-level features, to find a distributed representation of the characteristic data. It is a new field of research in machine learning, their motivation is to build, simulate the human brain to analyze learning neural network, which mimics the mechanism of the human brain to interpret the data.

 

2. Basic conversion (the concept of layers)

Neural network is constructed layer by layer, then each exactly what to do?

  • Mathematical formula: \ Begin {y} = a (W \ times \ begin {x} + {b})wherein \ Begin {x}is the input vector, \ Begin {s}output vector, \ Begin {b}is an offset vector, Wis the weight matrix, a()is the activation function. Each layer is only input \ Bf xafter such a simple operation to obtain \ Y thing.
  • Mathematical understood: The five kinds of operation input space (the set of input vectors) are as follows, to complete the input space -> Output transform space (line space matrix to the column space). Note: The reason for using "space" is not a single word of things are classified, but a class of things. Space is the set of all individuals such things.
    • 1. Dimension L / dimensionality reduction
    • 2. enlargement / reduction
    • 3. Turn
    • 4. translation
    • The "bending" in these five operations, by the operation of 1,2,3 W\cdot\vec{x}completed, the operation is made 4 + \ Begin {b}to complete the operation by 5 is a()achieved.

Mathematical appreciated that each neural network: linear transformation followed nonlinear variation, the input space toward the other space.

Linear perspective can be divided into: neural network learning is learning how to use the non-linear transformation linear transformation matrix plus activation function, the original input space to invest linearly separable / sparse space to classification / regression. 
Increase the number of nodes: increased dimensions, namely to increase the linear conversion capability. 
The number of layers: increasing the number of activation function, i.e. to increase the number of non-linear conversion.

 

3. Model Training

   After the learning process is to learn the right way to control the spatial transformation (consisting mode) weight matrix, the next question is how to learn the weight of each layer weight matrix W. By comparing the predicted value of the current network and we really want to target, and then updates the weight of each layer according to the discrepancy between the two weight matrix (for example, if the predicted value of the network is high, then adjust the weights to make it forecast lower and constantly adjust until you are able to predict the target). Hence the need to define "how the difference between the predicted value and the target value", which is the loss of function or equation objective function (loss function or objective function), used to measure the difference between the predicted value and the target value. The output value of the loss function (Loss) higher the greater the difference. That training the neural network becomes narrow loss as much as possible of course. The method used was a gradient fall (Gradient descent): by moving the loss value corresponding to the gradient of the current point in the opposite direction to continue to reduce the loss. How much is a move by the learning rate (learning rate) controlled.

 

4. Gradient Descent

Declining gradient problem:

 However, using a gradient descent training the neural network has two major problems.

One of: a local minimum value

Gradient descent is looking for a local minimum loss function, and we want the global minimum. As shown below, we hope to reduce the loss value may be the lowest point of the right dark blue, but there is some local minima loss "stuck" in the left.

Trying to solve the "stuck in local minima" The problem of method divided into two categories:

  • The pace of adjustment: adjust the learning rate, so that every time an update "pace" different. Common methods are:
  • Stochastic gradient descent (Stochastic Gradient Descent (SGD): one sample per update only the calculated gradient
  • Small batch gradient descent (Mini-batch gradient descent): average of several samples each update of the calculated gradients
  • Momentum (Momentum): instead of considering only the calculated gradient of the current sample; Nesterov momentum (Nesterov Momentum): Momentum Improvement
  • Adagrad, RMSProp, Adadelta, Adam: these methods are in accordance with the rules of the training process to reduce the learning rate, also part of the comprehensive momentum
  • Optimized starting point: reasonable initialize the weights (weights initialization), pre-training network (pre-train), the network get a good "starting point", as the starting point on the far right is better than the starting point of the far left. Common methods are: initial weight Gaussian distribution (Gaussian distribution), uniform distribution of the initial weight (Uniform distribution), Glorot initial weight, He initial weights, initial weight sparse matrix (sparse matrix)

Second problem: the gradient calculated

Data processed by machine learning are high-dimensional data, how to quickly calculate the gradient, rather than in years. Second, how to update the hidden layer weights? The solution is: FIG calculated: back-propagation algorithm needs to know that the back-propagation algorithm is a method of seeking a gradient. As fast Fourier contribution transform (FFT) is. The concept of the computation graph and calculate the gradient of the more reasonable and convenient.

 

Second, feed-forward neural networks

Feedforward neural networks (feedforward neural network), it is A one kind of the NN. In this neural network, each of the neurons, an input from the input layer receives before the start, and output to the next one, until the output layer . The entire network without feedback, can represent a directed acyclic graph.
Feedforward neural networks using one-way multi-layer structure. Wherein each layer comprises a plurality of neurons, there is no interconnection between neurons of the same layer, the interlayer transmitting information in one direction only. Wherein the first layer is called the input layer. Last layer is the output layer. Intermediate hidden layer, referred to as the hidden layer. Hidden layer may be a layer. It can also be multi-layered.
 
  • Network architecture: 2 dimensional input  \rightarrow\rightarrow1-dimensional output

  • Structure expression:
    • Forward transfer:  y=M(x)=relu(W_{h} \cdot relu(W_{x} \cdot x+b_{x})+b_{h}) (1)
      • Y Used to express random variable  Y values,  x a random variable representing  X the value of  M(x) our neural network model, the right side of the equal sign is the concrete expression.
    • Loss function: loss=1/2\cdot \sum\limits_i (y_i-t_i)^2
      • 该loss就是比较 Y 和 t 中所有值的差别。
  • 整体结构:左侧的图表示网络结构。绿色方框表示操作,也叫作层(layer)。该结构中,输入 x 经过hid_layer算出隐藏层的值 h ,再传递给out_layer,计算出预测值 Y ,随后与真实值 t 进行比较,算出损失 loss ,再从反向求导得出梯度后对每一层的 W和 b 进行更新。

 

  • 正向传递:如果放大hid_layer内部,从下向上,会看到 W_h 先用truncated_normal的方法进行了初始化,随后与输入 x 进行矩阵相乘,加上 b_h ,又经过了activation后,送给了用于计算 Y 的out_layer中。而 Y 的计算方式和 h 完全一致,但用的是不同的权重 W_o 和偏移 b_o 。最后将算出的预测值 Y 与真实值 t 一同求出 loss

  • 反向传递:如果放大train的内部,再放大内部中的gradients,就可以看到框架是从 loss开始一步步反向求得各个层中 W 和 b 的梯度的。

  • 权重更新:求出的各个层 W 和 b 的梯度,将会被用于更新对应的 W 和 b ,并用learning rate控制一次更新多大。(beta1_power和beta2_power是Adam更新方法中的参数,目前只需要知道权重更新的核心是各自对应的梯度。)

三、循环神经网络(RNN)

 前馈网络:window size为3帧的窗处理后的前馈网络

  • 动态图:左侧是时间维度展开前,右侧是展开后(单位时刻实际工作的只有灰色部分。)。前馈网络的特点使不同时刻的预测完全是独立的。我们只能通过窗处理的方式让其照顾到前后相关性。
 

 

  • 数学式子:h_t= \phi(W_{xh} \cdot concat(x_{t-1}, x_t, x_{t+1}) + {b}),concat表示将向量并接成一个更大维度的向量。
  • 学习参数:需要从大量的数据中学习W_{xh}b
  • 要学习各个时刻(3个)下所有维度(39维)的关系(39*3个),就需要很多数据。

 递归网络:不再有window size的概念,而是time step

  • 动态图:左侧是时间维度展开前,回路方式的表达方式,其中黑方框表示时间延迟。右侧展开后,可以看到当前时刻的h_t并不仅仅取决于当前时刻的输入x_t,同时与上一时刻的h_{t-1}也相关。
 

  • 数学式子:h_t= \phi(W_{xh} \cdot x_t + W_{hh} \cdot h_{t-1} + {b})h_t同样也由x_tW_{xh}的变化后的信息决定,
  • 但这里多另一份信息:W_{hh}\cdot h_{t-1},而该信息是从上一时刻的隐藏状态h_{t-1}经过一个不同的W_{hh}变换后得出的。
  • Note: W_{xh}The shape behavior dim_input, as dim_hidden_state, while W_{hh}a phalanx of ranks are dim_hidden_state.
  • Learning parameters: feed-forward network needs three time to help you learn once W_{xh}, and a recursive network can use three time to help you learn three times W_{xh}and W_{hh}. In other words: the weight matrix of all time are shared. This is a recursive network relative to the feed-forward network the most prominent advantages.
Recurrent Neural Network sharing feature is present in the temporal structure of the neural network variants.

Time sharing is the core of the core recursive structure of the network.

 

 

Guess you like

Origin www.cnblogs.com/jimchen1218/p/11805271.html