First, the concept of deep learning
1. What is the depth of learning
Deep learning (Deep Learning) is a form of machine learning, artificial neural network concepts originated. Containing multiple hidden layers of the multilayer perceptron learning is a deep structure. Attribute category indicates deep learning or more abstract features formed by the combination of high-level low-level features, to find a distributed representation of the characteristic data. It is a new field of research in machine learning, their motivation is to build, simulate the human brain to analyze learning neural network, which mimics the mechanism of the human brain to interpret the data.
2. Basic conversion (the concept of layers)
Neural network is constructed layer by layer, then each exactly what to do?
- Mathematical formula: wherein is the input vector, output vector, is an offset vector, is the weight matrix, is the activation function. Each layer is only input after such a simple operation to obtain .
- Mathematical understood: The five kinds of operation input space (the set of input vectors) are as follows, to complete the input space -> Output transform space (line space matrix to the column space). Note: The reason for using "space" is not a single word of things are classified, but a class of things. Space is the set of all individuals such things.
- 1. Dimension L / dimensionality reduction
- 2. enlargement / reduction
- 3. Turn
- 4. translation
- The "bending" in these five operations, by the operation of 1,2,3 completed, the operation is made 4 to complete the operation by 5 is achieved.
Mathematical appreciated that each neural network: linear transformation followed nonlinear variation, the input space toward the other space.
Linear perspective can be divided into: neural network learning is learning how to use the non-linear transformation linear transformation matrix plus activation function, the original input space to invest linearly separable / sparse space to classification / regression.
Increase the number of nodes: increased dimensions, namely to increase the linear conversion capability.
The number of layers: increasing the number of activation function, i.e. to increase the number of non-linear conversion.
3. Model Training
After the learning process is to learn the right way to control the spatial transformation (consisting mode) weight matrix, the next question is how to learn the weight of each layer weight matrix W. By comparing the predicted value of the current network and we really want to target, and then updates the weight of each layer according to the discrepancy between the two weight matrix (for example, if the predicted value of the network is high, then adjust the weights to make it forecast lower and constantly adjust until you are able to predict the target). Hence the need to define "how the difference between the predicted value and the target value", which is the loss of function or equation objective function (loss function or objective function), used to measure the difference between the predicted value and the target value. The output value of the loss function (Loss) higher the greater the difference. That training the neural network becomes narrow loss as much as possible of course. The method used was a gradient fall (Gradient descent): by moving the loss value corresponding to the gradient of the current point in the opposite direction to continue to reduce the loss. How much is a move by the learning rate (learning rate) controlled.
4. Gradient Descent
Declining gradient problem:
However, using a gradient descent training the neural network has two major problems.
One of: a local minimum value
Gradient descent is looking for a local minimum loss function, and we want the global minimum. As shown below, we hope to reduce the loss value may be the lowest point of the right dark blue, but there is some local minima loss "stuck" in the left.
Trying to solve the "stuck in local minima" The problem of method divided into two categories:
- The pace of adjustment: adjust the learning rate, so that every time an update "pace" different. Common methods are:
- Stochastic gradient descent (Stochastic Gradient Descent (SGD): one sample per update only the calculated gradient
- Small batch gradient descent (Mini-batch gradient descent): average of several samples each update of the calculated gradients
- Momentum (Momentum): instead of considering only the calculated gradient of the current sample; Nesterov momentum (Nesterov Momentum): Momentum Improvement
- Adagrad, RMSProp, Adadelta, Adam: these methods are in accordance with the rules of the training process to reduce the learning rate, also part of the comprehensive momentum
- Optimized starting point: reasonable initialize the weights (weights initialization), pre-training network (pre-train), the network get a good "starting point", as the starting point on the far right is better than the starting point of the far left. Common methods are: initial weight Gaussian distribution (Gaussian distribution), uniform distribution of the initial weight (Uniform distribution), Glorot initial weight, He initial weights, initial weight sparse matrix (sparse matrix)
Second problem: the gradient calculated
Data processed by machine learning are high-dimensional data, how to quickly calculate the gradient, rather than in years. Second, how to update the hidden layer weights? The solution is: FIG calculated: back-propagation algorithm needs to know that the back-propagation algorithm is a method of seeking a gradient. As fast Fourier contribution transform (FFT) is. The concept of the computation graph and calculate the gradient of the more reasonable and convenient.
Second, feed-forward neural networks
- Network architecture: 2 dimensional input 1-dimensional output
- Structure expression:
- Forward transfer: (1)
- Used to express random variable values, a random variable representing the value of our neural network model, the right side of the equal sign is the concrete expression.
- Loss function:
- 该loss就是比较 和 中所有值的差别。
- Forward transfer: (1)
- 整体结构:左侧的图表示网络结构。绿色方框表示操作,也叫作层(layer)。该结构中,输入 经过hid_layer算出隐藏层的值 ,再传递给out_layer,计算出预测值 ,随后与真实值 进行比较,算出损失 ,再从反向求导得出梯度后对每一层的 和 进行更新。
- 正向传递:如果放大hid_layer内部,从下向上,会看到 先用truncated_normal的方法进行了初始化,随后与输入 进行矩阵相乘,加上 ,又经过了activation后,送给了用于计算 的out_layer中。而 的计算方式和 完全一致,但用的是不同的权重 和偏移 。最后将算出的预测值 与真实值 一同求出
- 反向传递:如果放大train的内部,再放大内部中的gradients,就可以看到框架是从 开始一步步反向求得各个层中 和 的梯度的。
- 权重更新:求出的各个层 和 的梯度,将会被用于更新对应的 和 ,并用learning rate控制一次更新多大。(beta1_power和beta2_power是Adam更新方法中的参数,目前只需要知道权重更新的核心是各自对应的梯度。)
三、循环神经网络(RNN)
前馈网络:window size为3帧的窗处理后的前馈网络
- 动态图:左侧是时间维度展开前,右侧是展开后(单位时刻实际工作的只有灰色部分。)。前馈网络的特点使不同时刻的预测完全是独立的。我们只能通过窗处理的方式让其照顾到前后相关性。
- 数学式子:,concat表示将向量并接成一个更大维度的向量。
- 学习参数:需要从大量的数据中学习和。
- 要学习各个时刻(3个)下所有维度(39维)的关系(39*3个),就需要很多数据。
递归网络:不再有window size的概念,而是time step
- 动态图:左侧是时间维度展开前,回路方式的表达方式,其中黑方框表示时间延迟。右侧展开后,可以看到当前时刻的并不仅仅取决于当前时刻的输入,同时与上一时刻的也相关。
- 数学式子:。同样也由经的变化后的信息决定,
- 但这里多另一份信息:,而该信息是从上一时刻的隐藏状态经过一个不同的变换后得出的。
- Note: The shape behavior dim_input, as dim_hidden_state, while a phalanx of ranks are dim_hidden_state.
- Learning parameters: feed-forward network needs three time to help you learn once , and a recursive network can use three time to help you learn three times and . In other words: the weight matrix of all time are shared. This is a recursive network relative to the feed-forward network the most prominent advantages.
Recurrent Neural Network sharing feature is present in the temporal structure of the neural network variants.
Time sharing is the core of the core recursive structure of the network.