Deep learning - recurrent neural network notes

  Recurrent Neural Network (Recurrent Neural Network, RNN) and recurrent neural networks (Recursive Neural Network, RNN) on shorthand is the same, this note only for recurrent neural networks, Let me talk about this before, for these two neural networks some blog just its separate it from the name, there is no specific detail, after a lot of Chinese search, still found there are a lot of enthusiastic great God tells them carefully, we recommend temporarily https: // zybuluo. com / hanbingtao / note / 541458 this description. Most of this switched notes the article.

  Fully connected neural network and neural network convolution process they can only enter one by one separately, after a previous input and the input is totally irrelevant. However, some tasks need to be able better information processing sequence, i.e., the input and the input front of the latter there is a relationship. For example, when we understand the meaning of a sentence, isolated understand every word of this sentence is not enough, we need to deal with the entire sequence of these words are connected together; when we deal with the video, we can not just go alone analyzes each frame, and to analyze the entire sequence of the frames are connected together. At this point, you need to use the depth of field of study is important in another type of neural networks: recurrent neural network (Recurrent Neural Network).

  Below is a simple loop such as neural network, which consists of an input layer, a hidden layer and an output layer consisting of:

      

  If the above that there is a circle with an arrow W removed, it becomes the most common fully connected neural network. x is a vector that represents the value of the input layer (there is no painted circle that represents neuron node); s is a vector that represents the value of the hidden layer (here drew a hidden layer of nodes, you can imagine this one is actually a plurality of nodes, the same number of nodes and the vector s dimension); the U-is the right input layer to the hidden layer weight matrix; O is a vector that represents the output layer; V is the weight of the hidden layer to the output layer weight matrix. Value s hidden layer of the neural network cycle depends not only on the current input x, s time also depends on the value of the hidden layer. W is a weight matrix of hidden layer as a value on the right to re-enter this time.

  If the top of a development, recurrent neural network can also be painted look like this:

       

The network input is received at time t X t Then, the value of the hidden layer is S t , the output value O t . The key point is, S T value depends not only on X T , also depends on S T-. 1 . We can use the following formula to represent the calculation method of recurrent neural networks:

         

Formula 1 formula is the output layer, the output layer is a layer fully connected, i.e. each of its nodes and each node is connected to the hidden layer. V is the output layer weight matrix, g is the activation function. Equation 2 is calculated in the hidden layer, which layer is cyclic. U is the input x weight weight matrix, W is the last value S T-. 1 as the right input this time weight matrix, f is the activation function.

We can put S t-1 has a forward loop can be:

 Output value ot about S T expression, which is input by the foregoing previous value X T-. 1 , X T-2 , X T-. 3 , ... effects, which is why any cycle neural network can look forward a plurality of input values reasons.

Bidirectional Recurrent Neural Networks

FIG bidirectional loop network:

 Consider first FIG. Y 2 calculated, it can be seen from the figure, the bidirectional convolutional neural network hidden layer to store two values, a positive A participating calculated, another value A 'participation reverse calculation. The final output value Y 2 depends on the A 2 and A ' 2 , can be calculated as follows:

          

          

General rule can be seen: the forward calculation, the value of s hidden layer t and s t. 1- related; reverse calculation, the value of s hidden layer ' t and s' t. 1- related; final output depends forward and reverse calculated additive. Modeled Formula 1 and Formula 2, a two-way write cycle calculation Neural Networks:

          

We can see from the above three formulas to calculate forward and reverse calculations do not share the weight, that is to say U and U ', W and W', V and V 'are different weight matrix.

Depth Recurrent Neural Networks

  Recurrent Neural Networks described earlier only one hidden layer, of course, also be stacked two or more hidden layers, thus obtaining a depth of Recurrent Neural Networks. As shown below:

      

Recurrent Neural Network Training

 Neural network training algorithm cycle: BPTT

BPTT algorithm for training algorithm circulation layer, its basic principles and BP algorithm is the same, also contain the same three steps:

  1. Each neuron is calculated before the output value;
  2. Δ reverse calculation error term of each neuron j value, the error function E which is the weighted input neurons j of NET j partial derivatives;
  3. Calculate the weight of each weight gradient.

Finally then stochastic gradient descent algorithm updates the weights. Since https://zybuluo.com/hanbingtao/note/541458 This use of the numerical calculations inside the matrix operation approach, a long time a little bit forgotten, after much searching found https://www.jianshu.com/p / 87aa03352eb9 mathematical expression of this easier to understand. Layer loop as shown below:

       

上图表明了RNN网络的完整拓扑结构,从图中我们可以看到RNN网络中的参数情况。在这里我们只分析t时刻网络的行为与数学推导。t时刻网络迎来一个输入xt,网络此时刻的神经元状态st用如下式子表达:

          

为了方便做了如下变换:

          

RNN的损失函数选用交叉熵(Cross Entropy),这是机器学习中使用最广泛的损失函数之一了,其通常的表达式如下所示:

           

上面式子是交叉熵的标量形式,y i是真实的标签值,y * i是模型给出的预测值,最外面之所以有一个累加符号是因为模型输出的一般都是一个多维的向量,只有把n维损失都加和才能得到真实的损失值。交叉熵在应用于RNN时需要做一些改变:首先,RNN的输出是向量形式,没有必要将所有维度都加在一起,直接把损失值用向量表达就可以了;其次,由于RNN模型处理的是序列问题,因此其模型损失不能只是一个时刻的损失,应该包含全部N个时刻的损失。
故RNN模型在 t时刻的损失函数写成:
      
全部N个时刻的损失函数(全局损失)表达为如下形式:

     

需要说明的是:yt是t时刻输入的真实标签值,ot为模型的预测值,N代表全部N个时刻。下文中为了书写方便,将Loss简记为L。在结束本小节之前,最后补充一个softmax函数的求导公式:

         

BPTT算法

由于RNN模型与时间序列有关,因此不能直接使用BP(back propagation)算法。针对RNN问题的特殊情况,提出了BPTT算法。BPTT的全称是“随时间变化的反向传播算法”(back propagation through time)。这个方法的基础仍然是常规的链式求导法则,接下来开始具体推导。虽然RNN的全局损失是与全部N个时刻有关的,但为了简单笔者在推导时只关注t时刻的损失函数。

首先求出t时刻下损失函数关于ot*的微分:

求出损失函数关于参数V的微分:

因此,全局损失关于参数V的微分为:

 
求出t时刻的损失函数关于关于s t*的微分:
求出t时刻的损失函数关于s* t-1的微分:
求出t时刻损失函数关于参数U的偏微分。注意:由于是时间序列模型,因此t时刻关于U的微分与前t-1个时刻都有关,在具体计算时可以限定最远回溯到前n个时刻,但在推导时需要将前t-1个时刻全部带入:
因此,全局损失关于U的偏微分为:
求t时刻损失函数关于参数W的偏微分,和上面相同的道理,在这里仍然要计算全部前t-1时刻的情况:
 

因此,全局损失关于参数W的微分结果为:

至此,全局损失函数关于三个主要参数的微分都已经得到了。整理如下:

接下来进一步化简上述微分表达式,化简的主要方向为t时刻的损失函数关于ot的微分以及关于st*的微分。已知t时刻损失函数的表达式,求关于ot的微分:

softmax函数求导:

因此:

又因为:

且:

有了上面的数学推导,我们可以得到全局损失关于U,V,W三个参数的梯度公式:

由于参数U和W的微分公式不仅仅与t时刻有关,还与前面的t-1个时刻都有关,因此无法写出直接的计算公式。不过上面已经给出了t时刻的损失函数关于st-1的微分递推公式,想来求解这个式子也是十分简单的,在这里就不赘述了。

以上就是关于BPTT算法的全部数学推导。从最终结果可以看出三个公式的偏微分结果非常简单,在具体的优化过程中可以直接带入进行计算。对于这种优化问题来说,最常用的方法就是梯度下降法。针对本文涉及的RNN问题,可以构造出三个参数的梯度更新公式:

 
依靠上述梯度更新公式就能够迭代求解三个参数,直到三个参数的值发生收敛。

Guess you like

Origin www.cnblogs.com/yang901112/p/11896525.html
Recommended