Vernacular - short and long term memory (LSTM) in several steps, with code!

1. What is LSTM

In the time you read this article, you are based on an understanding of the true meaning of the word previously seen to infer the current word of what we already have. We will not all dropped everything, and thinking with a blank mind. Our minds have persistence. LSTM is equipped with this feature.

This will highlight ⼀ Frequently used types of neural ⻔ control loop Open networks: ⻓ short-term memory (long short-term memory, LSTM ) [1]. It ⽐ ⻔ cell cycle control is slightly more complicated structure ⼀ points, but also to solve the problem in the network RNN attenuation gradient, is an extension of the GRU.

Can first understand the process of the GRU, the LSTM will be much easier to understand, link address: three steps to understand - gated circulation unit (GRU)

LSTM START cited in three ⻔, i.e. ⻔ START input (input gate), forgetting ⻔ (forget gate) and output ⻔ (output gate), and a hidden state of the same shape memory cells (the memory cells offer certain files as ⼀ special kind of hidden) to record additional information.

2. START ⻔ input, and an output ⻔ forgetting ⻔

Sample cycle control and ⻔ ⼀ unit resets and updates ⻔ ⻔, ⻔ ⻓ the short-term memory are input START START input Xt the current time step and step to the next higher time hidden Ht-1, the output is a sigmoid function by the activation function calculated fully connected layer. Thus ⼀, the range ⻔ these three elements are [0, 1]. As shown below:

Specifically, assuming that the number of hidden units is h, a given time step t of small quantities of input START \ (X_t \ in {} _ \ R & lt mathbb {}} ^ {n-D * \) (sample number n, input START the number of d), and step to the next higher time hidden \ (. 1-H_ {T} \ in {} _ \ mathbb {} ^ {n-R & lt * H} \) . Three gate formula is as follows:

Input gate: \ [I_t = \ Sigma (XI X_tW_ {+}. 1-H_ {T} {Hi} W_ + B_i) \]

Forgotten Q: \ [f_t = \ Sigma (XF X_tW_ {T} + {H_-HF. 1} {+} W_ b_f) \]

The output of gate: \ [O_t = \ Sigma (X_tW_ H_ {XO} + {T} W_. 1-HO} + {b_o) \]

3. The memory cell candidate

Next, short-term memory necessary to calculate the candidate ⻓ memory cells \ (\ _t is tilde are {C} \) . It is calculated on the three ⻔ ⾯ presentation similar, but Use in the range [-1, 1] as a function tanh activation function, as shown below:

Specifically, the time step t of the candidate memory cell is calculated as follows:

\[\tilde{C}_t=tanh(X_tWxc+H_{t-1}W_{hc}+b_c)\]

4. Memory cells

We elements can range in [0, 1] START ⻔ input, and an output ⻔ forgetting ⻔ hidden state to control the flow of information, which is shoots as usual by pressing Multiplication Using (symbol ⊙) to achieve. Current time step memory cells \ (H_ {t} \ in _ {} \ mathbb {R} ^ {n * h} \) is calculated combination to the next higher time step memory cell and information of the current time step candidate memory cells, and by forgetting and outputs START ⻔ ⻔ to control the flow of information:

\[C_t=F_t⊙C_{t-1}+I_t⊙\tilde{C}_t\]

Below in the diagram below, the ⻔ forgotten one of the messages in the time control step whether the memory cell Ct-1 is transmitted to the current time step, and the control input of the current input ⻔ START START Xt time step how the candidate memory cell stream through C~t START current time step of the memory cell. If forgotten ⻔ ⼀ linear approximation output and a linear approximation START ⻔ ⼀ 0, last memory cell set remains stored and transmitted by the time the current time step ⾄. This design can cope with recurrent neural Open networks gradient Attenuation, more and better captures zoomed-dependent time series time step distance relationship.

5. hidden

With the memory cell later, then we can also control the memory cell Ht letters from hidden to output ⻔ through
the flow of information:

\[H_t=O_t⊙tanh(C_t)\]

这⾥的tanh函数确保隐藏状态元素值在-1到1之间。需要注意的是,当输出⻔近似1时,记忆细胞信息将传递到隐藏状态供输出层使⽤;当输出⻔近似0时,记忆细胞信息只⾃⼰保留。下图展⽰了⻓短期记忆中隐藏状态的全部计算:

6. LSTM与GRU的区别

LSTM与GRU二者结构十分相似,不同在于

  1. 新的记忆都是根据之前状态及输入进行计算,但是GRU中有一个重置门控制之前状态的进入量,而在LSTM里没有类似门;
  2. 产生新的状态方式不同,LSTM有两个不同的门,分别是遗忘门(forget gate)和输入门(input gate),而GRU只有一种更新门(update gate);
  3. LSTM对新产生的状态可以通过输出门(output gate)进行调节,而GRU对输出无任何调节。
  4. GRU的优点是这是个更加简单的模型,所以更容易创建一个更大的网络,而且它只有两个门,在计算性上也运行得更快,然后它可以扩大模型的规模。
  5. LSTM更加强大和灵活,因为它有三个门而不是两个。

7. LSTM可以使用别的激活函数吗?

关于激活函数的选取,在LSTM中,遗忘门、输入门和输出门使用Sigmoid函数作为激活函数;在生成候选记忆时,使用双曲正切函数Tanh作为激活函数。

值得注意的是,这两个激活函数都是饱和的,也就是说在输入达到一定值的情况下,输出就不会发生明显变化了。如果是用非饱和的激活函数,例如ReLU,那么将难以实现门控的效果。

Sigmoid函数的输出在0~1之间,符合门控的物理定义。且当输入较大或较小时,其输出会非常接近1或0,从而保证该门开或关。在生成候选记忆时,使用Tanh函数,是因为其输出在−1~1之间,这与大多数场景下特征分布是0中心的吻合。此外,Tanh函数在输入为0附近相比Sigmoid函数有更大的梯度,通常使模型收敛更快。

激活函数的选择也不是一成不变的,但要选择合理的激活函数。

8. 代码实现

MIST数据分类--TensorFlow实现LSTM

机器学习通俗易懂系列文章

3.png

9. 参考文献

《动手学--深度学习》


作者:@mantchs

GitHub:https://github.com/NLP-LOVE/ML-NLP

欢迎大家加入讨论!共同完善此项目!群号:【541954936】NLP面试学习群

Guess you like

Origin www.cnblogs.com/mantch/p/11369812.html