LSTM

Learning video: https://www.youtube.com/watch?v=R1S2xpF3O8Q

Recurrent Neural Networks

Humans don't start their thinking with a blank brain all the time. As you read this, you are inferring the true meaning of the current word based on the understanding you already have of the word you have seen before. We don't throw everything away and think with a blank brain. Our thoughts have persistence.
Traditional neural networks can't do this, and it seems like a huge drawback. For example, let's say you want to classify the time type of each time point in a movie. Traditional neural networks should have a hard time handling this - using previous events in a movie to infer subsequent events.
RNN solves this problem. RNNs are networks that contain loops that allow persistence of information.


RNN contains loops

In the example diagram above, the module of the neural network, A, is reading some input x_i and outputs a value h_i. Loops allow information to be passed from the current step to the next.
These loops make RNNs look very mysterious. However, if you think about it, this is no more difficult to understand than a normal neural network. RNNs can be thought of as multiple copies of the same neural network, each neural network module passing a message to the next. So, if we unroll this loop:


Expanded RNN

Chained features reveal that RNNs are inherently sequence- and list-related. They are the most natural neural network architecture for this kind of data.
And RNN has also been used by people! Over the past few years, RNNs have been used with moderate success in speech recognition, language modeling, translation, image captioning, and the list is growing. I recommend that you refer to Andrej Karpathy's blog post - The Unreasonable Effectiveness of Recurrent Neural Networks  for a richer set of interesting RNN successful applications.
The key to these successful applications is the use of LSTMs, a special RNN that performs better than standard RNNs on many tasks. Almost all the exciting results about RNNs are achieved with LSTMs. This blog post will also expand on LSTMs.

Long-Term Dependencies

One of the key points of RNNs is that they can be used to connect previous information to current tasks, such as using past video segments to infer the understanding of the current segment. If RNNs can do this, they become very useful. But is it really possible? The answer is, there are many more dependencies.
Sometimes, we just need to know previous information to perform the current task. For example, we have a language model that predicts the next word based on previous words. If we try to predict the final word "the clouds are in the sky", we don't need any other context - so the next word should obviously be sky. In such a scenario, the gap between the relevant information and the predicted word position is very small, and the RNN can learn to use the previous information.


Not too long relevant information and location intervals

But there will also be some more complex scenarios. Suppose we try to predict the last word of "I grew up in France... I speak fluent French". The current information suggests that the next word might be the name of a language, but if we need to figure out what language it is, we need the context of the previously mentioned France, which is far from the current location. This means that the gap between the relevant information and the current predicted position must become quite large.
Unfortunately, as this interval grows, the RNN loses its ability to learn to connect information so far away.


Relatively long intervals of relevant information and locations

In theory, RNNs can definitely handle such long-term dependencies. One can carefully pick parameters to solve the most elementary forms of these kinds of problems, but in practice, RNNs certainly fail to learn these things successfully. Bengio, et al. (1994), et al. have studied this problem in depth, and they have found some fairly fundamental reasons that make training RNNs very difficult.
Fortunately, however, LSTMs do not have this problem!

LSTM network

A Long Short Term network — commonly called an LSTM — is a special type of RNN that learns long-term dependencies. LSTM was proposed by Hochreiter & Schmidhuber (1997) , and was recently improved and promoted by Alex Graves . In many problems, LSTMs have achieved considerable success and are widely used.
LSTMs avoid long-term dependencies by deliberate design. Remember that long-term information is the default behavior of LSTMs in practice, not a very expensive ability to acquire!
All RNNs have a form of chain of repeating neural network modules. In a standard RNN, this repeating module has only a very simple structure, such as a tanh layer.


Repeating modules in standard RNNs contain a single layer

LSTM 同样是这样的结构,但是重复的模块拥有一个不同的结构。不同于 单一神经网络层,这里是有四个,以一种非常特殊的方式进行交互。


LSTM 中的重复模块包含四个交互的层

不必担心这里的细节。我们会一步一步地剖析 LSTM 解析图。现在,我们先来熟悉一下图中使用的各种元素的图标。


LSTM 中的图标

在上面的图例中,每一条黑线传输着一整个向量,从一个节点的输出到其他节点的输入。粉色的圈代表 pointwise 的操作,诸如向量的和,而黄色的矩阵就是学习到的神经网络层。合在一起的线表示向量的连接,分开的线表示内容被复制,然后分发到不同的位置。

LSTM 的核心思想

LSTM 的关键就是细胞状态,水平线在图上方贯穿运行。
细胞状态类似于传送带。直接在整个链上运行,只有一些少量的线性交互。信息在上面流传保持不变会很容易。


Paste_Image.png

LSTM 有通过精心设计的称作为“门”的结构来去除或者增加信息到细胞状态的能力。门是一种让信息选择式通过的方法。他们包含一个 sigmoid 神经网络层和一个 pointwise 乘法操作。


Paste_Image.png

Sigmoid 层输出 0 到 1 之间的数值,描述每个部分有多少量可以通过。0 代表“不许任何量通过”,1 就指“允许任意量通过”!

LSTM 拥有三个门,来保护和控制细胞状态。

逐步理解 LSTM

在我们 LSTM 中的第一步是决定我们会从细胞状态中丢弃什么信息。这个决定通过一个称为忘记门层完成。该门会读取h_{t-1}x_t,输出一个在 0 到 1 之间的数值给每个在细胞状态C_{t-1}中的数字。1 表示“完全保留”,0 表示“完全舍弃”。
让我们回到语言模型的例子中来基于已经看到的预测下一个词。在这个问题中,细胞状态可能包含当前主语的性别,因此正确的代词可以被选择出来。当我们看到新的主语,我们希望忘记旧的主语


决定丢弃信息

下一步是确定什么样的新信息被存放在细胞状态中。这里包含两个部分。第一,sigmoid 层称 “输入门层” 决定什么值我们将要更新。然后,一个 tanh 层创建一个新的候选值向量,\tilde{C}_t,会被加入到状态中。下一步,我们会讲这两个信息来产生对状态的更新。
在我们语言模型的例子中,我们希望增加新的主语的性别到细胞状态中,来替代旧的需要忘记的主语。


确定更新的信息

现在是更新旧细胞状态的时间了,C_{t-1}更新为C_t。前面的步骤已经决定了将会做什么,我们现在就是实际去完成。
我们把旧状态与f_t相乘,丢弃掉我们确定需要丢弃的信息。接着加上i_t * \tilde{C}_t。这就是新的候选值,根据我们决定更新每个状态的程度进行变化。
在语言模型的例子中,这就是我们实际根据前面确定的目标,丢弃旧代词的性别信息并添加新的信息的地方。


更新细胞状态

最终,我们需要确定输出什么值。这个输出将会基于我们的细胞状态,但是也是一个过滤后的版本。首先,我们运行一个 sigmoid 层来确定细胞状态的哪个部分将输出出去。接着,我们把细胞状态通过 tanh 进行处理(得到一个在 -1 到 1 之间的值)并将它和 sigmoid 门的输出相乘,最终我们仅仅会输出我们确定输出的那部分。
在语言模型的例子中,因为他就看到了一个 代词,可能需要输出与一个 动词 相关的信息。例如,可能输出是否代词是单数还是负数,这样如果是动词的话,我们也知道动词需要进行的词形变化。


输出信息

LSTM 的变体

我们到目前为止都还在介绍正常的 LSTM。但是不是所有的 LSTM 都长成一个样子的。实际上,几乎所有包含 LSTM 的论文都采用了微小的变体。差异非常小,但是也值得拿出来讲一下。
其中一个流形的 LSTM 变体,就是由 Gers & Schmidhuber (2000) 提出的,增加了 “peephole connection”。是说,我们让 门层 也会接受细胞状态的输入。


peephole 连接

上面的图例中,我们增加了 peephole 到每个门上,但是许多论文会加入部分的 peephole 而非所有都加。

另一个变体是通过使用 coupled 忘记和输入门。不同于之前是分开确定什么忘记和需要添加什么新的信息,这里是一同做出决定。我们仅仅会当我们将要输入在当前位置时忘记。我们仅仅输入新的值到那些我们已经忘记旧的信息的那些状态 。


coupled 忘记门和输入门

另一个改动较大的变体是 Gated Recurrent Unit (GRU),这是由 Cho, et al. (2014) 提出。它将忘记门和输入门合成了一个单一的 更新门。同样还混合了细胞状态和隐藏状态,和其他一些改动。最终的模型比标准的 LSTM 模型要简单,也是非常流行的变体。


GRU

这里只是部分流行的 LSTM 变体。当然还有很多其他的,如Yao, et al. (2015) 提出的 Depth Gated RNN。还有用一些完全不同的观点来解决长期依赖的问题,如Koutnik, et al. (2014) 提出的 Clockwork RNN。
要问哪个变体是最好的?其中的差异性真的重要吗?Greff, et al. (2015) 给出了流行变体的比较,结论是他们基本上是一样的。Jozefowicz, et al. (2015) 则在超过 1 万种 RNN 架构上进行了测试,发现一些架构在某些任务上也取得了比 LSTM 更好的结果。


Jozefowicz等人论文截图

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325578009&siteId=291194637