LSTM understanding the neural network (Understanding LSTM Networks translation)

**

Understanding LSTM Networks

**
original article address: https: //colah.github.io/posts/2015-08-Understanding-LSTMs/

作者:Christopher Olah(Google Brain, Research Scientist.)

https://colah.github.io/about.html

**

Recurrent neural network Recurrent Neural Networks

**
people are not thinking to start from scratch every second. When you read this article, you understand every word is based on your understanding of the word before. You would not throw away everything, start thinking again. Your thoughts have perseverance.

Traditional neural networks can not do this, it seems to be a major drawback. For example, suppose you want to film the events that occur every point classification. It is unclear how the traditional neural network using its reasoning on the movie of previous events to provide information for subsequent events.

Recurrent neural networks to solve this problem. They contain a network cycle, allows information persisted.

Recurrent Neural Networks have loops.  Recurrent Neural Networks have loops.

In the above figures, a neural network blocks A, an input observation xt and outputs a value ht. Loop (Loop) allows information to pass from one step to the next step of the network.

These cycles make recurrent neural network looks a bit mysterious. However, if you think about it, and they will find a common neural network is not much different. Recurrent neural network can be seen as multiple copies of the same network, each copy delivered a message to the subsequent network. Consider what would happen if we started circulating:

Here Insert Picture Description
An unrolled recurrent neural network.

This chain of similar nature indicate that recurrent neural network is closely related to the sequence and lists. They are the natural structure of the neural network to process the data.

They were indeed large-scale applications, and in the past few years, the RNNs applied to a variety of problems, and achieved incredible success: speech recognition, language modeling, translation, image captions, etc. . I will discuss about the amazing feats that can be achieved is left RNNs Andrej Karpathy blog post The Unreasonable Effectiveness of Recurrent Neural Networks. But they are really amazing.

The key to success is to use the "LSTM", which is a very special recurrent neural network, on many tasks can be much better than the standard RNN. The use of recurrent neural network can be achieved almost all the exciting results. This article will explore these LSTM.

Long-term dependence problem (The Problem of Long-Term Dependencies)

RNNs an attractive place to be, they may be able to previous information associated with the current task together, such as the use of video in front of a few frames may be helpful in understanding the current picture frame. If RNNs can do this, they will be very useful. But they can not be effective, it must, as the case may be.
Here Insert Picture Description

But there are some cases, we need more context. Try to predict the last word in the text, "I grew up in France ...... I speak fluent French." "Recent information indicates that the next word is likely to be the name of a language, but if we want to narrow the scope, French context we need earlier. "the gap between it and the point entirely relevant information needed can become very large.

Unfortunately, with the expansion of this gap, RNNs not learn the connection information.

Here Insert Picture Description

In theory, RNN is perfectly capable of handling this kind of "long-term dependency." One can carefully select parameters for them to address this type of problem. Unfortunately, in practice, RNN can not seem to learn them. Hochreiter (1991) [German] and Bengio, et al. (1994) and others on this issue in depth. They found that the root cause of some of the RNN difficult to do. [Http://ai.dinfo.unifi.it/paolo//ps/tnn-94-gradient.pdf

http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf】

Fortunately, LSTM do not have this problem!

LSTM Networks

Long-term memory network is often called "LSTMs", is a special RNN, learning can be long-term dependency. They introduced by Hochreiter & Schmidhuber (1997), and refined and promotion in the subsequent work of many people. 1. They are doing very well on a variety of issues, it is now widely used.

LSTMs designed to avoid long-term dependency problems. Remember a long time is actually their default behavior, rather than what they study hard!

Repeat for all modules in the form of chain recurrent neural networks have neural network. RNN in the standard, the repeat modules will have a very simple construction, such as a single layer tanh.

Here Insert Picture Description

The repeating module in a standard RNN contains a single layer.

LSTM also having such a chain structure, but repeat modules have different structure. Instead of only one layer of the neural network, but there are four in a very special way the layers interact.

Here Insert Picture Description
The repeating module in an LSTM contains four interacting layers.

Do not care about the details of the first occurrence. Later, we will gradually introduce LSTM map. Now, let's try symbols to be used.

Here Insert Picture Description
In the figure, each line carrying the entire vector, an output from one node to another node input. Pink circles represent pointwise operations, e.g. vector addition, while the yellow boxes represent the neural network learning layer. The combined bank said series, and the bifurcation of the line representing the content to be copied, and the copy will arrive at a different location.

The core idea behind LSTM (The Core Idea Behind LSTMs)

The key is LSTM unit state (cell state), the horizontal line through the top of FIG.

Status bit like a belt unit. It has a straight line extending along the entire chain, only some smaller linear interactions. Information flow without change very easily.

Here Insert Picture Description

LSTM does have the capability to delete or add information to the status unit, these functions called by the gate structures (structures called gates) carefully adjusted.

The door is a selective manner to allow information to pass through. They consist of a sigmoid neural net layer and a pointwise multiplication operation components.
Here Insert Picture Description

sigmoid layer outputs a number between zero and one, each component is described by how much should be allowed.

A value of 0 means "do not let anything through,"

A value of 1 means "all content through!"

LSTM with three such gates, to protect and control unit status.

LSTM分步指南(Step-by-Step LSTM Walk Through)

The first step is LSTM determines what information is discarded from the cell state. The decision as referred to by the "forgetting gate" (forget gate layer) layer sigmoid layer neural network. It See ht-1 and xt, and the status of each unit of the digital output Ct-1 is a number between 0 and 1. 1 stands for "completely keep this condition", 0 stands for "completely retained this condition."

Let us return to the example of the language model, the model attempts to predict the next word based on all the previous words. In such a problem, the current state of cells may include gender of the subject, which can correct pronouns. When we see a new theme, we would like to forget the old theme of gender.

Here Insert Picture Description

The next step is to determine what the new information to be stored in a cell state. It consists of two parts. sigmoid layer first, called "input gate layer" determines which value will be updated. Next, tanh layer to create a new candidate
Here Insert Picture Description

Vector, can be added to the state. In the next step, we will combine the two to create an update of the state.

In our language model example, we want to add the new theme of gender into the cell state, we have forgotten to replace the old theme.

Here Insert Picture Description

It is time to update the old unit to the new unit Ct-1 status Ct. The previous step has been determined what to do, we just need to be practical.

We multiplied the old state ft, forget the things we forget the earlier decision. Then add
Here Insert Picture Description

This is a new candidate value, we decided to update the scaled according to the size of each state value.

For the language model, which is the old theme of gender information we actually delete and add new information where, as we have identified in the previous steps did.

Here Insert Picture Description

Finally, we need to decide what you want to output. This output is based on our cell state, but it will be filtered version. First, we run a sigmoid layer, determine which parts of the state unit to be output.

We then place unit status tanh (pushes values ​​between -1 and 1), and then multiplied by the output of the sigmoid gate, so that only the output of the determination section.

For example language model, because it only sees a theme, it may be desirable to output information related to the verb, just in case. For example, it can output the subject is singular or plural, so that we know if the subject is singular or plural, the verb should become what form.

Here Insert Picture Description

The length of time variation of memory (Variants on Long Short Term Memory)

So far, I have described is a very common LSTM. However, not all of LSTM are the same as above. In fact, it seems that almost all papers involved LSTM all use slightly different versions. Insignificant, but some are worth mentioning.

A popular Gers & Schmidhuber (2000) is proposed to add LSTM variant "peephole connection." This means that we let the gate layer to view the unit status.
Here Insert Picture Description

The picture above shows all the doors add a peephole, but many papers will give some peep holes, and no other.

Another variation is to use gates to forget gate coupled to the input. We can make these decisions together, rather than separately decide what to forget and what new information should be added to it. We will only forget it when you need to enter some content. We will only enter a new value to the state only when forget the older content.

Here Insert Picture Description

A somewhat significant change is gated LSTM recurrent units introduced by Cho et al. (2014), or GRU. It forgotten door and enter into a combination of "Update the door." "It combines state of the cell and hidden, and make some other changes. LSTM than the standard model resulting model is simpler, and more and more popular.

Here Insert Picture Description

These are just the most famous LSTM variants minority. There are many other methods, e.g. Depth Gated RNNs by Yao, et al . (2015) .. https://arxiv.org/pdf/1508.03790v2.pdf
some different methods to deal with a long-term dependency, such Koutnik, et al. (2014) .. https://arxiv.org/pdf/1402.3511v1.pdf
Which of the following best variant? Differences important? Greff, et al. (2015) pop variants had a very good, it is found that they are almost the same. http://proceedings.mlr.press/v37/jozefowicz15.pdf
Jozefowicz, et Al. (, 2015) tested over ten thousand kinds RNN architecture found in some LSTM better than certain tasks. https://arxiv.org/pdf/1402.3511v1.pdf

Conclusion Conclusion

Before, I mentioned that people use RNN made remarkable achievements. All of these are basically using LSTM implemented. For most tasks, they do work better!

Written as a set of equations, LSTM looks very scary. I hope this walk through them, make them more accessible.

LSTM is an important step RNN done we can use. It is natural to think: There is another significant step in it? Researchers generally believe that: "Yes, the idea is to make every step of the RNN has focused greater selection of information information from, for example, if you create RNN description of the image title, it may choose to part of the image to view its output!. .. of each word in fact, Xu et al. (2015) did just that -! If you want attention, this could be an interesting starting point to attract the attention of a lot of exciting results, but it seems there a lot of other things ......

RNN attention mechanism is not the only exciting research topic. For example, Kalchbrenner, et al. (2015)) it seems to be very promising. RNN using work generated model - e.g. Gregor like. .. (Gregor, et al (2015), Chung, et al (2015), or Bayer & Osendorfer (2015) - it seems to be very interesting in recent years for recurrent neural network is an exciting time, but. especially in the years to come!

Acknowledgments Acknowledgments

Thanks to the many people helped me better understand LSTM, commented on visual effects and provide feedback about this post.

Thank you very much valuable feedback Google colleagues, especially Oriol Vinyals, Greg Corrado, Jon Shlens, Luke Vilnis and Ilya Sutskever. I also thank the many other friends and colleagues to take the time to help me, including Dario Amodei and Jacob Steinhardt. I am particularly grateful Cho Kyunghyun Cho extremely thoughtful communication about my chart.

Before writing this article, I explain LSTM practice in two series of seminars on the neural network. Thank everyone involved for my patience and feedback.

Published 11 original articles · won praise 9 · views 1072

Guess you like

Origin blog.csdn.net/shuzip/article/details/101466968