Step-by-step to LSTM: LSTM resolve neural network design principles

Ps: Hey Hey, you do not sprout light collection point is not like it _ (: з "∠) _

Hmmmm ...

LSTM clear in every detail why each formula is like this design? Want to know how simple RNN is a step by step toward the LSTM it? LSTM see through that mechanism work? Congratulations to open the right article!

Zero, 1 pre-knowledge:

In the previous article "feedforward to feedback: parsing RNN" , a small evening from simple feed-forward neural network without hidden layer leads to simple recurrent neural network:

y(t)=f(X(t)\cdot W + y(t-1) \cdot V+ b)

It is no hidden layer recurrent neural network, named "the Simple RNN" .

This way that each time when they are making decisions to consider the results of the decision-making on a moment. Is shown in the figure to the sauce:

v2-252ea014345388b927a35f4e710ef544_b.jpg
Wherein the ball in the lower hemisphere of the representative of the inner product of two vectors, inner product on the hemisphere activation results

While it is true it can be seen by this simple feedback influence each point in time the decision will be affected by the decision of the previous point in time , but it seems difficult to convince people that turned out to be something to talk to the memory side!

Think about people's daily behavior process is like this. For example, you are building blocks, then each time your actions will experience the following sub-processes:

1, the eyes see now in the hands of building blocks.
2, recall the most high-level building blocks of the scene.
3, 1 and 2 in conjunction with the information to make the current time into the building where the decisions.

I believe that smart little friends already know what I mean friends. The first step is the building block of the hands of an external input X current time; second step is to call the information / memory historic moment; X Step 3 is the integration of information and historical memory to infer the result of the decision, that is the previous step to RNN the process output y (t).

There are no more intelligent partner noticed a small step 2 amazing! ! ! When we recall history, a general building block shapes than simple memories, but to recall a more vague and macro scene. In this example, this scene is the recent behavior arising out of the abstract memory - the "building blocks of the highest level of topographic map"!

In other words, people do a lot of timing task, especially when the timing is slightly more complicated task, subconscious approach is not directly on the output y of time (t-1) direct connection came in, but rather a vague connection abstract things come in ! What this thing is it?

Of course, it is in the neural network hidden node h ah! That is, people subconsciously thing is the direct use of historical memory after a period of integration h, not just the output point on a time. And the output is taken from the hidden network node. So subconscious model is more rational people should be portrayed like this:

v2-2a4cbff5a56410d6732c9f95fc4c1ea8_b.jpg
Implicit in the memory unit and storing the flow, the output from the hidden units

This added to the recurrent neural network hidden layer of the neural network RNN is a classic! That is "Standard RNN" .

RNN from simple to change and significance of standard paper for subsequent content is very important, oh.

Zero, 2 pre-knowledge:

In the article "feedforward to feedback: Recurrent Neural Network (RNN)" simple and proven explained, since when the error back-propagation, a gradient will be calculated with the forward propagation occurs exponentially decay or amplification! And this is mathematically certainty. Thus, RNN of the memory cell is short.

First, how lossless transport gradient information?

Well, then we learn from the experience of the older design RNN, from the simple version, that is, without the hidden layer, simply complete the output to the input of the feedback network structure begins to design a new, can solve the problems in order to remember the disappearance of the gradient explosion long-distance dependencies neural network it!

So how to make a gradient with the flow of time does not occur exponentially disappear or explode it?

It seems to think quite hard, but the problem may be in the student answers! That is to make calculated constant gradient of 1! Because any power of 1 is 1 thing (¯∇¯)

Therefore, according to this funny idea, we have long-term memory unit designed to be denoted as c (c refer to all of the following with a long-term memory unit) , so we designed a mathematical model of long-term memory cell is like this myself:

c(t) = c(t-1)

In this case, the time derivative of the error back-propagation is the propagation constant to 1 ~ error it can forward all the way to the front end lossless network, and to learn to a remote tip of the far end of the network dependencies.

v2-c542419d65253b699b1b48c70a6ee27a_b.jpg

Passers: Excuse me?

Do not worry Do not worry, anyway, assuming that our information is stored in c, then c will be able to put this information all the way to the output layer no problem, right? T calculated at the time the information is stored in the c gradient, the gradient can it all the way to the time 0 without any loss is no problem, right? Right (¯∇¯)

Therefore, information transport issue is resolved , it is necessary to solve the information packing and unloading problems.

Second, how long the information into the memory unit?

Of course, we should define what new information yes. Simple RNN might directly used in the definition of new information, i.e. the current external input x (t) and the previous time the output timing of the network (i.e., feedback unit) y (t-1) combined to give the network at the current time get to new information, denoted by \hat c(t). which is:

\hat c(t) = f(W\cdot x(t)+V\cdot y(t-1))

Thanks to the comments section @ hoshino042 pointed out a clerical error here

Well, new information \hat cdefined. Consider the \hat cadded c inside. If we take this question asked students, then it may be two-pronged:

1, take into account!
2, added to the list!

So these two kind of work?

In fact, one would like a little easily determined: the multiplication operation is more as an operation of some control information (such as an arbitrary number and 0 after directly multiplied disappeared, corresponding to the closing operation; number is any number that is greater than 1 multiplying the scale will be enlarged, etc.), and the new addition operation is operation information superimposed to the old information.

Here's our in-depth discussion about the multiplication operations and add operations , which is critical in understanding LSTM. Of course, first of all, you have to master the concepts and methods of partial derivatives, the derivation rule of composite functions, the chain rule derivation. With these three points in order to understand the basic calculus Oh. Inadequate foundation shoes can skip the discussion here.

Multiplication:

When multiplication mathematical model to add information and even if the memory length:

c(t)=c(t-1)\cdot \hat c(t)

So the network complete mathematical model as follows:

c(t)=c(t-1)\cdot \hat c(t)Official [0.1]
\hat c(t) = f(W\cdot x(t)+V\cdot y(t-1))official [0.2]
y(t)=f(c(t))official [0.3]

For convenience of calculation, it is still the same as before the assumed linear activation function is activated (i.e. no activation function fact tanh at a small value may be approximately linear, while positive numbers RELU linear, this assumption is very understandable), which when the network model simplifies to:

y(t)=y(t-1) \times (W\cdot x(t)+V\cdot y(t-1)) Official [1]

If the network time T has elapsed loss steps to end, then the right to update the network parameter V t = 0 at the time of heavy, i.e. of the parameter V t = 0 time required partial derivative, i.e. calculated

\frac{\partial loss(t=T)}{\partial V(t=0)}

among them loss(t=T)=f_{loss}(y(t=T))

(Which f_loss (·) is the loss function)

Well, a little count, was found \frac{\partial loss(t=T)}{\partial V(t=0)}=f_{loss}'\times y'(t=T)in f_loss 'values that we want to pass along the gradient (parameter update information), then our goal is to discuss y' (t = T), write a whole is

\frac{\partial y(t=T)}{\partial V(t=0)} Official [2]

When V partial derivative of other variables (that is the W and x) naturally became constant, where we do a much simplified, direct delete W\cdot x(t-1)items! (Factorial a party ignoring the presence of second order power y), then you can directly expand the formula [1]:

y(T) =y(T-1)^2\cdot v(T-1) =v(0)^{2^T-1}\cdot y(0)^{2^{T-2}}

For v (0) derivative, then, will be

y' =y(0)^{2^{T-2}}\cdot (2^T-1)\cdot v(0)^{2^T-2}=a\cdot v(0)^{2^T-2}

If the RNN v^Tis the speed of sound level gradient explode and disappear, that this v^{2^T-2}is simply the speed of light level explosion and disappeared na! ~

So directly when new information is multiplied into long-term memory unit will only make the situation worse , leading to the original c (t) = c (t -1) to make derivative constant is the idea of a total failure, which also shows the multiplicative update and the information is not a simple superposition, but to control and scaling.

Thanks to the comments section @ Cheng Yi improve expressed here

On addition:

If the rules changed to add it? At this point add mathematical model for the information

c(t)=c(t-1)+\hat c(t)

As with the previous practice, to give the activated and assume a linear network model substituting

y(T) =y(T-1)+\hat c(T)

=y(T-1)+x\cdot w+ v(T-1)\cdot y(T-1)

=(v(0)+1)^T\cdot y(0)+T\cdot x\cdot w

Alas? There exponential term - but due to the addition of an offset v 1, leading to the possibility of an explosion is far greater than disappear. But by doing truncated gradient, it can also affect a large degree of ease gradient explosion.

Ah ~ probability gradient disappears much smaller, gradient explosion can barely eased, looks better than RNN fly much, after all, a good control of the premise of the explosion, the gradient disappears more slowly, the longer the distance memory thing.

Thus, when the long-term memory unit to add to information, the additivity rule be significantly better than multiplicative. Also proved more suitable for addition information overlay, and multiplication is more suitable for control and scaling.

Thus, we determine the application of the rules plus friends, so far we have designed the network should be like this:

c(t)=c(t-1)+\hat c(t) Official [3.1]

\hat c(t) = f(W\cdot x(t)+V\cdot y(t-1)) Official [3.2]

y(t)=f(c(t)) Official [3.3]

Is there a way to make the information packing and shipping exist under, so are less likely to disappear gradient change, so that the possibility of explosion and the extent of the gradient is also less of it?

You want it, we have long-term memory unit to add new frequency information is certainly very low, in real life only a few moments we can remember for a long time, most of the time information of a few days to forget. So now this model will fall into the practice of trying to always remember each time information is certainly unreasonable, we should remember that the only record information.

Obviously, the choice of new information or record is not a mind control operation, you should use multiplicative rule. Therefore, before the new information plus a control valve, just to make changes to the formula [3.1]

c(t)=c(t-1)+g_{in}\cdot \hat c(t)

The g_in we called "Enter Gate" friends, ranging from 0.0 to 1.0.

To achieve this the range, we can easily think of using sigmoid function as an input gate activation function, after all sigmoid output range must be between 0.0 and 1.0 thing. Therefore represented by input gate control gate activation function are sigmoid, thus controlling the door:

g_x = sigmoid(...)

Of course, this is a long-term memory when the control unit. We will certainly be time to set up a lot of memory cells, brain volume is too low or else the matter. Therefore, when each of the long-term memory cell has its own input gate, mathematically we may want to use \ otimesto represent the bitwise multiplication operations, capital letters represent the memory unit C to set long. which is:

C(t)=C(t-1)+g_{in}\otimes \hat C(t) Official [4]

Ah ~ Because the input door will only open when necessary, so in most cases the formula [4] can be viewed as C (t) = C (t-1), which is our ideal state. The resulting explosion additive gradient operation also greatly reduce it, the gradient disappears more and more lighter.

Third, the problems caused by frequent packing

And so, I love to think students may notice a problem. In case the neural network for a very informative read the text, so that then enter the door ecstatic, remains wide open state, gobble trying to remember all of this information, what would happen?

Obviously it will cause the value of c becomes very large!

You know, when our network is to be output should c activated (refer to the formula [0.3]), when c becomes large, sigmoid, tanh output of these common activation function is completely saturated! For example, as shown tanh:

v2-d815aa55d11a77803ba1630aced5c721_b.jpg

When c large, tanh close to 1, then c becomes another big does not make sense, because the saturated! The brain can not remember so many things!

This situation how to do it? Apparently this forward relu function without saturated activation function is an option, but we can not activate this function network output is limited to relu, right? It is also designed to defeat it too!

then what should we do?

In fact, think of our own works to know it. The reason why we both can remember childhood thing, it can also remember a year ago things did not feel the brain is not enough, because we do not. . . Forgetful thing . So also you need to add a door to forgetting things! This door is called "the forgotten door" it. So that each time comes, remember to forget some things first through the door and then forgetting to consider whether or not to accept new information this time.

Forgetting clearly gate is used to control the degree of memory disappear, so also with the multiplicative operation that we have evolved into a network design:

c(t)=g_{forget}c(t-1)+g_{in}\cdot \hat c(t)

Or vector form:

C(t)=g_{forget}C(t-1)+g_{in}\otimes \hat C(t)

Well ~ solved the problem of how to add new information unit controlled for our long-term memory, but also take into account the intimate and elegant solve the information entered is too rich causes the input control gate "ear to ear" embarrassing situation, it is time to consider our long-term memory unit outputs friends how ~

Fourth, how to network output

It is said that there is nothing to consider output, the current output Is not it just activates the current memories? Does not that say the front of y (t) = f (c (t))? (Where f (·) is the activation function)

Just think, when if people have a long memory of the 10,000 brain cells, brain cells, each one thing in mind, then we are every moment this 10,000 brain cells where memories of things when dealing with the immediate things again? Clearly it is not, we will only make the current time with the current task-related brain cell output part that should be added to the memory unit when the output of our long a valve! That should output:

y(t)=g_{out} \cdot f(c(t))

Ah ~ finally it looks like no problem.

Fifth, what is controlled by the control gate

Finally, we define what we control gate (input gate, forgotten door, output doors) under whose control can be friends.

It is clear that this problem is, of course, it is to let each door by the external input x (t) at the current time and the last time the output y (t-1) you, that is g_x(t)=f(W\cdot x(t)+V\cdot y(t-1)). . . . . . ?

Such thinking does not seem to have any problem in RNN but! Yes! Do not forget in our newly designed network, more than a bunch of valves! Note that the output gate in particular, once the output door is closed, it will cause the control memory f (c (t)) is cut off, each time the door is only the next external input x (t) at the current time is controlled by the ! This is clearly not in line with our designed (as much as possible to make decisions as long as possible taking into account the historical information). How to do it?

The simplest approach is to access the memory unit length then when each door, i.e. the long-term memory on a time point c (t-1) input gate and the access gate forgotten, the long-term memory c at the current time (t) access output gate (when the flow of information when the output of the gate, the current long-term memory had been completed the calculated time). which is

g_{in}(t) = sigm(W\cdot x(t)+V\cdot y(t-1)+U\cdot c(t-1))g_{forget}(t) = sigm(W\cdot x(t)+V\cdot y(t-1)+U\cdot c(t-1))g_{out}(t) = sigm(W\cdot x(t)+V\cdot y(t-1)+U\cdot c(t))

Of course, this allows each door to consider long-term memory approach is to fight the descendants of patches, these connections from long-term memory unit to the door unit is called a "peephole (cat's eye)" .

Six, Simple version of the design is completed

At this point there are any questions for me? It looks really no problem - we have designed the simple version of the network to complete the matter, to sum up, namely:

C(t)=g_{forget}C(t-1)+g_{in}\otimes \hat C(t)\hat c(t) = f(W\cdot x(t)+V\cdot y(t-1))y(t)=g_{out} \otimes f(C(t))

g_{in}(t) = sigm(W\cdot x(t)+V\cdot y(t-1)+U\cdot c(t-1))g_{forget}(t) = sigm(W\cdot x(t)+V\cdot y(t-1)+U\cdot c(t-1))g_{out}(t) = sigm(W\cdot x(t)+V\cdot y(t-1)+U\cdot C(t))

It is named "Threshold simple RNN" it! (Non-recognized academics)

Seven, the evolution of the Standard version

However, as a great designer, how can you stop at simple! We will be as simple RNN standardRNN promote the practice as a promotion of our standard version! That added to the hidden layer!

Why should we increase the hidden layer has been mentioned in the beginning of this article, which is simpleRNN to the core difference standardRNN, which is one of the reasons RNN and its variants can be used as the main character depth study. RNN mimic practice, we directly use the hidden layer units in place of the final output h y:

C(t)=g_{forget}\otimes C(t-1)+g_{in}\otimes \hat C(t)\hat C(t) = f(W\cdot x(t)+V\cdot h(t-1))h(t)=g_{out} \otimes f(C(t))y(t)=h(t)

g_{in}(t) = sigm(W\cdot x(t)+V\cdot h(t-1)+U\cdot C(t-1))g_{forget}(t) = sigm(W\cdot x(t)+V\cdot h(t-1)+U\cdot C(t-1))g_{out}(t) = sigm(W\cdot x(t)+V\cdot h(t-1)+U\cdot C(t))

Thanks to the comments section @ Cheng Yi symbol error correction formula

Obviously, because h can always be truncated output of the gate, so we can be very emotional to be understood as short-term memory unit h.

From the point of view of mathematics, it is short-term memory, because the flow through the gradient h was subjected to the h (t) -> c (t) -> h (t-1) multiplied by the path of the chain ( before input and output door is closed), apparently as a mathematical proof in front, it will explode and disappear gradient occurs, the gradient disappears when it means that memory is gone, that h is short-term memory unit.

The same idea can be re-prove it, since the gradient only from time c to go, there is no path to a serial multiplication can be avoided gradient disappears. Have forgotten door activation function and avoid saturation gradient, so long as the memory unit when c.

Well, our standard version of the new network also completed! Do you feel super large amount of information, but also a mess of it? Do not worry, small intimate evening will then take you to sum up this process before we feed the network:

New time t just comes,

1, when the first long-term memory cell c (t-1) gate g_forget by forgetting to forget some of the information.
2, wherein the external input g_forget by x (t) at the current time, a timing output (short-term memory) h (t-1), long-term memory c (t-1) on a timing control.
3, and then by the time the current short-term memory external input x (t) and the previous time h (t-1) to calculate the new current time information \hat c(t).
4, the input gate g_in then controlled by the current time portion of the new information is \hat c(t)written in the long-term memory unit, generates a new long-term memory c (t).
5, wherein g_in by x (t), h (t -1), c (t-1) control.
6, upon activation of long-term memory cell c (t), ready to heaven (output).
7, then g_out by the output gate to control, is approaching the current accumulated memory c (t) related to selected portions of the memory of the moment generate memories of our attention h (t), then this part of the memory output y (t ).
8, wherein the output of the gate is controlled by memory g_out c (t) of duration x (t), h (t -1) and the current time.

Macro point of view is this:

v2-2d900438060a9854d5fcd02ec4552bbb_b.jpg
This figure does not connect added peephole

Feedforward process finished, the process of gradient back propagation of deep learning platform to let the automatic derivation to complete it - there is a tendency of M students can try to manually process the above derivation.

Eight, a name

Well, finally the full text of the design process to sum up:

1, we have to solve the problem RNN gradient disappeared, in order to allow non-destructive gradient spread, think of c (t) = c (t-1) no problems this simple gradient propagation model, we then c is called the "long time memory unit. "

2, then the new information in order to secure stable and charged long-term memory unit, we introduced the "input gates."

3, then in order to solve too many times to load new information brought by the activation function saturation problem, the introduction of the "forgotten door."

4, and then to let the network be able to choose the right memory output , we introduced the "output door."

5, in order to solve the memory is then cut off the output gate after the door such that the respective units controlled reduced problem, we introduce a "peephole" connection.

6, and then upgrade to the simple structure of the neural network into the feedback Fuzzy historical memory structure, the introduction of hidden units h, and h was found fuzzy stored historical memory is short, so short-term memory unit as referred h.

7, so that the network not only has long-term memory, but also with short-term memory, simply named "when the length of the memory neural networks (Long Short Term Memory Neural Networks, referred to LSTM)" friends.


references:

1.Hochreiter S, Schmidhuber J. Long Short-TermMemory[J]. Neural Computation,
1997, 9(8): 1735-1780.
2. Gers F A, Schmidhuber J, Cummins F, et al.Learning to Forget: Continual Prediction with
LSTM[J]. Neural Computation,2000, 12(10): 2451-2471.
3. Gers F A,Schraudolph N N, Schmidhuber J, etal. Learning precise timing with lstm
recurrent networks[J]. Journal of MachineLearning Research, 2003, 3(1):
115-143.
4. A guide to recurrent neural networks and backpropagation. Mikael Bod ́en.
5. colah.github.io/posts/2
6. 《Supervised Sequence Labelling with Recurrent Neural Networks》Alex Graves
7. 《Hands on machine learning with sklearn and tf》Aurelien Geron
8. 《Deep learning》Goodfellow et.

Published 33 original articles · won praise 0 · Views 3290

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/104553537