Introduction to Deep Learning (62) Recurrent Neural Network - Bidirectional Recurrent Neural Network

foreword

The core content comes from blog link 1 blog link 2 I hope you can support the author a lot
This article is used for records to prevent forgetting

Recurrent Neural Networks - Bidirectional Recurrent Neural Networks

courseware

the future is important

I am _____
I am _____ very hungry,
I am _____ very hungry, I could eat half a pig.

I am happy.
I am not very hungry,
I am very very hungry, I could eat half a pig.

  • Depending on past and future contexts, can fill in very different words
  • So far RNN only looks at the past
  • While filling in the blanks, we can also see the future

Bidirectional RNN

insert image description here

  • — a forward RNN hidden layer
  • A backward RNN hidden layer
  • Combine the two hidden states to get the output
    insert image description here

reasoning

insert image description here

Summarize

  • Bidirectional Recurrent Neural Networks Utilize Orientational Temporal Information Through Inversely Updated Hidden Layers
  • It is usually used to extract features and fill in the blanks of the sequence, not to predict the future

Textbook

In sequence learning, we have historically assumed that the goal is to: Model the next output given an observation (e.g., in the context of a time series or in the context of a language model). While this is a typical scenario, it is not unique. What else could happen? We consider the following three tasks of filling in gaps in text sequences.

I___.
I'm hungry.
I'm ___ hungry, I could eat half a pig.

Depending on the amount of information available, we could fill in the blanks with different words, such as "happy", "not" and "very". Clearly, the "below" of each phrase conveys important information, if any, about which word is chosen to fill in the blank, so sequence models that fail to exploit this will perform poorly on related tasks. For example, to do well in named entity recognition (e.g., identify whether "Green" refers to "Mr. Green" or green), different lengths of context ranges are equally important. To get some inspiration for solving the problem, let's first detour to probabilistic graphical models.

1 Dynamic Programming in Hidden Markov Model

This section is used to illustrate the dynamic programming problem. The specific technical details are not important to understand the deep learning model , but it helps us think about why we use deep learning and why we choose a specific architecture.

If we want to use a probabilistic graphical model to solve this problem, we can design a hidden variable model: at any time step ttt , assuming there is a hidden variableht h_tht, pass probability P ( xt ∣ ht ) P(x_t \mid h_t)P(xtht) controls our observedxt x_txt. Furthermore, any ht → ht + 1 h_t \to h_{t+1}htht+1The transition is determined by some state transition probability P ( ht + 1 ∣ ht ) P(h_{t+1} \mid h_{t})P(ht+1ht) is given. This probabilistic graphical model is one隐马尔可夫模型(hidden Markov model,HMM), as shown in the figure.
insert image description here
Therefore, forTTFor a sequence of T observations, we have the following joint probability distribution over the observed and hidden states:
P ( x 1 , … , x T , h 1 , … , h T ) = ∏ t = 1 TP ( ht ∣ ht − 1 ) P ( xt ∣ ht ) , where P ( h 1 ∣ h 0 ) = P ( h 1 ) . P(x_1, \ldots, x_T, h_1, \ldots, h_T) = \prod_{t=1}^ TP(h_t \mid h_{t-1}) P(x_t \mid h_t), \text{ where } P(h_1 \mid h_0) = P(h_1).P(x1,,xT,h1,,hT)=t=1TP(htht1)P(xtht), where P(h1h0)=P(h1).

Now, suppose we observe all xi x_ixi, except xj x_jxj, and our goal is to calculate P ( xj ∣ x − j ) P(x_j \mid x_{-j})P(xjxj), 其中 x − j = ( x 1 , … , x j − 1 , x j + 1 , … , x T ) x_{-j} = (x_1, \ldots, x_{j-1}, x_{j+1}, \ldots, x_{T}) xj=(x1,,xj1,xj+1,,xT) . SinceP ( xj ∣ x − j ) P(x_j \mid x_{-j})P(xjxj) has no hidden variables, so we considerh 1 , … , h T h_1, \ldots, h_Th1,,hTAll possible combinations formed by the selection are summed. if any hi h_ihiacceptable kkk different values ​​(finite number of states), which means we need to knowk T k^TkThe sum of T items is obviously a difficult task. Fortunately, there is a neat solution:动态规划(dynamic programming)。

To see how dynamic programming works, we consider the hidden variables h 1 , … , h T h_1, \ldots, h_Th1,,hTSummation in turn. According to the above formula, it will be obtained:
P ( x 1 , … , x T ) = ∑ h 1 , … , h T P ( x 1 , … , x T , h 1 , … , h T ) = ∑ h 1 , … , h T ∏ t = 1 T P ( h t ∣ h t − 1 ) P ( x t ∣ h t ) = ∑ h 2 , … , h T [ ∑ h 1 P ( h 1 ) P ( x 1 ∣ h 1 ) P ( h 2 ∣ h 1 ) ] ⏟ π 2 ( h 2 ) = d e f P ( x 2 ∣ h 2 ) ∏ t = 3 T P ( h t ∣ h t − 1 ) P ( x t ∣ h t ) = ∑ h 3 , … , h T [ ∑ h 2 π 2 ( h 2 ) P ( x 2 ∣ h 2 ) P ( h 3 ∣ h 2 ) ] ⏟ π 3 ( h 3 ) = d e f P ( x 3 ∣ h 3 ) ∏ t = 4 T P ( h t ∣ h t − 1 ) P ( x t ∣ h t ) = … = ∑ h T π T ( h T ) P ( x T ∣ h T ) . \begin{split}\begin{aligned} &P(x_1, \ldots, x_T) \\ =& \sum_{h_1, \ldots, h_T} P(x_1, \ldots, x_T, h_1, \ldots, h_T) \\ =& \sum_{h_1, \ldots, h_T} \prod_{t=1}^T P(h_t \mid h_{t-1}) P(x_t \mid h_t) \\ =& \sum_{h_2, \ldots, h_T} \underbrace{\left[\sum_{h_1} P(h_1) P(x_1 \mid h_1) P(h_2 \mid h_1)\right]}_{\pi_2(h_2) \stackrel{\mathrm{def}}{=}} P(x_2 \mid h_2) \prod_{t=3}^T P(h_t \mid h_{t-1}) P(x_t \mid h_t) \\ =& \sum_{h_3, \ldots, h_T} \underbrace{\left[\sum_{h_2} \pi_2(h_2) P(x_2 \mid h_2) P(h_3 \mid h_2)\right]}_{\pi_3(h_3)\stackrel{\mathrm{def}}{=}} P(x_3 \mid h_3) \prod_{t=4}^T P(h_t \mid h_{t-1}) P(x_t \mid h_t)\\ =& \dots \\ =& \sum_{h_T} \pi_T(h_T) P(x_T \mid h_T). \end{aligned}\end{split} ======P(x1,,xT)h1,,hTP(x1,,xT,h1,,hT)h1,,hTt=1TP(htht1)P(xtht)h2,,hTPi2(h2)=def [h1P(h1)P(x1h1)P(h2h1)]P(x2h2)t=3TP(htht1)P(xtht)h3,,hTPi3(h3)=def [h2Pi2(h2)P(x2h2)P(h3h2)]P(x3h3)t=4TP(htht1)P(xtht)hTPiT(hT)P(xThT).

Usually, we will 前向递归(forward recursion)write:
π t + 1 ( ht + 1 ) = ∑ ht π t ( ht ) P ( xt ∣ ht ) P ( ht + 1 ∣ ht ) . \pi_{t+1}(h_{t +1}) = \sum_{h_t} \pi_t(h_t) P(x_t \mid h_t) P(h_{t+1} \mid h_t).Pit+1(ht+1)=htPit(ht)P(xtht)P(ht+1ht) .
The recursion is initialized asπ 1 ( h 1 ) = P ( h 1 ) \pi_1(h_1) = P(h_1)Pi1(h1)=P(h1) . Symbol simplification, can also be written asπ t + 1 = f ( π t , xt ) \pi_{t+1} = f(\pi_t, x_t)Pit+1=f ( pt,xt) , wherefff is some learnable function. This looks like the update equation in the latent variable model we discussed in Recurrent Neural Networks.

As with forward recursion, we can also use backward recursion to sum the same set of hidden variables. This will get:
P ( x 1 , … , x T ) = ∑ h 1 , … , h T P ( x 1 , … , x T , h 1 , … , h T ) = ∑ h 1 , … , h T ∏ t = 1 T − 1 P ( h t ∣ h t − 1 ) P ( x t ∣ h t ) ⋅ P ( h T ∣ h T − 1 ) P ( x T ∣ h T ) = ∑ h 1 , … , h T − 1 ∏ t = 1 T − 1 P ( h t ∣ h t − 1 ) P ( x t ∣ h t ) ⋅ [ ∑ h T P ( h T ∣ h T − 1 ) P ( x T ∣ h T ) ] ⏟ ρ T − 1 ( h T − 1 ) = d e f = ∑ h 1 , … , h T − 2 ∏ t = 1 T − 2 P ( h t ∣ h t − 1 ) P ( x t ∣ h t ) ⋅ [ ∑ h T − 1 P ( h T − 1 ∣ h T − 2 ) P ( x T − 1 ∣ h T − 1 ) ρ T − 1 ( h T − 1 ) ] ⏟ ρ T − 2 ( h T − 2 ) = d e f = … = ∑ h 1 P ( h 1 ) P ( x 1 ∣ h 1 ) ρ 1 ( h 1 ) . \begin{split}\begin{aligned} & P(x_1, \ldots, x_T) \\ =& \sum_{h_1, \ldots, h_T} P(x_1, \ldots, x_T, h_1, \ldots, h_T) \\ =& \sum_{h_1, \ldots, h_T} \prod_{t=1}^{T-1} P(h_t \mid h_{t-1}) P(x_t \mid h_t) \cdot P(h_T \mid h_{T-1}) P(x_T \mid h_T) \\ =& \sum_{h_1, \ldots, h_{T-1}} \prod_{t=1}^{T-1} P(h_t \mid h_{t-1}) P(x_t \mid h_t) \cdot \underbrace{\left[\sum_{h_T} P(h_T \mid h_{T-1}) P(x_T \mid h_T)\right]}_{\rho_{T-1}(h_{T-1})\stackrel{\mathrm{def}}{=}} \\ =& \sum_{h_1, \ldots, h_{T-2}} \prod_{t=1}^{T-2} P(h_t \mid h_{t-1}) P(x_t \mid h_t) \cdot \underbrace{\left[\sum_{h_{T-1}} P(h_{T-1} \mid h_{T-2}) P(x_{T-1} \mid h_{T-1}) \rho_{T-1}(h_{T-1}) \right]}_{\rho_{T-2}(h_{T-2})\stackrel{\mathrm{def}}{=}} \\ =& \ldots \\ =& \sum_{h_1} P(h_1) P(x_1 \mid h_1)\rho_{1}(h_{1}). \end{aligned}\end{split} ======P(x1,,xT)h1,,hTP(x1,,xT,h1,,hT)h1,,hTt=1T1P(htht1)P(xtht)P(hThT1)P(xThT)h1,,hT1t=1T1P(htht1)P(xtht)rT1(hT1)=def [hTP(hThT1)P(xThT)]h1,,hT2t=1T2P(htht1)P(xtht)rT2(hT2)=def hT1P(hT1hT2)P(xT1hT1) rT1(hT1) h1P(h1)P(x1h1) r1(h1).

Therefore, we can 后向递归(backward recursion)write:
ρ t − 1 ( ht − 1 ) = ∑ ht P ( ht ∣ ht − 1 ) P ( xt ∣ ht ) ρ t ( ht ) , \rho_{t-1}(h_{ t-1})= \sum_{h_{t}} P(h_{t} \mid h_{t-1}) P(x_{t} \mid h_{t}) \rho_{t}(h_{ t}),rt1(ht1)=htP(htht1)P(xtht) rt(ht) ,
initializeρ T ( h T ) = 1 \rho_T(h_T) = 1rT(hT)=1 . Both forward and backward recursion allow ustoT hidden variables inO ( k T ) \mathcal{O}(kT)O ( k T ) (linear rather than exponential) time for( h 1 , … , h T ) (h_1, \ldots, h_T)(h1,,hT) to sum all values. This is one of the great benefits of using graphical models for probabilistic reasoning. It's also消息传递算法a very specific example of a generic. Combining forward and backward recursion, we are able to compute
P ( xj ∣ x − j ) ∝ ∑ hj π j ( hj ) ρ j ( hj ) P ( xj ∣ hj ) . P(x_j \mid x_{-j}) \propto \sum_{h_j} \pi_j(h_j) \rho_j(h_j) P(x_j \mid h_j).P(xjxj)hjPij(hj) rj(hj)P(xjhj) .
Because of symbol simplification, backward recursion can also be written asρ t − 1 = g ( ρ t , xt ) \rho_{t-1} = g(\rho_t, x_t)rt1=g ( pt,xt) , whereggg is a function that can be learned. Again, this looks very much like an update equation, but instead of running forward like we see in recurrent neural networks, it works backwards. In fact, knowing when future data will be available is beneficial for Hidden Markov Models.

2 Two-way model

If we want to have a mechanism in the recurrent neural network that can provide similar forward-looking capabilities as hidden Markov models, we need to modify the design of the recurrent neural network. Fortunately, this is conceptually easy, just adding a RNN that "runs backwards from the last token" instead of just a "starting from the first token" in forward mode running" recurrent neural network. 双向循环神经网络(bidirectional RNNs)Hidden layers that pass information back are added to allow for more flexibility in handling such information. The diagram below depicts the architecture of a bidirectional recurrent neural network with a single hidden layer.
insert image description here
In fact, this is not much different from the forward and backward recursion of dynamic programming in hidden Markov models. The main difference is that the equations in HMMs have specific statistical significance. There is no such easy-to-understand explanation for bidirectional recurrent neural networks, we can only treat them as general-purpose, learnable functions. This shift epitomizes the design principles of modern deep networks: first use the functionally dependent types of classical statistical models, and then parameterize them into a general form.

2.1 Definition

The bidirectional recurrent neural network was proposed by (Schuster and Paliwal, 1997). Let's look at the details of such a network.

For any time step ttt , given a small batch of input dataX t ∈ R n × d \mathbf{X}_t \in \mathbb{R}^{n \times d}XtRn × d (number of samplesnnn , the number of inputs ddin each exampled ), and let the hidden layer activation function beϕ \phiϕ . In the bidirectional architecture, we set the forward and reverse hidden states of the time step asH → t ∈ R n × h \overrightarrow{\mathbf{H}}_t \in \mathbb{R}^{n \times h}H tRn×h H ← t ∈ R n × h \overleftarrow{\mathbf{H}}_t \in \mathbb{R}^{n \times h} H tRn × h , wherehhh is the number of hidden units. The forward and backward hidden states are updated as follows:
H → t = ϕ ( X t W xh ( f ) + H → t − 1 W hh ( f ) + bh ( f ) ) , H ← t = ϕ ( X t W xh ( b ) + H ← t + 1 W hh ( b ) + bh ( b ) ) , \begin{split}\begin{aligned} \overrightarrow{\mathbf{H}}_t &= \phi(\mathbf {X}_t \mathbf{W}_{xh}^{(f)} + \overrightarrow{\mathbf{H}}_{t-1} \mathbf{W}_{hh}^{(f)} + \mathbf{b}_h^{(f)}),\\ \overleftarrow{\mathbf{H}}_t &= \phi(\mathbf{X}_t \mathbf{W}_{xh}^{( b)} + \overleftarrow{\mathbf{H}}_{t+1} \mathbf{W}_{hh}^{(b)} + \mathbf{b}_h^{(b)}), \mathbf{b}_h^{(b)}), \ end{aligned}\end{split}H tH t=ϕ ( XtWxh(f)+H t1Whh(f)+bh(f)),=ϕ ( XtWxh(b)+H t+1Whh(b)+bh(b)),
其中,权重 W x h ( f ) ∈ R d × h , W h h ( f ) ∈ R h × h , W x h ( b ) ∈ R d × h , W h h ( b ) ∈ R h × h \mathbf{W}_{xh}^{(f)} \in \mathbb{R}^{d \times h}, \mathbf{W}_{hh}^{(f)} \in \mathbb{R}^{h \times h}, \mathbf{W}_{xh}^{(b)} \in \mathbb{R}^{d \times h}, \mathbf{W}_{hh}^{(b)} \in \mathbb{R}^{h \times h} Wxh(f)Rd×h,Whh(f)Rh×h,Wxh(b)Rd×h,Whh(b)Rh×h和偏置 b h ( f ) ∈ R 1 × h , b h ( b ) ∈ R 1 × h \mathbf{b}_h^{(f)} \in \mathbb{R}^{1 \times h}, \mathbf{b}_h^{(b)} \in \mathbb{R}^{1 \times h} bh(f)R1×h,bh(b)R1 × h are all model parameters.

Next, the forward hidden state H → t \overrightarrow{\mathbf{H}}_tH tand reverse hidden state H ← t \overleftarrow{\mathbf{H}}_tH tConnect them to obtain the hidden state H t ∈ R n × 2 h \mathbf{H}_t \in \mathbb{R}^{n \times 2h} that needs to be sent to the output layerHtRn × 2h . _ In a deep bidirectional recurrent neural network with multiple hidden layers, this information is passed as input to the next bidirectional layer. Finally, the output calculated by the output layer isO t ∈ R n × q \mathbf{O}_t \in \mathbb{R}^{n \times q}OtRn×q q q q is the number of output units):
O t = H t W hq + bq . \mathbf{O}_t = \mathbf{H}_t \mathbf{W}_{hq} + \mathbf{b}_q.Ot=HtWhq+bq.Here
, the weight matrixW hq ∈ R 2 h × q \mathbf{W}_{hq} \in \mathbb{R}^{2h \times q}WhqR2 h × q and biasbq ∈ R 1 × q \mathbf{b}_q \in \mathbb{R}^{1 \times q}bqR1 × q is the model parameter of the output layer. In fact, the two directions can have different numbers of hidden units.

2.2 Computational cost of the model and its application

A key property of bidirectional recurrent neural networks is that they use information from both ends of the sequence to estimate the output. That is, we use information from past and future observations to predict current observations. But in the case of predicting the next token, such a model is not what we need. Because when predicting the next token, we can't know what the next token will be after all, so we won't get good accuracy. Specifically, during training, we are able to use past and future data to estimate the present vacant word; while during testing, we only have past data, so the accuracy will be poor. The following experiment will illustrate this point.

Another serious problem is that bidirectional recurrent neural networks are computationally very slow. The main reason is that the forward propagation of the network requires forward and backward recursion in the bidirectional layer, and the backward propagation of the network also depends on the result of the forward propagation. Therefore, the gradient solver will have a very long chain.

The use of bidirectional layers is very rare in practice and is only used in some cases. For example, filling in missing words, lemma annotation (e.g. for named entity recognition), and encoding sequences as a step in a sequence processing pipeline (e.g. for machine translation).

3 Misapplication of bidirectional recurrent neural network

Since bidirectional recurrent neural networks use past and future data, we cannot blindly apply this language model to any prediction task. Although the perplexity levels produced by the model are reasonable, the model's ability to predict future tokens may be severely flawed. We use the sample code below as a warning in case they are used in the wrong context.

import torch
from torch import nn
from d2l import torch as d2l

# 加载数据
batch_size, num_steps, device = 32, 35, d2l.try_gpu()
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
# 通过设置“bidirective=True”来定义双向LSTM模型
vocab_size, num_hiddens, num_layers = len(vocab), 256, 2
num_inputs = vocab_size
lstm_layer = nn.LSTM(num_inputs, num_hiddens, num_layers, bidirectional=True)
model = d2l.RNNModel(lstm_layer, len(vocab))
model = model.to(device)
# 训练模型
num_epochs, lr = 500, 1
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, device)

output:

perplexity 1.1, 109857.9 tokens/sec on cuda:0
time travellerererererererererererererererererererererererererer
travellerererererererererererererererererererererererererer

4 Summary

  • In the bidirectional recurrent neural network, the hidden state of each time step is determined by the data before and after the current time step at the same time.

  • Bidirectional recurrent neural networks have similarities to the "forward-backward" algorithm in probabilistic graphical models.

  • Bidirectional recurrent neural networks are mainly used for sequence encoding and estimation of observations given a bidirectional context.

  • Bidirectional recurrent neural networks are very expensive to train due to the longer gradient chains.

Guess you like

Origin blog.csdn.net/qq_52358603/article/details/128376751