Natural Language Processing (NLP) Study Notes - RNN Model

1. Analysis of RNN architecture

1. Know the RNN model

1.1, what is the RNN model

RNN (Recurrent Neural Network), which is called a recurrent neural network in Chinese, generally takes sequence data as input, effectively captures the relationship characteristics between sequences through the internal structure design of the network, and generally outputs in the form of sequences.

General single-layer neural network structure:

RNN single-layer network structure:

The single-layer network structure after expanding the RNN with time steps:

The cyclic mechanism of RNN enables the result of the previous time step of the hidden layer of the model to be used as part of the input of the current time step (in addition to the normal input, the input of the current time step also includes the output of the hidden layer of the previous step) to the output of the current time step Make an impact.

 1.2. The role of RNN model

Because the RNN structure can make good use of the relationship between sequences, it can deal with continuous input sequences in nature, such as human language, speech, etc., and is widely used in various tasks in the field of NLP, such as text classification, Sentiment analysis, intent recognition, machine translation, etc.


Below we will conduct a simple analysis with an example of user intent recognition:

  • Step 1: The user has entered "What time is it ?", we first need to perform basic word segmentation on it, because RNN works in sequence and only receives one word at a time for processing.

  • Step 2: First feed the word "What" to RNN, it will generate an output O1.

  • Step 3: Continue to send the word "time" to the RNN, but at this time, the RNN not only uses "time" to generate the output O2, but also uses the output O1 from the previous hidden layer as input information.

  • Step 4: Repeat this step until all words are processed.

  • Step 5: Finally, output the final hidden layer to O5 for processing to parse user intent.


1.3. Classification of RNN models

Here we will classify RNN models from two perspectives. The first perspective is the structure of input and output, and the second perspective is the internal structure of RNN.

Classification according to the structure of the input and output:

  • N vs N - RNN
  • N vs 1 - RNN
  • 1 vs N - RNN
  • N vs M - RNN

Classified according to the internal structure of RNN:

  • Traditional RNN
  • LSTM
  • Bi-LSTM
  • GRU
  • Bi-GRU

N vs N - RNN:

It is the most basic structural form of RNN, and its biggest feature is that the input and output sequences are of equal length. Due to the existence of this limitation, its scope of application is relatively small, and it can be used to generate coincident verses of equal length .


N vs 1 - RNN :

Sometimes the input of the problem we have to deal with is a sequence, and the output is required to be a single value instead of a sequence. How should we model it? We only need to perform linear transformation on the output h of the last hidden layer. In most cases, in order to better clarify the results, we also need to use sigmoid or softmax for processing. This structure is often applied to text classification problems.


1 vs N - RNN :

What if the input is not a sequence and the output is a sequence? One of the most common ways we use is to make the input act on each output. This structure can be used to generate text tasks from pictures , etc.


N vs M - RNN: 

This is an RNN structure with unlimited input and output lengths. It consists of two parts: an encoder and a decoder. The internal structure of both is a certain type of RNN. It is also called a seq2seq architecture. The input data first passes through the encoder, Finally, a hidden variable c is output, and the most common practice is to use this hidden variable c to act on each step of the decoder to decode, so as to ensure that the input information is effectively used.

The seq2seq architecture was first proposed for machine translation, because its input and output are not limited, and it is now the most widely used RNN model structure. It has been applied in many fields such as machine translation, reading comprehension, and text summarization .


2. Traditional RNN model

2.1. Internal structure diagram of traditional RNN

Structural Explanation Diagram:

Internal structure analysis:

We focus on the square part in the middle, its input has two parts, namely h(t-1) and x(t), representing the hidden layer output of the previous time step, and the input of this time step, they enter After the RNN structure, it will be "fused" together. According to the structural explanation, we can know that this fusion is to splice the two to form a new tensor [x(t), h(t-1)], and then this new tensor The tensor of will pass through a fully connected layer (linear layer), which uses tanh as the activation function, and finally gets the output h(t) of this time step, which will be used as the input of the next time step and x(t+1) into the structure together. And so on.

Internal structure process demo:

According to the structural analysis, the internal calculation formula is obtained: 

The role of the activation function tanh:

Used to help regulate the values ​​flowing through the network, the tanh function compresses values ​​between -1 and 1.

 Use of traditional RNN tools in Pytorch:

Location: In the torch.nn toolkit, callable via torch.nn.RNN.

nn.RNN class initialization main parameter explanation:

  • input_size: The size of the feature dimension in the input tensor x.
  • hidden_size: The size of the feature dimension in the hidden layer tensor h.
  • num_layers: The number of hidden layers.
  • nonlinearity: The choice of activation function, the default is tanh.

nn.RNN class instantiation object main parameter explanation:

  • input: input tensor x.
  • h0: Initialized hidden layer tensor h.

Example of nn.RNN usage:

# 导入工具包
>>> import torch
>>> import torch.nn as nn
>>> rnn = nn.RNN(5, 6, 1)
>>> input = torch.randn(1, 3, 5)
>>> h0 = torch.randn(1, 3, 6)
>>> output, hn = rnn(input, h0)
>>> output
tensor([[[ 0.4282, -0.8475, -0.0685, -0.4601, -0.8357,  0.1252],
         [ 0.5758, -0.2823,  0.4822, -0.4485, -0.7362,  0.0084],
         [ 0.9224, -0.7479, -0.3682, -0.5662, -0.9637,  0.4938]]],
       grad_fn=<StackBackward>)

>>> hn
tensor([[[ 0.4282, -0.8475, -0.0685, -0.4601, -0.8357,  0.1252],
         [ 0.5758, -0.2823,  0.4822, -0.4485, -0.7362,  0.0084],
         [ 0.9224, -0.7479, -0.3682, -0.5662, -0.9637,  0.4938]]],
       grad_fn=<StackBackward>)

Advantages of traditional RNN:

Due to the simple internal structure and low requirements for computing resources, compared with the RNN variants we will learn later: LSTM and GRU model parameters, the total number of parameters is much less, and the performance and effect are excellent in short sequence tasks.

Disadvantages of traditional RNN:

When the traditional RNN solves the association between long sequences, it has been proved through practice that the performance of the classic RNN is very poor. The reason is that when the backpropagation is performed, the excessively long sequence leads to an abnormal calculation of the gradient, and the gradient disappears or explodes.

What is vanishing or exploding gradient?

According to the backpropagation algorithm and the chain rule, the calculation of the gradient can be simplified to the following formula:

The derivative value range of sigmoid is fixed between [0, 0.25], and once the w in the formula is also less than 1, the final gradient will become very, very small after multiplying by such a formula, which is This phenomenon is called gradient disappearance. Conversely, if we artificially increase the value of w to make it greater than 1, then the multiplication may cause the gradient to be too large, which is called gradient explosion.

Hazards of vanishing or exploding gradients:

If the gradient disappears during the training process, the weights cannot be updated, and eventually the training fails; the gradient caused by the gradient explosion is too large, and the network parameters are greatly updated. In extreme cases, the result will overflow (NaN value).


 3. LSTM model

LSTM (Long Short-Term Memory), also known as long-short-term memory structure, is a variant of traditional RNN. Compared with classic RNN, it can effectively capture the semantic association between long sequences and alleviate the phenomenon of gradient disappearance or explosion. At the same time, the structure of LSTM More complex, its core structure can be divided into four parts to parse:

  • forgotten door
  • input gate
  • cell state
  • output gate

3.1. Internal structure diagram of LSTM

Structure Explanation Diagram:

Partial structure diagram and calculation formula of the Forgotten Gate:

Forget gate structure analysis :

It is very similar to the internal structure calculation of traditional RNN. First, the current time step input x(t) is spliced ​​with the hidden state h(t-1) of the previous time step to obtain [x(t), h(t-1)] , and then transform through a fully connected layer, and finally activate f(t) through the sigmoid function, we can regard f(t) as the gate value, like the size of a door opening and closing, the gate value will affect In the tensor passing through the gate, the gate value of the forget gate will act on the cell state of the previous layer, which represents how much information has been forgotten in the past, and because the gate value of the forget gate is determined by x(t), h(t-1) Therefore, the whole formula means that according to the input of the current time step and the hidden state h(t-1) of the previous time step, it is decided to forget how much past information carried by the cell state of the previous layer.

Demonstration of the internal structure of the Forgotten Gate:

The role of the activation function sigmiod :

Used to help regulate the values ​​flowing through the network, the sigmoid function compresses the values ​​between 0 and 1.


The structure diagram and calculation formula of the input gate part:

Input gate structure analysis:

We see that there are two calculation formulas for the input gate. The first one is the formula for generating the gate value of the input gate. It is almost the same as the formula for the forget gate. How much needs to be filtered. The second formula of the input gate is the same as the internal structure calculation of the traditional RNN. For LSTM, it gets the current cell state instead of the hidden state like the classic RNN.

The process demonstration of the internal structure of the input gate:

Cell state update diagram and calculation formula:

Cell State Update Analysis:

The structure and calculation formula of cell update are very easy to understand. There is no fully connected layer here, just multiply the value of the forgetting gate just obtained by the C(t-1) obtained in the previous time step, and add the value of the input gate and The result of multiplying the unupdated C(t) obtained at the current time step. Finally, the updated C(t) is obtained as part of the input of the next time step. The entire cell state update process is the application of the forget gate and the input gate.

Demonstration of the cell state update process:


The structure diagram and calculation formula of the output gate part:

Output gate structure analysis :

There are also two formulas for the output gate. The first one is to calculate the gate value of the output gate. It is calculated in the same way as the forget gate and the input gate. The second is to use this gate value to generate the hidden state h(t), He will act on the updated cell state C(t), and perform tanh activation, and finally get h(t) as part of the input of the next time step. The whole output gate process is to generate the hidden state h(t ).

Demonstration of the internal structure process of the output gate:


What is Bi-LSTM?

Bi-LSTM is bidirectional LSTM, which does not change any internal structure of LSTM itself, but applies LSTM twice with different directions, and then stitches the LSTM results obtained twice as the final output.

Bi-LSTM structure analysis:

We can see in the figure that the sentence "I love China" or the input sequence has been processed by LSTM twice from left to right and right to left, and the resulting tensors have been spliced ​​as the final output. This This structure can capture some specific pre- or post-features in the language grammar and enhance semantic association, but the model parameters and computational complexity are also doubled. Generally, it is necessary to evaluate the corpus and computing resources before deciding whether to use this structure. structure.

Use of LSTM tools in Pytorch :

Location: In the torch.nn toolkit, callable via torch.nn.LSTM.

nn.LSTM class initialization main parameter explanation :

  • input_size: The size of the feature dimension in the input tensor x.
  • hidden_size: The size of the feature dimension in the hidden layer tensor h.
  • num_layers: The number of hidden layers.
  • bidirectional: Whether to choose to use bidirectional LSTM, if it is True, it is used; it is not used by default.

Explanation of the main parameters of the nn.LSTM class instantiation object:

  • input: input tensor x.
  • h0: Initialized hidden layer tensor h.
  • c0: initialized cell state tensor c.

nn.LSTM usage example:

# 定义LSTM的参数含义: (input_size, hidden_size, num_layers)
# 定义输入张量的参数含义: (sequence_length, batch_size, input_size)
# 定义隐藏层初始张量和细胞初始状态张量的参数含义:
# (num_layers * num_directions, batch_size, hidden_size)

>>> import torch.nn as nn
>>> import torch
>>> rnn = nn.LSTM(5, 6, 2)
>>> input = torch.randn(1, 3, 5)
>>> h0 = torch.randn(2, 3, 6)
>>> c0 = torch.randn(2, 3, 6)
>>> output, (hn, cn) = rnn(input, (h0, c0))
>>> output
tensor([[[ 0.0447, -0.0335,  0.1454,  0.0438,  0.0865,  0.0416],
         [ 0.0105,  0.1923,  0.5507, -0.1742,  0.1569, -0.0548],
         [-0.1186,  0.1835, -0.0022, -0.1388, -0.0877, -0.4007]]],
       grad_fn=<StackBackward>)
>>> hn
tensor([[[ 0.4647, -0.2364,  0.0645, -0.3996, -0.0500, -0.0152],
         [ 0.3852,  0.0704,  0.2103, -0.2524,  0.0243,  0.0477],
         [ 0.2571,  0.0608,  0.2322,  0.1815, -0.0513, -0.0291]],

        [[ 0.0447, -0.0335,  0.1454,  0.0438,  0.0865,  0.0416],
         [ 0.0105,  0.1923,  0.5507, -0.1742,  0.1569, -0.0548],
         [-0.1186,  0.1835, -0.0022, -0.1388, -0.0877, -0.4007]]],
       grad_fn=<StackBackward>)
>>> cn
tensor([[[ 0.8083, -0.5500,  0.1009, -0.5806, -0.0668, -0.1161],
         [ 0.7438,  0.0957,  0.5509, -0.7725,  0.0824,  0.0626],
         [ 0.3131,  0.0920,  0.8359,  0.9187, -0.4826, -0.0717]],

        [[ 0.1240, -0.0526,  0.3035,  0.1099,  0.5915,  0.0828],
         [ 0.0203,  0.8367,  0.9832, -0.4454,  0.3917, -0.1983],
         [-0.2976,  0.7764, -0.0074, -0.1965, -0.1343, -0.6683]]],
       grad_fn=<StackBackward>)

Advantages of LSTMs:

The gate structure of LSTM can effectively slow down the gradient disappearance or explosion that may occur in long sequence problems. Although it cannot eliminate this phenomenon, it performs better than traditional RNN in longer sequence problems.

LSTM disadvantages:

Due to the relatively complex internal structure, the training efficiency is much lower than that of traditional RNN under the same computing power.

4. GRU model

GRU (Gated Recurrent Unit) is also called the Gated Recurrent Unit structure. It is also a variant of traditional RNN. Like LSTM, it can effectively capture the semantic association between long sequences and alleviate the phenomenon of gradient disappearance or explosion. At the same time, its structure and calculation require Simpler than LSTM, its core structure can be divided into two parts to analyze:

  • update gate
  • reset door

4.1. The internal structure diagram and calculation formula of GRU

Structure Explanation Diagram:

GRU update gate and reset gate structure diagram

 Internal structure analysis

The same as the gate control in the LSTM analyzed before, first calculate the gate values ​​of the update gate and the reset gate, which are z(t) and r(t) respectively, and the calculation method is to use X(t) and h(t-1 ) splicing for linear transformation, and then activated by sigmoid. Afterwards, the reset gate value acts on h(t-1), which represents how much information from the previous time step can be used. Then use this reset The final h(t-1) performs basic RNN calculations, that is, it is spliced ​​with x(t) for linear change, and after tanh activation, a new h(t) is obtained. The gate value of the final update gate will act on the new h( t), and the 1-gate value will act on h(t-1), and then add the results of the two to get the final hidden state output h(t), this process means that the update gate has the ability to retain the previous As a result, when the gate value tends to 1, the output is the new h(t), and when the gate value tends to 0, the output is the h(t-1) of the previous time step.

The logic of Bi-GRU and Bi-LSTM is the same, they do not change their internal structure, but apply the model twice with different directions, and then splice the LSTM results obtained twice as the final output. For details, see the Bi-LSTM.

Use of GRU tools in Pytorch:

Location: In the torch.nn toolkit, callable via torch.nn.GRU.

nn.GRU class initialization main parameter explanation:

  • input_size: The size of the feature dimension in the input tensor x.
  • hidden_size: The size of the feature dimension in the hidden layer tensor h.
  • num_layers: The number of hidden layers.
  • bidirectional: Whether to choose to use bidirectional LSTM, if it is True, it is used; it is not used by default.

Explanation of the main parameters of the nn.GRU class instantiation object:

  • input: input tensor x.
  • h0: Initialized hidden layer tensor h.

Example of nn.GRU usage:

>>> import torch
>>> import torch.nn as nn
>>> rnn = nn.GRU(5, 6, 2)
>>> input = torch.randn(1, 3, 5)
>>> h0 = torch.randn(2, 3, 6)
>>> output, hn = rnn(input, h0)
>>> output
tensor([[[-0.2097, -2.2225,  0.6204, -0.1745, -0.1749, -0.0460],
         [-0.3820,  0.0465, -0.4798,  0.6837, -0.7894,  0.5173],
         [-0.0184, -0.2758,  1.2482,  0.5514, -0.9165, -0.6667]]],
       grad_fn=<StackBackward>)
>>> hn
tensor([[[ 0.6578, -0.4226, -0.2129, -0.3785,  0.5070,  0.4338],
         [-0.5072,  0.5948,  0.8083,  0.4618,  0.1629, -0.1591],
         [ 0.2430, -0.4981,  0.3846, -0.4252,  0.7191,  0.5420]],

        [[-0.2097, -2.2225,  0.6204, -0.1745, -0.1749, -0.0460],
         [-0.3820,  0.0465, -0.4798,  0.6837, -0.7894,  0.5173],
         [-0.0184, -0.2758,  1.2482,  0.5514, -0.9165, -0.6667]]],
       grad_fn=<StackBackward>)

Advantages of GRUs :

GRU and LSTM have the same effect. When capturing long sequence semantic associations, they can effectively suppress gradient disappearance or explosion. The effect is better than traditional RNN and the computational complexity is smaller than that of LSTM.

Disadvantages of GRUs

GRU still cannot completely solve the problem of gradient disappearance. At the same time, it acts as a variant of RNN, which has a major disadvantage of the RNN structure itself, that is, it cannot be calculated in parallel. This is the key to the development of RNN in the future when the amount of data and model volume gradually increase. bottleneck.

5. Attention mechanism

5.1. What is attention :

When we observe things, the reason why we can quickly judge a thing (of course allow the judgment to be wrong) is because our brain can quickly focus on the most recognizable part of the thing to make a judgment, rather than starting from the beginning. Only after observing things from end to end can there be a judgment result. It is based on this theory that the attention mechanism is produced.

What is the attention calculation rule :

It requires three specified inputs Q(query), K(key), V(value), and then obtains the attention result through the calculation formula, which represents the attention expression of the query under the action of key and value. When the input When Q=K=V, it is called the self-attention calculation rule.

Common attention calculation rules:

  • Splice Q and K on the vertical axis, make a linear change, and then use softmax processing to obtain the result, and finally perform tensor multiplication with V.

  • Splice Q and K on the vertical axis, make a linear change and then use the tanh function to activate, then perform internal summation, and finally use softmax processing to obtain the result and then perform tensor multiplication with V.

  •  Perform the dot product operation on the transposition of Q and K, then divide by a scaling factor, then use softmax to obtain the result, and finally perform tensor multiplication with V.

Explanation: When the attention weight matrix and V are both three-dimensional tensors and the first dimension represents the number of batches, the bmm operation is performed. Bmm is a special tensor multiplication operation.

Demonstration of bmm operation:

# 如果参数1形状是(b × n × m), 参数2形状是(b × m × p), 则输出为(b × n × p)
>>> input = torch.randn(10, 3, 4)
>>> mat2 = torch.randn(10, 4, 5)
>>> res = torch.bmm(input, mat2)
>>> res.size()
torch.Size([10, 3, 5])

5.2, what is the attention mechanism

The attention mechanism is the carrier of the deep learning network that the attention calculation rules can be applied to, and includes some necessary fully connected layers and related tensor processing to integrate it with the application network. The attention mechanism using the self-attention calculation rules self-attention mechanism

Explanation: In the field of NLP, most of the current attention mechanisms are applied to the seq2seq architecture, that is, the encoder and decoder models.

5.3. The role of the attention mechanism

Attention mechanism on the decoder side: it can effectively focus the output of the encoder according to the model target, and improve the effect when it is used as the input of the decoder. Improve the previous encoder output is a single fixed-length tensor, which cannot store too much information Condition.

Attention mechanism on the encoder side: It mainly solves the problem of representation, which is equivalent to the feature extraction process, and obtains the attention representation of the input. Generally, self-attention is used.

5.4. Implementation steps of attention mechanism

  • Step 1: According to the attention calculation rules, perform corresponding calculations on Q, K, and V.
  • Step 2: According to the calculation method used in the first step, if it is a splicing method, you need to splice Q with the calculation result of the second step. If it is a transposed dot product, it is generally self-attention, Q and V are the same , then there is no need to splice with Q.
  • The third step: Finally, in order to make the entire attention mechanism output according to the specified size, use the linear layer to perform a linear transformation on the result of the second step to obtain the final attention representation for Q.

Code analysis of common attention mechanisms:

import torch
import torch.nn as nn
import torch.nn.functional as F

class Attn(nn.Module):
    def __init__(self, query_size, key_size, value_size1, value_size2, output_size):
        """初始化函数中的参数有5个, query_size代表query的最后一维大小
           key_size代表key的最后一维大小, value_size1代表value的导数第二维大小, 
           value = (1, value_size1, value_size2)
           value_size2代表value的倒数第一维大小, output_size输出的最后一维大小"""
        super(Attn, self).__init__()
        # 将以下参数传入类中
        self.query_size = query_size
        self.key_size = key_size
        self.value_size1 = value_size1
        self.value_size2 = value_size2
        self.output_size = output_size

        # 初始化注意力机制实现第一步中需要的线性层.
        self.attn = nn.Linear(self.query_size + self.key_size, value_size1)

        # 初始化注意力机制实现第三步中需要的线性层.
        self.attn_combine = nn.Linear(self.query_size + value_size2, output_size)


    def forward(self, Q, K, V):
        """forward函数的输入参数有三个, 分别是Q, K, V, 根据模型训练常识, 输入给Attion机制的
           张量一般情况都是三维张量, 因此这里也假设Q, K, V都是三维张量"""

        # 第一步, 按照计算规则进行计算, 
        # 我们采用常见的第一种计算规则
        # 将Q,K进行纵轴拼接, 做一次线性变化, 最后使用softmax处理获得结果
        attn_weights = F.softmax(
            self.attn(torch.cat((Q[0], K[0]), 1)), dim=1)

        # 然后进行第一步的后半部分, 将得到的权重矩阵与V做矩阵乘法计算, 
        # 当二者都是三维张量且第一维代表为batch条数时, 则做bmm运算
        # unsqueeze(0)将二维张量扩展为三维
        attn_applied = torch.bmm(attn_weights.unsqueeze(0), V)

        # 之后进行第二步, 通过取[0]是用来降维, 根据第一步采用的计算方法, 
        # 需要将Q与第一步的计算结果再进行拼接
        output = torch.cat((Q[0], attn_applied[0]), 1)

        # 最后是第三步, 使用线性层作用在第三步的结果上做一个线性变换并扩展维度,得到输出
        # 因为要保证输出也是三维张量, 因此使用unsqueeze(0)扩展维度
        output = self.attn_combine(output).unsqueeze(0)
        return output, attn_weights

transfer:

query_size = 32
key_size = 32
value_size1 = 32
value_size2 = 64
output_size = 64
attn = Attn(query_size, key_size, value_size1, value_size2, output_size)
Q = torch.randn(1,1,32)
K = torch.randn(1,1,32)
V = torch.randn(1,32,64)
out = attn(Q, K ,V)
print(out[0])
print(out[1])

Output effect:

tensor([[[ 0.4477, -0.0500, -0.2277, -0.3168, -0.4096, -0.5982,  0.1548,
          -0.0771, -0.0951,  0.1833,  0.3128,  0.1260,  0.4420,  0.0495,
          -0.7774, -0.0995,  0.2629,  0.4957,  1.0922,  0.1428,  0.3024,
          -0.2646, -0.0265,  0.0632,  0.3951,  0.1583,  0.1130,  0.5500,
          -0.1887, -0.2816, -0.3800, -0.5741,  0.1342,  0.0244, -0.2217,
           0.1544,  0.1865, -0.2019,  0.4090, -0.4762,  0.3677, -0.2553,
          -0.5199,  0.2290, -0.4407,  0.0663, -0.0182, -0.2168,  0.0913,
          -0.2340,  0.1924, -0.3687,  0.1508,  0.3618, -0.0113,  0.2864,
          -0.1929, -0.6821,  0.0951,  0.1335,  0.3560, -0.3215,  0.6461,
           0.1532]]], grad_fn=<UnsqueezeBackward0>)


tensor([[0.0395, 0.0342, 0.0200, 0.0471, 0.0177, 0.0209, 0.0244, 0.0465, 0.0346,
         0.0378, 0.0282, 0.0214, 0.0135, 0.0419, 0.0926, 0.0123, 0.0177, 0.0187,
         0.0166, 0.0225, 0.0234, 0.0284, 0.0151, 0.0239, 0.0132, 0.0439, 0.0507,
         0.0419, 0.0352, 0.0392, 0.0546, 0.0224]], grad_fn=<SoftmaxBackward>)

Guess you like

Origin blog.csdn.net/qq837993702/article/details/127172380