Deep Learning Series Notes 09 Recurrent Convolutional Neural Network RNN

1. Sequence data and models

1.1 Sequence data

  • Many data in real life have a time series structure, such as movie ratings (neither fixed nor random, but will change over time)
  • In statistics, making predictions beyond known observations is called extrapolation, and making estimates between existing observations is called interpolation.

insert image description here

The new deep neural network architecture RNN should be used to process sequence data.
The data we use here has changed to include a time dimension.
And unlike the previous image classification and recognition, here our data is correlated.

1.2 Sequence model - autoregressive model and latent variable autoregressive model

insert image description here

1. Autoregressive models

  • Assuming that in reality, a rather long sequence x(t-1) , … , x1 may be unnecessary, so only a time span of length τ needs to be considered , that is, using the observation sequence x(t-1 ) , … , x(t-τ)
  • The advantage of this is to ensure that when t > τ, the number of parameters is fixed
  • It is called an autoregressive model because the model performs regression on itself

2. Latent autoregressive models
insert image description here

  • The summary ht of past observations is retained in the latent variable autoregressive model, and the prediction xt_hat and the summary ht are updated at the same time
  • xt_hat = P( xt_hat | ht ) ; ht = g( h(t - 1) , x(t - 1) )
  • Because ht is never observed, this type of model is also called a latent variable autoregressive model

insert image description here

1.3 Sequence Model - Hidden Markov Model

In the approximation of the autoregressive model , use x(t-1) , … , x(t-τ) instead of x(t-1) , … , x1 to estimate xt, if this assumption is true, That is, the approximation is accurate and can be said to satisfy the Markov condition (Markov condition)

If t=1, get a first-order Markov model (first-order Markov model)

insert image description here
insert image description here

  • Assume that the current data is only related to τ past data points (every time you predict a new data, you only need to look at the past τ data points, τ can be freely selected, the smaller the τ, the simpler the model, and the larger the τ, the more complex the model) , the advantage of this assumption is that the value of τ is fixed and will not increase with time (this is also more in line with realistic logic, the longer the past event is from the predicted event, the degree of correlation between them smaller)

1.4 Causality

insert image description here

1.5 Summary

insert image description here

2. Text preprocessing

The core idea of ​​text preprocessing is how to convert the words in the text into training samples .

In this section, we will go through common preprocessing steps for parsing text. These steps usually include:

  • 1. Load the text into memory as a string .
  • 2. Split the string into tokens (such as words and characters) .
  • 3. Build a vocabulary that maps split tokens to numeric indices.
  • 4. Convert text to digital index sequence for easy model operation

2.1 Read the dataset

import collections
import re
from d2l import torch as d2l
d2l.DATA_HUB['time_machine'] = (d2l.DATA_URL + 'timemachine.txt', '090b5e7e70c295757f55df93cb0a180b9691891a')

def read_time_machine(): 
    """将数据集加载到文本行的列表中"""
    with open(d2l.download('time_machine'), 'r') as f:
        lines = f.readlines()
    return [re.sub('[^A-Za-z]+', ' ', line).strip().lower() for line in lines]
    """去掉回车 将所有字母全部变成小写"""

lines = read_time_machine()

2.2 Lemmatization

Each text sequence is split into a list of tokens, and a token is the basic unit of text. Finally, return a list of lists of tokens, each of which is a string.

Tokenize is a relatively common operation in NLP: convert a sentence or a piece of text into a token (string, character or word)

def tokenize(lines, token='word'):  #将文本行列表(lines)作为输入
    """将文本行拆分为单词或字符词元"""
    if token == 'word':
        return [line.split() for line in lines]
    elif token == 'char':
        return [list(line) for line in lines]
    else:
        print('错误:未知词元类型:' + token)

tokens = tokenize(lines)

2.3 Create vocabulary

The type of token is a string , and the input required by the model is a number , so this type is not convenient for the model to use. Now, let's build a dictionary, often called a vocabulary, that maps tokens of type string to numeric indices starting at the beginning.
Map split tokens to numeric indices: Convert text to numeric index sequences for easy model manipulation.

class Vocab:  
    """文本词表"""
    def __init__(self, tokens=None, min_freq=0, reserved_tokens=None):
        if tokens is None:
            tokens = []
        if reserved_tokens is None:
            reserved_tokens = []
        # 按出现频率排序
        counter = count_corpus(tokens)
        self._token_freqs = sorted(counter.items(), key=lambda x: x[1],
                                   reverse=True)
        # 未知词元的索引为0
        self.idx_to_token = ['<unk>'] + reserved_tokens
        self.token_to_idx = {
    
    token: idx
                             for idx, token in enumerate(self.idx_to_token)}
        for token, freq in self._token_freqs:
            if freq < min_freq:
                break
            if token not in self.token_to_idx:
                self.idx_to_token.append(token)
                self.token_to_idx[token] = len(self.idx_to_token) - 1

    def __len__(self):
        return len(self.idx_to_token)

    def __getitem__(self, tokens):
        if not isinstance(tokens, (list, tuple)):
            return self.token_to_idx.get(tokens, self.unk)
        return [self.__getitem__(token) for token in tokens]

    def to_tokens(self, indices):
        if not isinstance(indices, (list, tuple)):
            return self.idx_to_token[indices]
        return [self.idx_to_token[index] for index in indices]

    @property
    def unk(self):  # 未知词元的索引为0
        return 0

    @property
    def token_freqs(self):
        return self._token_freqs

def count_corpus(tokens):  #@save
    """统计词元的频率"""
    # 这里的tokens是1D列表或2D列表
    if len(tokens) == 0 or isinstance(tokens[0], list):
        # 将词元列表展平成一个列表
        tokens = [token for line in tokens for token in line]
    return collections.Counter(tokens)

2.4 Summary

  • Text is one of the most common forms of sequence data.
  • To preprocess text, we typically split the text into tokens, build a vocabulary, map token strings to numeric indices, and convert text data into token indices for model manipulation.

3. Recurrent Neural Networks

What is an RNN model?
RNN (Recurrent Neural Network) . Chinese is called a recurrent neural network , which generally takes sequence data as input, and effectively captures the relationship characteristics between sequences through the internal structure design of the network . Generally, the output is also in the form of a sequence.

3.1 Introduction to RNNs

insert image description here

  • The cyclic mechanism of RNN enables the result of the previous time step of the hidden layer of the model to be used as part of the input of the current time step ( in addition to the normal input, the input of the current time step also includes the hidden layer output of the previous step ) to the output of the current time step make an impact.
  • The role of the RNN model:
    Because the RNN structure can make good use of the relationship between sequences , it can handle continuous input sequences in nature, such as human language, speech, etc., and is widely used in various tasks in the NLP field. Such as text classification sentiment analysis, intent recognition, machine translation, etc.

For example: input What time is it to the machine?
After 4 cycles, analyze the final output O5 to judge the information input by the user.
insert image description here

3.2 Classification of RNN models

① Classify according to input and output structure

  • N vs N - RNN
  • N vs 1 - RNN
  • 1 vs N - RNN
  • N vs M - RNN

N vs N - RNN
is the most basic structural form of RNN. The biggest feature is that the input and output sequences are of equal length. Due to the existence of this restriction, its scope of application is relatively small, and it can be used to generate matching verses of equal length. insert image description here
N vs 1 - RNN
Sometimes the input of the problem we have to deal with is a sequence, and the output is required to be a single value instead of a sequence. We only need to perform a linear transformation on the output h of the last hidden layer. In most cases Next, in order to better clarify the results, sigmold or softmax should be used for processing. This structure is often applied to text classification problems.
insert image description here
1 vs N -
How does RNN handle the case where the input is not a sequence and the output is a sequence? One of the most common ways we do this is to have that input act on every output. This structure can be used to generate text tasks from pictures, etc.
insert image description here
N vs M - RNN
This is an RNN structure with unlimited input and output lengths, which consists of two parts: an encoder and a decoder. The internal structure of both is some kind of RNN, which is also known as seq2seq architecture. The input data first passes through the encoder, and finally outputs a hidden variable c, and then the most common method is to use this hidden variable c to act on each step of the decoder to decode, so as to ensure that the input information is effectively used.
insert image description here

② Classify according to the internal structure of RNN

  • Traditional RNN
  • LSTM
  • Bi - LSTM (Bidirectional Long Short-Term Memory Network)
  • GRU
  • Bi - GRU

The following will introduce one by one according to this classification method.

3.3 Traditional RNN model

3.3.1 Internal structure analysis:

insert image description here

insert image description here

3.3.2 Data flow process:

insert image description here

3.3.3 Formula:

insert image description here
The role of the activation function tanh: used to help adjust the value flowing through the network, the tanh function compresses the value between -1 and 1.

3.3.4 Experiment code:

#导入若干工具包
import torch
import torch.nn as nn
#实例化rnn对象
#第一个参数: input_size(输入张量x的维度)
#第二个参数: hidden_size(隐藏层的维度,隐藏层神经元数量)
#第三个参数: num_.layers(隐藏层的层数)
rnn = nn.RNN(5, 6, 1)
#设定输入的张量x
#第一个参数: sequence_.length(输入序列的长度)
#第二个参数: batch_size (批次的样本数)
#第三个参数: input_size(输入张量x的维度)
input1 = torch.randn(1, 3, 5)
#设定初始化的h0
#第一个参数: num_layers * num_ directions (层数*网络方向数)
#第二个参数: batch_size (批次的样本数)
#第三个参数: hidden_size(隐藏层的维度)
h0 = torch.randn(1, 3, 6)
#输入张量放入RNN 得到输出结果
output, hn = rnn(input1, h0)
print(output)
print(output.shape)
print(hn)
print(hn.shape)
-----------------------------------------
tensor([[[ 0.2020,  0.3738,  0.8060, -0.6857, -0.6111,  0.6123],
         [-0.9363,  0.3544, -0.2019,  0.8183, -0.1817, -0.6506],
         [-0.6587,  0.6482, -0.8166, -0.5486, -0.0163,  0.7191]]],
       grad_fn=<StackBackward0>)
torch.Size([1, 3, 6])
tensor([[[ 0.2020,  0.3738,  0.8060, -0.6857, -0.6111,  0.6123],
         [-0.9363,  0.3544, -0.2019,  0.8183, -0.1817, -0.6506],
         [-0.6587,  0.6482, -0.8166, -0.5486, -0.0163,  0.7191]]],
       grad_fn=<StackBackward0>)
torch.Size([1, 3, 6])

3.3.5 Advantages and disadvantages:

  • Advantages of traditional RNN:
    Due to its simple internal structure and low requirements for computing resources, compared with the RNN variants we will learn later: LSTM and GRU model parameters, the total number of parameters is much less, and the performance and effect are excellent in short sequence tasks.
  • Disadvantages of traditional RNN:
    When traditional RNN solves the association between long sequences, it has been proved through practice that the performance of classic RNN is very poor. The reason is that during backpropagation, too long sequences lead to abnormal calculation of gradients, and gradients disappear. or explode .

3.4 LSTM long short-term memory network

3.4.1 Internal structure analysis:

LSTM (Long Short-Term Memory), also known as long-term short-term memory structure, is a variant of traditional RNN. Compared with classic RNN, it can effectively capture the semantic association between long sequences and alleviate the phenomenon of gradient disappearance or explosion . At the same time, the structure of LSTM is more complex, and its core structure can be divided into four parts to analyze: forgetting gate, input gate, cell state, output gate

insert image description here

  • Forget Gate: Represents how much information is forgotten in the past
    insert image description here
    insert image description here

  • input gate
    insert image description here
    insert image description here

  • Cell State Update Graph
    insert image description here
    insert image description here

  • output gate
    insert image description here
    insert image description here

3.4.2 Data flow process:

  • forgotten door
    insert image description here
  • input gate

insert image description here

  • Cell State Update Graph
    insert image description here
  • output gate
    insert image description here

3.4.3 Experiment code:

#导入若干工具包
import torch
import torch.nn as nn
#实例化LSTM对象
#第一个参数: input_size(输入张量x的维度)
#第二个参数: hidden_size(隐藏层的维度, 隐藏层的神经元数量)
#第三个参数: num_layers (隐藏层的层数)
lstm = nn.LSTM(5, 6, 2)
#初始化输入张量x
#第一个参数: sequence_length(输入序列的长度)
#第二个参数: batch_size(批次的样本数量)
#第三个参数: input_size(输入张量x的维度)
input1 = torch. randn(1, 3, 5)
#初始化隐藏层张量h0,和细胞状态c0
#第一个参数: num_layers * num_directions (隐藏层的层数*方向数.
#第二个参数: batch_size (批次的样本数量)
#第三个参数: hidden_size(隐藏层的维度)
h0 = torch. randn(2, 3, 6)
c0 = torch. randn(2, 3, 6)
#将inputI, h0, c0输入lstm中, 得到输出张量结果
output, (hn, cn) = lstm(input1, (h0, c0))
print (output)
print (output.shape)
print (hn)
print (hn.shape)
print(cn)
print (cn.shape)
---------------------------------------
tensor([[[-0.0356,  0.1013, -0.4488, -0.2720, -0.0605, -0.2809],
         [-0.0743,  0.3319,  0.1953,  0.3076, -0.4295,  0.0784],
         [-0.2240,  0.1658,  0.1031,  0.3426, -0.2790,  0.3442]]],
       grad_fn=<StackBackward0>)
torch.Size([1, 3, 6])
tensor([[[ 0.1035,  0.0796, -0.0350,  0.3091, -0.0084, -0.0795],
         [ 0.1013,  0.4979, -0.3049,  0.3802,  0.2845, -0.1771],
         [ 0.0804, -0.2093, -0.0581, -0.3859,  0.3678, -0.2731]],

        [[-0.0356,  0.1013, -0.4488, -0.2720, -0.0605, -0.2809],
         [-0.0743,  0.3319,  0.1953,  0.3076, -0.4295,  0.0784],
         [-0.2240,  0.1658,  0.1031,  0.3426, -0.2790,  0.3442]]],
       grad_fn=<StackBackward0>)
torch.Size([2, 3, 6])
tensor([[[ 0.1972,  0.1682, -0.0902,  0.9651, -0.0115, -0.1569],
         [ 0.1968,  1.4286, -0.5794,  0.9468,  0.7288, -0.3405],
         [ 0.2432, -1.5347, -0.1129, -1.4662,  0.5249, -0.6214]],

        [[-0.0889,  0.4005, -1.2702, -0.5516, -0.0938, -0.6681],
         [-0.1985,  0.6989,  0.4673,  1.0849, -0.7235,  0.2078],
         [-0.4790,  0.4915,  0.3270,  0.6981, -0.6362,  0.6638]]],
       grad_fn=<StackBackward0>)
torch.Size([2, 3, 6])

3.4.4 Advantages and disadvantages:

  • Advantages of LSTM :
    The structure of LSTM can effectively slow down the gradient disappearance or explosion that may occur in long sequence problems. Although this phenomenon cannot be eliminated, it performs better than traditional RNN in longer sequence problems.
  • Disadvantages of LSTM:
    Due to the relatively complex internal structure, the training efficiency is much lower than that of traditional RNN under the same computing power.

3.5 GRU Gated Recurrent Unit

GRU (Gated Recurrent Unit) is also called the GRU structure. It is also a variant of traditional RNN. Like LSTM, it can effectively capture the semantic association between long sequences and alleviate the phenomenon of gradient disappearance or explosion. At the same time, its structure and calculation are simpler than LSTM. Its core structure can be divided into two parts to analyze: update gate and reset gate.

3.5.1 Internal structure analysis:

insert image description here

insert image description here
insert image description here

3.5.3 Experiment code:

#实例化GRU对象
#第一个参数: input_size(输入张量x的维度)
#第二个参数: hidden_size(隐藏层的维度, 隐藏层神经元的数量)
#第三个参数: num_layers (隐藏层的层数)
gru = nn. GRU(5, 6, 2)
#初始化输入张量input1
#第一个参数: sequence_.length(序列的长度)
#第二个参数: batch_size(批次的样本个数)
#第三个参数: input_size(输入张量x的维度)
input1 = torch.randn(1, 3, 5)
#初始化隐藏层的张量h0
#第一个参数: num_layers * num_ _di rections (隐藏层的层数*方向数)
#第二个参数: batch_size(批次的样本个数)
#第三个参数: hidden_size(隐藏层的维度)
h0 = torch. randn(2, 3, 6)
#将input1, h0输入GRU中, 得到输出张量结果
output, hn = gru(input1, h0)
print (output)
print (output. shape)
print (hn)
print (hn. shape )

3.5.4 Advantages and disadvantages:

  • Advantages of GRU:
    GRU and LSTM have the same function. When capturing long-sequence semantic associations, they can effectively suppress gradient disappearance or explosion. The effect is better than traditional RNN and the computational complexity is smaller than that of LSTM.
  • Disadvantages of GRU:
    GRU still cannot completely solve the problem of gradient disappearance. At the same time, it acts as a variant of RNN, which has a major
    drawback it cannot be calculated in parallel. In the future, when the amount of data and model volume gradually increase, It is a key bottleneck in the development of RNN.

Guess you like

Origin blog.csdn.net/weixin_45751396/article/details/127288277