LSTM and one-dimensional convolution dimension change analysis

When reproducing Sketchmate recently, the RNN branch used LSTM and one-dimensional convolution. Now let's sort out the ideas.

Table of contents

1. LSTM long short-term memory network

2. One-dimensional convolution Conv1d

3. Code Verification

1. LSTM long short-term memory network

The LSTM network can be represented by the following figure: (Source: Li Mu: Hands-on Deep Learning Section 9.2)

calculation process:

def lstm(inputs, state, params):
    [W_xi, W_hi, b_i, W_xf, W_hf, b_f, W_xo, W_ho, b_o, W_xc, W_hc, b_c, W_hq, b_q] = params
    (H, C) = state
    outputs = []
    for X in inputs:
        I = torch.sigmoid((X @ W_xi) + (H @ W_hi) + b_i)
        F = torch.sigmoid((X @ W_xf) + (H @ W_hf) + b_f)
        O = torch.sigmoid((X @ W_xo) + (H @ W_ho) + b_o)
        C_tilda = torch.tanh((X @ W_xc) + (H @ W_hc) + b_c)
        C = F * C + I * C_tilda
        H = O * torch.tanh(C)
        Y = (H @ W_hq) + b_q
        outputs.append(Y)
    return torch.cat(outputs, dim=0), (H, C)

Regarding torch.nn.LSTM(), the parameters and dimension changes are explained as follows:

(1) The input parameter list includes:

input_size: The feature dimension of the input data, usually embedding_dim (the dimension of the word vector);
hidden_size: the dimension of the hidden layer in LSTM;
num_layers: the number of layers of the recurrent neural network;
bias: use or not bias, default=True
batch_first: This should be noted, usually the data we input shape=(batch_size, seq_length, embedding_dim), and batch_first defaults to False, so by default, before our input data is sent to LSTM, we need to add the two dimensions of batch_size and seq_length Exchange; if batch_first is True, this operation is not required;
dropout: The default is 0, which means no dropout;
bidirectional: The default is false, which means that bidirectional LSTM is not used;

(2) The data passed to LSTM includes input, (h_0, c_0):

input: tensor of shape = [seq_length, batch_size, input_size];
h_0: tensor of shape = [num_layers * num_directions, batch, hidden_size], which contains the initial hidden state of each sentence in the current batch_size, num_layers is the number of layers of LSTM, if bidirectional = True, then num_directions = 2, Otherwise it is 1, indicating that there is only one direction;
c_0: The same shape as h_0, it contains the initial cell state of each sentence in the current batch_size;

If h_0, c_0 are not provided, the default is 0.

(3) LSTM output data includes output, (h_t, c_t):

output.shape = [seq_length, batch_size, num_directions * hidden_size], which contains the output features of the last layer of LSTM (h_t), t is the length of each sentence in batch_size.
h_t.shape = [num_directions * num_layers, batch, hidden_size]
c_t.shape = h_t.shape

h_n contains the hidden state of the last word of the sentence, and c_t contains the cell state of the last word of the sentence, so they have nothing to do with the length of the sentence seq_length. output[-1] is equal to h_t because output[-1] contains the hidden state of the last word of each sentence in batch_size sentences.

Note that the hidden state in LSTM is actually the output, and the cell state is always hidden in LSTM, recording information.

2. One-dimensional convolution Conv1d

One-dimensional convolution, as the name implies, is to perform convolution in one-dimensional space, which is usually used to process time-series data. The convolution process is as shown in the figure below.

Note: The convolution kernel is not one-dimensional. One-dimensional convolution means that the convolution direction is one-dimensional.

The data shape for convolution is: [batch_size, seq_len, embedding_dim];

After convolution, it becomes: [batch_size, out_channels, sql_len-kernel_size+1].

During convolution, it is performed in the last dimension. out_channels represents the number of output channels, and there are several convolution kernels with several output channels.

3. Code Verification

import torch
import torch.nn as nn
import torch.nn.functional as F

class SketchRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, dropout=0.1, n_layers=1):
        super(SketchRNN, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers

        self.conv1d_1 = nn.Conv1d(input_size, 48, 5)
        self.dropout_1 = nn.Dropout(0.1)
        self.conv1d_2 = nn.Conv1d(48, 64, 5)
        self.dropout_2 = nn.Dropout(0.1)
        self.conv1d_3 = nn.Conv1d(64, 96, 3)
        self.dropout_3 = nn.Dropout(0.1)
        self.lstm_1 = nn.LSTM(96, hidden_size, n_layers, dropout, batch_first=True, bidirectional=True)
        self.fc_mu1 = nn.Linear(hidden_size * 186 * 2, output_size)
        # self.fc_mu2 = nn.Linear(128, output_size)

    def forward(self, inputs, hidden):
        inputs = inputs.transpose(1, 2)
        print('inputs:', inputs.shape)

        output = self.conv1d_1(inputs)
        print('conv1d_1:', output.shape)
        output = self.dropout_1(output)
        output = self.conv1d_2(output)
        print('conv1d_2:', output.shape)
        output = self.dropout_2(output)
        output = self.conv1d_3(output)
        print('conv1d_3:', output.shape)
        output = self.dropout_3(output)
        output = output.transpose(1, 2) # 交换维度
        print('output.transpose:', output.shape)

        output, (hidden, x) = self.lstm_1(output, hidden)

        output = output.contiguous() # 当调用contiguous()时，会强制拷贝一份tensor，让它的布局和从头创建的一模一样，但是两个tensor完全没有联系。
        output = output.view(output.size(0), -1)
        output_lstm = self.fc_mu1(output)
        # output = self.fc_mu2(output_lstm)
        output = F.log_softmax(output_lstm, dim=1)
        return output, output_lstm

model = SketchRNN(3, 256, 40, dropout=0.1)
a = torch.rand([512, 186, 96])
lstm_1 = nn.LSTM(96, hidden_size=256, num_layers=1, dropout=0.1, batch_first=True, bidirectional=True)
output, (hidden, x)=lstm_1(a, None)
print(output.shape)
print(hidden.shape)
print(x.shape)
output = output.contiguous()
print(output.shape)
output = output.view(output.size(0), -1)
print(output.shape)
fc_mu1 = nn.Linear(256 * 186 * 2, 40)
output_lstm = fc_mu1(output)
print(output_lstm.shape)
output = F.log_softmax(output_lstm, dim=1)
print(output.shape)

output:

inputs: torch.Size([512, 196, 3])
inputs: torch.Size([512, 3, 196])
conv1d_1: torch.Size([512, 48, 192])
conv1d_2: torch.Size([512, 64, 188])
conv1d_3: torch.Size([512, 96, 186])
output.transpose: torch.Size([512, 186, 96])
# LSTM
output.shape:torch.Size([512, 186, 512])
hidden.shape:torch.Size([2, 512, 256])
x.shape:torch.Size([2, 512, 256])
# contiguous()深拷贝
view:torch.Size([512, 95232])  # 186*512=95232
output_lstm.shape:torch.Size([512, 40])

The entire dimension change diagram is as follows: