Understand LSTM and handwritten LSTM structures in one article

`torch.nn.LSTM` is a class in PyTorch used to create long short-term memory networks (Long Short-Term Memory). LSTM is a variant of Recurrent Neural Network (RNN) used to process sequence data.

Official LSTM API documentation

The following are the main parameters of `torch.nn.LSTM` (used to configure and customize the behavior of the LSTM layer):

1. `input_size` (required parameter): The size of the feature dimension of the input data. This is the dimension of the feature vector of the input sequence.

2. `hidden_size` (required parameter): the dimension size of the hidden state of the LSTM unit. This determines the dimensions of the output and internal hidden states of the LSTM layer.

3. `num_layers` (optional parameter, default is 1): the number of stacked layers of the LSTM layer. You can stack multiple LSTM layers together to increase the model's capacity and representation capabilities.

4. `bias` (optional parameter, default is True): a Boolean value that determines whether to include a bias term in the LSTM unit.

5. `batch_first` (optional parameter, default is False): a Boolean value specifying the shape of the input data. If set to True, the shape of the input data should be `(batch_size, sequence_length, input_size)`, otherwise `(sequence_length, batch_size, input_size)`.

6. `dropout` (optional parameter, default is 0.0): Dropout rate applied to each LSTM layer except the last layer. This helps prevent overfitting.

7. `bidirectional` (optional parameter, default is False): a Boolean value that specifies whether to use bidirectional LSTM. If set to True, the LSTM will have forward and backward hidden states to better capture the contextual information of the sequence.

8. `batch_first` (optional parameter, default is False): a Boolean value used to specify the shape of the input data. If set to True, the input data should be `(batch_size, sequence_length, input_size)`, otherwise `(sequence_length, batch_size, input_size)`.

9. `device` (optional parameter): Specify the device on which the LSTM layer is to be created, such as CPU or GPU.

10. `dtype` (optional parameter): Specify the data type, such as `torch.float32` or `torch.float64`.

11. `return_sequences` (optional parameter, default is False): a Boolean value that specifies whether to return the output sequence of each time step. If set to True, the complete output sequence is returned; otherwise, only the output of the last time step is returned.

These parameters allow you to configure the LSTM layer according to the specific task and model architecture. Depending on your needs, you can flexibly choose different parameter values to build different LSTM models.

Input to LSTM

The input of the `torch.nn.LSTM` layer is usually a tuple `(input, (h_0, c_0))` containing two elements. The calling method is:

output, (h_n, c_n) = torch.nn.LSTM(input, (h_0,c_0))

in:

(1)

input is usually a three-dimensional tensor, the specific shape depends on whether the `batch_first` parameter is set. The input tensor includes the following dimensions:

1. Batch Dimension: This is the number of samples in the data. If `batch_first` is set to True, the batch dimension will be the first dimension; otherwise, the batch dimension will be the second dimension.

2. Sequence Length Dimension: This is the number of time steps and the length of the sequence. It is the number of data points in the input sequence.

3. Feature Dimension: This is the number of features of the input data points. It represents the dimension of the input feature vector xt at each time step.

Based on the above description, here are two common input shapes:

- If `batch_first` is True:
- The shape of the input tensor is `(batch_size, sequence_length, input_size)`.
- `batch_size` is the batch size, indicating the number of samples processed simultaneously.
- `sequence_length` is the length of the sequence, i.e. the number of time steps.
- `input_size` is the dimension of the input feature vector.

- If `batch_first` is False:
- The shape of the input tensor is `(sequence_length, batch_size, input_size)`.
- `sequence_length` is the length of the sequence, i.e. the number of time steps.
- `batch_size` is the batch size, indicating the number of samples processed simultaneously.
- `input_size` is the dimension of the input feature vector.

Note that this is only the shape of the input, the parameters of the LSTM layer (such as `input_size` and `hidden_size`) must match the input shape. Depending on your specific task and data, you will need to organize the input data into appropriately shaped tensors and then pass them to the `torch.nn.LSTM` layer for forward propagation.

(2)

`(h_0, c_0)`: is a tuple containing the initial hidden state and the initial cell state.
- `h_0`: is the initial hidden state, its shape is `(num_layers * num_directions, batch_size, hidden_size)`. `num_layers` is the number of stacked layers of the LSTM layer, `num_directions` is 1 or 2, depending on whether bidirectional LSTM is used.
- `c_0`: is the initial cell state, and its shape is also `(num_layers * num_directions, batch_size, hidden_size)`.

Output of LSTM

The output of a `torch.nn.LSTM` layer is usually a tuple `(output, (h_n, c_n))` containing two elements, where:

1. `output`: is a tensor containing the LSTM output at each time step. Its shape is `(batch_size, sequence_length, num_directions * hidden_size)` [when batch_first = True], where: -
`sequence_length` is the length of the sequence, that is, the number of time steps.
- `batch_size` is the batch size, indicating the number of samples processed simultaneously.
- `num_directions` is 1 or 2, depending on whether bidirectional LSTM is used.
- `hidden_size` is the dimension size of the hidden state of the LSTM unit.

2. `(h_n, c_n)`: is a tuple containing the hidden state and cell state of the last time step.
- `h_n`: is the hidden state of the last time step, its shape is `(num_layers * num_directions, batch_size, hidden_size)`. `num_layers` is the number of stacked layers of the LSTM layer, `num_directions` is 1 or 2, depending on whether bidirectional LSTM is used.
- `c_n`: is the cell state of the last time step, and its shape is also `(num_layers * num_directions, batch_size, hidden_size)`.

You can choose whether you want to use the output of all time steps in the output, or only the last time step, depending on the needs of your task.

Generally, if you only care about the final output, you can use `output[-1]` or `h_n`. If you need a complete sequence of time step output, you can use `output`. These outputs can be passed to other layers or used for subsequent processing of tasks.

Weight parameters of LSTM

`torch.nn.LSTM` has the following main weight parameters (used to capture long-term dependencies in sequences):

1. `weight_ih_l[k]`: This is the weight parameter input to the LSTM unit, where k represents the index of the LSTM layer. The dimensions of `weight_ih_l[k]` are (4 * hidden_size, input_size), where hidden_size is the size of the LSTM hidden state and input_size is the feature dimension of the input data. This weight parameter controls how the input data affects the state of the LSTM cell.

2. `weight_hh_l[k]`: This is the weight parameter of the hidden state to the LSTM unit, where k represents the index of the LSTM layer. The dimensions of `weight_hh_l[k]` are (4 * hidden_size, hidden_size). This weight parameter controls how the hidden state of the previous time step affects the hidden state of the current time step.

3. `bias_ih_l[k]` and `bias_hh_l[k]`: These are the bias parameters input to the LSTM unit and the hidden state to the LSTM unit, where k represents the index of the LSTM layer. The dimensions of `bias_ih_l[k]` are (4 * hidden_size), and the dimensions of `bias_hh_l[k]` are also (4 * hidden_size). These bias parameters are used to adjust the influence of input and hidden states.

The 4 in the above weight parameters represents the gating mechanism of the LSTM unit, which is usually called the input gate, forget gate, output gate and cell state. LSTM uses these gates to control the flow of information to capture long-term dependencies.

To access and modify these weight parameters, you can use the `state_dict` attribute to get or set the model's weights. For example, if you have a `torch.nn.LSTM` model named `lstm_model`, you can use the following code to get a dictionary of weight parameters: lstm_weights = lstm_model.state_dict(). You can then extract and modify specific weight parameters from the `lstm_weights` dictionary. Please note that modifying the weight parameters may affect the performance of the model, so proceed with caution.

You can also use:

for k, v in lstm_model.named_parameters():
    print(k, v) #Print the weight parameter name and value

Method to obtain the weight parameters of the model.

code part

The following code includes the official API and handwritten LSTM source code.


# 视频链接：
# https://www.bilibili.com/video/BV1zq4y1m7aH/?spm_id_from=333.788&vd_source=fb7bfda367c76676e2483b9b60485e57

# 实现LSTM 源码
# 定义常量
import torch
import torch.nn as nn
batch_size, T, input_size, hidden_size = 2, 3, 4, 5


input = torch.randn(batch_size, T, input_size)
c_0 = torch.randn(batch_size, hidden_size) # 初始细胞单元，不参与网络训练
h_0 = torch.randn(batch_size, hidden_size) # 初始隐藏状态

# 调用官方API
lstm_layer = nn.LSTM(input_size=input_size, hidden_size=hidden_size, batch_first=True)
output, (h_n, c_n) = lstm_layer(input, (h_0.unsqueeze(0), c_0.unsqueeze(0)))
print("LSTM API")
print("output:\n", output)
print("h_n:\n", h_n)
print("c_n:\n", c_n)

# for k, v in lstm_layer.named_parameters():
#     print(k, v)
lstm_weight = lstm_layer.state_dict() # 使用`state_dict`属性来获取或设置模型的权重
print("lstm_weight:\n", lstm_weight)

# 自己写一个LSTM模型
def lstm_forward(input, initial_states, w_ih, w_hh, b_ih, b_hh):
    """

    :param input:
    :param initial_states:
    :param w_ih:
    :param w_hh:
    :param b_ih:
    :param b_hh:
    :return:
    """
    h_0, c_0 = initial_states # 初始状态
    batch_size, T, input_size = input.shape
    hidden_size = w_ih.shape[0] // 4
    prev_h = h_0
    prev_c = c_0

    batch_w_ih = w_ih.unsqueeze(0).tile(batch_size, 1, 1) # [batch_size, 4*hidden_size, input_size]
    batch_w_hh = w_hh.unsqueeze(0).tile(batch_size, 1, 1) # [batch_size, 4*hidden_size, hidden_size]
    output_size = hidden_size
    output = torch.zeros(batch_size, T, output_size) # 输出序列

    for t in range(T):
        x = input[:, t, :] # 当前时刻的输入向量，[batch_size*input_size]
        w_times_x = torch.bmm(batch_w_ih, x.unsqueeze(-1)) # [batch_size, 4*hidden_size, 1]
        w_times_x = w_times_x.squeeze(-1) # [batch_size, 4*hidden_size]

        w_times_h_prev = torch.bmm(batch_w_hh, prev_h.unsqueeze(-1)) # [batch_size, 4*hidden_size, 1]
        w_times_h_prev = w_times_h_prev.squeeze(-1)  # [batch_size, 4*hidden_size]

        # 分别计算输入门(i)、遗忘门(f)、cell(g)、输出门(o)
        i_t = torch.sigmoid(w_times_x[:, :hidden_size] + w_times_h_prev[:, :hidden_size]
                            +b_ih[ :hidden_size] + b_hh[ :hidden_size])
        f_t = torch.sigmoid(w_times_x[:, hidden_size:2*hidden_size] + w_times_h_prev[:, hidden_size:2*hidden_size]
                            + b_ih[hidden_size:2*hidden_size] + b_hh[hidden_size:2*hidden_size])
        g_t = torch.tanh(w_times_x[:, 2*hidden_size:3*hidden_size] + w_times_h_prev[:, 2*hidden_size:3*hidden_size]
                            + b_ih[2*hidden_size:3*hidden_size] + b_hh[2*hidden_size:3*hidden_size])
        o_t = torch.sigmoid(w_times_x[:, 3*hidden_size:4*hidden_size] + w_times_h_prev[:, 3*hidden_size:4*hidden_size]
                            + b_ih[3*hidden_size:4*hidden_size] + b_hh[3*hidden_size:4*hidden_size])
        prev_c = f_t * prev_c + i_t * g_t
        prev_h = o_t * torch.tanh(prev_c)

        output[:, t, :] = prev_h

    return output, (prev_h, prev_c)


output_custom, (h_final_custom, c_final_custom) = lstm_forward(input=input, initial_states = (h_0, c_0), w_ih=lstm_layer.weight_ih_l0,
             w_hh=lstm_layer.weight_hh_l0, b_ih=lstm_layer.bias_ih_l0, b_hh=lstm_layer.bias_hh_l0)

print("LSTM custom")
print("output_custom:\n", output_custom)
print("h_final_custom:\n", h_final_custom)
print("c_final_custom:\n", c_final_custom)

Visual understanding of LSTM model input and output

Pictures and text from: Detailed explanation of LSTM parameters in pytorch (a picture helps you better understand each parameter)_pytorch lstm parameter picture_xjtuwfj's blog-CSDN blog