LSTM notes

Lstm can be regarded as a special form of RNN, which is an improved recurrent neural network.

It can solve the problem that RNN cannot handle long-distance dependencies.

The specific features of LSTM can be summarized as one more unit state and three gates (forgetting gate, input gate, output gate) when the hidden layer is transferred, which work together on the hidden layer unit of the original RNN.

The forgetting gate $eq?f_%7Bt%7D$ acts on the hidden state of the previous sequence $eq?C_%7Bt-1%7D$ . Determine the effect of the previous time series hidden unit state on the current hidden unit.

$eq?f_%7Bt%7D*C_%7Bt-1%7D$

The input gate $eq?i_%7Bt%7D$ acts on the input of the current hidden unit $eq?%5Ctilde%7BC%7D_%7Bt%7D$ (the input is the output of the previous hidden unit $eq?h_%7Bt-1%7D$ and the input of the current hidden unit). Determines the effect of the current input on the current hidden unit.

$eq?%5Ctilde%7BC%7D_%7Bt%7D%3Dtanh%28W_%7BC%7D%5Ccdot%20%5Bh_%7Bt-1%7D%2CX_%7Bt%7D%5D+b_%7BC%7D%29$

$eq?i_%7Bt%7D*%5Ctilde%7BC%7D_%7Bt%7D$

The forgetting gate combined with the input gate determines the current hidden layer state output $eq?C_%7Bt%7D$ , and the output gate $eq?o_%7Bt%7D$ acts on the current output to generate the current hidden unit output.

$eq?C_%7Bt%7D%3Df_%7Bt%7D*C_%7Bt-1%7D+i_%7Bt%7D*%5Ctilde%7BC%7D_%7Bt%7D$

$eq?h_%7Bt%7D%3Dtanh%28o_%7Bt%7D*C_%7Bt%7D%29$

where $eq?f_%7Bt%7D%2Ci_%7Bt%7D%2Co_%7Bt%7D$ are all linear combinations of the inputs of the current hidden unit, and then normalized by the sigmoid function:

$eq?f_%7Bt%7D%3D%5Csigma%28W_%7Bf%7DX_%7Bt%7D+U_%7Bf%7Dh_%7Bt-1%7D+b_%7Bf%7D%29$

$eq?i_%7Bt%7D%3D%5Csigma%28W_%7Bi%7DX_%7Bt%7D+U_%7Bi%7Dh_%7Bt-1%7D+b_%7Bi%7D%29$

$eq?o_%7Bt%7D%3D%5Csigma%28W_%7Bo%7DX_%7Bt%7D+U_%7Bo%7Dh_%7Bt-1%7D+b_%7Bo%7D%29$

1. Lstm implementation in Pytorch----torch.nn.lstm function

parameter:

input_size: The number of features of the input data, the dimension of X in the above figure
hidden_size: hidden unit size, the dimension of h in the above figure
num_layer: Number of hidden layers. LSTM can be stacked in multiple layers in space
bias: bias
bidirectional: whether it is bidirectional LSTM
batch_first: When acting on input data, use batch as the first dimension

2. The entire model input data is a three-dimensional tensor with a size of (seq_len, batch, input_size)

seq_len: number of time steps.

It is equivalent to entering a sentence containing several words. Each word of a sentence is input to the model in turn, producing steps in time. Usually all sentence padding is of fixed length

input_size: The number of features of the input data, which is also the size of the input hidden unit data, equivalent to the feature dimension (dimension) of each word
batch: batch size. It is equivalent to how many sentences are input to the model at one time and processed at the same time.

For example, enter the following verse:

goose

Quxiang Xiangtiange

White hair floating in green water

Anthurium stirs clear waves

Assume that the feature dimension of each word is 6, and the batch size is specified to be 4.

Then in this example: seq_len=5 (the first sentence is automatically padded to 5 characters), batch=4, input_size=6

Each time 4 characters are input into the model, the first batch is: "goose", "qu", "white", "red"

Then the question comes. These four words are not in the same sentence. Their hidden states should be independent of each other and should not affect the next sequential word in different sentences.

We have two parameters that record the hidden unit data: h, c. In h,c, the status of each word (timing) of each sentence in the same batch is saved separately.

3. In addition to the input data, the input model also has two initial states ( , ) of each hidden unit . Note that in pytorch, we do not have to initialize the initial state. $eq?h_%7B0%7D$ $eq?c_%7B0%7D$

Its size is still a tensor

h0(num_layers*num_directions,batch,hidden_size)
c0( num_layers*num_directions,batch,hidden_size)

num_layers: Number of LSTM layers, number of layers of hidden units (spatial dimension)

hidden_size: the dimension of the hidden unit (the dimension of )

batch: batch size

Another dimension, batch (batch size), cannot be shown in the figure. It actually ensures that different words in the same batch have their own status data. You can imagine repeating the same sequence a batch of times.

4. The output data of LSTM is also a three-dimensional tensor with a size of (seq_len,batch,num_directions*hidden_size)

The content is the output of the last layer of hidden units

Parameter meaning is the same as above

also,

5. The output data also includes ( $eq?h_%7Bn%7D$ , $eq?c_%7Bn%7D$ ), which saves the status of each hidden unit in the nth time series.

Its size is:

hn(num_layers*num_directions,batch,hidden_size)
cn( num_layers*num_directions,batch,hidden_size)

Parameter meaning is the same as above

Rotate the above picture 90 degrees and expand it to different timings. The input, output, hidden unit output and overall state are shown in the figure below

6.A calling example:

input_size=300；hidden_size=100；num_layers=1；batch_size=64；seq_len=7

7.An application example

Use the LSTM model as the encoder layer and splice the linear transformation model (nn.Linear) as the decoder layer to implement an emotion classification model.

Here is the code snippet for model creation:

class SentimentNet(nn.Module):
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                 bidirectional, weight, labels, use_gpu, **kwargs):
        super(SentimentNet, self).__init__(**kwargs)
        self.num_hiddens = num_hiddens
        self.num_layers = num_layers
        self.use_gpu = use_gpu
        self.bidirectional = bidirectional
        self.embedding = nn.Embedding.from_pretrained(weight)
        self.embedding.weight.requires_grad = False
        self.encoder = nn.LSTM(input_size=embed_size, hidden_size=self.num_hiddens,
                               num_layers=num_layers, bidirectional=self.bidirectional,
                               dropout=0)
        if self.bidirectional:
            self.decoder = nn.Linear(num_hiddens * 4, labels)
        else:
            self.decoder = nn.Linear(num_hiddens * 2, labels)

    def forward(self, inputs):
        embeddings = self.embedding(inputs)
        states, hidden = self.encoder(embeddings.permute([1, 0, 2]))
        encoding = torch.cat([states[0], states[-1]], dim=1)
        outputs = self.decoder(encoding)
        return outputs

For other codes such as data cleaning and model training, please see this project

https://github.com/SamaelChen/machine-learning-practice-code/blob/master/pytorch/langage%20model/lstm-sentiment.ipynb