When does pytorch lstm use output (output layer) and when to use h (hidden layer)

You should already know about lstm before reading this article.
Here we only talk about the output of pytorch's lstm. In fact, all rnn networks are the same.

one-way lstm

1 layer lstm unit

Here it is assumed that the input batch_size is 8,
the sentence length is 10,
the word vector dimension is 128,
the hidden layer dimension of lstm is 50, and
there is only 1 layer of lstm unit.

batch_size = 8        #batch为8
seq_len = 10          #句子长度为10
embedding_size = 128  #词向量维度为128

x = torch.rand((batch_size,seq_len,embedding_size))

input_size = embedding_size  #对于lstm的输入,每个词的维度就是词向量维度
hidden_size = 50
num_layers=1

lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True, bidirectional=False)

output,(h,c) = lstm(x)

print('output.shape = ',output.shape)
print('h.shape = ',h.shape)

The output is:

output.shape =  torch.Size([8, 10, 50])
h.shape =  torch.Size([1, 8, 50])

It can be understood that the output here contains the information of each word (10 words per sentence, and the information dimension of each word is 50).
It should be noted that lstm considers timing information when propagating, that is to say, the information of each word in the output here is the information generated after considering all the previous word information (that is, the above information).

Simply draw a picture like this (for a single sentence)
insert image description here


print(torch.all(output[:,-1,:] == h[0]).item())

The output is:

True

Here is to explain that the information h of the hidden layer is the last word information of each sentence. According to what we just said, the information of each word in the output is the information generated after considering all the previous word information (that is, the above information). So `h[0]= output[:,-1,:] (h is three-dimensional, taking h[0] is two-dimensional)` is the result that contains the information of the entire sentence.

2-layer lstm unit

The code is the same as above, except that the lstm unit becomes 2 layers.

batch_size = 8
seq_len = 10
embedding_size = 128

x = torch.rand((batch_size,seq_len,embedding_size))

input_size = embedding_size  #对于输入,每个词的维度就是词向量维度
hidden_size = 50
num_layers=2

lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True, bidirectional=False)

output,(h,c) = lstm(x)

print('output.shape = ',output.shape)
print('h.shape = ',h.shape)

The output is:

output.shape =  torch.Size([8, 10, 50])
h.shape =  torch.Size([2, 8, 50])

print(torch.all(output[:,-1,:] == h[0]).item())
print(torch.all(output[:,-1,:] == h[1]).item())

The output is:

False
True

Simply draw a picture like this (for a single sentence)
insert image description here

It can be seen that the output layer output is actually the same as the single-layer lstm, but the two-layer lstm can use more fully connected layers to make the information more accurate.
And h[0] and h[1] are just two hidden layers, they both contain the information of the entire sentence, but the learned information will be different.



bidirectional lstm

1 layer lstm unit

Here it is assumed that the input batch_size is 8,
the sentence length is 10,
the word vector dimension is 128,
the hidden layer dimension of lstm is 50, and
there is only 1 layer of lstm unit.

batch_size = 8
seq_len = 10
embedding_size = 128

x = torch.rand((batch_size,seq_len,embedding_size))

input_size = embedding_size  #对于输入,每个词的维度就是词向量维度
hidden_size = 50
num_layers=1

lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True, bidirectional=True)

output,(h,c) = lstm(x)

print('output.shape = ',output.shape)
print('h.shape = ',h.shape)

The output is:

output.shape =  torch.Size([8, 10, 100])
h.shape =  torch.Size([2, 8, 50])

Simply draw a picture like this (for a single sentence)
insert image description here


print(torch.all(output[:,-1,:hidden_size]==h[0]).item())
print(torch.all(output[:,0,hidden_size:]==h[1]).item())

The output is:

True
True

It can be seen that the output is obtained by concatenate the information of each word in the forward lstm and the information in the reverse lstm.
h[0] is the information of the entire sentence obtained from the forward lstm.
h[1] is the information of the entire sentence obtained by reverse lstm.

Again, here the parameters of forward lstm and reverse lstm are updated separately, they are not weight shared , you can see the following code.

for i in range(len(lstm.all_weights[0])):
    if torch.any(lstm.all_weights[0][i]!=lstm.all_weights[1][i]):
        print(False)
False
False
False
False

2-layer lstm unit

The two-layer bidirectional lstm is the same as before, except that there are two more forward hidden layers and reverse hidden layers.

batch_size = 8
seq_len = 10
embedding_size = 128

x = torch.rand((batch_size,seq_len,embedding_size))

input_size = embedding_size  #对于输入,每个词的维度就是词向量维度
hidden_size = 50
num_layers=2

lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True, bidirectional=True)

output,(h,c) = lstm(x)

print('output.shape = ',output.shape)
print('h.shape = ',h.shape)

The output is:

output.shape =  torch.Size([8, 10, 100])
h.shape =  torch.Size([4, 8, 50])

print(torch.all(output[:,-1,:hidden_size]==h[2]).item())
print(torch.all(output[:,0,hidden_size:]==h[3]).item())

The output is:

True
True

drawing is
insert image description here

For multi-layer lstm, it is actually a layer-by-layer processing, and the input of the latter layer is the output of the previous layer. For example, the 2-layer lstm here is actually to first process the single-layer lstm on the part of the picture frame, and get The output vector is then used as the input of the second layer lstm to continue the calculation.

By this code:

for i in range(len(lstm.all_weights)):
    for w in lstm.all_weights[i][:2]:
        print(w.shape)
torch.Size([200, 128])
torch.Size([200, 50])
torch.Size([200, 128])
torch.Size([200, 50])
torch.Size([200, 100])
torch.Size([200, 50])
torch.Size([200, 100])
torch.Size([200, 50])

We can see torch.Size([200, 100])that this is actually the weight matrix caused by the output of the first layer (dimension 50) cat as the input of the second layer. (200 is 4*hidden_size, representing (W_hi|W_hf|W_hg|W_ho)4 weight matrices)


When to use the output layer output information

Taking two-way lstm as an example,
we know that the output is actually the information of each word, but these words consider all the information of the context because of the two-way lstm (if it is one-way lstm, only the above information is considered).
So if we want to do the problem of classifying the part of speech of each word in a sentence such as named entity recognition , we need to use the output layer output information.


When to Use Hidden Layer Information

Taking bidirectional lstm as an example,
we know that the hidden layer is actually the information of the entire sentence, and bidirectional lstm simultaneously obtains the information results when reading the sentence in the forward direction and the information result when reading the sentence in the reverse direction.
So if we want to classify the information of the entire sentence such as the sentiment classification of the sentence, then we only need to concatenate the two information results and send them to the linear layer for classification. Generally, h[-2] is used (forward propagation to obtain the deepest hidden layer result), h[-1] (backpropagation to obtain the deepest hidden layer result).


Afterword

In fact, all RNN networks are like this. For example, the encoder part of Transformer and Bert is a combination of multiple encoders. To put it bluntly, it is to encode and re-encode the hidden layer of a sentence. In fact, we only need the hidden layer of the last layer. up.

However, the general practice is to sum the result of the first hidden layer and the result of the last hidden layer or the result of the average pooling operation as the embedding vector of the entire sentence. In this way, with the help of the idea of ​​residual network, the problem of information loss caused by information spreading in an overly deep network can be avoided.

Supongo que te gusta

Origin blog.csdn.net/qq_49030008/article/details/124980055
Recomendado
Clasificación