seq2seq combined with attention mechanism
write on top
Code reference:
https://github.com/jasoncao11/nlp-notebook/tree/master/4-2.Seq2seq_Att
Thank you, sir, almost all the codes of the text summary are available,
only a small part needs to be modified, which may be due to version reasons
This code has run through. If you have any questions, please leave a message and discuss together.
If there is something wrong with your understanding, please give me some advice.
Reference:
https://www.bilibili.com/video/BV1op4y1U7ag?t=1013
https://github.com/bentrevett/pytorch-seq2seq
https://github.com/DD-DuDa/nlp_course
https://zhuanlan .zhihu.com/p/383866592
This article undertakes
[Text Summary (2)] Seq2Seq of pytorch
https://blog.csdn.net/WTYuong/article/details/129683262
Attention
注
: The attention in seq2seq is not commonly used.
You can take a closer look at the attention in the transformer, which is more used and simpler
In the previous article, we said that our encoder is that 把所有的输入编码成一个向量context
this vector comes from the output of the last layer of Encoder.
Decoder Decoder only 通过这个向量解码对应的句子
.
question
The first question: Can this vector context really contain all the information of the input sentence? Imagine that you want to translate a sentence containing 100 words, and this context is only 200 dimensions, and it not only needs to contain each word, but also order, semantics, etc., which is almost impossible.
The second question: Assuming that the vector context really contains all the information, can the Decoder really translate all the corresponding things just by looking at this single vector? Each step of the decoder needs to extract the information of the corresponding position from this vector.
A simple example is, suppose you are a decoder, you are listening to a one-minute speech in English, you take notes while listening, and after listening, you only use the bits and pieces you wrote down in Chinese Translate the content of one minute.
solve
then what should we do?
It's very simple, listen to a sentence and then pause, translate it, and then continue :)
here you can feelattention的一个思想——对齐align
At each step of translation, our model needs to focus on the corresponding input location.
Ex: Assuming that the model needs to translate "Change your life today", the first input of our Decoder needs to know that the first input input by the Encoder is "change", and then the Decoder looks at this "change" to translate.
how to pay attention
Our Encoder does not need to make any changes, mainly because the input of our Decoder has changed.
Decoder input: by [context vector + Embedding]
Become [context vector + attention_output + Embedding]
The linear layer of Decoder has also changed accordingly
attention_output
In the absence of attention, the first input of the decoder should be [encoder's last hidden layer output + embedding], and the vector output by this hidden layer is called h 0 h0h 0
At this time, it is necessary to calculate the state of each hidden layer of the encoders 1 , s 2 , s 3 , s 4 , s 5 s1,s2,s3,s4,s5p 1 ,s 2 ,s 3 ,s 4 ,s 5 and thish 0 h0scorebetween h 0 ( h 0 , sk ) , k = 1 , . . . 5 score(h0,sk),k=1,...5score(h0,sk),k=1,...5
Use softmax to replace all scores with the probability distribution of [0, 1], and becomeak , k = 1 , . . . 5 ak,k=1,...5and ,k=1,...5
Calculate the attention output: that is, the weighted sum c of the encoder state with attention weights and
so on
code structure
The code similar to seq2seq is not detailed
Model structure definition model.py
Model structure definition code
# -*- coding: utf-8 -*-
import random
import torch.nn as nn
import torch
import torch.nn.functional as F
Encoder function
Compared with the previous model using a two-layer GRU, a bidirectional RNN is now used.
For bidirectional RNNs, there are two RNNs per layer. A forward RNN traverses the embedded sentence from left to right (shown in green in the figure below), and a backward RNN traverses the embedded sentence from right to left (teal).
All that needs to be done in the code is setup bidirectional = True
and then pass the embedded sentence to the RNN as before.
The Encoder function builds an encoder. The internal RNN uses torch's built-in GRU, and the parameters are:
input_dim: the size of the input vocabulary
emb_dim: the dimension of embedding
enc_hid_dim: the size of the hidden layer
dropout: the probability of dropout
forward parameter:
src: original text data, which is data that has been converted from words to serial numbers through the vocabulary
forword outputs the overall output of the Encoder and the output of each state of the Encoder. The output of each state is used to calculate the subsequent attention.
可选
, in order to avoid the influence of the pad symbol in the sequence when calculating the attention in the subsequent calculation, itnn.utils的pad_paddad_sequence方法
can be applied去掉doc_len以后的pad符号
.
doc_len: The real length of each data. When calculating RNN, you can only calculate the state of the corresponding length, without calculating the pad symbol pad_packed_sequence is the
embedding输入
of the word sequence and the real length of the sequence, so that doc_len will not be calculated when calculating the sequence After the pad symbol is up.
packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, doc_len)
After calculating the RNN, in order to form a matrix to facilitate GPU calculation, each sequence of doc_len < max_len will be filled. Here, the method is used. The input is the
pad_packed_sequence
sequence packed_outputs calculated by RNN, which will be used in subsequent attention calculations把填充的信息规避掉
.
outputs, _ = nn.utils.rnn.pad_packed_sequence(packed_outputs)
In the actual implementation, the transformation of matrix dimensions is cumbersome. For matrix operations, it is often necessary to increase or decrease dimensions or exchange the order of dimensions. The code has given annotations. It is recommended to debug it by yourself to experience the process of dimension transformation.
The input of the encoder is the original text, the output is hidden_state, and the size needs to be set
class Encoder(nn.Module):
def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
super().__init__()
self.embedding = nn.Embedding(input_dim, emb_dim)
self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional=True, batch_first=True)
self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, src):
#src = [batch size, src len]
embedded = self.dropout(self.embedding(src))
#embedded = [batch size, src len, emb dim]
outputs, hidden = self.rnn(embedded)
#outputs = [batch size, src len, hid dim * num directions]
#hidden = [n layers * num directions, batch size, hid dim]
#hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
#outputs are always from the last layer
#hidden [-2, :, : ] is the last of the forwards RNN
#hidden [-1, :, : ] is the last of the backwards RNN
#initial decoder hidden is final hidden state of the forwards and backwards
# encoder RNNs fed through a linear layer
hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
#outputs = [batch size, src len, enc hid dim * 2]
#hidden = [batch size, dec hid dim]
return outputs, hidden
hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
Due to the use of bidirectional GRU, the output of the last hidden layer is forward and reverse.
In this example, only one layer of GRU is built, so in fact, the final output dimension is 【2,batch size, hid dim】
the dimension that turns the output into a hidden layer. You only need to let the two merged vectors enter a linear layer, and then do a linear transformation to
finally Got h0, which is the first input of decoder
Attention module
1. self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
Here is W1 in the figure below, h is compared with dec_hid_dim (dimension of hidden layer), sk is the vector enc_hid_dim * 2 of the Kth encoder forward and reverse merged together
The dimension of the above output becomes [dec hid dim, src len]
2. self.v = nn.Linear(dec_hid_dim, 1, bias = False)
Corresponding to the following figure, each input needs to be converted into a score:
The output dimension becomes [src len]
3. hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
Assuming this is h0, we need src_len h0 and the hidden layer state sk, k=1...5 to merge, so we need to repeat src_len h0 4
.energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) F.softmax(attention, dim=1)
5、F.softmax(attention, dim=1)
class Attention(nn.Module):
def __init__(self, enc_hid_dim, dec_hid_dim):
super().__init__()
self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
self.v = nn.Linear(dec_hid_dim, 1, bias = False)
def forward(self, hidden, encoder_outputs):
#hidden = [batch size, dec hid dim]
#encoder_outputs = [batch size, src len, enc hid dim * 2]
src_len = encoder_outputs.shape[1]
#repeat decoder hidden state src_len times
hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
#hidden = [batch size, src len, dec hid dim]
energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2)))
#energy = [batch size, src len, dec hid dim]
attention = self.v(energy).squeeze(2)
#attention= [batch size, src len]
return F.softmax(attention, dim=1)
Decoder
1. self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim) emb_dim
Corresponding to the word embedding enc_hid_dim of the output word 2, corresponding to attention_output, because it is bidirectional so 2
2. self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
The linear layer corresponding to each decoder includes attention_output, decoder output, and word embedding of the output word
3. weighted = torch.bmm(a, encoder_outputs)
This After C(t), the input of each decoer is [hidden_State + C(t) + embedding]
class Decoder(nn.Module):
def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
super().__init__()
self.output_dim = output_dim
self.attention = attention
self.embedding = nn.Embedding(output_dim, emb_dim)
self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim, batch_first=True)
self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, inputs, hidden, encoder_outputs):
#inputs = [batch size]
#hidden = [batch size, dec hid dim]
#encoder_outputs = [batch size, src len, enc hid dim * 2]
inputs = inputs.unsqueeze(1)
#inputs = [batch size, 1]
embedded = self.dropout(self.embedding(inputs))
#embedded = [batch size, 1, emb dim]
a = self.attention(hidden, encoder_outputs)
#a = [batch size, src len]
a = a.unsqueeze(1)
#a = [batch size, 1, src len]
weighted = torch.bmm(a, encoder_outputs)
#weighted = [batch size, 1, enc hid dim * 2]
rnn_input = torch.cat((embedded, weighted), dim = 2)
#rnn_input = [batch size, 1, (enc hid dim * 2) + emb dim]
output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
#output = [batch size, seq len, dec hid dim * n directions]
#hidden = [n layers * n directions, batch size, dec hid dim]
#seq len, n layers and n directions will always be 1 in this decoder, therefore:
#output = [batch size, 1, dec hid dim]
#hidden = [1, batch size, dec hid dim]
embedded = embedded.squeeze(1)
output = output.squeeze(1)
weighted = weighted.squeeze(1)
prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1))
#prediction = [batch size, output dim]
return prediction, hidden.squeeze(0)
training + validation
loss_vals = []
loss_vals_eval = []
for epoch in range(N_EPOCHS):
model.train()
epoch_loss= []
pbar = tqdm(train_iter) # 为进度条设置描述
# print(type(pbar))
pbar.set_description("[Train Epoch {}]".format(epoch)) #设置描述
for i,batch in enumerate(pbar):
# print(batch)
trg = batch.trg
src = batch.src
# print(type(trg),type(src))
trg, src = trg.to(device), src.to(device)
model.zero_grad()
output = model(src, trg)
#trg = [batch size, trg len]
#output = [batch size, trg len, output dim]
output_dim = output.shape[-1]
output = output[:,1:,:].reshape(-1, output_dim)
trg = trg[:,1:].reshape(-1)
#trg = [(trg len - 1) * batch size]
#output = [(trg len - 1) * batch size, output dim]
loss = criterion(output, trg)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), CLIP)
epoch_loss.append(loss.item())
optimizer.step()
pbar.set_postfix(loss=loss.item())
loss_vals.append(np.mean(epoch_loss))
model.eval()
epoch_loss_eval= []
pbar = tqdm(val_iter)
pbar.set_description("[Eval Epoch {}]".format(epoch))
for i,batch in enumerate(pbar):
# print(batch)
trg = batch.trg
src = batch.src
trg, src = trg.to(device), src.to(device)
model.zero_grad()
output = model(src, trg)
#trg = [batch size, trg len]
#output = [batch size, trg len, output dim]
output_dim = output.shape[-1]
output = output[:,1:,:].reshape(-1, output_dim)
trg = trg[:,1:].reshape(-1)
#trg = [(trg len - 1) * batch size]
#output = [(trg len - 1) * batch size, output dim]
loss = criterion(output, trg)
epoch_loss_eval.append(loss.item())
pbar.set_postfix(loss=loss.item())
loss_vals_eval.append(np.mean(epoch_loss_eval))