[Text Summary (3)] Seq2seq: attention of Pytorch

write on top

Code reference:
https://github.com/jasoncao11/nlp-notebook/tree/master/4-2.Seq2seq_Att
Thank you, sir, almost all the codes of the text summary are available,
only a small part needs to be modified, which may be due to version reasons

This code has run through. If you have any questions, please leave a message and discuss together.
If there is something wrong with your understanding, please give me some advice.

Reference:
https://www.bilibili.com/video/BV1op4y1U7ag?t=1013
https://github.com/bentrevett/pytorch-seq2seq
https://github.com/DD-DuDa/nlp_course
https://zhuanlan .zhihu.com/p/383866592

This article undertakes
[Text Summary (2)] Seq2Seq of pytorch
https://blog.csdn.net/WTYuong/article/details/129683262

Attention

: The attention in seq2seq is not commonly used.
You can take a closer look at the attention in the transformer, which is more used and simpler

In the previous article, we said that our encoder is that 把所有的输入编码成一个向量contextthis vector comes from the output of the last layer of Encoder.

Decoder Decoder only 通过这个向量解码对应的句子.

question

The first question: Can this vector context really contain all the information of the input sentence? Imagine that you want to translate a sentence containing 100 words, and this context is only 200 dimensions, and it not only needs to contain each word, but also order, semantics, etc., which is almost impossible.

The second question: Assuming that the vector context really contains all the information, can the Decoder really translate all the corresponding things just by looking at this single vector? Each step of the decoder needs to extract the information of the corresponding position from this vector.

A simple example is, suppose you are a decoder, you are listening to a one-minute speech in English, you take notes while listening, and after listening, you only use the bits and pieces you wrote down in Chinese Translate the content of one minute.

solve

then what should we do?

It's very simple, listen to a sentence and then pause, translate it, and then continue :)

here you can feelattention的一个思想——对齐align

At each step of translation, our model needs to focus on the corresponding input location.

Ex: Assuming that the model needs to translate "Change your life today", the first input of our Decoder needs to know that the first input input by the Encoder is "change", and then the Decoder looks at this "change" to translate.

how to pay attention

Our Encoder does not need to make any changes, mainly because the input of our Decoder has changed.

Decoder input: by [context vector + Embedding]

Become [context vector + attention_output + Embedding]

The linear layer of Decoder has also changed accordingly

attention_output

insert image description here

In the absence of attention, the first input of the decoder should be [encoder's last hidden layer output + embedding], and the vector output by this hidden layer is called h 0 h0h 0
At this time, it is necessary to calculate the state of each hidden layer of the encoders 1 , s 2 , s 3 , s 4 , s 5 s1,s2,s3,s4,s5p 1 ,s 2 ,s 3 ,s 4 ,s 5 and thish 0 h0scorebetween h 0 ( h 0 , sk ) , k = 1 , . . . 5 score(h0,sk),k=1,...5score(h0,sk),k=1,...5
Use softmax to replace all scores with the probability distribution of [0, 1], and becomeak , k = 1 , . . . 5 ak,k=1,...5and ,k=1,...5
Calculate the attention output: that is, the weighted sum c of the encoder state with attention weights and
so on

code structure

insert image description here

The code similar to seq2seq is not detailed

Model structure definition model.py

Model structure definition code

# -*- coding: utf-8 -*-
import random
import torch.nn as nn
import torch 
import torch.nn.functional as F

Encoder function

Compared with the previous model using a two-layer GRU, a bidirectional RNN is now used.
For bidirectional RNNs, there are two RNNs per layer. A forward RNN traverses the embedded sentence from left to right (shown in green in the figure below), and a backward RNN traverses the embedded sentence from right to left (teal).
All that needs to be done in the code is setup bidirectional = Trueand then pass the embedded sentence to the RNN as before.

The Encoder function builds an encoder. The internal RNN uses torch's built-in GRU, and the parameters are:

input_dim: the size of the input vocabulary
emb_dim: the dimension of embedding
enc_hid_dim: the size of the hidden layer
dropout: the probability of dropout

forward parameter:

src: original text data, which is data that has been converted from words to serial numbers through the vocabulary

forword outputs the overall output of the Encoder and the output of each state of the Encoder. The output of each state is used to calculate the subsequent attention.
可选, in order to avoid the influence of the pad symbol in the sequence when calculating the attention in the subsequent calculation, it nn.utils的pad_paddad_sequence方法can be applied 去掉doc_len以后的pad符号.
doc_len: The real length of each data. When calculating RNN, you can only calculate the state of the corresponding length, without calculating the pad symbol pad_packed_sequence is the
embedding 输入of the word sequence and the real length of the sequence, so that doc_len will not be calculated when calculating the sequence After the pad symbol is up.
packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, doc_len)

After calculating the RNN, in order to form a matrix to facilitate GPU calculation, each sequence of doc_len < max_len will be filled. Here, the method is used. The input is the pad_packed_sequencesequence packed_outputs calculated by RNN, which will be used in subsequent attention calculations 把填充的信息规避掉.
outputs, _ = nn.utils.rnn.pad_packed_sequence(packed_outputs)

In the actual implementation, the transformation of matrix dimensions is cumbersome. For matrix operations, it is often necessary to increase or decrease dimensions or exchange the order of dimensions. The code has given annotations. It is recommended to debug it by yourself to experience the process of dimension transformation.

The input of the encoder is the original text, the output is hidden_state, and the size needs to be set

class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()       
        self.embedding = nn.Embedding(input_dim, emb_dim)       
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional=True, batch_first=True)     
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)     
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):     
        #src = [batch size, src len]
        embedded = self.dropout(self.embedding(src))
        #embedded = [batch size, src len, emb dim]
        outputs, hidden = self.rnn(embedded)
        #outputs = [batch size, src len, hid dim * num directions]
        #hidden = [n layers * num directions, batch size, hid dim]
        #hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        #outputs are always from the last layer
        #hidden [-2, :, : ] is the last of the forwards RNN 
        #hidden [-1, :, : ] is the last of the backwards RNN
        #initial decoder hidden is final hidden state of the forwards and backwards 
        #  encoder RNNs fed through a linear layer
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
        #outputs = [batch size, src len, enc hid dim * 2]
        #hidden = [batch size, dec hid dim]
        return outputs, hidden

hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))Due to the use of bidirectional GRU, the output of the last hidden layer is forward and reverse.
In this example, only one layer of GRU is built, so in fact, the final output dimension is 【2,batch size, hid dim】
the dimension that turns the output into a hidden layer. You only need to let the two merged vectors enter a linear layer, and then do a linear transformation to
finally Got h0, which is the first input of decoder

Attention module

1. self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)Here is W1 in the figure below, h is compared with dec_hid_dim (dimension of hidden layer), sk is the vector enc_hid_dim * 2 of the Kth encoder forward and reverse merged together
insert image description here

The dimension of the above output becomes [dec hid dim, src len]
2. self.v = nn.Linear(dec_hid_dim, 1, bias = False)
Corresponding to the following figure, each input needs to be converted into a score:
insert image description here

The output dimension becomes [src len]
3. hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
Assuming this is h0, we need src_len h0 and the hidden layer state sk, k=1...5 to merge, so we need to repeat src_len h0 4
.energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) F.softmax(attention, dim=1)

insert image description here
5、F.softmax(attention, dim=1)
insert image description here

class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()       
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs):        
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        src_len = encoder_outputs.shape[1]     
        #repeat decoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)                
        #hidden = [batch size, src len, dec hid dim]      
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) 
        #energy = [batch size, src len, dec hid dim]
        attention = self.v(energy).squeeze(2)        
        #attention= [batch size, src len]        
        return F.softmax(attention, dim=1)

Decoder

1. self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim) emb_dimCorresponding to the word embedding enc_hid_dim of the output word 2, corresponding to attention_output, because it is bidirectional so 2
2. self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)The linear layer corresponding to each decoder includes attention_output, decoder output, and word embedding of the output word
3. weighted = torch.bmm(a, encoder_outputs)This After C(t), the input of each decoer is [hidden_State + C(t) + embedding]

class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super().__init__()
        self.output_dim = output_dim
        self.attention = attention       
        self.embedding = nn.Embedding(output_dim, emb_dim)        
        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim, batch_first=True)        
        self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)      
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, inputs, hidden, encoder_outputs):             
        #inputs = [batch size]
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [batch size, src len, enc hid dim * 2]        
        inputs = inputs.unsqueeze(1)
        #inputs = [batch size, 1]        
        embedded = self.dropout(self.embedding(inputs))
        #embedded = [batch size, 1, emb dim]        
        a = self.attention(hidden, encoder_outputs)                
        #a = [batch size, src len]     
        a = a.unsqueeze(1)        
        #a = [batch size, 1, src len]
        weighted = torch.bmm(a, encoder_outputs)      
        #weighted = [batch size, 1, enc hid dim * 2]     
        rnn_input = torch.cat((embedded, weighted), dim = 2)    
        #rnn_input = [batch size, 1, (enc hid dim * 2) + emb dim]           
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))       
        #output = [batch size, seq len, dec hid dim * n directions]
        #hidden = [n layers * n directions, batch size, dec hid dim]    
        #seq len, n layers and n directions will always be 1 in this decoder, therefore:
        #output = [batch size, 1, dec hid dim]
        #hidden = [1, batch size, dec hid dim]       
        embedded = embedded.squeeze(1)
        output = output.squeeze(1)
        weighted = weighted.squeeze(1)        
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1))        
        #prediction = [batch size, output dim]        
        return prediction, hidden.squeeze(0)

training + validation

loss_vals = []
loss_vals_eval = []
for epoch in range(N_EPOCHS):
    model.train()
    epoch_loss= []
    pbar = tqdm(train_iter) # 为进度条设置描述
    # print(type(pbar))
    pbar.set_description("[Train Epoch {}]".format(epoch))  #设置描述
    for i,batch in enumerate(pbar):
        # print(batch)
        trg = batch.trg
        src = batch.src
        # print(type(trg),type(src))
        trg, src = trg.to(device), src.to(device)
        model.zero_grad()
        output = model(src, trg)
        #trg = [batch size, trg len]
        #output = [batch size, trg len, output dim]        
        output_dim = output.shape[-1]       
        output = output[:,1:,:].reshape(-1, output_dim)
        trg = trg[:,1:].reshape(-1)               
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]     
        loss = criterion(output, trg)    
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), CLIP)
        epoch_loss.append(loss.item())
        optimizer.step()
        pbar.set_postfix(loss=loss.item())
    loss_vals.append(np.mean(epoch_loss))
    
    model.eval()
    epoch_loss_eval= []
    pbar = tqdm(val_iter)
    pbar.set_description("[Eval Epoch {}]".format(epoch)) 
    for i,batch in enumerate(pbar):
        # print(batch)
        trg = batch.trg
        src = batch.src
        trg, src = trg.to(device), src.to(device)
        model.zero_grad()
        output = model(src, trg)
        #trg = [batch size, trg len]
        #output = [batch size, trg len, output dim]        
        output_dim = output.shape[-1]       
        output = output[:,1:,:].reshape(-1, output_dim)
        trg = trg[:,1:].reshape(-1)               
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]     
        loss = criterion(output, trg)    
        epoch_loss_eval.append(loss.item())
        pbar.set_postfix(loss=loss.item())
    loss_vals_eval.append(np.mean(epoch_loss_eval))    

Guess you like

Origin blog.csdn.net/wtyuong/article/details/129580187