Attention mechanism (3): Bahdanau attention

Column: Neural Network Recurrence Directory

attention mechanism

Attention Mechanism is an artificial intelligence technology that allows neural networks to focus on key information while ignoring unimportant parts when processing sequence data. The attention mechanism has been widely used in natural language processing, computer vision, speech recognition and other fields.

The main idea of ​​the attention mechanism is to make the model pay more attention to important inputs by assigning different weights to input signals at different positions when processing sequence data. For example, when processing a sentence, the attention mechanism can adjust the model's attention to each word according to the importance of each word. This technique can improve the performance of the model, especially when dealing with long sequence data.

In deep learning models, attention mechanisms are often implemented by adding additional network layers that learn how to compute weights and apply those weights to the input signal. Common attention mechanisms include self-attention, multi-head attention, etc.

In conclusion, the attention mechanism is a very useful technique that can help neural networks to better process sequential data and improve the performance of the model.



Review of seq2seq

insert image description here
Suppose the input sequence is x 1 , . . . , x T x_1,...,x_Tx1,...,xTwhere xt x_txtis the ttth in the input text sequencet words. at time stepttt , the recurrent neural network converts the word unitxt x_txtThe input feature vector xt x_txtand ht − 1 h_{t-1}ht1(that is, the hidden state of the previous time step) is converted to ht h_tht(i.e. the hidden state of the current step).

Encoder

using a function fff to describe the transformation made by the recurrent layer of the recurrent neural network:
ht = f ( xt , ht − 1 ) h_t=f(x_t,h_{t-1})ht=f(xt,ht1)
In summary, the encoder passes the selected functionqqq , convert the hidden state of all time steps into context variables:
c = q ( h 1 , . . . , h T ) c=q(h_1,...,h_T)c=q(h1,...,hT)

decoder

The context variable cc output by the encoderc for the entire input sequencex 1 , . . . , x T x_1,...,x_Tx1,...,xTto encode. Output sequence y 1 , . . . , y T y_1,...,y_T from the training datasety1,...,yT, for each time step t ′ t't (with input sequence or encoder time stepttt different), the decoder outputsy ′ y’y depends on the previous output subsequencey 1 , . . . , yt − 1 y_1,...,y_{t-1}y1,...,yt1and the context variable ccc, 即 P ( y t ′ ∣ y 1 , . . , y t ′ − 1 , c ) P(y_t'|y_1,..,y_{t'-1},c) P ( andty1,..,yt1,c)

To model this conditional probability on sequences, we can use another recurrent neural network as a decoder. At any time step t' t' on the output sequencet , the RNN will outputyt from the previous time step ′ − 1 y_{t'-1}yt1and the context variable ccc as its input, and then combine them with the previous hidden statest ′ − 1 s_{t'-1}st1Convert to hidden state st ′ s_{t'}st. Therefore, the function gg can be usedg to represent the transformation of the hidden layer of the decoder:
st ′ = g ( yt ′ − 1 , c , st ′ − 1 ) s_{t'}=g(y_{t'-1},c,s_{t' -1})st=g(yt1,c,st1)

attention

Bahdanau attention is an attention mechanism used in neural machine translation, first proposed by Dzmitry Bahdanau et al. in 2015.

In traditional neural machine translation models, an encoder encodes a source language sentence into a fixed-length vector, and a decoder uses this vector to generate a target language sentence. However, this method of fixed-length vectors has some disadvantages, such as the inability to handle longer sentences and the inability to capture complex dependencies between different words in a sentence.

Bahdanau attention addresses these issues by introducing a dynamic alignment mechanism. At each time step, the decoder computes an alignment distribution based on the previous output and all hidden states of the encoder, which indicates which positions in the encoder are most useful for the output at the current time step. The decoder then uses the weighted encoder hidden states to generate the output at the current time step.

This dynamic alignment mechanism can enable the model to better capture the complex dependencies between the source language and the target language, thereby improving the accuracy and fluency of translation. Bahdanau attention has been widely used in machine translation, speech recognition, image annotation and other fields, and has become one of the mainstream methods in the field of neural machine translation.

Bahdanau model

This new attention-based model is the same as the one above, except for the context variable ccc at any decoding time stept′ t′t will bect ′ c_{t'}ctreplace. Suppose there is TT in the input sequenceT tokens, decoding time stept ′ t't is the output of attention focus:
ct ′ = ∑ t = 1 T α ( st ′ − 1 , ht ) ht c_{t'}=\sum_{t=1}^T \alpha(s_{t '-1},h_t)h_tct=t=1Ta ( st1,ht)ht
where, time step t ′ − 1 t'-1tThe hidden state of the decoder at 1 st ′ − 1 s_{t'-1}st1is the query, the encoder hidden state ht h_thtIt is both a key and a value, and the attention weight α \alphaα is computed using an additive attention scoring function.

insert image description here

accomplish

model definition

#@save
class AttentionDecoder(d2l.Decoder):
    """带有注意力机制解码器的基本接口"""
    def __init__(self, **kwargs):
        super(AttentionDecoder, self).__init__(**kwargs)

    @property
    def attention_weights(self):
        raise NotImplementedError

Next, let's implement a recurrent neural network decoder with Bahdanau attention in the next Seq2SeqAttentionDecoder class. First, initialize the state of the decoder, requiring the following inputs:

  1. The hidden state of the final layer of the encoder at all time steps will be used as the key and value of attention;

  2. The hidden state of the full layer of the encoder at the previous time step will be used as the hidden state of the initialized decoder;

  3. Encoder effective length (excluding padding tokens in the attention pool).

At each decoding time step, the final layer hidden state from the previous time step of the decoder is used as a query. Therefore, both the attention output and the input embedding are concatenated as input to the RNN decoder.

class Seq2SeqAttentionDecoder(AttentionDecoder):
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                 dropout=0, **kwargs):
        super(Seq2SeqAttentionDecoder, self).__init__(**kwargs)
        self.attention = d2l.AdditiveAttention(
            num_hiddens, num_hiddens, num_hiddens, dropout)
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.GRU(
            embed_size + num_hiddens, num_hiddens, num_layers,
            dropout=dropout)
        self.dense = nn.Linear(num_hiddens, vocab_size)

    def init_state(self, enc_outputs, enc_valid_lens, *args):
        # outputs的形状为(batch_size,num_steps,num_hiddens).
        # hidden_state的形状为(num_layers,batch_size,num_hiddens)
        outputs, hidden_state = enc_outputs
        return (outputs.permute(1, 0, 2), hidden_state, enc_valid_lens)

    def forward(self, X, state):
        # enc_outputs的形状为(batch_size,num_steps,num_hiddens).
        # hidden_state的形状为(num_layers,batch_size,
        # num_hiddens)
        enc_outputs, hidden_state, enc_valid_lens = state
        # 输出X的形状为(num_steps,batch_size,embed_size)
        X = self.embedding(X).permute(1, 0, 2)
        outputs, self._attention_weights = [], []
        for x in X:
            # query的形状为(batch_size,1,num_hiddens)
            query = torch.unsqueeze(hidden_state[-1], dim=1)
            # context的形状为(batch_size,1,num_hiddens)
            context = self.attention(
                query, enc_outputs, enc_outputs, enc_valid_lens)
            # 在特征维度上连结
            x = torch.cat((context, torch.unsqueeze(x, dim=1)), dim=-1)
            # 将x变形为(1,batch_size,embed_size+num_hiddens)
            out, hidden_state = self.rnn(x.permute(1, 0, 2), hidden_state)
            outputs.append(out)
            self._attention_weights.append(self.attention.attention_weights)
        # 全连接层变换后,outputs的形状为
        # (num_steps,batch_size,vocab_size)
        outputs = self.dense(torch.cat(outputs, dim=0))
        return outputs.permute(1, 0, 2), [enc_outputs, hidden_state,
                                          enc_valid_lens]

    @property
    def attention_weights(self):
        return self._attention_weights

This code implements the core principles of Bahdanau's attention model. The Bahdanau attention model is a model for sequence-to-sequence (seq2seq) learning. Its core idea is to dynamically select the output of the encoder according to the current decoder state and the output of the encoder at each step of the decoder. different parts to calculate the output at the current moment.
insert image description here

Specifically, the implementation process of this code segment is as follows:

  1. At each step of the decoder, first feed the current decoder into xxx and encoderoutputs enc_outputs enc\_outputse n c _ o u tp u t s perform attention calculations to obtain the current context vectorcontext contextcon t e x t . _ The attention calculation uses the Additive Attention mechanism, wherequery queryq u ery indicates the state of the decoder at the current moment,keys keyskeys v a l u e s values v a l u es are the output of the encoderenc _ outputs enc\_outputsenc_outputs e n c _ v a l i d _ l e n s enc\_valid\_lens e n c _ v a l i d _ l e ns is the effective length of the encoder.

  2. Input xx at the current momentx and the context vectorcontext contextContex t is spliced ​​on the feature dimension to get a new input xxx

  3. Enter the new xxx and the decoder state hidden_statehidden\_statehi dd e n _ s t a t e is input to GRU, and the output at the current momentis out outo u t and new decoder statehidden_state hidden\_statehidden_state

  4. Output the output at the current moment outo u t added to the output listoutputs outputsoutputs 中。

  5. Add the attention weight at the current moment to the attention weight list _ attention _ weights \_attention\_weightsDuring _ a tt e n t i o n _ w e i g h t s .

Repeat the above steps until all moments of the decoder output sequence are calculated.

During the implementation process, it should be noted that the dimensions of the input data need to be consistent with the design of the model. Specifically, the input data XXThe dimension of X should be( batch_size , seq_length ) (batch\_size, seq\_length)(batch_size,seq_length),其中 b a t c h s i z e batch_size batchsi ze indicates the batch size,seqlength seq_lengthseqle n g t h represents the sequence length. The input required by the recurrent neural network (RNN) layer in the model is( seq _ length , batch _ size , input _ size ) (seq\_length, batch\_size, input\_size)(seq_length,batch_size,in p u t _ s i ze ) , so it is necessary to perform dimension transformation on the input data. In this code, the permute() function in PyTorch is used to transform the dimensions of the input data.

At the same time, when calculating the attention at each step, it is also necessary to enc_valid_lens according to the effective length of the encoder encvalidlensencva l i dlen s performs Mask processing to avoid the influence of invalid positions on the attention calculation . Here, the call() function in the Additive Attention class is used to automatically mask the invalid positions.

Specifically, the decoder's constructor accepts the following parameters:

vocab_size: the size of the vocabulary
embed_size: the size of the embedding vector
num_hiddens: the number of hidden units in the RNN layer
num_layers: the number of layers in the RNN layer
dropout: the probability of dropout

In the constructor, the decoder first calls the constructor of its parent class AttentionDecoder for initialization. Then, it defines three modules:

self.attention: An additive attention model that takes as input the output of the encoder and the hidden state of the decoder, and returns a context vector to guide the decoder to generate the target sequence at the current time step.

self.embedding: An embedding layer that converts the target sequence into an embedding vector.

self.rnn: An RNN layer with a GRU unit that concatenates the context vector and embedding vector and returns the output and hidden state at the current time step.

Next, the decoder defines an init_state method that takes the encoder's output, the encoder's effective length, and any other parameters and transforms them into the decoder's initial state. This method swaps the first and second dimensions of the encoder's output and adds the effective length as a third element to the state.

Finally, the decoder defines a forward method that takes as input the target sequence and the state of the decoder, and returns the resulting sequence and its final state. In this method, the target sequence is first converted into an embedding vector, and then each time step in the sequence is iterated. At each time step, the decoder uses an attention mechanism to compute a context vector and concatenates it with the embedding vector for input into the RNN. The decoder then adds the RNN output to one list and the attention weights to another list. Finally, the decoder uses a fully-connected layer to transform all outputs into a probability distribution, and returns the resulting sequence and its state. Attention weights can be accessed through the property self.attention_weights.

In natural language processing, word embedding (Word embedding) refers to mapping words into a low-dimensional real number vector space. By representing each word as a vector, it can be processed more conveniently in a neural network.
In this code, nn.Embedding is a PyTorch module used to define a word vector embedding layer. In the initialization function of Seq2SeqAttentionDecoder, self.embedding is an instance of nn.Embedding, which is used to encode words into corresponding word vectors. Specifically, given an integer, which corresponds to the number of a word, the embedding layer converts the integer into a fixed-size vector, which is used as input in the neural network.
In this code, self.embedding receives X as input, where X is an integer tensor with shape (batch_size, num_steps). This embedding layer converts each integer into a word vector, and these word vectors form a tensor of shape (batch_size, num_steps, embed_size). In this way, we can encode a sequence of text into a corresponding sequence of word vectors and use these vectors in a neural network for processing.

encoder = d2l.Seq2SeqEncoder(vocab_size=10, embed_size=8, num_hiddens=16,
                             num_layers=2)
encoder.eval()
decoder = Seq2SeqAttentionDecoder(vocab_size=10, embed_size=8, num_hiddens=16,
                                  num_layers=2)
decoder.eval()
X = torch.zeros((4, 7), dtype=torch.long)  # (batch_size,num_steps)
state = decoder.init_state(encoder(X), None)
output, state = decoder(X, state)
output.shape, len(state), state[0].shape, len(state[1]), state[1][0].shape

train

embed_size, num_hiddens, num_layers, dropout = 32, 32, 2, 0.1
batch_size, num_steps = 64, 10
lr, num_epochs, device = 0.005, 250, d2l.try_gpu()

train_iter, src_vocab, tgt_vocab = d2l.load_data_nmt(batch_size, num_steps)
encoder = d2l.Seq2SeqEncoder(
    len(src_vocab), embed_size, num_hiddens, num_layers, dropout)
decoder = Seq2SeqAttentionDecoder(
    len(tgt_vocab), embed_size, num_hiddens, num_layers, dropout)
net = d2l.EncoderDecoder(encoder, decoder)
d2l.train_seq2seq(net, train_iter, lr, num_epochs, tgt_vocab, device)

insert image description here

test

engs = ['go .', "i lost .", 'he\'s calm .', 'i\'m home .']
fras = ['va !', 'j\'ai perdu .', 'il est calme .', 'je suis chez moi .']
for eng, fra in zip(engs, fras):
    translation, dec_attention_weight_seq = d2l.predict_seq2seq(
        net, eng, src_vocab, tgt_vocab, num_steps, device, True)
    print(f'{
      
      eng} => {
      
      translation}, ',
          f'bleu {
      
      d2l.bleu(translation, fra, k=2):.3f}')

insert image description here

attention visualization

# 加上一个包含序列结束词元
d2l.show_heatmaps(
    attention_weights[:, :, :, :len(engs[-1].split()) + 1].cpu(),
    xlabel='Key positions', ylabel='Query positions')

insert image description here

Guess you like

Origin blog.csdn.net/qq_51957239/article/details/129728647