Learn TF Attention while writing code

1. Attention background introduction

Usually the attention mechanism can focus network resources on certain parts that require attention, and selectively weaken parts that are not important to the network results. The attention mechanism of the network comes from human visual attention. Because humans have limited energy and cannot pay attention to all details, they can selectively weaken unnecessary or unimportant parts and focus on important parts. After perceiving the external world in this way, the person's eyes are directed to move [19]. Attention Mechanism greatly reduces the complexity of the network by focusing on some useful ways in a network task. In 2014, Mnih et al. proposed a recurrent neural network model of visual attention mechanism (Recurrent models of visual attention), and found that the recurrent neural network after adding the attention mechanism can better identify important parts of the image, which not only makes the model It not only improves the high concurrent network computing power but also improves the image recognition accuracy of the model [21]. In 2015, Bahdanau D et al. added the attention mechanism to the field of natural language processing for the first time and applied it to machine translation and alignment tasks. Compared with the traditional neural network model, the accuracy has been greatly improved [22 ]. Later in the same year, Luong et al. proposed two attention mechanisms based on the attention mechanism in Bahdanau's paper, namely the global attention model (Global attention model) and the local attention model (Local attention model), which further improved seq2seq. The effect of natural language processing tasks [23]. The main difference between these two attention mechanisms is that the global attention model uses the vector value of each token in the query, while the local attention model only uses the vector value of a certain part of the tokens. In 2016, Yin W and others proposed the attention based CNN (ABCNN), which was the first time to introduce the attention mechanism into the convolutional neural network model to model sentences. . Experiments on the Answer Selection Data Set (AS) found that the classification accuracy has been greatly improved [24]. In 2017, Vaswani et al. trained a language translation model by simply using the self-attention mechanism (Self-Attention) and a fully connected network. Compared with the original neural network model, the accuracy was higher [10].

[21] Mnih V. Heess V. Graves A. Recurrent models of visual attention[C]//Proceedings of Advances in neural information processing systems 2014: 2204-2212

[24] Yin W , H Schütze,  Xiang B , et al. ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs[J]. Computer Science, 2015.

[23] Luong M T, Pham H, Manning C D. Effective approaches to attention-based neural machine translation[C]//Proceedings of the 2015 conference on empirical methods in natural language processing. Lisbon, Portugal: Association for Computational Linguistics, 2015: 1412–1421.

[22] Bahdanau D ,  Cho K ,  Bengio Y . Neural Machine Translation by Jointly Learning to Align and Translate[J]. Computer Science, 2014.

[10] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in neural information processing systems. 2017: 5998-6008.

2. What is Attention?

Attention Mechanism is an important concept in the fields of machine learning and artificial intelligence. It is used to simulate the attention mechanism in human perception processes such as vision or hearing. The goal of the attention mechanism is to enable the model to pay more attention to task-related parts and ignore task-irrelevant information when processing information. This mechanism was originally proposed inspired by the human brain's processing of information.

The basic principles of the attention mechanism are as follows:

  1. Input information: First, the attention mechanism receives input information, which can be sequence data, images, speech, etc.

  2. Query, key and value: For each input, the attention mechanism introduces three parts: query, key and value. These parts are usually learned through neural networks. The query represents what you want to focus on, the keys represent features in the input information, and the values ​​are the information associated with each key.

  3. Weight assignment: The attention mechanism calculates weights based on the relationship between the query and the key, and these weights determine how much each value contributes in the final output. The weights are usually calculated using some form of similarity measure (such as dot product, scaled dot product, etc.).

  4. Weighted sum: Multiply the calculated weights by the corresponding values, and then weight and sum them to get the final output. This output usually contains the parts of the model that focus on when processing the input information.

  5. Repeat: The above process is often repeated multiple times so that the model can dynamically adjust attention in different contexts.

The core idea of ​​the attention mechanism is to allow the model to automatically determine which parts to focus on when processing input information, thereby improving the performance of the model in various tasks. It is widely used in the fields of natural language processing, computer vision and speech processing, such as the Transformer model in machine translation, the U-Net model in image segmentation, and the Listen, Attend and Spell (LAS) model in speech recognition. wait.

In general, the attention mechanism can help the model better understand and utilize input information, improving the performance and generalization ability of the model.

3. Why Attention

Because LSTM and GRU only improve the long sentence dependency problem of recurrent neural networks to a certain extent, and the information memory ability is not very strong and the computing power is limited. If the model wants to remember a lot of information, it has to be designed to be more complex. In order to solve these problems, the attention mechanism emerged, which can learn from a large amount of information. Important information is selected to alleviate the complexity of the neural network model, and it can perform efficient parallel operations. The calculation of the attention mechanism is a matching process, that is, mapping the output value through a query vector to key and value data pairs.

The calculation of attention generally has three stages. The first stage is to calculate the correlation or similarity between the query vector Q and each input K, and obtain the attention weight coefficientS_i :

S_i=f(Q,K_i)

The second stage uses the SoftMax function to scale the weight coefficient obtained in the first stage, that is, normalize it into a probability distributionai, the numerator maps the current output of the neuron to (0, +∞), and the denominator is the sum of all output result values. The formula is as follows:

a _i=softmax (S_i ) = e^{S_i }/(\sum e^{S_j})

The third stage: Add the weight and value obtained in the second stage to obtain the final required Attention value:

Attention(Q,K,V)=\sum a_i V_i

4. Introduction to TF attention API

Attention class

tf.keras.layers.Attention(use_scale=False, score_mode="dot", **kwargs)

Dot-product attention layer, a.k.a. Luong-style attention.

Inputs are query tensor of shape [batch_size, Tq, dim]value tensor of shape [batch_size, Tv, dim] and key tensor of shape [batch_size, Tv, dim]. The calculation follows the steps:

  1. Calculate scores with shape [batch_size, Tq, Tv] as a query-key dot product: scores = tf.matmul(query, key, transpose_b=True).
  2. Use scores to calculate a distribution with shape [batch_size, Tq, Tv]distribution = tf.nn.softmax(scores).
  3. Use distribution to create a linear combination of value with shape [batch_size, Tq, dim]return tf.matmul(distribution, value).

5. Experimental code

5.1. Verify and understand the TF attention method, only enter the query and value matrices.

def softmax(t):
    s_value = np.exp(t) / np.sum(np.exp(t), axis=-1, keepdims=True)
    # print('softmax value: ', s_value)
    return s_value

def numpy_attention(inputs,
        mask=None,
        training=None,
        return_attention_scores=False,
        use_causal_mask=False):

    query = inputs[0]
    value = inputs[1]
    key = inputs[2] if len(inputs) > 2 else value

    score = np.matmul(query, key.transpose())
    attention_score_np = softmax(score)
    result = np.matmul(attention_score_np, value)
    print('attention score in numpy =', attention_score_np)
    print('result in numpy = ', result)


def verify_logic_in_attention_with_query_value():
    query_data = np.array(
        [[1, 0.0, 1],[2, 3, 1]]
    )
    value_data = np.array(
        [[2, 1.0, 1],[1, 4, 2 ]]
    )
    print(query_data.shape)

    numpy_attention([query_data, value_data], return_attention_scores=True)
    print("=============following is keras attention output================")

    attention_layer= tf.keras.layers.Attention()

    result, attention_scores = attention_layer([query_data, value_data], return_attention_scores=True)

    print('attention_scores = ', attention_scores)
    print('result=', result);
if __name__ == '__main__':
    verify_logic_in_attention_with_query_value()

operation result

(2, 3)
attention score in numpy = [[5.0000000e-01 5.0000000e-01]
 [3.3535013e-04 9.9966465e-01]]
result in numpy =  [[1.5        2.5        1.5       ]
 [1.00033535 3.99899395 1.99966465]]
=============following is keras attention output================
attention_scores =  tf.Tensor(
[[5.0000000e-01 5.0000000e-01]
 [3.3535014e-04 9.9966466e-01]], shape=(2, 2), dtype=float32)
result= tf.Tensor(
[[1.5       2.5       1.5      ]
 [1.0003353 3.998994  1.9996647]], shape=(2, 3), dtype=float32)

5.2. Verify and understand the TF attention method, input query, key, value matrix.

def verify_logic_in_attention_with_query_key_value():
    query_data = np.array(
        [[1, 0.0, 1],[2, 3, 1]]
    )
    value_data = np.array(
        [[2, 1.0, 1],[1, 4, 2 ]]
    )
    key_data = np.array(
        [[1, 2.0, 2], [3, 1, 0.1]]
    )
    print(query_data.shape)

    numpy_attention([query_data, value_data, key_data], return_attention_scores=True)
    print("=============following is keras attention output================")

    attention_layer= tf.keras.layers.Attention()

    result, attention_scores = attention_layer([query_data, value_data, key_data], return_attention_scores=True)

    print(attention_layer.get_weights())
    print('attention_scores = ', attention_scores)
    print('result=', result);
if __name__ == '__main__':
    verify_logic_in_attention_with_query_key_value()

result

(2, 3)
attention score in numpy = [[0.47502081 0.52497919]
 [0.7109495  0.2890505 ]]
result in numpy =  [[1.47502081 2.57493756 1.52497919]
 [1.7109495  1.86715149 1.2890505 ]]
=============following is keras attention output================
[]
attention_scores =  tf.Tensor(
[[0.47502086 0.52497923]
 [0.7109495  0.28905058]], shape=(2, 2), dtype=float32)
result= tf.Tensor(
[[1.4750209 2.5749378 1.5249794]
 [1.7109495 1.8671517 1.2890506]], shape=(2, 3), dtype=float32)

Guess you like

Origin blog.csdn.net/keeppractice/article/details/132635732