Multi-Head Attention Mechanism in Transformer - Why Do You Need Multi-Heads?

Why does Transformer use a multi-head attention mechanism?

Bulls can learn features and information in different dimensions. Why can we learn information from different dimensions?

The answer is: The multi-head attention mechanism consists of a single self attention. Since self attention learns data features by generating a QKV matrix, each self attention will eventually produce an output feature in one dimension, so when using the multi-head attention mechanism At this time, the model can learn multi-dimensional feature information, which allows the model to better understand the data from multiple dimensions. At the same time, the multi-head attention mechanism is still calculated in parallel, which is also in line with the current hardware architecture and improves computing efficiency. See below for detailed instructions.

First of all, it should be clear that the component of multi head attention is a single self attention. Let’s first understand the calculation process of self attention.

The calculation of self attention is shown in the figure below on the left: The source of QKV is to linearly transform the input The operation yields QKV,

Finally, the output of self attention can be obtained through the following formula, which we can understand as feature information in one dimension.

The multi-head attention mechanism is shown in the figure below, which allows the model to focus on multiple different positions from different subspaces (the matrix in the QKV source file in the figure above is generated and can be learned), thereby capturing different levels of features and information. , so that each head can learn different relationships, which allows the model to understand the data from different perspectives.

Some Problems:

1. Why are Q and K generated using different weight matrices? Why can't I use the same value to do its own dot multiplication?

Answer: 1) Using different weights for evidence can increase the expressive ability of the model. Different weight matrices allow the model to learn and match different features in different spaces. 2) Different weight matrices can make the model better parameterized and learn more information instead of just relying on one weight matrix.

2. Why is dot product used in self attention instead of add? What is the difference in complexity between the two?

Answer: Dot product attention usually performs better in practice, probably because of its computational simplicity and efficiency. The dot product complexity is O(d*n^2) where d is the feature dimension and n is the sequence length.

3. Transformer encoder structure

Answer: The encoder consists of multiple identical blocks, as shown in the figure below. Each block contains two modules, multi head attention and feed forward modules. Each module contains a residual connection and normalization layer.

Guess you like

Origin blog.csdn.net/qq_39333636/article/details/134649271