[Deep Learning] Detailed Explanation of Multi-Head Attention Mechanism

One sentence summary:
The multi-head in the multi-head attention mechanism is different from the convolution kernel in the multiple convolutional layers in the convolutional neural network. The multiple convolutional layers in the convolutional neural network are equivalent to copying a single convolutional network. num_layers times, each convolutional layer can be operated independently. The multi-head attention can be understood as splitting the input feature value into smaller pieces, assigning a separate trainable weight parameter to each piece, and then sharing the output result of the same hidden layer, each head cannot see As a complete and independent codec architecture, it is operated separately .

Beginner, I don't know if I understand it correctly, please point out if I am wrong, welcome to discuss.

#@save
class MultiHeadAttention(nn.Module):
    """多头注意力"""
    def __init__(self,key_size,query_size,value_size,num_hiddens,num_heads,dropout,bias=False,**kwargs):
        super(MultiHeadAttention, self).__init__(**kwargs)
        self.num_heads=num_heads
        self.attention=d2l.DotProductAttention(dropout)
        self.W_q=nn.Linear(query_size,num_hiddens,bias=bias)
        self.W_k=nn.Linear(key_size,num_hiddens,bias=bias)
        self.W_v=nn.Linear(value_size,num_hiddens,bias=bias)
        self.W_o=nn.Linear(num_hiddens,num_hiddens,bias=bias)

    def forward(self,queries,keys,values,valid_lens):
        # queries,keys,values形状:(batch_size,查询或者键值对的个数,num_hiddens)
        # valid_lens的形状:(batch_size,)或(batch_size,查询个数)
        # 经过变换后,输出的queries,keys,values的形状:
        # (batch_size*num_heads,查询或者键值对的个数,num_hiddens/num_heads)
        queries=transpose_qkv(self.W_q(queries),self.num_heads)
        keys = transpose_qkv(self.W_k(keys), self.num_heads)
        values = transpose_qkv(self.W_v(values), self.num_heads)

        if valid_lens is not None:
            # 在轴0,将第一项(标量或者矢量)复制num_heads次,
            # 然后如此复制第二项,然后诸如此类。
            valid_lens = torch.repeat_interleave(valid_lens, repeats=self.num_heads, dim=0)

        # output的形状:(batch_size*num_heads,查询的个数, num_hiddens/num_heads)
        output = self.attention(queries, keys, values, valid_lens)

        # output_concat的形状:(batch_size,查询的个数,num_hiddens)
        output_concat = transpose_output(output, self.num_heads)
        return self.W_o(output_concat)

This code defines a multi-head attention (MultiHeadAttention) class, which implements the forward propagation process of the multi-head attention mechanism.

The line by line explanation is as follows:

  1. class MultiHeadAttention(nn.Module):

    • A class called MultiHeadAttention is defined, inherited from nn.Module.
  2. def __init__(self, key_size, query_size, value_size, num_hiddens, num_heads, dropout, bias=False, **kwargs):

    • The initialization function is used to create a MultiHeadAttention object.
    • parameter:
      • key_size: The size of the key.
      • query_size: The size of the query.
      • value_size: The size of the value.
      • num_hiddens: the number of hidden units.
      • num_heads: Number of attention heads.
      • dropout: Dropout probability for dropping attention weights.
      • bias: Whether to use a bias term in the linear transformation.
      • **kwargs: Additional keyword arguments.
    • In the initialization function, the individual linear transformation layers required for multi-head attention are created.
  3. def forward(self, queries, keys, values, valid_lens):

    • The forward propagation function defines the calculation logic of multi-head attention.
    • parameter:
      • queries: Tensor of query, shape is (batch_size, number of query or "key-value" pairs, num_hiddens).
      • keys: Tensor of keys, of shape (batch_size, number of query or "key-value" pairs, num_hiddens).
      • values: Tensor of values, shape (batch_size, query or number of "key-value" pairs, num_hiddens).
      • valid_lens: Tensor of effective length, shape is (batch_size,) or (batch_size, number of queries).
    • In the forward pass function, perform the following steps:
  4. queries = transpose_qkv(self.W_q(queries), self.num_heads)

    • Linearly transform the query tensor and transform its shape for multi-head attention computation.
    • self.W_q(queries)Performs a linear transformation on the query tensor.
    • transpose_qkvThe function transforms the shape of the transformed tensor to meet the computational requirements of multi-head attention.
  5. keys = transpose_qkv(self.W_k(keys), self.num_heads)

    • Performs a linear transformation on the key tensor, and performs a shape transformation.
  6. values = transpose_qkv(self.W_v(values), self.num_heads)

    • Linearly transforms a tensor of values, and performs a shape transformation.
  7. if valid_lens is not None:

    • Checks for the existence of a valid length.
  8. valid_lens = torch.repeat_interleave(valid_lens, repeats=self.num_heads, dim=0)

    • If the effective length exists, repeated interpolation is done based on the number of attention heads to match the computed shape of the multi-head attention.
  9. output = self.attention(queries, keys, values, valid_lens)

    • Call self.attentionto calculate multi-head attention

, to get the output tensor.

  1. output_concat = transpose_output(output, self.num_heads)
  • A shape transformation is performed on the output tensor of multi-head attention to return to the original shape.
  1. return self.W_o(output_concat)
  • The shape-transformed tensor is then linearly transformed to obtain the final multi-head attention output.

Overall, this code defines a multi-head attention class in which queries, keys, and values ​​are processed by linear and shape transformations, and the output of the multi-head attention is computed using DotProductAttention. Finally, the output of multi-head attention is transformed again by linear transformation to obtain the final output of multi-head attention.

Q: Where are the bulls here?

A: The attention of bulls is reflected in the following aspects:

  1. self.num_heads: The parameter passed in the initialization function num_headsspecifies the number of heads for multi-head attention. Each attention head has its own query, key and value linear transformation parameters.

  2. queries = transpose_qkv(self.W_q(queries), self.num_heads), keys = transpose_qkv(self.W_k(keys), self.num_heads), values = transpose_qkv(self.W_v(values), self.num_heads): These three lines of code transform the query, key, and value through linear transformations, and then transpose_qkvperform shape transformations through functions. This shape transformation reshapes the original query, key, and value tensors into (batch_size * num_heads, 查询或"键-值"对的个数, num_hiddens / num_heads)shapes respectively, which num_hiddens / num_headsrepresent the number of hidden units for each attention head.

  3. output_concat = transpose_output(output, self.num_heads): At the end of the forward propagation, transpose_outputthe output of the multi-head attention is transformed by the function. This shape transformation reshapes the output tensor of multi-head attention into (batch_size, 查询的个数, num_hiddens)the shape of .

Through the above operations, the input query, key, and value tensors are divided into multiple attention heads, and each attention head performs independent linear transformation and attention calculation. Finally, the outputs of multiple attention heads are merged by shape transformation to obtain the final output of multi-head attention. Each attention head can focus on different features and contexts, thus capturing richer information and enhancing the representation and expressiveness of the model.

Q: num_hiddens / num_heads represent the number of hidden units of each attention head. That is to say, the multi-head here is to expand the 0th dimension of the original tensor by batch_size * num_heads, but at the same time, the total number of hidden layers is evenly distributed to the heads. In fact, The total elements of the original tensor = batch_size query or the number of key-value pairs num_hiddens, the total elements after multi-head transformation = batch_size num_heads query or the number of key-value pairs * num_hiddens/num_heads, are the two constant?

A: Your understanding is correct.

In multi-head attention, the batch_size * num_headsparallel computation of multiple attention heads is achieved by expanding the 0th dimension of the original tensor (usually batch_size) to . At the same time, in order to keep the total number of hidden units unchanged, the number of hidden units is num_hiddensevenly distributed to each attention head, that is, the number of hidden units of each attention head is num_hiddens / num_heads.

The calculation of the total number of elements you mentioned is also correct. The total number of elements of the original tensor is batch_size * 查询或者键值对的个数 * num_hiddens, and the total number of elements after the multi-head transformation is batch_size * num_heads * 查询或者键值对的个数 * num_hiddens / num_heads. These two are equal because during the multi-head transformation, the total number of hidden units is kept constant and only redistributed to multiple attention heads to achieve parallel computation.

Q: Why should the total number of hidden units remain constant?

A: The reason why the total number of hidden units remains constant in multi-head attention is to ensure that the model's expressive power and representation ability are not affected.

In multi-head attention, each attention head has its own linear transformation parameters, which include query, key and value linear transformation matrices. The dimensions of these linear transformation matrices are num_hiddens x (num_hiddens/num_heads), where num_hiddensis the number of original hidden units and num_hiddens/num_headsis the number of hidden units for each attention head.

By keeping the total number of hidden units constant, i.e. num_hiddens, we can ensure that each attention head is able to use the same total amount of information in multi-head attention. Although the number of hidden units per attention head is reduced, the whole model can still capture sufficiently rich features and contextual information through the parallel computation and information combination of multiple attention heads.

If we change the total number of hidden units in multi-head attention, such as increasing or decreasing it, then the expressive power and representation ability of the model will change. Increasing the total number of hidden units increases the complexity and number of parameters of the model, possibly leading to overfitting. Reducing the total number of hidden units may result in a model that is not expressive enough to capture enough features and contextual information.

Therefore, in order to maintain the expressiveness and representation ability of the model, the total number of hidden units is usually kept constant in multi-head attention, and parallel computing is achieved by only amortizing it to each attention head.

Q: So, the so-called multi-head is actually adding the number of trainable weight parameters to the input features on the original information elements? For example, there are only three trainable parameters of Wq, Wk, and Wv for single heads, and the number of trainable parameters for multi-heads becomes Wq num_heads, Wk num_heads, Wv*num_heads?

A: Yes, your understanding is correct.

In multi-head attention, each attention head has its own linear transformation parameters. Taking the linear transformation of query (Q), key (K), and value (V) as an example, single-head attention has only one query matrix Wq, one key matrix Wk, and one value matrix Wv, while multi-head attention has multiple attention Each attention head has its own query matrix Wq, key matrix Wk and value matrix Wv.

Thus, the number of trainable parameters for multi-head attention is indeed the number of parameters for single-head multiplied by the number of attention heads. Specifically, for each attention head, the number of parameters of the linear transformation is the number of original linear transformation parameters multiplied by the number of attention heads. For example, if single-head attention has 3 linear transformation parameters (Wq, Wk, Wv), and multi-head attention has 5 attention heads, then the number of linear transformation parameters for multi-head attention is 3 parameters times 5 Attention head, i.e. 15 parameters.

With multi-head attention, the model is able to utilize different attention heads to learn different features and contextual information, which enhances the representation and expressiveness of the model. Each attention head can focus on different aspects of the input features, thus providing a more comprehensive and diverse information representation.

Q: So the multi-head attention mechanism is different from the multi-layer convolution kernel of the convolutional neural network? In the multi-layer convolution kernel of the convolutional neural network, the number of hidden layers is also multiplied by num_layers, right?

A: Yes, the multi-head attention mechanism is different from the multi-layer convolution kernel in the convolutional neural network.

In multi-head attention, the attention heads act like different linear transformations computed in parallel. Each attention head has its own query, key and value linear transformation parameters, and the attention weights and context vectors are computed in parallel. Multi-head attention captures different features and contextual information through the combination of multiple attention heads.

In a convolutional neural network, a multi-layer convolution kernel refers to the use of multiple convolution kernels in the same layer for parallel computing. Each convolution kernel performs a convolution operation on the input data and extracts specific features. Multi-layer convolution kernels capture features at different levels and scales through parallel computing and parameter sharing.

In multi-head attention, multiple heads perform parallel computations on different linear transformations of the same layer. Each attention head has its own linear transformation parameters and computes attention weights and context vectors. There is no parameter sharing between attention heads, but the final context vector is obtained by weighted summation of attention weights.

In a convolutional neural network, multi-layer convolution kernels are computed in parallel on different convolution kernels of the same layer. Each convolution kernel extracts different features through convolution operations and outputs different feature maps. Parameters are usually shared between convolution kernels to reduce the amount of parameters and improve the generalization ability of the model.

Therefore, multi-head attention and multi-layer convolutional kernels in ConvNets are different mechanisms for capturing different types of features and contextual information in different tasks and scenarios.

Attached is a diagram of the change process of multi-head attention parameters compiled in the discussion area:

Guess you like

Origin blog.csdn.net/weixin_44624036/article/details/131019298