[Natural Language Processing NLP] Bert pre-training model, CNN, LSTM model input and output detailed explanation on Bert

1. Input and output of BertModel

from transformers import BertModel
bert=BertModel.from_pretrained('bert-base-chinese')
out=bert(context, attention_mask=mask)

1. Input

contextThe input tensor of the Bert model needs to meet the following requirements:

  1. Tensor shape: contextShould be a 2D tensor with shape [batch_size, sequence_length], where

    • batch_sizeis the batch size of input samples,
    • sequence_lengthis the length of the input sequence.
  2. Data Type: contextThe data type should be an integer type, eg torch.LongTensor.

  3. Value range: contextThe values ​​in should be word indices in the vocabulary. Typically, special symbols in the vocabulary are assigned predefined indices, such as [PAD], [UNK], [CLS]and [SEP]. The rest of the words will be mapped to the corresponding indices in the vocabulary.

In addition, in order to effectively control the attention of the model and improve computational efficiency, attention_masktensors . attention_maskis a binary tensor (consisting of 0s and 1s) of the same shape as the input tensor, which indicates which positions are valid (1 means valid) and which positions are filled (0 means filled). The attention weight for the filled positions will be set to zero, so the model will not pay attention to the filled positions.

Note that the length limit of the input tensor depends on the maximum sequence length limit of the Bert model. Parts exceeding the maximum length need to be truncated or otherwise processed.

To sum up, the input contexttensor of the Bert model should be a two-dimensional integer tensor with a shape of [batch_size, sequence_length], and attention_maskthe tensor can be used in combination to identify the filling position.

2. Output

For example, when the Bert model is called above, the output result out contains several attributes such as last_hidden_state, pooler_output, hidden_states, past_key_values, attentions, and cross_attentions.

The following is the meaning of each attribute in the output of the BERT model:

  1. last_hidden_state: This is the output of the last hidden layer of the BERT model. It is a tensor [batch_size, sequence_length, hidden_size]of representing the context-sensitive representation of each input token. This tensor contains hidden state information for each position in the input sequence.

last_hidden_state[:,0]A representation of the first positions (i.e. [CLS]tokens .
In the BERT model, a special [CLS]token to represent the summary information of the entire sequence. last_hidden_state[:,0]A representation of this [CLS]token , which is a [batch_size, hidden_size]tensor of shape .
This [CLS]token representation can be used as a summary of the entire sequence or a sentence-level representation, typically for downstream tasks of classification or sentence-level feature extraction. In some tasks, it can be used last_hidden_state[:,0]as representation of the entire sequence for sentiment classification, text matching and other tasks.
It should be noted that last_hidden_state[:,0]is a per-sample representation, if multiple samples are processed in a batch, the dimension batch_sizeof will correspond to the number of samples.

  1. pooler_output: This is the output of the BERT model after the pooling operation. It is a tensor [batch_size, hidden_size]of representing the pooled representation of the entire input sequence. It is often used as a sentence-level representation for classification or sentence-level feature extraction for downstream tasks.

  2. hidden_states: This is the output of all hidden layers in the BERT model. It is a list containing the output of each hidden layer, where each element has shape [batch_size, sequence_length, hidden_size]. hidden_states[0]denotes the output of the first hidden layer (i.e. last_hidden_state), hidden_states[1]denotes the output of the second hidden layer, and so on. These hidden layer outputs can be used for more detailed analysis or for some special tasks.

  3. past_key_values: This is the previous key-value pair used to generate the next prediction token. It is a tuple containing the previous key-value pairs generated by the previous few calls to the BERT model. It is often used in generative tasks such as text generation to preserve previous state information in multi-step predictions.

  4. attentions: This is the attention weight generated by the self-attention mechanism. It is a list containing the attention weight matrix for each attention head. The shape of the attention weight matrix is [batch_size, num_heads, sequence_length, sequence_length]​​, which represents the degree to which the model pays attention to other positions at each position.

  5. cross_attentions: This is the attention weight generated by the cross-attention mechanism in the BERT model. It is a list containing the attention weight matrix for each intersection attention head. The shape of the attention weight matrix is [batch_size, num_heads, sequence_length, sequence_length]​​, which represents the degree to which the model pays attention to another input sequence (such as two sentences in the sentence-level task) at each position.

These attributes provide the output information of the BERT model at different levels and attention mechanisms, and can be used according to the requirements of the task.


2. Input and output of CNN

from transformers import BertModel
import torch.nn.functional as F

def conv_and_pool(self, x, conv):
    x = F.relu(conv(x)).squeeze(3)  #[batch_size, out_channels, output_length]
    x = F.max_pool1d(x, x.size(2)).squeeze(2)   #[batch_size, channels]
    return x

num_filters = 256
filter_sizes = (2, 3, 4)
convs = nn.ModuleList(
            [nn.Conv2d(1, config.num_filters, (k, config.hidden_size))
             for k in config.filter_sizes])
             
bert=BertModel.from_pretrained('bert-base-chinese')

encoder_out = self.bert(context, attention_mask=mask).last_hidden_state   #[batch_size, sequence_length, hidden_size]
out = encoder_out.unsqueeze(1) # [batch_size, 1(in_channels), sequence_length, hidden_size]
out = torch.cat([self.conv_and_pool(out, conv) for conv in self.convs], 1)  #[batch_size, channels*len(self.convs)]

1. nn.Conv2d

convs = nn.ModuleList([nn.Conv2d(1, num_filters, (k, hidden_size)) for k in config.filter_sizes])

This line of code defines a list of convolutional layers convs, where each convolutional layer is nn.Conv2dcreated .

nn.Conv2dis a class used to define two-dimensional convolutional layers in PyTorch. Here, nn.Conv2d(1, config.num_filters, (k, config.hidden_size))a convolutional layer object is created by using
The input of each convolutional layer is a four-dimensional tensor with a shape of [batch_size, in_channels, sequence_length, embedding_size]. The parameters are explained as follows:

  • batch_sizeis the batch size of input samples.
  • in_channelsis the number of input channels, usually 1 for text data, indicating single-channel input.
  • sequence_lengthis the length of the input sequence, i.e. the number of tokens.
  • embedding_sizeis the embedding dimension of each token in the input sequence.

nn.ModuleListCompose multiple convolutional layer objects into a list by using list comprehensions and self.convs. This creates a list of modules consisting of multiple convolutional layers.

In this code snippet, config.filter_sizes is a tuple containing the widths of multiple convolution kernels. Specifically, the code uses list comprehensions and nn.ModuleList to create three convolutional layer objects, corresponding to kernel widths of 2, 3, and 4.

The purpose of this design may be to use multi-scale convolution operations in tasks such as text classification to extract features from different window sizes. Each convolution kernel produces an output feature map, which will be used for subsequent processing or classification tasks. By using convolution kernels of different widths, the model can simultaneously capture different ranges of semantic information, thereby improving the model's ability to understand the input text.

2. conv(out)

conv=nn.Conv2d(1, num_filters, (k, config.hidden_size))

  1. outis the input tensor before going through the convolutional layer, with shape [batch_size, in_channels, sequence_length, hidden_size].

    • batch_sizeis the batch size of input samples.
    • in_channelsis the number of input channels, usually 1, since in this example the input is a 1D sequence.
    • sequence_lengthis the length of the input sequence.
    • hidden_sizeis the hidden dimension, i.e. the dimension of the feature vector at each location.
  2. conv(out)is the output tensor after the convolution operation, with shape [batch_size, out_channels, output_length, feature_size].

    • batch_sizeSame as input tensor.
    • out_channelsis the number of output channels of the convolutional layer, config.num_filtersdetermined .
    • output_lengthIs the length of the output sequence after the convolution operation, which depends on the length of the input sequence, the size of the convolution kernel and the padding method.
    • feature_sizeis the dimension of the feature vector at each position, determined by the convolution kernel size and hidden dimension.

Calculation method of feature_size

To compute feature_size, one needs to know the size of the convolution kernel and the dimensions of the hidden layer.

Suppose the size of the convolution kernel is (k, hidden_size), where kis the width of the convolution kernel and hidden_sizeis the dimension of the hidden layer. In a 2D convolution operation, the convolution kernel slides in two dimensions, the sequence length and the hidden layer dimension.

The feature_sizecalculation method is kto reduce the length of the input sequence sequence_lengththrough the convolution operation according to the width of the convolution kernel, while keeping the dimension of the hidden layer unchanged.

Assuming no padding operation is used, the length of the output sequence output_lengthcan be calculated by the following formula:

output_length = sequence_length - k + 1

Therefore, the final feature_sizeis equal to the hidden layer dimension hidden_size.

To sum up, (k, hidden_size)when , after performing the convolution operation, the shape of the output tensor is [batch_size, out_channels, output_length, hidden_size], where out_channelsis the number of output channels of the convolution layer, output_lengthand is the output calculated according to the length of the input sequence and the size of the convolution kernel The sequence length, hidden_sizewhich is the hidden layer dimension, is also the dimension of the feature vector at each position ( feature_size).

3. F.max_pool1d

F.max_pool1dis a function used in PyTorch for one-dimensional maximum pooling operation, and its input and output tensor dimension requirements are as follows:

  1. Dimensionality requirements for input tensors: The shape of input tensors should be [batch_size, channels, sequence_length], where

    • batch_sizeis the batch size of input samples,
    • channelsis the number of input channels, usually corresponding to the number of output channels of the convolutional layer,
    • sequence_lengthis the length of the input sequence.
  2. Dimensions of the output tensor: The output tensor has the same shape as the input tensor, i.e. [batch_size, channels, output_length], where

    • batch_sizeSame as the input tensor,
    • channelsSame as the input tensor,
    • output_lengthis the length of the output sequence after the max pooling operation, which depends on the pooling window size, stride and padding method.
  3. x = torch.nn.functional.max_pool1d(x, x.size(2))

    • xis the input tensor, assumed to be of shape [batch_size, channels, sequence_length].
    • x.size(2)Returns the size of the input tensor xalong the third dimension, which is the length of the input sequence sequence_length.

It should be noted that F.max_pool1dthe pooling operation can only be performed on the last dimension of the input tensor, that is, pooling is performed on the sequence dimension. The size, stride and filling method of the pooling window can be specified by parameters. When performing a one-dimensional max pooling operation, the maximum value in each window will be extracted to form an output tensor.

If the shape of the input tensor does not meet the requirements, you can use the corresponding function to reshape, such torch.unsqueezeas to increase the dimension or torch.transposeto perform dimension exchange to meet the requirements of F.max_pool1dthe function .

4. Overlay the CNN model on the Bert pre-training model

To superimpose the CNN model on the basis of the BERT pre-trained model for classification, it can be considered that the output of the model last_hidden_stateand pooler_outputinput of the convolutional layer have different characteristics and applicability:

  1. last_hidden_state: last_hidden_stateis the output of the last hidden layer of the BERT model, which is a tensor [batch_size, sequence_length, hidden_size]of . When using it last_hidden_stateas an input to a convolutional layer, the following cases can be considered:

    • Applicability: last_hidden_stateContains a context-sensitive representation of each input token, which can capture detailed information of the input sequence. Therefore, it is suitable for tasks that need to use local features for classification or processing, such as text classification, named entity recognition, etc. Through convolution operations, local features of different sizes can be extracted for finer-grained analysis and modeling of the input.
    • Note: Since the shape last_hidden_stateof is [batch_size, sequence_length, hidden_size], before applying the convolution operation, it needs to be converted to a 4D tensor [batch_size, 1, sequence_length, hidden_size]of to match the input requirements of the convolutional layer.
  2. pooler_output: pooler_outputis the output of the BERT model after the pooling operation, and it is a tensor with a shape [batch_size, hidden_size]of . When using it pooler_outputas an input to a convolutional layer, the following cases can be considered:

    • Applicability: pooler_outputIt can be seen as a pooled representation of the entire input sequence, with higher-level semantic information. Therefore, it is suitable for tasks that classify or process entire sequences, such as sentence-level sentiment classification, text similarity, etc. Through the convolution operation, the features pooler_outputin for deeper analysis and modeling of the input sequence.
    • Note: Since the shape pooler_outputof is [batch_size, hidden_size], before applying the convolution operation, it needs to be converted to a 4D tensor [batch_size, 1, 1, hidden_size]of to match the input requirements of the convolutional layer.

In practical applications, the choice of which output to use as the input of the convolutional layer depends on the task requirements and data characteristics. If the task requires more detailed local features, it can be used last_hidden_state; if the task pays more attention to the overall semantic information or sentence-level representation, it can be used pooler_output. At the same time, different combinations and variants can be tried to find the most suitable input representation for the task.

3. Input and output of lstm

from transformers import BertModel
from torch import nn

bert=BertModel.from_pretrained('bert-base-chinese')
lstm=nn.LSTM(input_size, rnn_hidden_size, num_layers, bidirectional=True, batch_first=True, dropout=config.dropout)
# nn.LSTM(输入特征大小, 隐藏状态大小, lstm层数, 是否为双向, 输入张量第一维是否为批量维度, 丢弃率, bias=True是否使用偏置项)

encoder_out= bert(context, attention_mask=mask).last_hidden_state   # [batch_size, sequence_length, hidden_size]
out, _ = self.lstm(encoder_out)

nn.LSTM()The input parameters of the function are as follows:

  • input_size: The size of the input feature.
  • hidden_size: The size of the hidden state.
  • num_layers: The number of layers of LSTM.
  • bias: Whether to use bias items, the default is True.
  • batch_first: Whether the input tensor has batch dimension in the first dimension, defaults to False.
  • dropout: The dropout rate applied to the output of the LSTM layer, defaults to 0.
  • bidirectional: Whether to use bidirectional LSTM, the default is False.

The types and dimensions of the input parameters and output results of the model are as follows:

  • Input parameters:

    • input: input tensor [sequence_length, batch_size, input_size]of , where
      • sequence_lengthis the length of the input sequence,
      • batch_sizeis the batch size of input samples,
      • input_sizeis the size of the input features.
    • h_0: initial hidden state tensor [num_layers * num_directions, batch_size, hidden_size]of , where
      • num_layersis the number of layers of LSTM,
      • num_directionsis the direction number of LSTM (2 for bidirectional, 1 for unidirectional),
      • batch_sizeis the batch size of input samples,
      • hidden_sizeis the size of the hidden state.
    • c_0: initial cell state tensor [num_layers * num_directions, batch_size, hidden_size]of with h_0the same dimension as .
  • Output result:

    • output: output sequence tensor [sequence_length, batch_size, num_directions * hidden_size]of , where
      • sequence_lengthis the length of the input sequence,
      • batch_sizeis the batch size of input samples,
      • num_directionsis the direction number of LSTM (2 for bidirectional, 1 for unidirectional),
      • hidden_sizeis the size of the hidden state.
    • h_n: The hidden state tensor [num_layers * num_directions, batch_size, hidden_size]of , with h_0the same dimension as .
    • c_n: Cell state tensor of shape [num_layers * num_directions, batch_size, hidden_size]at the last time step with h_0the same dimension as .

Note that the dimensions and types of input parameters and output results are based on the actual shape and settings of the input tensors and parameters. The above descriptions are examples in general, and the specific dimensions and types may vary depending on the specific input data shape and model parameters.

Example:

import torch
import torch.nn as nn

input_size = 10
hidden_size = 20
num_layers = 2
batch_size = 4
sequence_length = 6
num_directions = 1

lstm = nn.LSTM(input_size, hidden_size, num_layers, bidirectional=False, batch_first=False)

input = torch.randn(sequence_length, batch_size, input_size)
h_0 = torch.randn(num_layers * num_directions, batch_size, hidden_size)
c_0 = torch.randn(num_layers * num_directions, batch_size, hidden_size)

output, (h_n, c_n) = lstm(input, (h_0, c_0))

print("Output shape:", output.shape)
print("Hidden state shape:", h_n.shape)
print("Cell state shape:", c_n.shape)

If you want to add an LSTM layer based on the output of the Bert pre-training model, since the shape of the output out.last_hidden_state tensor of the Bert model is consistent, it can be directly passed as input to the LSTM layer, without the need for shape conversion like CNN.

Guess you like

Origin blog.csdn.net/weixin_44624036/article/details/131365191