BERT of self-study big language model

The BERT model was proposed by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. It is a bidirectional transformer pre-trained using a combination of masked language modeling objectives and next-sentence predictions on large corpora including the Toronto Book Corpus and Wikipedia.
BERT aims to pretrain deep bidirectional representations of unlabeled text by jointly conditioning left and right context in all layers. As a result, pretrained BERT models require only one additional output layer for fine-tuning, creating state-of-the-art models for a wide range of tasks, such as question answering and language inference, without requiring extensive task-specific architectural modifications.
BERT is trained on Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) objectives. It is efficient at predicting mask tokens and NLU in general, but is not optimal for text generation.

2018 was a pivotal year for major breakthroughs in machine learning models in the field of natural language processing (NLP). Our conceptual understanding of how to better capture the underlying meanings and relationships of words and sentences is constantly evolving. At the same time, the NLP community continues to release amazing and powerful components that are free to download and apply to your own models and processes. This advance has been dubbed the ImageNet moment in NLP, similar to how machine learning in computer vision developed a few years ago.

The release of BERT is one of the important milestones and is considered to be a landmark event that opened a new era of NLP. BERT breaks multiple records for models dealing with language-related tasks. Soon after, the code for the BERT model was open-sourced, and a version of the model pre-trained on a large-scale dataset was made available for download. This development is significant because anyone who wants to build machine learning models involving language processing can now use this powerful engine as an off-the-shelf component, saving the time, effort, knowledge and resource.
insert image description here
insert image description here
BERT builds on many innovative ideas emerging in the NLP community, including but not limited to: semi-supervised sequence learning (by Andrew Dai and Quoc Le), ELMo (by Matthew Peters and from AI2 and UW CSE teams), ULMFiT (designed by fast.ai founders Jeremy Howard and Sebastian Ruder), OpenAI Transformer (designed by OpenAI researchers Radford, Narasimhan, Salimans, and Sutskever), and the Transformer model (by Vaswani et al. propose).

Two steps in the development of BERT. You can download the model pretrained in step 1 (trained on unannotated data) and only worry about fine-tuning it in step 2.

The paper provides two model sizes for BERT:
insert image description here

BERT BASE - comparable in size to OpenAI Transformer for performance comparison
BERT LARGE - a ridiculously large model that achieves the state-of-the-art results reported in the paper
BERT is basically a Transformer-encoder stack trained on it. Now is a good time to refer you to my previous article, The Illustrated Transformer, which explained the Transformer model - the basic concept of BERT and the concepts we will discuss next.

insert image description here

Both sizes of BERT models have a large number of encoder layers (called Transformer Blocks in this paper) - 12 in the Base version and 24 in the Large version. Same as the default configuration in the Transformer reference implementation in the original paper (6 encoding layers, 512 hidden units, and 8 attention heads).

model input
insert image description here

The first input token provides a special [CLS] token for reasons that will become apparent later. Here CLS stands for classification.

Just like a transformer's normal encoder, BERT takes as input a sequence of words that flows upwards. Each layer applies self-attention and passes its result through a feed-forward network before passing it on to the next encoder.
insert image description here

Architecturally, it's exactly the same as Transformer so far (except for size, which is just a configuration we can set). It's in the output that we first start to see how things diverge.

Model Output
Outputs a vector of size hidden_size (768 in BERT Base) per position. For the sentence classification example we saw above, we only focus on the output of the first position (to which we pass the special [CLS] token).

insert image description here

This vector can now be used as input to the classifier of our choice. The paper achieves good results using only a single-layer neural network as a classifier.

insert image description here

If you have more labels (for example, if you are an email service that labels emails with "spam", "not spam", "social" and "promotional"), you just tune the classifier network to get More output neurons are then passed through softmax.

Similarities to Convolutional Networks
For those with a computer vision background, this vector switching should be reminiscent of what happens between the convolutional part of a network like VGGNet and the fully connected classification part at the end of the network.

insert image description here

A New Era of Embedded

These new developments brought about a new shift in the way words are encoded. So far, word embeddings have been the main force of leading NLP models for processing language. Methods such as Word2Vec and Glove have been widely used for such tasks. Before pointing out what has changed now, let's review how they are used.

Word Embeddings Review
For machine learning models to process words, they need some form of numerical representation that the model can use in computation. Word2Vec shows that we can use vectors (lists of numbers) to correctly represent words in a way that captures semantic or meaning-related relationships (e.g. the ability to tell if words are similar or opposite, or the relationship between a pair of words like "Stockholm" and "Sweden" same as the relation between "Cairo" and "Egypt") and syntactic or grammar-based relations (e.g. the relation between "had" and "has" is the same as that between "is" and "is").

The field quickly realized that it was a good idea to use embeddings pre-trained on large amounts of text data rather than training them on models that are often small datasets. Therefore, it is possible to download a list of words and their embeddings generated by pre-training with Word2Vec or GloVe. Here is an example GloVe embedding for the word "stick" (embedding vector size 200)
insert image description here

GloVe word embedding for the word "stick" - a vector of 200 floats (rounded to two decimal places). It lasts for 200 values.

ELMo: Context Matters

If we use this GloVe representation, then the word "stick" will be represented by this vector no matter what the context. "Wait" Some NLP researchers say (Peters et al., 2017, McCann et al., 2017, and Peters et al., 2018 in the ELMo paper) that "stick" has multiple meanings depending on its use Location. Why not give it an embedding based on the context it's used in - capturing both the word meaning in that context and other contextual information? ”. Thus, contextual word embeddings were born.

Contextual word embeddings can assign different word embeddings based on the meaning of the word in the context of the sentence.
Also, instead of using a fixed embedding for each word, RIP Robin Williams ELMo looks at the entire sentence before assigning each word in it an embedding. It uses a bidirectional LSTM trained on a specific task to create these embeddings.
insert image description here

ELMo is an important step towards pre-training in the context of NLP. The ELMo LSTM will be trained on a massive dataset using the language of our dataset, and we can then use it as a component of other models that need to handle language.

What is ELMo's secret?

ELMo gains its understanding of language by being trained to predict the next word in a sequence of words—a task known as language modeling. This is convenient because we have a lot of text data from which such a model can learn without labels.
insert image description here

A step in ELMo's pre-training process: given "Let's stick to" as input, predict the next most likely word - a language modeling task. When trained on a large dataset, the model begins to recognize patterns in language. In this example, it's unlikely to guess the next word accurately. More practically, after words like "hang", it would assign a higher probability to words like "out" (spelled "hang out") than "camera".

We can see the hidden state of each unrolled LSTM step protruding from the back of ELMo's head. These will come in handy during the embedding process after pre-training is done.

ELMo actually goes a step further by training a bidirectional LSTM - so that its language model not only perceives the next word, but also the previous one.

Great slides on ELMo
ELMo proposes contextual embeddings by combining hidden states (and initial embeddings) in a certain way (connected post-weighted sum).
insert image description here

ULM-FiT: Determining Transfer Learning in NLP
ULM-FiT introduces ways to effectively leverage much of what a model learns during pre-training -- not just embeddings, and not just contextual embeddings. ULM-FiT introduces a language model and a procedure to efficiently fine-tune this language model for various tasks.

NLP finally has a way to do transfer learning like computer vision.

Transformer: Beyond LSTM
The release of Transformer papers and code, and its results on tasks like machine translation, is starting to get some in the field thinking of them as an alternative to LSTMs. The situation is further complicated by the fact that Transformers handle long-term dependencies better than LSTMs.

Transformer's Encoder-Decoder structure makes it very suitable for machine translation. But how would you use it for sentence classification? How would you use this to pre-train a language model that can be fine-tuned for other tasks (downstream tasks are those supervised learning tasks that utilize pre-trained models or components as the field says).

OpenAI Transformer: Pretrained Transformer Decoder for Language Modeling
It turns out that we don't need an entire Transformer to fine-tune language models for transfer learning and NLP tasks. We can just use Transformer's decoder. The decoder is a good choice because it's a natural choice for language modeling (predicting the next word) because it's built to mask future tokens - a very useful properties of value.

insert image description here

OpenAI Transformer consists of a decoder stack from Transformer.
The model stacks twelve decoder layers. Since there is no encoder in this setup, these decoder layers will not have an encoder-decoder attention sublayer like the vanilla transformer decoder layer. However, it will still have the self-attention layer (masked so it doesn't peak at future markers).

Using this structure, we can continue to train the model on the same language modeling task: predicting the next word using a large (unlabeled) dataset. That is, throw the text of 7000 books to it and let it learn! Books are great for this kind of task because it allows the model to learn to associate relevant information even if they are separated by large amounts of text - information that you don't have when you train on tweets or articles, for example.
insert image description here

The OpenAI Transformer is now ready to be trained to predict the next word on a dataset consisting of 7,000 books.
Transferring Learning to Downstream Tasks
Now that the OpenAI transformer is pretrained and its layers tuned to handle language reasonably, we can start using it for downstream tasks. Let's first look at sentence classification (classifying emails as "spam" or "not spam"):

insert image description here

The OpenAI paper outlines some input transformations to handle inputs for different types of tasks. The figure below in the paper shows the model structure and input transformations for performing different tasks.

insert image description here

Isn't that smart?

BERT: From decoder to encoder
openAI Transformer provides us with a Transformer-based fine-tuning pre-training model. But something was missing in the transition from LSTMs to Transformers. ELMo's language model is bidirectional, but openAI transformer only trains forward language model. Can we build a Transformer-based model whose language model looks both forward and backward (in technical terms - "conditional on left and right context")?

"Hold my beer," says R-rated BERT.

Masked language model
"We're going to use a Transformer encoder," BERT said.

"That's crazy," Ernie replied, "everyone knows that two-way conditioning makes each word see itself indirectly in multiple layers of context."

"We will use MASK", BERT said confidently.
insert image description here

BERT's clever language modeling task masks 15% of the words in the input and asks the model to predict the missing words.
Finding the right task to train a Transformer-encoder stack is a complex hurdle that BERT addresses by adopting the concept of a "masked language model" from earlier literature, where it is called the cloze task.

Besides masking out 15% of the input, BERT mixes things in to improve how the model is fine-tuned later on. Sometimes it randomly replaces one word with another and asks the model to predict the correct word for that position.

Two-sentence tasks
If you look back at the input transformations OpenAI Transformer does to handle different tasks, you'll notice that some tasks require the model to say something intelligent about two sentences (e.g. are they just paraphrased versions of each other? Given a Wikipedia entry as input, and a question about that entry as another input, can we answer this question?).

In order for BERT to better handle the relationship between multiple sentences, the pre-training process includes an additional task: Given two sentences (A and B), is it possible that B is the sentence after A?

The second task of BERT pre-training is the dual sentence classification task. The tokenization in this diagram is oversimplified because BERT actually uses WordPieces as tokens instead of words - so some words are broken into smaller chunks.
The task-specific model
BERT paper shows a variety of ways to use BERT for different tasks.

The BERT fine-tuning method for feature extraction
is not the only way to use BERT. Just like ELMo, you can use pretrained BERT to create contextual word embeddings. You can then feed these embeddings into your existing model — the process the paper shows yields results not too far off from fine-tuning BERT on tasks like named entity recognition.

Which vector works best as a contextual embedding? I think it depends on the task. The paper examines six choices (compared to a fine-tuned model with a score of 96.4):

Try BERT

Check out the code in the BERT repository:

The model builds class BertModel in modeling.py(), which is almost the same as a normal Transformer encoder.
run_classifier.py is an example of a fine-tuning process. It also builds classification layers for supervised models. If you want to build your own classifier, take a look at the create_model() method in this file.

Several pretrained models are available for download. These cover BERT Base and BERT Large, as well as languages ​​like English, Chinese, and a multilingual model covering 102 languages, trained on Wikipedia.

BERT does not treat words as tokens. Instead, it looks at WordPieces. tokenization.py is a tokenizer that converts your words into wordPieces suitable for BERT.

#导入Python库并准备环境
!pip install transformers seqeval[gpu]

import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import torch
from torch.utils.data import Dataset, DataLoader
#从Transformer导入BertConfig、BertModel

from transformers import BertTokenizer, BertConfig, BertForTokenClassification

#判断是否使用GPU算力
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
print(device)






BertConfig

This is the configuration class used to store the configuration of BertModel or TFBertModel. It is used to instantiate a BERT model based on specified parameters, defining the model architecture. Instantiating a configuration with default values ​​will produce a similar configuration to the BERT bert-base-uncased architecture.

# 初始化一个 BERT bert-base-uncased 风格的配置
configuration = BertConfig()


# 从 bert-base-uncased 样式配置初始化模型(具有随机权重)
model = BertModel(configuration)

# 访问模型配置
configuration = model.config

Parameter explanation:
vocab_size (int, optional, defaults to 30522) — the vocabulary size of the BERT model. inputs_ids defines the number of different tokens that can be represented passed when calling BertModel or TFBertModel.
hidden_​​size (int, optional, defaults to 768) — Dimensionality of encoding and pooling layers.
num_hidden_​​layers (int, optional, defaults to 12) — Number of hidden layers in Transformer encoder.
num_attention_heads (int, optional, defaults to 12) — Number of attention heads for each attention layer in Transformer encoder.
intermediate_size (int, optional, defaults to 3072) — Dimensions of "intermediate" (often called feed-forward) layers in the Transformer encoder.
hidden_​act ( stror Callable, optional , defaults to “gelu”) — non-linear activation function (function or string) in encoder and pooler. If strings, "gelu", "relu" and "silu" are supported. “gelu_new”
hidden_dropout_prob ( float, optional , defaults to 0.1) — dropout probability for all fully connected layers in embedding, encoder and pooler.
attention_probs_dropout_prob ( float, optional , defaults to 0.1) — dropout rate for attention probabilities.
max_position_embeddings ( int , optional , defaults to 512 ) — The maximum sequence length the model may use. Usually set it to a large value just in case (for example, 512 or 1024 or 2048).
type_vocab_size (int, optionaltoken_type_ids, defaults to 2) — Vocabulary size passed when calling BertModel or TFBertModel.
initializer_range ( float, optional , defaults to 0.02) — Standard deviation of truncated_normal_initializer used to initialize all weight matrices.
layer_norm_eps ( float, optional , defaults to 1e-12) — epsilon to use for the layer normalization layer.
position_embedding_type ( str, optional , defaults to “absolute”) — Type of positional embedding. Select "absolute", "relative_key", one of "relative_key_query". For positional embeddings, use "absolute". For more information on "relative_key", see Self-Attention with Relative Position Representations (Shaw et al.). For more information about "relative_key_query", see Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).
is_decoder ( bool, optional , defaults to False) — Whether the model is used as a decoder. If False, the model is used as an encoder.
use_cache ( bool, optional , defaults to True) — Whether the model should return the last key/value considerations (not all models use this). Only if config.is_decoder=True.
classifier_dropout ( float, optional ) — Dropout rate for classifier headers.

BertTokenizer

Build a BERT tokenizer. Based on WordPiece.

This tokenizer inherits from PreTrainedTokenizer, which contains most of the main methods. Users should refer to this superclass for more information on these methods.

parameter

vocab_file ( str ) – file containing vocabulary.
do_lower_case ( bool, optional , defaults to True) — Whether to lowercase the input when tokenizing.
do_basic_tokenize ( bool, optional , defaults to True) — Whether to do basic tokenization before WordPiece.
never_split ( Iterable, optional ) – Collection of tokens that will never be split during tokenization. Only valid if do_basic_tokenize=True
unk_token ( str, optional , defaults to "[UNK]") — unknown token. A token that is not in the vocabulary cannot be converted to an ID, but is set to this token.
sep_token ( str, optional , defaults to "[SEP]") — Separator, used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for text and question answering questions. It is also used as the last marker of sequences built with special markers.
pad_token ( str, optional , defaults to "[PAD]") — Token to use for padding, e.g. when batching sequences of different lengths.
cls_token ( str, optional , defaults to "[CLS]") – Classifier token to use when doing sequence classification (classify whole sequences instead of tokens). When built with special tokens, it is the first token of the sequence.
mask_token ( str, optional , defaults to "[MASK]") – Token to mask values. This is the token used when training this model using masked language modeling. This is the marker that the model will try to predict.
tokenize_chinese_chars ( bool, optional , defaults to True) — whether to tokenize Chinese characters.
For Japanese, this should probably be disabled (see this issue).

strip_accents ( bool, optional ) — Whether to strip all accents. If this option is not specified, it will be determined by the value of for lowercase (as in original BERT).

Here are the methods of this class:

build_inputs_with_special_tokens

参数:
token_ids_0 ( List[int]) — 将添加特殊标记的 ID 列表。
token_ids_1 ( List[int], optional ) — 可选的第二个序列对 ID 列表。

return:
List[int]

A list of input IDs with appropriate special tokens.

Build model inputs for sequence classification tasks from a sequence or a pair of sequences by concatenating and adding special markers. BERT sequences have the following format:

Single sequence: [CLS] X [SEP]
sequence pair: [CLS] A [SEP] B [SEP]

get_special_tokens_mask

( token_ids_0 : typing.List[int]token_ids_1 : typing.Optional[typing.List[int]] = Nonealready_has_special_tokens : bool = False ) → List[int]

参数

token_ids_0 ( List[int]) — ID 列表。
token_ids_1 ( List[int], optional ) — 可选的第二个序列对 ID 列表。
already_has_special_tokens ( bool, optional , defaults to False) — 标记列表是否已经使用模型的特殊标记格式化。

return:

List[int]

List of integers in the range [0, 1]: 1 for special flags, 0 for sequence flags.

Retrieves the sequence ID from the token list with no special tokens added. This method prepare_for_model is called when using the tokenizer method to add special tokens.

create_token_type_ids_from_sequences

( token_ids_0 : typing.List[int]token_ids_1 : typing.Optional[typing.List[int]] = None ) → List[int]

参数

token_ids_0 ( List[int]) — ID 列表。
token_ids_1 ( List[int], optional ) — 可选的第二个序列对 ID 列表。

return:

List[int]

A list of token type ids according to the given sequence.

Create a mask from the two sequences passed in for the sequence pair classification task. A BERT sequence

A pair mask has the following format:

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 |
first sequence | second sequence |
If token_ids_1 is None, this method returns only the first part of the mask (0s).

save_vocabulary

( save_directory : strfilename_prefix : typing.Optional[str] = None )

save vocabulary

class transformers.BertModel

( configadd_pooling_layer = True )

#参数

#config ( BertConfig ) — 具有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重,只会加载配置。查看from_pretrained()方法加载模型权重。

This model is based on the Bert Model, which outputs the original hidden state without a specific head.

This model inherits from PreTrainedModel. You can check the documentation of the superclass for common methods implemented by the library, such as downloading or saving models, resizing input embeddings, trimming headers, and more.

This model is also a PyTorch torch.nn.Module subclass. You can use it like a normal PyTorch module, and refer to the PyTorch documentation for matters related to general usage and behavior.

The model can be run as an encoder (using only self-attention) or a decoder. When used as a decoder, a layer of cross-attention is added between self-attention layers, following Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin in Attention is all you need".

To run as a decoder, the is_decoder parameter in the model's configuration parameters needs to be set to True. To use this model in a Seq2Seq model, both the is_decoder parameter and the add_cross_attention parameter need to be set to True, and encoder_hidden_states needs to be provided as input in the forward pass.

forward

参数:

input_ids ( torch.LongTensorof shape (batch_size, sequence_length)) — 词汇表中输入序列标记的索引。

attention_mask ( torch.FloatTensorof shape (batch_size, sequence_length), optional ) — 避免对填充标记索引执行注意力的掩码。在以下位置选择的掩码值[0, 1]1 对于未屏蔽的标记,
0 表示被屏蔽的标记。

token_type_ids ( torch.LongTensorof shape (batch_size, sequence_length), optional ) — 段令牌索引以指示输入的第一部分和第二部分。指数选择于[0, 1]0对应一个句子A token,
1对应一个句子B token。

position_ids ( torch.LongTensorof shape (batch_size, sequence_length), optional ) — 位置嵌入中每个输入序列标记的位置索引。在范围内选择[0, config.max_position_embeddings - 1]。

head_mask(torch.FloatTensor形状为(num_heads,)or (num_layers, num_heads),可选)— 使自注意力模块的选定头部无效的掩码。在以下位置选择的掩码值[0, 1]1表示头部没有被遮盖,
0 表示头部被屏蔽。

inputs_embeds ( torch.FloatTensorof shape (batch_size, sequence_length, hidden_size), optionalinput_ids ) — 可选地,您可以选择直接传递嵌入表示而不是传递。input_ids如果您希望比模型的内部嵌入查找矩阵更多地控制如何将索引转换为关联向量,这将很有用。

output_attentions ( bool, optional ) — 是否返回所有注意力层的注意力张量。attentions有关更多详细信息,请参阅返回的张量。

output_hidden_​​states ( bool, optional ) — 是否返回所有层的隐藏状态。hidden_states有关更多详细信息,请参阅返回的张量。

return_dict ( bool, optional ) — 是否返回 ModelOutput而不是普通元组。

encoder_hidden_​​states ( torch.FloatTensorof shape (batch_size, sequence_length, hidden_size), optional ) — 编码器最后一层输出的隐藏状态序列。如果模型配置为解码器,则用于交叉注意。

encoder_attention_mask ( torch.FloatTensorof shape (batch_size, sequence_length), optional ) — 避免对编码器输入的填充令牌索引进行注意的掩码。如果模型配置为解码器,则此掩码用于交叉注意。在以下位置选择的掩码值[0, 1]1 对于未屏蔽的标记,
0 表示被屏蔽的标记。

past_key_values(每个元组tuple(tuple(torch.FloatTensor))的长度有 4 个形状的张量)——包含注意块的预计算键和值隐藏状态。可用于加速解码。config.n_layers(batch_size, num_heads, sequence_length - 1, embed_size_per_head)

use_cache ( bool, optional ) — 如果设置为True,past_key_values则返回键值状态并可用于加速解码(参见 past_key_values)。

return transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions或tuple(torch.FloatTensor)




from transformers import AutoTokenizer, BertModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state

BertForPreTraining

class transformers.BertForPreTraining

参数

config ( BertConfig ) — 具有模型所有参数的模型配置类。使用配置文件初始化不会加载与模型关联的权重,只会加载配置。查看from_pretrained()方法加载模型权重。
在预训练期间完成的顶部有两个头的 Bert 模型:一个masked language modeling头和一个next sentence prediction (classification)头。

该模型继承自PreTrainedModel。检查超类文档以了解库为其所有模型实现的通用方法(例如下载或保存、调整输入嵌入的大小、修剪头等)

这个模型也是 PyTorch torch.nn.Module 的子类。将其用作常规 PyTorch 模块,并参考 PyTorch 文档以了解与一般用法和行为相关的所有事项。
from transformers import AutoTokenizer, BertForPreTraining
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = BertForPreTraining.from_pretrained("bert-base-uncased")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs)

prediction_logits = outputs.prediction_logits
seq_relationship_logits = outputs.seq_relationship_logits

Guess you like

Origin blog.csdn.net/qq_38915354/article/details/131018389