BERT model fine-tuning in practice: Use Transformers to fine-tune the BERT model for question answering and text classification tasks

1. Introduction to BERT

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained natural language processing model released by Google in 2018 . The core of the BERT model is the Transformer encoder, which can be unsupervised pre-trained on a large-scale corpus, and then fine-tuned on various NLP tasks through fine-tuning. The BERT model is a two-way deep learning model that can simultaneously consider all words in context to better understand the meaning of a sentence. The BERT model has been shown to achieve state-of-the-art results on several NLP tasks, including question answering, text classification, named entity recognition, and more.

BERT is a natural language understanding model based on deep neural networks, which can learn the semantics and structure of languages from large-scale unlabeled texts.
The innovation of BERT is that it uses a two-way Transformer encoder, which can simultaneously consider context information in both directions, thereby capturing richer language features.
BERT's pre-training tasks include MLM and NSP, which are used to learn vocabulary and sentence-level representations, respectively. MLM is a cloze task that randomly replaces some words in the input text with special symbols [MASK], and then lets the model predict the masked words. NSP is a binary classification task, it is given two sentences, let the model judge whether they are continuous.
BERT has achieved significant improvements in a variety of natural language processing tasks, such as question answering, sentiment analysis, named entity recognition, text classification, etc. BERT not only improves the performance of the model, but also simplifies the fine-tuning process of the model. It only needs to add a small number of task-related layers on top of the pre-trained model to adapt to different tasks.
BERT has also spawned many improved or extended models based on it, such as RoBERTa, ALBERT, XLNet, ELECTRA, etc. These models optimize or innovate BERT in different aspects, such as increasing the amount of data, reducing the amount of parameters, changing pre-training tasks, etc.

2. The basic principle of BERT

2.1. Fine-tuning BERT

BERT's pre-training and fine-tuning process. Except for the output layer, the same model structure is used for pre-training and fine-tuning. The parameters of the same pre-trained model can be used to initialize models for different downstream tasks. When fine-tuning, all parameters are fine-tuned. [CLS] is a special symbol that will be added in front of each input sample, and [SEP] is a special delimiter that can be used to separate different sentences (e.g. question/answer).

Briefly, Figure 1 shows how BERT learns language knowledge from a large amount of unlabeled text, and then fine-tunes according to different tasks. BERT uses a multi-layer Transformer encoder as a model structure, which can simultaneously consider context information in both left and right directions. BERT also uses two pre-training tasks, Masked Language Model (MLM) and Next Sentence Prediction (NSP), for learning vocabulary and sentence-level representations.

Masked Language Model（MLM）

MLM is a pre-training task designed to allow the model to learn bidirectional language representations. The specific method is to randomly replace some words with [MASK] symbols in a sentence, and then let the model predict what the replaced words are. In this way, the model needs to use contextual information to understand the meaning of the sentence. In order to reduce the difference between pre-training and fine-tuning stages, MLM also has a certain probability to replace the selected word with a random word or keep it unchanged.

Next Sentence Prediction（NSP）

NSP is another pre-training task for the model to learn the relationship between two sentences. The specific method is to input two sentences into the model, and then let the model judge whether the two sentences are continuous. As such, the model needs to understand the logical and semantic connections between sentences. NSP helps to improve some downstream tasks that need to process multi-sentence input, such as question answering and natural language inference.

Through these two tasks, BERT can learn general language knowledge, and then by adding a small number of task-related layers on top of the pre-trained model, it can adapt to different downstream tasks, such as question answering, sentiment analysis, named entity recognition, text classification wait.

Introduced in BERT's paper, BERT also uses the SQuAD dataset as an example of fine-tuning. The classification head in the original BERT, the first classification is the output of the Next Sentence Prediction result, that is to say, it brings us It is whether sentence A and sentence B are related contexts. In each sentence, some MLM tasks are done, but SQuAD is different from its head. What we need to focus on is a task of finding answers. The most important thing here The most important thing is that you have to give me a few classifiers and mark them in the answer. The so-called fine-tuning of BERT is actually fine-tuning the parameters in the output head, so that it can give specific answers to specific questions.

Regarding the fine-tuning of BERT, there are two possibilities:

The first possibility is that it only fine-tunes the classification output head , which is to keep a large number of parameters of Pre-training BERT unchanged, because fine-tuning does not need to retrain a large model with so many parameters, and there are not so many computing resources, and there are not so many For a long time, many internal parameters of BERT do not need to be concerned, and only need to focus on the classification output head.
Another possibility is to fine-tune the overall parameters of BERT during the fine-tuning process. Generally speaking, this is not necessary, because for most tasks, fine-tuning the classification output is enough. However, specific tasks are analyzed in detail. In some cases, when the task is more complicated, it is necessary to adjust BERT itself as a whole, and adjust the parameters in the original BERT to meet our new needs.

Specifically how to choose, you can try to compare the effect of fine-tuning according to the actual situation to make a decision.

3. Based on Pytorch fine-tuning BERT to realize the question answering task

The original BERT before any training is theoretically unable to complete the question answering task, because it can only complete two tasks, one is MLM, the other is NSP, and it has not been trained for question answering tasks. If we want BERT to support the question answering task, we need to use the SQuAD dataset to fine-tune BERT, and then use the fine-tuned BERT to complete a question answering task.

3.1、Stanford QA Dataset

The SQuAD (Stanford Question Answering Dataset) dataset is a standard dataset released by Stanford University that is often used for question answering tasks. It extracts many questions and many answers from Wikipedia. Each question is followed by a text segment. ,For example:

问题：What kind of animals are visible in Yellowstone?
文本段 (Context)：Yellowstone National Park is home to a variety of animals. Some of the park's larger mammals include the grizzly bear, black bear, gray wolf, bison, elk, moose, mule deer, and white-tailed deer.
答案：grizzly bear, black bear, gray wolf, bison, elk, moose, mule deer, and white-tailed deer.

Its answer has a characteristic, the answer must be included in the text segment, so it is actually an extraction type of task, not for you to reorganize the answer to the question, but as long as you find the answer word in this text segment, you can It worked. But this answer word may be one word, or it may be a combination of many words. In simple terms, the SQuAD dataset allows you to extract several adjacent words from a large text to represent the answer to the question.

3.2. Dataset Feature Extraction

Convert the training examples of the SQuAD 2.0 dataset into the input features of the BERT model, and save these features to disk to reduce repeated calculations, and then directly load the dataset later.

import pickle
from transformers.data.processors.squad import SquadV2Processor, squad_convert_examples_to_features
from transformers import BertTokenizer

# 初始化SQuAD Processor, 数据集, 和分词器
processor = SquadV2Processor()
train_examples = processor.get_train_examples('SQuAD')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# 将SQuAD 2.0示例转换为BERT输入特征
train_features = squad_convert_examples_to_features(
    examples=train_examples,
    tokenizer=tokenizer,
    max_seq_length=384,
    doc_stride=128,
    max_query_length=64,
    is_training=True,
    return_dataset=False,
    threads=1
)

# 将特征保存到磁盘上
with open('SQuAD/training_features.pkl', 'wb') as f:
    pickle.dump(train_features, f)

3.3. Original BERT Q&A

Test the problem directly on the untrained BERT model. The specific implementation code is as follows:

from transformers import BertForQuestionAnswering, BertTokenizer, BertForQuestionAnswering, AdamW
import torch
from torch.utils.data import TensorDataset

# 是否有GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 下载未经微调的BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased').to(device)

# 评估未经微调的BERT的性能
def china_capital():
    question, text = "What is the population of Shenzhen? ", "The population of Shenzhen is approximately 13 million."
    inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs.to(device))
    answer_start_index = torch.argmax(outputs.start_logits)
    answer_end_index = torch.argmax(outputs.end_logits) + 1
    predict_answer_tokens = inputs['input_ids'][0][answer_start_index:answer_end_index]
    predicted_answer = tokenizer.decode(predict_answer_tokens)
    print("深圳的人口是多少？", predicted_answer)

china_capital()

Output result:

深圳的人口是多少？ what is the population of shenzhen? [SEP] the population of shenzhen is

Judging by the results, it failed to return the correct answer.

3.4. Load SQuAD feature dataset

Before model training, the features of the SQuAD 2.0 dataset need to be loaded and converted into PyTorch tensors. The transformed tensors are then combined into a training dataset, and the training data is randomly sampled and batched using a dataloader for use during training.

from transformers import BertTokenizer, BertForQuestionAnswering, AdamW
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from transformers.data.processors.squad import SquadV2Processor, SquadExample, squad_convert_examples_to_features

# 加载SQuAD 2.0数据集的特征
import pickle
with open('SQuAD/training_features.pkl', 'rb') as f:
    train_features = pickle.load(f)

# 定义训练参数
train_batch_size = 8
num_epochs = 3
learning_rate = 3e-5

# 将特征转换为PyTorch张量
all_input_ids = torch.tensor([f.input_ids for f in train_features], dtype=torch.long)
all_attention_mask = torch.tensor([f.attention_mask for f in train_features], dtype=torch.long)
all_token_type_ids = torch.tensor([f.token_type_ids for f in train_features], dtype=torch.long)
all_start_positions = torch.tensor([f.start_position for f in train_features], dtype=torch.long)
all_end_positions = torch.tensor([f.end_position for f in train_features], dtype=torch.long)

train_dataset = TensorDataset(all_input_ids, all_attention_mask, all_token_type_ids, all_start_positions, all_end_positions)
num_samples = 100
train_dataset = TensorDataset(
    all_input_ids[:num_samples], 
    all_attention_mask[:num_samples], 
    all_token_type_ids[:num_samples], 
    all_start_positions[:num_samples], 
    all_end_positions[:num_samples])
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=train_batch_size)

Here we use picklethe library to open the signature file ( ) saved on disk training_features.pkl, and load the signature data into variables train_features.

Then convert the feature data into PyTorch tensors. Functions are used here torch.tensor()to convert the individual fields in the feature into PyTorch tensors.

TensorDatasetNext, the code combines the transformed tensors into a training dataset using ( train_dataset). Before doing this, the code also selects a subset of samples by slicing so that only the first 100 samples are used in the example for training. This subset sample size num_samplesis controlled by a variable.

Finally, the code uses RandomSamplerrandom sampling of the training dataset and DataLoaderconverts the training dataset to an iterable using the dataloader ( train_dataloader). train_batch_sizeThe loader will divide the data into small batches according to the specified batch size ( ) for training.

3.5. Fine-tuning BERT with SQuAD

After loading the dataset and preprocessing the data, you need to load the BERT model and optimizer, and fine-tuning the BERT model.

# 加载BERT模型和优化器
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased').to(device)
optimizer = AdamW(model.parameters(), lr=5e-5)

# 微调BERT
for epoch in range(num_epochs):
    for step, batch in enumerate(train_dataloader):
        model.train()
        optimizer.zero_grad()
        input_ids, attention_mask, token_type_ids, start_positions, end_positions = tuple(t.to(device) for t in batch)
        outputs = model(input_ids=input_ids, 
                        attention_mask=attention_mask, 
                        token_type_ids=token_type_ids, 
                        start_positions=start_positions, 
                        end_positions=end_positions)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

        # Print the training loss every 500 steps
        if step % 5 == 0:
            print(f"Epoch [{epoch+1}/{num_epochs}], Step [{step+1}/{len(train_dataloader)}], Loss: {loss.item():.4f}")

# 保存微调后的模型
model.save_pretrained("04 BERT/SQuAD/SQuAD_finetuned_bert")

First, BertForQuestionAnswering.from_pretrained('bert-base-uncased')load a pretrained BERT model using and move it to the specified device ( device). deviceCould be CPU or GPU, depending on your setup.

Next, the code uses AdamWthe optimizer to optimize the parameters of the BERT model. AdamWis a commonly used optimization algorithm for updating the weights of a model during fine-tuning. lr=5e-5Specifies the initial learning rate.

Then, the code uses a loop to fine-tune. The outer loop is num_epochsto indicate how many training cycles to perform. The inner loop iterates over the training data and train_dataloaderis a data loader that contains batches of data.

At each step, the code sets the model to training mode ( model.train()), zeros the gradient of the optimizer ( optimizer.zero_grad()), and moves the data to the specified device. Then, the code passes the input data to the BERT model for forward propagation to get the output. The output contains the loss ( outputs.loss), which is used to compute and backpropagate gradients ( loss.backward()), and use the optimizer to update the parameters of the model ( optimizer.step()).

The code also includes a conditional statement to print the training loss every 500 steps.

Finally, the code uses model.save_pretrained()the method to save the fine-tuned model to the specified path. This will save the model's weight parameters and configuration file so that the fine-tuned model can be loaded and used later.

3.6. Use the fine-tuned BERT to do reasoning and answer questions

After the model training is completed, we can start to use the trained BERT model for reasoning and question answering. Here we continue to call the functions defined earlier china_capital(), and ask the same question about the trained model again.

china_capital()

Output result:

is approximately 13 million

From the running results, we can see that after BERT is trained, it can correctly understand the QA question and answer and find the correct answer.

4. Text classification based on Transformers fine-tuning BERT

Transformer models have shown amazing results in most tasks in the field of natural language processing. The combination of transfer learning and large-scale Transformer language models has become the standard for state-of-the-art NLP.

Next we will describe how to fine-tune BERT (and other Transformer models) for text classification using the [ Huggingface Transformers library ] on a dataset of your choice . If you want to train BERT from scratch, pre-training is required.

We will demonstrate using a dataset of 20 newsgroups as fine-tuning data; the dataset contains about 18,000 news posts on 20 different topics, if you have a custom dataset suitable for classification, you can follow similar steps, only Few changes are required.

4.1. Settings

First, let's install the Huggingface converter library, among other libraries:

pip install -U transformers
pip install -U accelerate

Import the necessary modules:

import torch
from transformers.file_utils import is_tf_available, is_torch_available, is_torch_tpu_available
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import numpy as np
import random
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split

Next, we create a function to achieve the same result across different runs by setting the random number seed in different modules:

def set_seed(seed: int):
    """
    辅助函数，用于设置random、numpy、torch和/或tf（如果安装了）中的种子，以实现可重复的行为。

    参数:
        seed (:obj:`int`): 要设置的种子。
    """
    random.seed(seed)
    np.random.seed(seed)
    if is_torch_available():
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        # 即使CUDA不可用，调用此函数也是安全的。
    if is_tf_available():
        import tensorflow as tf
        # # 设置tf模块中的种子为seed
        tf.random.set_seed(seed)

set_seed(1)

We will use the BERT model. More specifically, we will use bert-base-uncasedpre-trained weights from the library.

# 我们将要训练的模型是基于未分大小写的 BERT
# 在这里可以查看文本分类模型: https://huggingface.co/models?filter=text-classification
model_name = "bert-base-uncased"

# 每个文档/句子样本的最大序列长度
max_length = 512

max_lengthis the maximum length of our sequence. In other words, we will only pick the first 512 tokens from each document or post. You can also change it anytime. It is recommended to ensure the memory consumption during training before modifying.

4.2. Load the dataset

Next, we load the BertTokenizerFast object to tokenize and encode the text for input into the BERT model. There are two parameters here, namely:

model_name : This parameter is used to specify the name of the pre-trained model to be loaded, such as "bert-base-chinese" or "bert-base-uncased". Different pre-trained models may have different vocabulary and word segmentation rules, so choose the appropriate model according to your task and data.
do lower case : This parameter is used to specify whether to lowercase the text. Generally speaking, if your pre-trained model is case-insensitive (such as "bert-base-uncased"), then you should Set this parameter to True; if your pretrained model is case-sensitive (such as "bert-base-cased"), then you should set this parameter to False.

# 加载 tokenizer
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)

Output result:

Define read_20newsgroupsa function to download and load the dataset:

def read_20newsgroups(test_size=0.2):
  # 从sklearn的仓库下载并加载20newsgroups数据集
  dataset = fetch_20newsgroups(subset="all", shuffle=True, remove=("headers", "footers", "quotes"))
  documents = dataset.data
  labels = dataset.target

  # 将数据集分为训练集和测试集，并返回数据和标签名称
  return train_test_split(documents, labels, test_size=test_size), dataset.target_names

# 调用函数
(train_texts, valid_texts, train_labels, valid_labels), target_names = read_20newsgroups()

train_textsand valid_textsare lists of documents (lists of strings) for the training and validation sets, respectively, train_labelsand valid_labelslikewise, they are integers from 0 to 19 or lists of labels. target_namesis a list of the names of our 20 tags, each with its own name.

Now we encode the corpus using a tokenizer:

# 对数据集进行分词，当超过max_length时进行截断， 当长度小于max_length时用0进行填充
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=max_length)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=max_length)

We will truncationset to Trueso that we can eliminate max_lengthtokens that exceed , and we will also paddingset to to Truefill documents with length less than with empty tokens max_length.

The following code encapsulates our word-segmented text data into a torch Dataset:

class NewsGroupsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor([self.labels[idx]])
        return item

    def __len__(self):
        return len(self.labels)

# 将我们的分词数据转换为torch Dataset
train_dataset = NewsGroupsDataset(train_encodings, train_labels)
valid_dataset = NewsGroupsDataset(valid_encodings, valid_labels)

Since we're going to use the Transformers library Trainer, which expects our dataset to be one torch.utils.data.Dataset, we make a simple class that implements __len__()a method that returns the number of samples, and __getitem__()a method that returns the data sample at a particular index.

4.3. Training model

Now that we have our data ready, let's download and load our BERT model and its pretrained weights:

# 通过 cuda 加载模型
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=len(target_names)).to("cuda")

We used BertForSequenceClassificationthe class from the Transformers library and num_labelsset it to the length of our available labels, which is 20.

Then the model is transferred to CUDA GPUthe upper execution. If you're on a CPU (not recommended), just remove the to() method.

Before we start fine-tuning our model, let's create a simple function to calculate our desired metric. You can freely include any indicators you want to set, here is the accuracy rate, you can also add precision, recall rate, etc.

from sklearn.metrics import accuracy_score

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  # 使用 sklearn 函数计算准确率
  acc = accuracy_score(labels, preds)
  return {
      'accuracy': acc,
  }

The code below uses TrainingArgumentsclasses to specify our training parameters such as number of epochs, batch size and some other parameters:

training_args = TrainingArguments(
    output_dir='./results',          # 输出目录
    num_train_epochs=3,              # 训练的总轮数
    per_device_train_batch_size=8,   # 训练期间每个设备的批次大小
    per_device_eval_batch_size=20,   # 评估时的批次大小
    warmup_steps=500,                # 学习率调度器的预热步数
    weight_decay=0.01,               # 权重衰减的强度
    logging_dir='./logs',            # 存储日志的目录
    load_best_model_at_end=True,     # 训练完成后加载最佳模型（默认指标为损失）
    # 但您可以指定metric_for_best_model参数来更改为准确率或其他指标
    logging_steps=400,               # 每个logging_steps记录和保存权重
    save_steps=400,
    evaluation_strategy="steps",     # 每个logging_steps进行评估
)

Each parameter is explained in the code comments. I chose 8 as the training batch size because that was the maximum I could fit in memory in the Google Colab environment. If you encounter CUDA out of memory errors, you need to reduce this value. If you are using a more powerful GPU, increasing the batch size will significantly increase the training speed. You can also tune other parameters, such as increasing the number of epochs for better training results.

Set the sum logging_stepsto save_steps400, which means the model will be evaluated and saved after every 400 steps, make sure to increase it when you reduce the batch size below 8. This is because saving checkpoints consumes a lot of disk space and can cause the entire environment to run out of disk space.

Then, we pass the training parameters, dataset and compute_metricscallback to our Trainerobject:

trainer = Trainer(
    model=model,                         # 被实例化的 Transformers 模型用于训练
    args=training_args,                  # 训练参数，如上所定义
    train_dataset=train_dataset,         # 训练数据集
    eval_dataset=valid_dataset,          # 评估数据集
    compute_metrics=compute_metrics,     # 计算感兴趣指标的回调函数
)

Train the model:

# 训练模型
trainer.train()

The training process may take minutes/hours depending on your environment, here are the results I executed on Google Colab:

***** Running training *****
  Num examples = 15076
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 5655
 [104/189 01:03 < 00:52, 1.63 it/s]
 [5655/5655 1:38:57, Epoch 3/3]
Step    Training Loss   Validation Loss Accuracy
400    2.375800    1.362277    0.615915
800    1.248300    1.067971    0.670822
1200    1.107000    0.983286    0.705305
1600    1.069100    0.974196    0.714589
2000    0.960900    0.880331    0.735013
2400    0.729300    0.893299    0.730769
2800    0.671300    0.863277    0.758621
3200    0.679900    0.868441    0.752785
3600    0.651800    0.862627    0.762599
4000    0.501500    0.884086    0.761538
4400    0.377700    0.876371    0.778249
4800    0.395800    0.891642    0.777984
5200    0.341400    0.889924    0.782493
5600    0.372800    0.894866    0.779841

TrainOutput(global_step=5655, training_loss=0.8157047524692526, metrics={'train_runtime': 5942.2004, 'train_samples_per_second': 7.611, 'train_steps_per_second': 0.952, 'total_flos': 1.1901910025060352e+16, 'train_loss': 0.8157047524692526, 'epoch': 3.0})

As can be seen from the training results, the validation loss gradually decreases and the accuracy increases above 77.9%.

We will load_best_model_at_endset it to Trueautomatically load the model with the best performance at the end of the training, we can evaluate()confirm it with the method:

# 在训练后评估当前模型
trainer.evaluate()

Output result:

{'eval_loss': 0.8626272082328796,
 'eval_accuracy': 0.7625994694960212,
 'eval_runtime': 115.0963,
 'eval_samples_per_second': 32.755,
 'eval_steps_per_second': 1.642,
 'epoch': 3.0}

Now that we have trained the model, save the model for later inference:

# 保存微调后的模型和分词器
model_path = "20newsgroups-bert-base-uncased"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

('20newsgroups-bert-base-uncased/tokenizer_config.json',
 '20newsgroups-bert-base-uncased/special_tokens_map.json',
 '20newsgroups-bert-base-uncased/vocab.txt',
 '20newsgroups-bert-base-uncased/added_tokens.json',
 '20newsgroups-bert-base-uncased/tokenizer.json')

4.4. Reload model/tokenizer

# 仅在Python文件中可用，而不是在笔记本中使用
model = BertForSequenceClassification.from_pretrained(model_path, num_labels=len(target_names)).to("cuda")
tokenizer = BertTokenizerFast.from_pretrained(model_path)

4.5. Executing reasoning

The function below takes a text as a string, tokenizes and encodes it with our tokenizer, uses softmaxthe function to compute the output probability, and returns the actual label:

def get_prediction(text):
    # 将我们的文本准备成分词序列
    inputs = tokenizer(text, padding=True, truncation=True, max_length=max_length, return_tensors="pt").to("cuda")
    # 对我们的模型进行推理
    outputs = model(**inputs)
    # 通过执行softmax函数获取输出概率
    probs = outputs[0].softmax(1)
    # 执行argmax函数以获取候选标签
    return target_names[probs.argmax()]

4.6. Model example

Here is an example:

# 示例1: 棒球分类
text = """
This newsgroup is a discussion platform for baseball fans and players. 
It covers topics such as game results, statistics, strategies, rules, equipment, and history. You can also find news and opinions about professional baseball leagues, such as MLB, NPB, KBO, and CPBL. 
If you love baseball, this is the place for you to share your passion and knowledge with other enthusiasts.
"""
print(get_prediction(text))

Output result:

rec.sport.baseball

As expected, we're talking Macbooks. Here's a second example:

# 示例2: 计算机图形
text = """
This newsgroup is a discussion platform for computer graphics enthusiasts and professionals. It covers topics such as algorithms, software, hardware, formats, standards, and applications of computer graphics. 
You can also find tips and tutorials on how to create and manipulate graphics using various tools and techniques. If you are interested in computer graphics, this is the place for you to learn and exchange ideas with other experts.
"""
print(get_prediction(text))

Output result:

comp.graphics

This is the tab for Science->Space, sure enough!

Another example:

# 示例3: 医学新闻
text = """
Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus.
Most people infected with the COVID-19 virus will experience mild to moderate respiratory illness and recover without requiring special treatment.  
Older people, and those with underlying medical problems like cardiovascular disease, diabetes, chronic respiratory disease, and cancer are more likely to develop serious illness.
"""
print(get_prediction(text))

Output result:

sci.med

Regarding the categories contained in the news dataset, we can output the target_names variable to view:

[
   'alt.atheism',
   'comp.graphics',
   'comp.os.ms-windows.misc',
   'comp.sys.ibm.pc.hardware',
   'comp.sys.mac.hardware',
   'comp.windows.x',
   'misc.forsale',
   'rec.autos',
   'rec.motorcycles',
   'rec.sport.baseball',
   'rec.sport.hockey',
   'sci.crypt',
   'sci.electronics',
   'sci.med',
   'sci.space',
   'soc.religion.christian',
   'talk.politics.guns',
   'talk.politics.mideast',
   'talk.politics.misc',
   'talk.religion.misc'
]

V. Conclusion

This article describes how to train a BERT model on a dataset using the Huggingface Transformers library, which has achieved significant performance gains in several natural language processing tasks. BERT is a Transformer-based bidirectional encoder that is pre-trained to capture the relationship between different words in a sentence. By fine-tuning on specific downstream tasks, BERT can be adapted to various NLP tasks, including question answering, sentiment analysis, and named entity recognition. Experiments show that BERT achieves state-of-the-art results on several benchmarks.