Natural language processing practical project 8- BERT model construction, training BERT to realize the task of entity extraction and recognition

Hello everyone, I am Weixue AI. Today I will introduce to you the natural language processing practical project 8- BERT model construction, and train BERT to realize the task of entity extraction and recognition. The BERT model is a deep learning model for natural language processing that can be trained to understand the contextual relationship between words to provide high-quality language representation for downstream tasks. Its structure is composed of multiple Transformer encoders, and Transformer encoders are composed of multiple self-attention mechanisms. During training, the model improves the accuracy of language representation by predicting masked words and judging the relationship between two sentences. In the entity recognition task, the BERT model can be used as a feature extractor, and the context-related vector representation of each word is input into the classifier to complete entity recognition.

1. The framework of the BERT model

The basic structure of BERT is a multi-layer Transformer encoder architecture. Transformer is a self-attention mechanism that allows the model to capture important relationships between different words. Specifically, BERT uses a self-attention head to generate a vector representation for each word in a text sequence, while capturing the contextual information of the entire sentence. These vector representations can be combined from bottom to higher layers, allowing the model to learn more complex semantic structures.

The BERT model has two main pre-training models:
1.BERT-Base: Contains 12 layers (Encoder layers), 12 self-attention heads (Attention heads) and 768 hidden layer sizes (Hidden size), a total of about 110M parameters.
2.BERT-Large: Contains 24 layers (Encoder layers), 16 self-attention heads (Attention heads) and 1024 hidden layer sizes (Hidden size), with a total of about 340M parameters.

 2. BERT pre-training

The pre-training of BERT is mainly divided into two stages: pre-training and fine-tuning.

2.1 Pre-training:

In the pre-training phase, BERT's innovation is to use a large amount of unlabeled text for two-way training. At this stage, BERT introduces two pre-training tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).
 Masked Language Modeling (MLM) : During training, a portion of words in an input sentence are randomly replaced with a special mask symbol (MASK). The goal of the model is to predict the masked word based on contextual information from the rest of the sentence. This training method enables the model to learn bidirectional semantic information.
Next Sentence Prediction (NSP) : This task aims to let the model learn to understand the relationship between sentences. Given a pair of sentences, the model needs to predict whether the second sentence immediately follows the first sentence. This task helps BERT to better deal with tasks that need to understand the relationship between multiple sentences, such as question answering and natural language reasoning.

2.2 Fine-tuning:

After the pre-training phase is completed, the BERT model has learned a rich semantic representation. Then, in actual NLP tasks, we need to fine-tune the pre-trained BERT. The fine-tuning stage requires less labeled data to optimize the model for specific tasks.
In the fine-tuning process, a task-related neural network layer, such as a fully connected layer, convolutional layer, etc., is usually added on top of the BERT model. Then, together with the entire BERT model, end-to-end fine-tuning training is performed. During training, the loss is calculated using labeled data, and the parameters are updated using gradient descent. After fine-tuning, BERT is able to generate more targeted results for specific tasks.

3. Training BERT to realize the task of entity extraction and recognition

The following is a complete code for Chinese named entity recognition using the BERT model of PyTorch and the Hugging Face Transformers library, loading third-party libraries, and importing data samples.

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForTokenClassification, AdamW
from sklearn.model_selection import train_test_split
from tqdm import tqdm

def load_data_from_txt(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        lines = f.readlines()
    data = []
    for line in lines:
        line = line.strip()
        if not line:
            continue
        token, label = line.split()
        data.append((token, label))
    return data

file_path = 'ner_data.txt'
data = load_data_from_txt(file_path)
print(data)

ner_data.txt file data sample:

李 B-PER
华 I-PER
山 I-PER
是 O
一 O
个 O
优 O
秀 O
的 O
程 O
序 O
员 O
。 O
阿 B-ORG
里 I-ORG
巴 I-ORG
巴 I-ORG
是 O
一 O
家 O
著 O
名 O
的 O
中 B-LOC
国 I-LOC
公 O
司 O
。 O
陈 B-PER
明 I-PER
在 O
北 B-LOC
京 I-LOC
上 O
了 O
一 O
所 O
大 O
学 O
。 O

Model loading and training:

# 预处理
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")
label_map = {"B-PER": 0, "I-PER": 1, "B-ORG": 2, "I-ORG": 3, "B-EDU":4,"I-EDU":5, "B-LOC":6,"I-LOC":7,"O": 8}
inverse_label_map = {v: k for k, v in label_map.items()}

class NERDataset(Dataset):
    def __init__(self, data, tokenizer, label_map):
        self.data = data
        self.tokenizer = tokenizer
        self.label_map = label_map

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        token, label = self.data[idx]
        input_ids = tokenizer.encode(token, add_special_tokens=False)
        label_id = self.label_map[label]
        return torch.tensor(input_ids, dtype=torch.long), torch.tensor(label_id, dtype=torch.long)

dataset = NERDataset(data, tokenizer, label_map)
train_data, val_data = train_test_split(dataset, test_size=0.2, random_state=42)
train_loader = DataLoader(train_data, batch_size=1, shuffle=True)
val_loader = DataLoader(val_data, batch_size=1, shuffle=False)

# 模型
model = BertForTokenClassification.from_pretrained("bert-base-chinese", num_labels=len(label_map))
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# 优化器
optimizer = AdamW(model.parameters(), lr=5e-5)

# 训练
num_epochs = 8
for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    total_correct = 0
    total_count = 0

    for batch in tqdm(train_loader):
        input_ids, labels = batch
        input_ids = input_ids.to(device)
        labels = labels.to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        total_correct += (outputs.logits.argmax(-1) == labels).sum().item()
        total_count += labels.size(0)

    avg_loss = total_loss / total_count
    accuracy = total_correct / total_count
    print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {avg_loss:.4f}, Accuracy: {accuracy:.4f}")

operation result:

100%|██████████| 97/97 [00:39<00:00,  2.47it/s]
  0%|          | 0/97 [00:00<?, ?it/s]Epoch 1/10, Loss: 1.4691, Accuracy: 0.5979
100%|██████████| 97/97 [00:40<00:00,  2.42it/s]
Epoch 2/10, Loss: 1.3695, Accuracy: 0.6598
100%|██████████| 97/97 [00:39<00:00,  2.48it/s]
Epoch 3/10, Loss: 1.2924, Accuracy: 0.5979
100%|██████████| 97/97 [00:39<00:00,  2.48it/s]
  0%|          | 0/97 [00:00<?, ?it/s]Epoch 4/10, Loss: 1.3100, Accuracy: 0.6701
100%|██████████| 97/97 [00:37<00:00,  2.59it/s]
Epoch 5/10, Loss: 1.2179, Accuracy: 0.6598
100%|██████████| 97/97 [00:40<00:00,  2.39it/s]
  0%|          | 0/97 [00:00<?, ?it/s]Epoch 6/10, Loss: 0.9726, Accuracy: 0.6495
100%|██████████| 97/97 [00:39<00:00,  2.46it/s]
  0%|          | 0/97 [00:00<?, ?it/s]Epoch 7/10, Loss: 1.0536, Accuracy: 0.6186
100%|██████████| 97/97 [00:40<00:00,  2.42it/s]
Epoch 8/10, Loss: 0.9458, Accuracy: 0.6907

Guess you like

Origin blog.csdn.net/weixin_42878111/article/details/130928796