Natural Language Processing Practical Project 13 - The Whole Process of Keyword Extraction Model Training Based on GRU Model and NER

Hello everyone, I am Weixue AI. Today I will introduce to you the natural language processing practical project 13-the whole process of training the keyword extraction model based on the GRU model and NER. This article mainly introduces keyword extraction sample data, GRU model model construction and training, named entity recognition (NER), model evaluation and application. The goal of the project is to achieve accurate and robust keyword extraction by training a GRU model, and Improve the effect of keyword extraction by integrating the NER model. This project provides a complete process that can be adjusted and expanded according to actual needs.

Contents:
1. GRU model introduction
2. NER method to extract keywords
3. NER method code implementation
4. Summary

1. Introduction to GRU model

GRU is a variant of recurrent neural network for modeling tasks on sequential data. Compared with the traditional RNN structure, GRU introduces a gating mechanism to solve the long-term dependency problem and alleviate the gradient vanishing and exploding problems.

The main components of the GRU model include:

1. Input Gate: It determines which parts of the input information need to be updated to the hidden state. It combines the input data with the previous hidden state through a sigmoid function and outputs a value between 0 and 1 representing the updated weights.

2. Update Gate: Controls whether to update the value of the hidden state. It evaluates the current input and the previous hidden state through a sigmoid function to determine whether to combine new information with the previous hidden state.

3. Reset Gate: Evaluate the current input and the previous hidden state, and decide what information to keep and what information to ignore in the hidden state. The gate can get two different outputs through a sigmoid function and a tanh function, and then multiply them to get the final reset gate result.

4. Hidden State: used to store information in the sequence, and passed and updated at each time step. The hidden state is updated based on the input and the previous hidden state. The previous hidden states are weighted and combined by using the input gate and the reset gate, and then use the tanh function to get new candidate hidden states. Finally, the update gate determines how to combine the new candidate hidden state with the previous hidden state to obtain the final hidden state.

The GRU model has the following advantages in sequence modeling tasks:

Handling long-term dependencies: The GRU model can selectively update and preserve information in sequences by using a gating mechanism to better handle long-term dependencies.

Alleviate the gradient problem: Due to the existence of the gating mechanism, the GRU model can effectively alleviate the problems of gradient disappearance and gradient explosion, and improve the training effect and stability of the model.

Fewer parameters: Compared with the long short-term memory network (LSTM), the GRU model has fewer parameters and is easier to train and adjust.
insert image description here

2. NER method to extract keywords

NER can be used for keyword extraction, by identifying named entities in text and extracting keywords from them. Compared with traditional keyword extraction methods, NER has the following advantages:

1. Accuracy: NER can accurately locate specific entities in the text and provide more accurate keyword extraction results.

2. Context understanding: NER not only simply extracts words, but also understands the meaning of entities according to the context, improving the accuracy of keyword extraction.

3. Adapt to multiple fields: Due to NER’s ability to understand context, it can be used for keyword extraction in different fields, such as news, medicine, law, etc.

The workflow of NER usually includes the following steps:

1. Data preparation: Collect and prepare labeled training data, where the labeled data should include the starting position of the entity and the corresponding label.

2. Feature extraction: select appropriate features from the text to represent entities, such as part of speech, context, etc. These features are often used to train models.

3. Model training: Use labeled training data to train a machine learning model, such as Conditional Random Field (CRF), Recurrent Neural Networks (RNN), etc.

4. Label prediction: Use the trained model to predict new text and mark the location and category of the entity.

5. Post-processing: According to the task requirements, perform post-processing on the NER results, such as filtering out irrelevant entities, merging adjacent entities, etc.

6. Keyword extraction: From the extracted entities, select entities with key meanings as keywords.

3. Code implementation of NER mode

import torch
import torch.nn as nn
from torch.optim import Adam

# 定义模型
class KeywordExtractor(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(KeywordExtractor, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.linear = nn.Linear(hidden_size, output_size)

    def forward(self, input):
        embedded = self.embedding(input)
        output, hidden = self.gru(embedded)
        output = self.linear(output.view(-1, self.hidden_size))
        return output.view(len(input), -1, output.size(1))


# 准备训练数据
train_data = [
    ("我 爱 北京", ["O", "O", "B-LOC"]),
    ("张三 是 中国 人", ["B-PER", "O", "B-LOC", "O"]),
    ("李四 是 美国 人", ["B-PER", "O", "B-LOC", "O"]),
    ("我 来自 北京", ["O", "O", "B-LOC"]),
    ("我 来自 广州", ["O", "O", "B-LOC"]),
    ("王五 去 英国 玩", ["B-PER", "O", "B-LOC", "O"]),
    ("我 喜欢 上海", ["O", "O", "B-LOC"]),
    ("刘东 是 北京 人", ["B-PER", "O", "B-LOC", "O"]),
    ("李明 来自 深圳", ["B-PER", "O", "B-LOC"]),
    ("我 计划 去 香港 旅行", ["O", "O","O", "B-LOC", "O"]),
    ("你 想去 法国 吗", ["O", "O", "B-LOC", "O"]),
    ("福州 是 你的 家乡 吗", ["B-LOC", "O", "O", "O", "O"]),
    ("张伟 和 王芳 一起 去 新加坡", ["B-PER", "O", "B-PER", "O", "O", "B-LOC"]),
    # 其他训练样本...
]

# 构建词汇表
word2idx = {
    
    "<PAD>": 0, "<UNK>": 1}
tag2idx = {
    
    "O": 0, "B-LOC": 1, "B-PER":2}
for sentence, tags in train_data:
    for word in sentence.split():
        if word not in word2idx:
            word2idx[word] = len(word2idx)
    for tag in tags:
        if tag not in tag2idx:
            tag2idx[tag] = len(tag2idx)
idx2word = {
    
    idx: word for word, idx in word2idx.items()}
idx2tag = {
    
    idx: tag for tag, idx in tag2idx.items()}

# 超参数
input_size = len(word2idx)
output_size = len(tag2idx)
hidden_size = 128
num_epochs = 100
batch_size = 2
learning_rate = 0.001

# 实例化模型和损失函数
model = KeywordExtractor(input_size, hidden_size, output_size)
criterion = nn.CrossEntropyLoss()

# 定义优化器
optimizer = Adam(model.parameters(), lr=learning_rate)


# 准备训练数据的序列张量和标签张量
def prepare_sequence(seq, to_idx):
    idxs = [to_idx.get(token, to_idx["<UNK>"]) for token in seq.split()]
    return torch.tensor(idxs, dtype=torch.long)

# 填充数据
def pad_sequences(data):
    # 计算最长句子的长度
    max_length = max(len(item[0].split()) for item in data)
    aligned_data = []
    for sentence, tags in data:
        words = sentence.split()
        word_s = words + ['O'] * (max_length - len(tags))
        sentence = ' '.join(word_s)
        aligned_tags = tags + ['O'] * (max_length - len(tags))
        aligned_data.append((sentence, aligned_tags))

    return aligned_data

# 训练模型
for epoch in range(num_epochs):
    for i in range(0, len(train_data), batch_size):
        batch_data = train_data[i:i + batch_size]
        batch_data = pad_sequences(batch_data)

        inputs = torch.stack([prepare_sequence(sentence, word2idx) for sentence, _ in batch_data])
        targets = torch.LongTensor([tag2idx[tag] for _, tags in batch_data for tag in tags])

        # 前向传播
        outputs = model(inputs)
        loss = criterion(outputs.view(-1, output_size), targets)

        # 反向传播和优化
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'
                  .format(epoch + 1, num_epochs, i + 1, len(train_data) // batch_size, loss.item()))

# 测试模型
test_sentence = "李明 想去 北京 游玩"
with torch.no_grad():
    inputs = prepare_sequence(test_sentence, word2idx).unsqueeze(0)
    outputs = model(inputs)
    _, predicted = torch.max(outputs.data, 2)
    tags = [idx2tag[idx.item()] for idx in predicted.squeeze()]
    print('输入句子:', test_sentence)
    print('关键词标签:', tags)

operation result:

Epoch [99/100], Step [3/6], Loss: 0.0006
Epoch [99/100], Step [5/6], Loss: 0.0002
Epoch [99/100], Step [7/6], Loss: 0.0002
Epoch [99/100], Step [9/6], Loss: 0.0006
Epoch [99/100], Step [11/6], Loss: 0.0006
Epoch [99/100], Step [13/6], Loss: 0.0009
Epoch [100/100], Step [1/6], Loss: 0.0003
Epoch [100/100], Step [3/6], Loss: 0.0006
Epoch [100/100], Step [5/6], Loss: 0.0002
Epoch [100/100], Step [7/6], Loss: 0.0002
Epoch [100/100], Step [9/6], Loss: 0.0005
Epoch [100/100], Step [11/6], Loss: 0.0006
Epoch [100/100], Step [13/6], Loss: 0.0009

输入句子: 李明 想去 北京 游玩
关键词标签: ['B-PER', 'O', 'B-LOC', 'O']

4. Summary

Named entity recognition (NER) is a technology in natural language processing, the purpose is to identify and extract named entities with specific meaning from text. These named entities can be words with specific meanings such as person names, place names, organization names, time, and dates.

The task of NER is to label each word in the text as a predefined named entity category. Common categories include names (PERSON), place names (LOCATION), organization names (ORGANIZATION), etc. Through NER technology, the key information in the text can be extracted to help understand the meaning and context of the text.

The core idea of ​​NER is to combine machine learning and natural language processing technology, and use the trained model to analyze and process the text. Commonly used methods include rule-based, statistical, and machine learning-based methods. Among them, the method based on machine learning is trained on a large-scale labeled data set, and the accuracy of recognition is improved by learning to recognize the patterns and laws of named entities.

NER has a wide range of application scenarios in practical applications, including information extraction, intelligent search, question answering systems, etc. The keywords extracted by NER technology can be used for further information processing and analysis, which helps to improve the understanding and processing effect of the text.

In summary, Named Entity Recognition (NER) is an important natural language processing technique that can extract named entities with specific meanings from text. It plays an important role in various application scenarios, providing strong support for text analysis and information extraction.

Guess you like

Origin blog.csdn.net/weixin_42878111/article/details/131865175