Natural language processing practical project 5-text data processing input model operation, take named entity recognition as an example, get through NLP model training from 0 to 1

Hello everyone, I am Weixue AI, and today I will bring you Natural Language Processing Practical Project 5-Text Data Processing Input Model Operation, taking named entity recognition as an example. The case I gave today is named entity recognition. Suppose we have a named entity recognition task and need to identify entities such as names, places, and organizations from text. We have some sample data with entity labels. Here we show how to process and load this data so that it can be fed into the model. Data processing is the first step.

1. Sample data set

Each sample in our dataset corresponds to a label, which contains one or more entities. The data format is as follows:

Chen B-NAME
Learn M-NAME
Akira E-NAME
： O
1 O
9 O
7 O
7 O
Year O
5 O
Moon O
Exit O
Raw O
， O
大 B-EDU
学 E-EDU
Bi O
Industry O
， O
High B-TITLE
Grade M-TITLE
Via M-TITLE
Je M-TITLE
Teacher E-TITLE
。 O

Here, each line contains a word and its corresponding label (entity label). The label follows the BMEO format, that is, B is used at the beginning of the entity (such as B-NAME indicates the beginning of the name), M is used inside the entity (such as M-NAME indicates the inside of the name), and E is used at the end of the entity (such as E-NAME indicates the name of the person) end), do not denote the position of any entity using O.

2. Data processing and loading

Here we discuss in detail the process of processing data.

1. Read data:

First, we need to read data from the file. We can read the data row by row while storing words and labels separately. Use the following code:

import numpy as np
import random

#加载数据
with open(“data.txt”, “r”) as f:
    data = f.readlines()

#加载词表
with open("vocab.txt", "r",encoding='utf-8') as fs:
    vocab = fs.readlines()
vocablist =[]
for lines in vocab:
    if lines.strip():  # not empty line
        word= lines.split()
        vocablist.append(word)

words, labels = [], []
for line in data:
    if line.strip():  # not empty line
        word, label = line.split()
        words.append(word)
        labels.append(label)
    else:  # new sentence
        sentences.append((words, labels))
        words, labels = [], []

2. Vocabulary and label encoding:

We need to convert words and labels into numerical representations. Before doing this, first create a vocabulary and label table from the dataset.

word_vocab = set(word[0] for word in vocablist)
label_vocab = set(label for sentence in sentences for label in sentence[1])

word2idx = {word: idx + 2 for idx, word in enumerate(word_vocab)}
word2idx["<PAD>"] = 0
word2idx["<UNK>"] = 1
print(word2idx)

label2idx = {label: idx for idx, label in enumerate(label_vocab)}

Convert words and labels to their numerical representations

data = []
for words, labels in sentences:
    word_ids = [word2idx.get(word, 1) for word in words]  # 1 is the index for <UNK>
    label_ids = [label2idx[label] for label in labels]
    data.append((word_ids, label_ids))

3. Split the dataset

In order to train and evaluate the model, we need to split the dataset into training set, validation set and test set. At this point dataeach item in the list is a ( word_ids, label_ids) tuple. Split the data into training set, validation set and test set

def split_data(data, train_ratio=0.8, valid_ratio=0.1):
    "将数据拆分为训练集、验证集和测试集"
    total_samples = len(data)
    train_samples = int(train_ratio * total_samples)
    valid_samples = int(valid_ratio * total_samples)

    train_data = data[:train_samples]
    valid_data = data[train_samples: train_samples + valid_samples]
    test_data = data[train_samples + valid_samples:]

    return train_data, valid_data, test_data

random.shuffle(data)
train_data, valid_data, test_data = split_data(data)

4. Complete sequence

Since the inputs to the neural network need to be of the same length, we need to complete the input sequence. We can <PAD>complete sequences of words and tags using tokens (which have index 0).

def pad_sequences(sequences, maxlen=None, padding="post"):
    if maxlen is None:
        maxlen = max(len(seq) for seq in sequences)

    padded_sequences = np.zeros((len(sequences), maxlen))
    for i, seq in enumerate(sequences):
        if padding == "post":
            padded_sequences[i, :len(seq)] = seq
        else: # pre-padding
            padded_sequences[i, -len(seq):] = seq

    return padded_sequences

train_inputs, train_labels = zip(*train_data)
train_inputs = pad_sequences(train_inputs)
train_labels = pad_sequences(train_labels)

valid_inputs, valid_labels = zip(*valid_data)
valid_inputs = pad_sequences(valid_inputs)
valid_labels = pad_sequences(valid_labels)

test_inputs, test_labels = zip(*test_data)
test_inputs = pad_sequences(test_inputs)
test_labels = pad_sequences(test_labels)

3. Model building and training

Now we are ready to feed data into the model. Here, using deep learning frameworks such as Tensorflow/Keras, the processed data can be input into the model in batches for training. Use Keras for training, and the training model can be changed by yourself.

from tensorflow.keras import Model, layers, Input
from tensorflow.keras.preprocessing import sequence

# … define your model
input_layer = Input(shape=(None,))
embedding_layer = layers.Embedding(input_dim=len(word2idx), output_dim=128)(input_layer)
lstm_layer = layers.Bidirectional(layers.LSTM(128, return_sequences=True))(embedding_layer)
output_layer = layers.TimeDistributed(layers.Dense(len(label2idx)))(lstm_layer)


model = Model(inputs=input_layer, outputs=output_layer)
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy",metrics=['accuracy'])

# Train the model
num_epochs = 10
batch_size = 32
model.fit(x=train_inputs, y=train_labels, batch_size=batch_size, epochs=num_epochs, validation_data=(valid_inputs, valid_labels))

operation result:

...
Epoch 7/10
108/108 [==============================] - 4s 38ms/step - loss: 0.1368 - accuracy: 0.9654 - val_loss: 0.2655 - val_accuracy: 0.9324
Epoch 8/10
108/108 [==============================] - 4s 38ms/step - loss: 0.1882 - accuracy: 0.9491 - val_loss: 0.2092 - val_accuracy: 0.9370
Epoch 9/10
108/108 [==============================] - 4s 38ms/step - loss: 0.1552 - accuracy: 0.9587 - val_loss: 0.1423 - val_accuracy: 0.9672
Epoch 10/10
108/108 [==============================] - 4s 38ms/step - loss: 0.1401 - accuracy: 0.9680 - val_loss: 0.1787 - val_accuracy: 0.9674

4. Model prediction

idx2label = {idx: label for label, idx in label2idx.items()}
# 输入的原始文本句子
input_text = "陈明，男，1967年出生，本科学历，现在在微学软件有限公司上班，是董事长职位."

# 对输入文本进行预处理
input_words = list(input_text)#.split()
#print(input_words)
input_word_ids = [word2idx.get(word, 1) for word in input_words]
input_word_ids = np.array(input_word_ids)[None, :]  # Add the batch dimension

# 使用模型进行预测
predictions = model.predict(input_word_ids)
pred_label_ids = np.argmax(predictions, axis=-1)

# 处理预测结果
predicted_labels = [idx2label[label_id] for label_id in pred_label_ids[0]]

# 显示原始文本及其预测的实体标签
for word, label in zip(input_words, predicted_labels):
    print(f"{word} {label}")

The purpose of writing this article is to let everyone know how to operate the text data input model before training, which is the key. More exciting and sustainable attention.