paddleNLP information extraction model combat (entity recognition, relationship extraction)

named entity recognition

NER Named Entity Recognition is to identify the entities needed in the sentence,
the labeling tool uses brat
labeling method is the BIO
training framework uses paddleNLP

The training algorithm uses ernie
ERINE (Enhanced Representation through Knowledge Integration), which is a pre-training model released by Baidu. It expands the word-level MASK in BERT to three levels of Knowledge Masking, so that the model can learn more language knowledge, and surpass BERT in the practical effect of multiple tasks.
ERNIE divides language knowledge into three categories: word level (Basic-Level), phrase level (Phrase-Level) and entity level (Entity-Level). By performing Masking (occlusion) on these three levels of objects, the model can learn grammatical knowledge and semantic knowledge in language knowledge. For example, assuming that the training sentence is [Changsha is the capital city of Hunan Province], and the place name entity [Changsha] is randomly occluded, the model can learn the relationship between [Changsha] and [Hunan Province] to a certain extent, that is, the model can learn more multi-semantic knowledge.

1. The specific steps are as follows

(1) Load custom data
1. Each piece of training data contains a sentence of text and the label of each Chinese character in this text. Then, it is also necessary to perform data processing on the input sentence, such as word segmentation, mapping vocabulary id, etc.
(2) Data processing
1. The pre-training model ERNIE processes Chinese data in units of words. The pre-trained model has built-in corresponding tokenizer. Specify the name of the model you want to use to load the corresponding tokenizer.
2. The tokenizer is used to convert the original input text into an input data form acceptable to the model model.
insert image description here
(3) Model training and evaluation
The process of model training usually has the following steps:

1. Take a batch data from the dataloader
2. Feed the batch data to the model for forward calculation
3. Pass the forward calculation result to the loss function to calculate the loss. Pass the forward calculation result to the evaluation method to calculate the evaluation index.
4. The loss is reversely returned to update the gradient. Repeat the above steps.
Every time an epoch is trained, the program will evaluate once to evaluate the effect of the current model training.
(4) Model prediction
The saved model after training can be used for prediction. Call the predict() function to make a one-click prediction.
Model output test results

information extraction

Information extraction aims to extract structured knowledge, such as entities, relations, events, etc., from unstructured natural language texts. For a given natural language sentence, all SPO triples satisfying the schema constraints are extracted according to the predefined schema set.
To be continued. . .

Guess you like

Origin blog.csdn.net/dream_home8407/article/details/128231568