C.4.5 PaddleNLP UIE--Quickly improve performance with small samples (including doccona annotation)

Insert image description here
Introduction to NLP column: data enhancement, intelligent annotation, intent recognition algorithm | multi-classification algorithm, text information extraction, multi-modal information extraction, interpretability analysis, performance tuning, model compression algorithm, etc.

Insert image description here
Detailed introduction to the column: Introduction to the NLP column: data enhancement, intelligent annotation, intent recognition algorithm | multi-classification algorithm, text information extraction, multi-modal information extraction, interpretability analysis, performance tuning, model compression algorithm, etc.

Forefathers planted trees for later generations to enjoy the shade. This column provides information: data enhancement, intelligent annotation, intent recognition algorithm | multi-classification algorithm, text information extraction, multi-modal information extraction, interpretability analysis, performance optimization, model compression algorithm, etc. Project code integration saves you a lot of time and improves efficiency. Help you quickly complete task implementation and scientific research baseline.

Related articles:
1. Extracting key information from express delivery orders [1] - Word vector optimization based on BiGRU+CR+ pre-training
2. Extracting express delivery order information [2] Based on ERNIE1.0 to ErnieGram + CRF pre-training model
3. Express order information extraction [3] – Five annotated data improves accuracy, and only five annotated samples are needed to quickly complete the express order information task 1)
PaddleNLP general information extraction technology UIE [1] Industrial application examples: information extraction {entity relationship extraction, Chinese Word segmentation and precise entity labeling. Sentiment analysis, etc.}, text error correction, question and answer system, chat robot, customized training
2) PaddleNLP-UIE (2) - small samples to quickly improve performance (including doccona annotation)
! Highly recommended: Data annotation platform doccano----Introduction, installation, use, pitfall records

Code source included at the end of the article

0. Information extraction definition and difficulties

The task of automatically extracting structured information from unstructured or semi-structured text mainly includes tasks such as entity recognition, relationship extraction, event extraction, sentiment analysis, comment extraction, etc. At the same time, information extraction covers a wide range of fields. The technical requirements for information extraction are high. Some examples are shown below.


  • Cross-domain and cross-task requirements: It is difficult to transfer knowledge between fields. For example, it is difficult to transfer general domain knowledge to vertical fields, and it is difficult to transfer knowledge between vertical fields to each other; there are different information extraction tasks such as entities, relationships, and events. need.
  • High degree of customization: For different information extraction tasks such as entities, relationships, events, etc., different models need to be developed, and the development cost and machine resource consumption are very high.
  • There is no or little training data: data in some fields are scarce and difficult to obtain, and the expertise of the field makes the threshold for data labeling high.

In response to the above problems, the Institute of Software of the Chinese Academy of Sciences and Baidu jointly proposed a universal information extraction technology UIE (Unified Structure Generation for Universal Information Extraction) that unifies many tasks, which was published at ACL'22. UIE achieved SOTA performance in four information extraction tasks, including entities, relationships, events, and emotions, and in 13 data sets under fully supervised, low-resource, and few-sample settings.

PaddleNLP combines the knowledge in the Wenxin large model to enhance the NLP large model ERNIE 3.0, exerting the strong potential of UIE in Chinese tasks, and open sourced the first industrial-level technical solution for general information extraction , which does not require labeled data (or only a small amount of label data), you can quickly complete various information extraction tasks.

**Link to guide: https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie )

1. Use the PaddleNLP Taskflow tool to solve the difficulties of information extraction (Chinese version)

1.1 Install PaddleNLP

! pip install --upgrade paddlenlp
! pip show paddlenlp

1.2 Use Taskflow UIE tasks to see the effect

Human resources entry certificate information extraction

from paddlenlp import Taskflow 
schema = ['姓名', '毕业院校', '职位', '月收入', '身体状况']
ie = Taskflow('information_extraction', schema=schema)
schema = ['姓名', '毕业院校', '职位', '月收入', '身体状况']
ie.set_schema(schema)
ie('兹证明凌霄为本单位职工,已连续在我单位工作5 年。学历为嘉利顿大学毕业,目前在我单位担任总经理助理  职位。近一年内该员工在我单位平均月收入(税后)为  12000 元。该职工身体状况良好。本单位仅此承诺上述表述是正确的,真实的。')
[{
    
    '姓名': [{
    
    'text': '凌霄',
    'start': 3,
    'end': 5,
    'probability': 0.9042383385504706}],
  '毕业院校': [{
    
    'text': '嘉利顿大学',
    'start': 28,
    'end': 33,
    'probability': 0.9927952662605009}],
  '职位': [{
    
    'text': '总经理助理',
    'start': 44,
    'end': 49,
    'probability': 0.9922470268350594}],
  '月收入': [{
    
    'text': '12000 元',
    'start': 77,
    'end': 84,
    'probability': 0.9788556518998917}],
  '身体状况': [{
    
    'text': '良好',
    'start': 92,
    'end': 94,
    'probability': 0.9939678710475306}]}]
# Jupyter Notebook默认做了格式化输出,如果使用其他代码编辑器,可以使用Python原生包pprint进行格式化输出

from pprint import pprint
pprint(ie('兹证明凌霄为本单位职工,已连续在我单位工作5 年。学历为嘉利顿大学毕业,目前在我单位担任总经理助理  职位。近一年内该员工在我单位平均月收入(税后)为  12000 元。该职工身体状况良好。本单位仅此承诺上述表述是正确的,真实的。'))

Medical pathology analysis

schema = ['肿瘤部位', '肿瘤大小']
ie.set_schema(schema)
ie('胃印戒细胞癌,肿瘤主要位于胃窦体部,大小6*2cm,癌组织侵及胃壁浆膜层,并侵犯血管和神经。')
[{
    
    '肿瘤部位': [{
    
    'text': '胃窦体部',
    'start': 13,
    'end': 17,
    'probability': 0.9601818899487213}],
  '肿瘤大小': [{
    
    'text': '6*2cm',
    'start': 20,
    'end': 25,
    'probability': 0.9670914301489972}]}]

1.3 Use Taskflow UIE for entity extraction, relationship extraction, event extraction, emotion classification, and opinion extraction

# 实体抽取
schema = ['时间', '赛手', '赛事名称']
ie.set_schema(schema)
ie('2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!')
[{
    
    '时间': [{
    
    'text': '2月8日上午',
    'start': 0,
    'end': 6,
    'probability': 0.9857379716035553}],
  '赛手': [{
    
    'text': '中国选手谷爱凌',
    'start': 24,
    'end': 31,
    'probability': 0.7232891682586384}],
  '赛事名称': [{
    
    'text': '北京冬奥会自由式滑雪女子大跳台决赛',
    'start': 6,
    'end': 23,
    'probability': 0.8503080086948529}]}]
# 关系抽取
schema = {
    
    '歌曲名称': ['歌手', '所属专辑']}  
ie.set_schema(schema)
ie('《告别了》是孙耀威在专辑爱的故事里面的歌曲')
[{
    
    '歌曲名称': [{
    
    'text': '告别了',
    'start': 1,
    'end': 4,
    'probability': 0.629614912348881,
    'relations': {
    
    '歌手': [{
    
    'text': '孙耀威',
       'start': 6,
       'end': 9,
       'probability': 0.9988381005599081}],
     '所属专辑': [{
    
    'text': '爱的故事',
       'start': 12,
       'end': 16,
       'probability': 0.9968462078543183}]}},
   {
    
    'text': '爱的故事',
    'start': 12,
    'end': 16,
    'probability': 0.28168707817316374,
    'relations': {
    
    '歌手': [{
    
    'text': '孙耀威',
       'start': 6,
       'end': 9,
       'probability': 0.9951415104192272}]}}]}]
# 事件抽取
schema = {
    
    '地震触发词': ['地震强度', '时间', '震中位置', '震源深度']}  # 事件需要通过xxx触发词来选择触发词
ie.set_schema(schema)
ie('中国地震台网正式测定:5月16日06时08分在云南临沧市凤庆县(北纬24.34度,东经99.98度)发生3.5级地震,震源深度10千米。')
[{
    
    '地震触发词': [{
    
    'text': '地震',
    'start': 56,
    'end': 58,
    'probability': 0.9977425555988333,
    'relations': {
    
    '地震强度': [{
    
    'text': '3.5级',
       'start': 52,
       'end': 56,
       'probability': 0.998080217831891}],
     '时间': [{
    
    'text': '5月16日06时08分',
       'start': 11,
       'end': 22,
       'probability': 0.9853299772936026}],
     '震中位置': [{
    
    'text': '云南临沧市凤庆县(北纬24.34度,东经99.98度)',
       'start': 23,
       'end': 50,
       'probability': 0.7874014521275967}],
     '震源深度': [{
    
    'text': '10千米',
       'start': 63,
       'end': 67,
       'probability': 0.9937974422968665}]}}]}]
# 情感倾向分类
schema = '情感倾向[正向,负向]' # 分类任务需要[]来设置分类的label
ie.set_schema(schema) 
ie('这个产品用起来真的很流畅,我非常喜欢')
[{
    
    '情感倾向[正向,负向]': [{
    
    'text': '正向', 'probability': 0.9990024058203417}]}]
# 评价抽取
schema = {
    
    '评价维度': ['观点词', '情感倾向[正向,负向]']}  # 评价抽取的schema是固定的,后续直接按照这个schema进行观点抽取
ie.set_schema(schema) # Reset schema
ie('地址不错,服务一般,设施陈旧')
[{
    
    '评价维度': [{
    
    'text': '地址',
    'start': 0,
    'end': 2,
    'probability': 0.9888139270606509,
    'relations': {
    
    '观点词': [{
    
    'text': '不错',
       'start': 2,
       'end': 4,
       'probability': 0.9927845886615216}],
     '情感倾向[正向,负向]': [{
    
    'text': '正向', 'probability': 0.998228967796706}]}},
   {
    
    'text': '设施',
    'start': 10,
    'end': 12,
    'probability': 0.9588298547520608,
    'relations': {
    
    '观点词': [{
    
    'text': '陈旧',
       'start': 12,
       'end': 14,
       'probability': 0.928675281256794}],
     '情感倾向[正向,负向]': [{
    
    'text': '负向', 'probability': 0.9949388606013692}]}},
   {
    
    'text': '服务',
    'start': 5,
    'end': 7,
    'probability': 0.9592857070501211,
    'relations': {
    
    '观点词': [{
    
    'text': '一般',
       'start': 7,
       'end': 9,
       'probability': 0.9949359182521675}],
     '情感倾向[正向,负向]': [{
    
    'text': '负向', 'probability': 0.9952498258302498}]}}]}]
# 跨任务跨领域抽取
schema = ['寺庙', {
    
    '丈夫': '妻子'}]  # 抽取的任务中包含了实体抽取和关系抽取
ie.set_schema(schema)
ie('李治即位后,让身在感业寺的武则天续起头发,重新纳入后宫。')
[{
    
    '寺庙': [{
    
    'text': '感业寺',
    'start': 9,
    'end': 12,
    'probability': 0.9888581774497425}],
  '丈夫': [{
    
    'text': '李治',
    'start': 0,
    'end': 2,
    'probability': 0.989690572797457,
    'relations': {
    
    '妻子': [{
    
    'text': '武则天',
       'start': 13,
       'end': 16,
       'probability': 0.9987625986790256}]}}]}]

1.4 Some tips for using Taskflow UIE

1.4.1. Adjust batch_size to improve prediction efficiency

from paddlenlp import Taskflow
schema = ['费用']
ie.set_schema(schema)
ie = Taskflow('information_extraction', schema=schema, batch_size=2) #资源不充裕情况,batch_size设置小点,利用率增加。。
ie(['二十号21点49分打车回家46块钱', '8月3号往返机场交通费110元', '2019年10月17日22点18分回家打车46元', '三月三0号23点10分加班打车21元'])
[{
    
    '费用': [{
    
    'text': '46块钱',
    'start': 13,
    'end': 17,
    'probability': 0.9781786110574338}]},
 {
    
    '费用': [{
    
    'text': '110元',
    'start': 11,
    'end': 15,
    'probability': 0.9504088995163151}]},
 {
    
    '费用': [{
    
    'text': '46元',
    'start': 21,
    'end': 24,
    'probability': 0.9753814247531167}]},
 {
    
    '费用': [{
    
    'text': '21元',
    'start': 15,
    'end': 18,
    'probability': 0.9761294626311425}]}]

1.4.2. Use UIE-Tiny model to speed up model prediction

from paddlenlp import Taskflow
schema = ['费用']
ie.set_schema(schema)
ie = Taskflow('information_extraction', schema=schema, batch_size=2, model='uie-tiny') #
ie(['二十号21点49分打车回家46块钱', '8月3号往返机场交通费110元', '2019年10月17日22点18分回家打车46元', '三月三0号23点10分加班打车21元'])
[{
    
    '费用': [{
    
    'text': '46块钱',
    'start': 13,
    'end': 17,
    'probability': 0.8945340489542026}]},
 {
    
    '费用': [{
    
    'text': '110元',
    'start': 11,
    'end': 15,
    'probability': 0.9757676375014448}]},
 {
    
    '费用': [{
    
    'text': '46元',
    'start': 21,
    'end': 24,
    'probability': 0.860397941604333}]},
 {
    
    '费用': [{
    
    'text': '21元',
    'start': 15,
    'end': 18,
    'probability': 0.8595131018474689}]}]

2. Small samples improve UIE effect

We train the UIE baseline version in Taskflow through a large number of labeled samples, but the effect of UIE extraction is not satisfactory in some sub-fields. UIE can quickly improve the effect through small samples.
Why can UIE improve results through small samples? The modeling method of UIE is mainly based on Promptthe method. PromptFine-tuning on small samples is very effective. Below we use a specific case
to show the effect of UIE fine-tuning.
Insert image description here

2.1 Extraction of voice reimbursement work order information

1. Background

Within a certain company, voice input can be used to reimburse taxi fares. The voice ASR model can be used to recognize speech as text, and at the same time extract text information. The extracted information mainly includes four aspects: time, departure place, and destination. , expenses, by extracting information from four aspects of text, you can complete the filling out of a reimbursement work order.

2. Challenge

At present, the Taskflow UIE task has not fully reached the level of industrial use for this very vertical task . Therefore, certain fine-tuning methods are needed to complete the fine-tuning of the UIE model to improve the effect of the model . The following are some cases.

ie.set_schema(['时间', '出发地', '目的地', '费用'])
ie('10月16日高铁从杭州到上海南站车次d5414共48元')  # 无法准确抽取出发地、目的地
[{
    
    '时间': [{
    
    'text': '10月16日',
    'start': 0,
    'end': 6,
    'probability': 0.9552445817793149}],
  '出发地': [{
    
    'text': '杭州',
    'start': 9,
    'end': 11,
    'probability': 0.5713024802221334}],
  '费用': [{
    
    'text': '48元',
    'start': 24,
    'end': 27,
    'probability': 0.8932524634666485}]}]

2.2 Label data

Detailed version of the reference link—doccano annotation process.
We recommend using the data annotation platform doccano for data annotation. This case also opens up the channel from annotation to training, that is, after doccano exports the data, the data can be easily converted into the input model through the doccano.py script. required form to achieve seamless connection. To achieve this goal, you need to label the data on the doccano platform according to the following labeling rules:

Step 1. Install doccano locally (do not run inside AI Studio, local test environment python=3.8)

$ pip install doccano

Step 2. Initialize the database and account (username and password can be replaced with custom values)

$ doccano init

$ doccano createuser --username my_admin_name --password my_password

Step 3. Start doccano

  • Start doccano's WebServer in a window, keep the window

$ doccano webserver --port 8000

  • Start doccano's task queue in another window

$ doccano task

Step 4. Run doccano to annotate entities and relationships

  • Open a browser (Chrome is recommended), enter it in the address bar and press http://127.0.0.1:8000/Enter to get the following interface.

  • Log in to your account. Click in the upper right corner LOGINand enter the username and password set in Step 2 to log in.

  • Create project. Click in the upper left corner CREATEto jump to the following interface.

    • Check sequence annotation ( Sequence Labeling)
    • Fill in the project name ( Project name) and other necessary information
    • Check Allow entities to overlap ( Allow overlapping entity) and Use relationship annotation ( Use relation labeling)
    • Once created, the project homepage video provides detailed instructions on the seven steps from data import to export.

  • Set labels. Click in the Labels column Actionsto Create Labelset it manually or Import Labelsimport it from a file.

    • The top Span represents the entity label, and Relation represents the relationship label, which need to be set separately.
  • Import Data. Click in the Datasets column Actionsto Import Datasetimport text data from the file.

    • According to the example given in File format, select the appropriate format to import the custom data file.
    • After the import is successful, it will jump to the data list.
  • Label the data. Click Annotatethe button on the far right of each piece of data to start labeling. The Label Types switch on the right side of the tag page can switch between entity labels and relationship labels.

    • Entity labeling: You can label entities by directly selecting text with the mouse.
    • Relationship annotation: First click on the relationship label to be annotated, and then click on the corresponding head and tail entities in sequence to complete the relationship annotation.
  • export data. Click in the Datasets column Actionsto Export Datasetexport the labeled data.


Convert annotated data into data required for UIE training

  • Save the annotation data of the doccano platform in ./data/the directory. For scenarios where voice reimbursement work order information is extracted, the annotated data can be downloaded directly .

Documentation for each task

https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/uie/doccano.md

! wget https://paddlenlp.bj.bcebos.com/datasets/erniekit/speech-cmd-analysis/audio-expense-account.jsonl
! mv audio-expense-account.jsonl ./data/

Run the following code to convert the annotated data into the data required for UIE training
splits 0.2 0.8 0.0 training set test set validation set

Configurable parameter description

  • doccano_file: Data annotation file exported from doccano.
  • save_dir: The directory where the training data is saved. It is stored in datathe directory by default.
  • negative_ratio: Maximum negative example ratio. This parameter is only valid for extraction type tasks. Appropriate construction of negative examples can improve the model effect. The number of negative examples is related to the actual number of labels. The maximum number of negative examples = negative_ratio * the number of positive examples. This parameter is only valid for the training set and defaults to 5. In order to ensure the accuracy of the evaluation indicators, the validation set and the test set are constructed with all negative examples by default.
  • splits: The proportion of training set and validation set when dividing the data set. The default is [0.8, 0.1, 0.1], which means 8:1:1the data is divided into training set, validation set and test set according to the proportion.
  • task_type: Select the task type. There are two types of tasks available: extraction and classification.
  • options: Specify the category label of the classification task. This parameter is only valid for classification type tasks.
  • prompt_prefix: Declare the prompt prefix information of the classification task. This parameter is only valid for classification type tasks.
  • is_shuffle: Whether to randomly disperse the data set, the default is True.
  • seed: Random seed, default is 1000.
! python preprocess.py --input_file ./data/audio-expense-account.jsonl --save_dir ./data/ --negative_ratio 5 --splits 0.2 0.8 0.0 --seed 1000

2.3 Training UIE model

  • Use annotated data for small sample training, and model parameters are saved in ./checkpoint/the directory.

Tips: It is recommended to use the GPU environment, otherwise memory overflow may occur. In the CPU environment, you can modify the model to uie-tinyadjust the batch_size appropriately.

To increase the accuracy: –num_epochs Set a larger value and train more

Configurable parameter description:

  • train_path: Training set file path.
  • dev_path: Verification set file path.
  • save_dir: Model storage path, default is ./checkpoint.
  • learning_rate: Learning rate, default is 1e-5.
  • batch_size: Batch processing size, please adjust according to the video memory situation. If there is insufficient video memory, please lower this parameter appropriately. The default is 16.
  • max_seq_len: The maximum text segmentation length. When the input exceeds the maximum length, the input text will be automatically segmented. The default is 512.
  • num_epochs: Number of training rounds, default is 100.
  • model: Select a model, and the program will fine-tune the model based on the selected model. Optional uie-baseand are available uie-tiny, and the default is uie-base.
  • seed: Random seed, default is 1000.
  • logging_steps: The number of steps between log printing intervals, the default is 10.
  • valid_steps: The number of interval steps for evaluate, default is 100.
  • device: Which equipment to choose for training, you can choose CPU or GPU.
! python finetune.py --train_path ./data/train.txt --dev_path ./data/dev.txt --save_dir ./checkpoint --model uie-tiny --learning_rate 1e-5 --batch_size 2 --max_seq_len 512 --num_epochs 50 --seed 1000 --logging_steps 10 --valid_steps 10
#! python finetune.py --train_path ./data/train.txt --dev_path ./data/dev.txt --save_dir ./checkpoint --model uie-base --learning_rate 1e-5 --batch_size 16 --max_seq_len 512 --num_epochs 50 --seed 1000 --logging_steps 10 --valid_steps 10
  • Use the model parameters after small sample training to test again the cases that cannot be extracted correctly.
from paddlenlp import Taskflow

schema = ['时间', '出发地', '目的地', '费用']

few_ie = Taskflow('information_extraction', schema=schema, task_path='./checkpoint/model_best')

few_ie(['10月16日高铁从杭州到上海南站车次d5414共48元',
        '10月22日从公司到首都机场38元过路费'])
[{
    
    '时间': [{
    
    'text': '10月16日',
    'start': 0,
    'end': 6,
    'probability': 0.9998620769863464}],
  '出发地': [{
    
    'text': '杭州',
    'start': 9,
    'end': 11,
    'probability': 0.997861665709749}],
  '目的地': [{
    
    'text': '上海南站',
    'start': 12,
    'end': 16,
    'probability': 0.9974161074329579}],
  '费用': [{
    
    'text': '48',
    'start': 24,
    'end': 26,
    'probability': 0.950222029031579}]},
 {
    
    '时间': [{
    
    'text': '10月22日',
    'start': 0,
    'end': 6,
    'probability': 0.9995716364718135}],
  '目的地': [{
    
    'text': '首都机场',
    'start': 10,
    'end': 14,
    'probability': 0.9984550308953608}],
  '费用': [{
    
    'text': '38',
    'start': 14,
    'end': 16,
    'probability': 0.9465688451171062}]}]

Project link

Link to this project:
https://aistudio.baidu.com/aistudio/projectdetail/4160689?contributionType=1

Project homepage:
https://aistudio.baidu.com/aistudio/usercenter

Guess you like

Origin blog.csdn.net/sinat_39620217/article/details/125167816#comments_28642943