Introduction to Thousand Words Dataset Competition Questions

Competition topic

General Information Extraction Task Evaluation

Describe a variety of different information extraction tasks with a unified general framework , and focus on examining the adaptability and migration capabilities of related technologies in the face of new and unknown information extraction tasks and paradigms .

Introduction

Information extraction aims to structure the information in unstructured text , which is the basic technology and important research field of natural language processing . It has been widely concerned by academia and industry. Traditional information extraction tasks and evaluations are usually aimed at a specific text domain and a single extraction task . It is difficult to evaluate the extraction performance of related technologies and methods in general scenarios and tasks .
To this end, the Institute of Software of the Chinese Academy of Sciences, Baidu and the Qianyan open source dataset project jointly launched the industry's first general information extraction evaluation . Qianyan General Information Extraction Competition
This list is the regular season version of Qianyan General Information Extraction. Facing long-term registration and submission of NLP developers. The deadline for submission is not set, and the task settings are consistent with the competition**: not limited to the traditional single-task information extraction evaluation paradigm, but to describe a variety of different information extraction tasks with a unified general framework**, emphasizing Investigate the adaptation and migration capabilities of relevant technical methods in the face of new and unknown information extraction tasks and paradigms, so as to meet the actual needs of rapid iteration and rapid migration in the current information extraction field , and be closer to actual business applications .
———————————————————————
The information extraction task aims to automatically extract structured information from unstructured text according to specific extraction requirements . Among them, the specific extraction requirements refer to the extraction framework in the extraction task , which consists of extraction categories (personal names, company names, company listing events) andObject structure (entity, relationship, event, etc.) composition. This task is a Chinese information extraction task, that is, according to a specific extraction framework sss , from a given set of free textxxIn x , extract all the information structuresYYY. ( Entitiesdifferent extraction frameworks will extract different information structures, as follows:

Extraction Framework Example: Financial Event Extraction

input texttext

Ningbo Rongbai New Energy Technology Co., Ltd. (referred to as "Rongbai Technology", stock code: 688005) is listed on the Science and Technology Innovation Board.

extract demand

insert image description here

event definition

It is the process in which an enterprise publicly issues additional shares to investors through the stock exchange for the first time in order to raise funds for enterprise development.
<Listed company> was listed on <listed sector> at <listing time>, and raised a total of <financing amount>.

argument definition

  • Listed enterprise: refers to a joint stock limited company whose shares are listed and traded on the stock exchange after being approved by the State Council or the securities management department authorized by the State Council .
  • Listing time: refers to the time when the securities management department is listed and traded on the stock exchange
  • Listed sectors: Refers to the main board, small and medium board, GEM, and others.
  • Financing amount: refers to the total capital raised by listed companies through the act of "listing"
    ————————————————————————
    insert image description here

Extraction Framework Example 2 East Olympic Event Extraction

input texttext

On the morning of February 8th, in the women's freestyle skiing final of the Beijing Winter Olympics, Chinese player Gu Ailing won the gold medal with 188.25 points!

extract demand

insert image description here
insert image description here
insert image description here

Extract framework 3 character information

enter text

On the morning of February 8th, in the women's freestyle skiing final of the Beijing Winter Olympics, Chinese player Gu Ailing won the gold medal with 188.25 points!

extract demand

insert image description here

Example output 3

insert image description here

Example Framework 4: Dialogue Emotion Extraction

insert image description here
insert image description here
insert image description here

Dataset introduction

The data and extraction framework of this evaluation mainly come from the application cases of Qianyan Data Platform and Baidu’s general information extraction. This evaluation builds a variety of extraction frameworks in multiple fields and scenarios, including medical, legal, financial and other fields and entity extraction, Various extraction tasks such as relation extraction and event extraction . In order to evaluate the information extraction ability of the existing technology in the general field and the migration ability of new tasks and scenarios. Participants can use the existing models and Qianyan platform to perform rapid data construction and migration of existing models through the data sets obtained from the open class .
At the same time, the evaluation encourages participants to use publicly available datasets and knowledge base data to construct training data through semi-supervised and long-distance supervision .
The composition of the dataset mainly consists of two parts:

  • 6 Seen Schemas (known frameworks)
    • It mainly comes from the data available on the Qianyan platform and the AI ​​Studio platform. Participants can build models based on the platform data . This track mainly evaluates the ability of existing technologies to build models based on labeled data .
  • 4 Unseen Schema (unknown framework)
    • It mainly comes from the extraction cases of Baidu data. The evaluation party only provides a small amount of verification data, which is used to confirm the extraction requirements and model verification with the contestants . This track mainly
      evaluates the migration ability of existing technologies for new extraction requirements . This evaluation data Divided into three releases:
  • Seen Schema definition file , validation data. This part of the data mainly comes from various data in the Qianyan dataset platform. Each Schema contains structure and type definitions, and provides a small amount of validation data . Verification data is used to help contestants confirm labeling specifications (such as labeling boundaries, etc.).
  • Unseen Schema definition and a small amount of corresponding validation data . Each Schema contains structure and type definitions, and provides a small amount of validation data. **Verification data is used to help contestants confirm labeling specifications (such as labeling boundaries, etc.).
  • Test set data (final test set). Participants need to extract information from plain text data and corresponding extraction requirements (including both seen and unseen) , and finally submit the extraction results.

the data shows

Extract frame definition

The extraction framework definition file is in YAML format , which contains the extraction . Each extraction framework file contains definition information such as entities, relationships and events .

insert image description here

training set file

The training set file of different extraction frameworks is a jsonlines file . One line in the file is a training instance, including input text X, extraction framework S (schema) and target structure Y (entity, relation, event). The data sample is as follows:

{ "text": "Ningbo Rongbai New Energy Technology Co., Ltd. (referred to as "Rongbai Technology", stock code: 688005) was listed on the Science and Technology Innovation Board. road capital blessing.", "entity": [], "relation": [], "event": [ { "type": "listed", "text": "listed", "args": [ {"type ": "Listed sector", "offset": [38, 39, 40], "text": "Science and Technology Innovation Board"}, {"type": "Listed company", "offset": [0, 1, 2 , 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], “ text”: “Ningbo Rongbai New Energy Technology Co., Ltd.”} ] } ] , schema”: "Financial Information" }















The commonly used fields contained in each instance in the training file are

  • text input text
  • scheme: the corresponding extraction framework
  • entity entity annotation result
  • relation labeling result
  • event event annotation result

test set file

insert image description here
Common fields contained in each instance in the test file

  • text input text
  • scheme: the corresponding extraction framework
  • id extract instance id

submission format

The model prediction results are submitted to AI Studio in the file format of jsonlines encoded as UTF-8 , and the platform performs online scoring and real-time ranking. A row in the file is a json object, which is the prediction result of an instance, as shown below. Contestants need to submit results for all test samples. If there is no output result, the target structure (entity, relation, event) list will be empty .

Evaluation content

The competition is evaluated based on the output records extracted by the extraction system from the input sequence . We uniformly express the extraction tasks of different paradigms into different multigroups , and evaluate the results of set deduplication. The evaluation script automatically converts the output results in the submission format into multigroups and evaluates them . The evaluation form may include binary groups with triplets .

The basic elements involved in a multigroup include:

  • Text block extraction result span, (appears in the form of a string, no corresponding offset is required)

  • A label representing a type (eg: entity type, event type)

  • Labels representing association relationships (for example: relationship type, event argument type).
    Specifically, the tuple of evaluation specifically includes:

  • (Span, type label): Representative extraction tasks include entity extraction tasks (entity mention span, entity type), event trigger word recognition task (trigger word span, event type)

  • (Association label, Span1, Span2): Representative extraction tasks include relational extraction tasks (relation type, subject span, object span), emotional triplets (emotional polarity, opinion object span, emotional expression span)

  • (type label, relationship label, span): representative extraction tasks include event argument identification (event type, argument role, argument span)
    Please note that this evaluation focuses on information extraction rather than labeling . Therefore, for the same information that appears multiple times in the same piece of text , we will evaluate it after deduplication. For example, for the same specific entity that appears multiple times in the same input text, the model only needs to output one dyad, and if multiple identical dyads are output , the evaluation script will automatically deduplicate them .

Evaluation index

insert image description here

overall score

insert image description here

experience

Slowly dig out this competition thoroughly, complete the competition in the form of questions, and complete one month is progress, and slowly accumulate your own competition experience.

Guess you like

Origin blog.csdn.net/kuxingseng123/article/details/129420321