Recently, due to experimental needs, we have collected and sorted out the data sets of the relationship extraction direction, mainly including SemEval, Wiki80, and NYT10. At present, the fully-supervised relationship extraction task is generally done on SemEval, and the remotely-supervised relationship extraction task is generally done on NYT10.
SemEval
Data set source
The SemEval dataset comes from Task 8 in the 2010 International Semantic Evaluation Conference: "Multi-Way Classification of Semantic Relations Between Pairs of Nominals"
Data set introduction
Task: For a given sentence and two labeled nouns, select the most suitable relationship from the given list of relationships.
The data set contains a total of 9+1 relationships, and the proportion of various types of data is shown in the following figure:
Source: https://github.com/thunlp/OpenNRE/tree/master/benchmark, the format is json
The SemEval folder contains four files:
semeval_rel2id.json: the reference of various relationships and their indexes. The same relationship is divided into two relationships due to the different positions of the two entities e1 and e2 (for example, "Product-Producer(e2,e1)&Product-Producer( e1, e2)) So counting the relationship "Other", there are 19 kinds of relationships (0-18).
semeval_train.txt & semeval_val.txt: There are 8000 samples in the original SemEval-Task-8 data set as train, but the data set obtained here divides the original train into train (6507 samples) and val (1493 samples) The samples) are all in json format, and the samples of the same relationship are distributed together.
semeval_test.txt: Consistent with the sample format in train and val, including 2717 samples
Sample format:
Example: {"token": ["trees", "grow", "seeds", "."], "h": {"name": "trees", "pos": [0, 1] }, “t”: {“name”: “seeds”, “pos”: [2, 3]}, “relation”: “Product-Producer(e2,e1)”}
It contains four keys:
"token": tokenized sentence
"h": the name and position of the head entity in the
sample "t": the name and position of the tail entity in the
sample "relation": two entities in the sample The relationship in the example is Product-Producer(e2,e1), which means that entity 1 (head entity) is Producer, and entity 2 (tail entity) is Product.
The semeval data set adopts artificial precision standard and does not contain noise
reference
Data official website: http://semeval2.fbk.eu/semeval2.php?location=tasks#T11
Data source: https://github.com/thunlp/OpenNRE/tree/master/benchmark
Statistics: https://blog .csdn.net/qq_29883591/article/details/88567561
Wiki80
Data set source
According to the original text on OpenNRE (We also provide a new dataset Wiki80, which is derived from FewRel.), Wiki80 is extracted from the dataset FewRel released by Tsinghua University.
Data set introduction
Task: For a given sentence and two labeled nouns, select the most suitable relationship from the given list of relationships.
The data set contains a total of 80 relations. According to statistics, the number of each relation is 700, with a total of 56,000 samples.
relationship | Number |
---|---|
place served by transport hub | 700 |
mountain range | 700 |
religion | 700 |
participating team | 700 |
contains administrative territorial entity | 700 |
head of government | 700 |
country of citizenship | 700 |
original network | 700 |
heritage designation | 700 |
performer | 700 |
participant of | 700 |
position held | 700 |
has part | 700 |
location of formation | 700 |
located on terrain feature | 700 |
architect | 700 |
country of origin | 700 |
publisher | 700 |
director | 700 |
father | 700 |
developer | 700 |
military branch | 700 |
mouth of the watercourse | 700 |
nominated for | 700 |
movement | 700 |
successful candidate | 700 |
followed by | 700 |
manufacturer | 700 |
instance of | 700 |
after a work by | 700 |
member of political party | 700 |
licensed to broadcast to | 700 |
headquarters location | 700 |
sibling | 700 |
instrument | 700 |
country | 700 |
occupation | 700 |
esidence | 700 |
work location | 700 |
subsidiary | 700 |
participant | 700 |
operator | 700 |
characters | 700 |
occupant | 700 |
genre | 700 |
operating system | 700 |
owned by | 700 |
platform | 700 |
tributary | 700 |
winner | 700 |
said to be the same as | 700 |
composer | 700 |
league | 700 |
record label | 700 |
distributor | 700 |
screenwriter | 700 |
ports season of league or competition | 700 |
taxon rank | 700 |
location | 700 |
field of work | 700 |
language of work or name | 700 |
applies to jurisdiction | 700 |
notable work | 700 |
located in the administrative territorial entity | 700 |
crosses | 700 |
original language of film or TV show | 700 |
competition class | 700 |
part of | 700 |
sport | 700 |
constellation | 700 |
position played on team / speciality | 700 |
located in or next to body of water | 700 |
voice type | 700 |
follows | 700 |
spouse | 700 |
military rank | 700 |
mother | 700 |
member of | 700 |
child | 700 |
main subject | 700 |
合计 | 56000 |
Ps:这里56000个是val与train一起统计的
Wiki80 文件夹中共包含3个文件:
Wiki80_rel2id.json : 关系及其索引的对照表,合计80个关系,和Semeval中的不同,这里面的关系不包含实体的前后关系。
Wiki80_train.txt & wiki80_val.txt : trian(50400个样本)、val(5600个样本)合计56000个样本。
数据集中不包含测试集
样本格式:
例子:{“token”: [“Vahitahi”, “has”, “a”, “territorial”, “airport”, “.”], “h”: {“name”: “territorial airport”, “id”: “Q16897548”, “pos”: [3, 5]}, “t”: {“name”: “vahitahi”, “id”: “Q1811472”, “pos”: [0, 1]}, “relation”: “place served by transport hub”}
样本的格式同semeval中的几乎一致,但是在头实体和尾实体中加入了id这一属性。
Wiki80数据集采用人工精标,不包含噪声
参考:
数据来源:https://github.com/thunlp/OpenNRE/tree/master/benchmark
数据参考:https://opennre-docs.readthedocs.io/en/latest/get_started/benchmark
数据统计:自测
NYT10
数据集来源:
NYT10是在基于远程监督的关系抽取任务上最常用的数据集,NYT10数据集来自于10年的论文Modeling Relations and Their Mentions withoutLabeled Text,是由NYT corpus 同Freebase远程监督得到:
## 数据集介绍
任务:对于给定了的句子和两个做了标注的名词,从给定的关系清单中选出最合适的关系。
数据集中一共包含52+1(包括NA)个关系,各个关系在样本中的分布如下:
NYT10文件夹中包含4个文件:
Nyt10_rel2id.json : 包含53个关系及其各自对应的索引
Nyt10_train.txt : 包含466876个样本
Nyt10_val.txt : 包含55167个样本
Nyt10_test.txt : 包含172448个样本
Ps:NYT10的数据集是通过远程监督得到的,所以样本的是根据包的形式分布的及含有相同实体的数据集分布在一起。
样本格式:
例子:
{“text”: “Hundreds of bridges were added to the statewide inventory after an earthquake in 1994 in Northridge , a suburb of Los Angeles .”, “relation”: “/location/neighborhood/neighborhood_of”,“h”:{“id”:"/guid/9202a8c04000641f800000000008fe6d", “name”: “Northridge”, “pos”: [89, 99]}, “t”: {“id”: “/guid/9202a8c04000641f80000000060b2879”, “name”: “Los Angeles”, “pos”: [114, 125]}}
与Wiki80的样本格式相似,区别在于NYT10的文本没有进行标记处理。
NYT10数据集采用远程监督得到,包含噪声。
参考
数据来源:https://github.com/thunlp/OpenNRE/tree/master/benchmark
相关论文:https://link.springer.com/content/pdf/10.1007%2F978-3-642-15939-8_10.pdf
数据统计:自测
All the data here comes from thunlp, and the more commonly used data sets: TACRED, ACE 2005 official website downloads all require an LDC account. If anyone is willing to provide it, thank you very much!