【数据集分析】TACRED关系抽取数据集分析(一)—— 理解单条实例

目录

【数据集分析】TACRED关系抽取数据集分析(一)—— 理解单条实例
【数据集分析】TACRED关系抽取数据集分析(二)—— 统计类别和实例数
【数据集分析】TACRED关系抽取数据集分析(三)—— Relation Distribution
【数据集分析】TACRED关系抽取数据集分析(四)—— train set 和 valid set中是否有重复数据

最近拿到一个关系抽取数据集,TACRED,分析了一波单条数据、关系分布等,分享一下分析思路和代码。

1. 单条实例分析

{
    
    'label': 'org:founded',
 'text': 'Zagat Survey , the guide empire that started as a hobby for Tim and Nina Zagat in 1979 as a two-page typed list of New York restaurants compiled from reviews from friends , has been put up for sale , according to people briefed on the decision .',
 'ents': [['Zagat', 1, 5, 0.5], ['1979', 82, 86, 0.5]],
 'ann': [['Q140258', 0, 12, 0.57093775], ['Q7804542', 60, 78, 0.532475]]}

可以看到一个instance的格式为json格式,分别是:

{‘label’: ‘’, ’ ',
‘text’: ’ ',
‘ents’: [[头实体, 头实体起始位置, 头实体结束位置, ], [尾实体, 尾实体起始位置, 尾实体结束位置, ]]}

我将数据转化成了一个我喜欢的格式以及key值的命名,这样取数据时对于我就会比较方便,你也可以转换一下,因为我后面几节的分析是基于转化了格式的数据集的数据。

dictkey值如下:

{“text”: , “relation”: , “h”: {“id”: , “name”: , “pos”: }, “t”: {“id”: , “name”: , “pos”: }}

转化后一个instance如下:

{
    
    
    "text":"Zagat Survey , the guide empire that started as a hobby for Tim and Nina Zagat in 1979 as a two-page typed list of New York restaurants compiled from reviews from friends , has been put up for sale , according to people briefed on the decision .",
    "relation":"org:founded",
    "h":{
    
    
        "id":"0",
        "name":"Zagat",
        "pos":[
            1,
            5
        ]
    },
    "t":{
    
    
        "id":"1",
        "name":"1979",
        "pos":[
            82,
            86
        ]
    }
}

NOTE:

  1. instance的结构组成:由{句子,h,t}三部分。其中 h 和 t 也是dict,该dict包含三部分{id,name,pos}。
  2. 原数据集合没有h和d的id,因此我分别赋予了0,1给这两个值,在 h 和 t 中我添加了一个pos,意义是头实体或者尾实体的在句子中的position。
  3. 其实dict类型可以用json相互转化,存储和读取比较规范。

2. 代码

import json
train_rel_fre_dict = {
    
    }
train_data = {
    
    }
temp1 = {
    
    }
temp2 = {
    
    }
def convert_dataset(old_path, new_path):
    with open(new_path, 'w', encoding = 'utf-8') as f_op:
        with open(old_path, 'r', encoding = 'utf-8') as f:
            for i in json.load(f):
                train_data['text'] = i['text']
                train_data['relation'] = i['label']
                temp1['id'] = '0'
                temp1['name'] = i['ents'][0][0]
                temp1['pos'] = [i['ents'][0][1], i['ents'][0][2]]
                train_data['h'] = temp1
                temp2['id'] = '1'
                temp2['name'] = i['ents'][1][0]
                temp2['pos'] = [i['ents'][1][1], i['ents'][1][2]]
                train_data['t'] = temp2
                json.dump(train_data, f_op)
                f_op.write('\n')

convert_dataset(train_path, 'tacred_train.txt')
convert_dataset(valid_path, 'tacred_valid.txt')
convert_dataset(test_path, 'tacred_test.txt')

参考感谢

[1] TACRED官网:https://nlp.stanford.edu/projects/tacred/

猜你喜欢

转载自blog.csdn.net/xiangduixuexi/article/details/107217121