SQuAD 数据预处理(2)

SQuAD 数据预处理(2)

下面是SQuAD 数据预处理的一个应用实例,使用这个示例来理解proprecess.py这个文件中定义的函数


前言

一个数据处理的步骤


一、数据处理

数据加载与解析

# load dataset json files

train_data = load_json('data/squad_train.json')
valid_data = load_json('data/squad_dev.json')
-------------------------------------------------------
Length of data:  442
Data Keys:  dict_keys(['title', 'paragraphs'])
Title:  University_of_Notre_Dame
Length of data:  48
Data Keys:  dict_keys(['title', 'paragraphs'])
Title:  Super_Bowl_50
# parse the json structure to return the data as a list of dictionaries
# 解析数据,生成context、query和label三元组,还包含id和答案
train_list = parse_data(train_data)
valid_list = parse_data(valid_data)

print('Train list len: ',len(train_list))
print('Valid list len: ',len(valid_list))
-------------------------------------------------------
out:
Train list len:  87599
Valid list len:  34726

建立词表

# 转换lists到dataframes
train_df = pd.DataFrame(train_list)
valid_df = pd.DataFrame(valid_list)

train_df.head()

shuchu
删除离群点,数据过长的点

# get indices of outliers and drop them from the dataframe
# 获得离群值的索引并删除

%time drop_ids_train = filter_large_examples(train_df)
train_df.drop(list(drop_ids_train), inplace=True)

%time drop_ids_valid = filter_large_examples(valid_df)
valid_df.drop(list(drop_ids_valid), inplace=True)
-----------------------------------------------------
out:
Wall time: 1min 25s
Wall time: 34.7 s
# 建立一个词汇表
vocab_text = gather_text_for_vocab([train_df, valid_df])
print("Number of sentences in the dataset: ", len(vocab_text))
-----------------------------------------------------
out:
Number of sentences in the dataset:  118441

建立一个word和character-level的词汇表

%time word2idx, idx2word, word_vocab = build_word_vocab(vocab_text)
print("----------------------------------")
%time char2idx, char_vocab = build_char_vocab(vocab_text)
-----------------------------------------
out:
raw-vocab: 110478
vocab-length: 110480
word2idx-length: 110480
Wall time: 22.5 s
----------------------------------
raw-char-vocab: 1401
char-vocab-intersect: 232
char2idx-length: 234
Wall time: 1.81 s

清除错误

为了删除因为符号弄错的标签

# numericalize context and questions for training and validation set
# 将训练集和验证集的context和questions数字化
%time train_df['context_ids'] = train_df.context.apply(context_to_ids, word2idx=word2idx)
%time valid_df['context_ids'] = valid_df.context.apply(context_to_ids, word2idx=word2idx)
%time train_df['question_ids'] = train_df.question.apply(question_to_ids, word2idx=word2idx)
%time valid_df['question_ids'] = valid_df.question.apply(question_to_ids, word2idx=word2idx)
print(train_df.loc[0])
----------------------------------------------------------------------
out:
Number of error indices: 1000
Number of error indices: 428
# get indices with tokenization errors and drop those indices 
# 获取带有tokenization错误的索引,并删除这些索引,由于空格等原因出现的错误字
train_err = get_error_indices(train_df, idx2word)
valid_err = get_error_indices(valid_df, idx2word)

train_df.drop(train_err, inplace=True)
valid_df.drop(valid_err, inplace=True)

# get start and end positions of answers from the context
# this is basically the label for training QA models
# 从context中得到答案的开始和结束位置,这是训练QA模型的基本标签
train_label_idx = train_df.apply(index_answer, axis=1, idx2word=idx2word)
valid_label_idx = valid_df.apply(index_answer, axis=1, idx2word=idx2word)

train_df['label_idx'] = train_label_idx
valid_df['label_idx'] = valid_label_idx

总结

全部处理完后的一个示例

print(train_df.loc[0])

在这里插入图片描述

猜你喜欢

转载自blog.csdn.net/qq_42388742/article/details/112123754