-
Requirement: Integrate the training set, validation set and test set of type A data under all files, and integrate the training set, validation set and test set of type B data separately.
-
Put the data file to be processed under the data file, and store it under the match_data file after the data processing is completed.
The data format is as follows:
-
import the required packages
import os
import json
from tqdm import tqdm
- File path and directory related operations
data_path = "data/" # 根路径
path_list = os.listdir(data_path) # 获取根路径下的文件目录
print(path_list)
- Separately define lists for storing various types of data
# 分别定义用于存放各类数据的列表
train_data_a = []
val_data_a = []
train_data_b = []
val_data_b = []
test_data_a = []
test_data_b = []
- processing a single sentence
# 对单句进行处理
def process(sentence):
sentence = sentence.replace(" ", "") # 去空格操作
sentence = sentence.lower() # 小写化操作
return sentence
- An operation that writes data to a new file
# 将数据写入新文件操作
def write_data(path, data):
with open(path, "w", encoding="utf-8") as fw: # 打开指定文件
for ele in data: # 按行写入数据
ele = [_.replace("\t", "") for _ in ele] # 去掉每一行的制表符
fw.write("\t".join(ele) + "\n") # 每行中的每列数据用制表符分割,每行末尾添加换行符
- Loop through the data files and operate on various data files respectively
# 分别对各类文件进行处理
for dir_name in listdir_names: # 遍历文件名列表
# 处理训练集文件
with open(data_path + dir_name + "/train.txt", "r", encoding="utf-8") as fr: # 分别打开各文件夹中的训练集文件
for line in tqdm(fr): # 按行读取文件内容
data = json.loads(line.strip()) # 按行将数据加载成json格式
if "A" in dir_name: # 按A、B类分别存放数据
train_data_a.append((process(data["source"]), process(data["target"]), process(data["labelA"])))
else:
train_data_b.append((process(data["source"]), process(data["target"]), process(data["labelB"])))
# 处理验证集文件
with open(data_path + dir_name + "/valid.txt", "r", encoding="utf-8") as fr:
for line in tqdm(fr):
data = json.loads(line.strip())
if "A" in dir_name:
valid_data_a.append((process(data["source"]), process(data["target"]), process(data["labelA"])))
else:
valid_data_b.append((process(data["source"]), process(data["target"]), process(data["labelB"])))
# 处理测试集文件
with open(data_path + dir_name + "/test_with_id.txt", "r", encoding="utf-8") as fr:
for line in tqdm(fr):
data = json.loads(line.strip())
if "A" in dir_name:
test_data_a.append((process(data["source"]), process(data["target"]), process(data["id"])))
else:
test_data_b.append((process(data["source"]), process(data["target"]), process(data["id"])))
- Print the length of the data list
# 打印数据列表的长度
print(len(train_data_a))
print(len(train_data_b))
print(len(valid_data_a))
print(len(valid_data_b))
print(len(test_data_a))
print(len(test_data_b))
- Write the processed new data to their respective files
# 将处理后的新数据分别写入各自文件
write_data("./match_data/train_A.txt", train_data_a)
write_data("./match_data/train_B.txt", train_data_b)
write_data("./match_data/valid_A.txt", valid_data_a)
write_data("./match_data/valid_B.txt", valid_data_b)
write_data("./match_data/test_A.txt", test_data_a)
write_data("./match_data/test_B.txt", test_data_b)
- The integrated data format is as follows:
- The next note records how to reprocess the data processed in this note into bert's input data format.