Data preprocessing (1): processing and writing of source data

  • Requirement: Integrate the training set, validation set and test set of type A data under all files, and integrate the training set, validation set and test set of type B data separately.

  • Put the data file to be processed under the data file, and store it under the match_data file after the data processing is completed.
    insert image description here
    The data format is as follows:
    insert image description here

  • import the required packages

import os
import json
from tqdm import tqdm
  • File path and directory related operations
data_path = "data/" # 根路径
path_list = os.listdir(data_path) # 获取根路径下的文件目录
print(path_list)
  • Separately define lists for storing various types of data
# 分别定义用于存放各类数据的列表
train_data_a = []
val_data_a = []
train_data_b = []
val_data_b = []
test_data_a = []
test_data_b = []
  • processing a single sentence
# 对单句进行处理
def process(sentence):
    sentence = sentence.replace(" ", "") # 去空格操作
    sentence = sentence.lower() # 小写化操作
    return sentence
  • An operation that writes data to a new file
# 将数据写入新文件操作
def write_data(path, data):
    with open(path, "w", encoding="utf-8") as fw: # 打开指定文件
        for ele in data: # 按行写入数据
            ele = [_.replace("\t", "") for _ in ele] # 去掉每一行的制表符
            fw.write("\t".join(ele) + "\n") # 每行中的每列数据用制表符分割,每行末尾添加换行符
  • Loop through the data files and operate on various data files respectively
# 分别对各类文件进行处理
for dir_name in listdir_names: # 遍历文件名列表
    # 处理训练集文件
    with open(data_path + dir_name + "/train.txt", "r", encoding="utf-8") as fr: # 分别打开各文件夹中的训练集文件
        for line in tqdm(fr): # 按行读取文件内容
            data = json.loads(line.strip()) # 按行将数据加载成json格式
            if "A" in dir_name: # 按A、B类分别存放数据
                train_data_a.append((process(data["source"]), process(data["target"]), process(data["labelA"])))
            else:
                train_data_b.append((process(data["source"]), process(data["target"]), process(data["labelB"])))

    # 处理验证集文件
    with open(data_path + dir_name + "/valid.txt", "r", encoding="utf-8") as fr:
        for line in tqdm(fr):
            data = json.loads(line.strip())
            if "A" in dir_name:
                valid_data_a.append((process(data["source"]), process(data["target"]), process(data["labelA"])))
            else:
                valid_data_b.append((process(data["source"]), process(data["target"]), process(data["labelB"])))

    # 处理测试集文件
    with open(data_path + dir_name + "/test_with_id.txt", "r", encoding="utf-8") as fr:
        for line in tqdm(fr):
            data = json.loads(line.strip())
            if "A" in dir_name:
                test_data_a.append((process(data["source"]), process(data["target"]), process(data["id"])))
            else:
                test_data_b.append((process(data["source"]), process(data["target"]), process(data["id"])))
  • Print the length of the data list
# 打印数据列表的长度
print(len(train_data_a))
print(len(train_data_b))
print(len(valid_data_a))
print(len(valid_data_b))
print(len(test_data_a))
print(len(test_data_b))
  • Write the processed new data to their respective files
# 将处理后的新数据分别写入各自文件
write_data("./match_data/train_A.txt", train_data_a)
write_data("./match_data/train_B.txt", train_data_b)
write_data("./match_data/valid_A.txt", valid_data_a)
write_data("./match_data/valid_B.txt", valid_data_b)
write_data("./match_data/test_A.txt", test_data_a)
write_data("./match_data/test_B.txt", test_data_b)
  • The integrated data format is as follows:
    insert image description here
  • The next note records how to reprocess the data processed in this note into bert's input data format.

Guess you like

Origin blog.csdn.net/weixin_40605573/article/details/115919736