Some time ago, there was a need for text semantic matching, but the company's labeled data was not enough for unsupervised learning, so we could only use open source data sets. The open source data set is cleaned into json format, and we extract data from json and save it in txt format for subsequent use. The JSON data format is as follows:
the processed txt data format is as follows:
the processing code is shown in the figure below:
import json
import os
import sys
sen1 = []
sen2 = []
label = []
with open('./1.json',encoding='utf-8') as f:
for line in f:
try:
line.index("sen1")
# line = line.strip('\n')
pos = line.index(':')
sen1.append(line[pos+3:len(line)-3])
except ValueError:
pass
try:
line.index("sen2")
# line = line.strip('\n')
pos = line.index(':')
sen2.append((line[pos+3:len(line)-3]))
except ValueError:
pass
try:
line.index("label")
try:
line.index("sen1")
except ValueError:
pos = line.index(':')
# label.append(line[pos + 3:len(line) - 2])
# label.append(line[pos + 1:len(line) - 1])
label.append(line[pos + 3:len(line) - 2])
except ValueError:
pass
write_file = open('./1.txt',"a+",encoding='utf-8')
j=0
while j< len(sen1):
str_info = sen1[j]+"\t"+sen2[j]+"\t"+label[j]+"\n"
write_file.write(str_info)
j = j + 1