If you want to get data sets and code, please click here .
mission details
The basic requirements of
the data samples written to the output file in the file format in the following examples. It should be noted that all of the input files 暂无数据
are by 暂无
written to the output file, all None
are by NULL
written to the output file. CCP 240 sample data file.
Input File Sample
Sample Sample data file ori_data follows:
Tue Mar 19 16:23:02 2019,杭州租房网 > 萧山租房 > 钱江世纪城租房 > 佳境天城人合苑租房 , 合租·佳境天城人合苑4室1厅, 2430元/月(季付价), 公寓 独立卫生间 近地铁 押一付一 随时看房 , 合租 4室1厅2卫 16㎡ 朝南 房屋信息 基本信息 发布:12天前 入住:随时入住 租期:暂无数据 看房:随时可看 楼层:5/18层 电梯:暂无数据 车位:暂无数据 用水:暂无数据 用电:暂无数据 燃气:暂无数据 采暖:暂无数据 ,None, 地址和交通距离地铁2号线-振宁路329m, end
Tue Mar 19 16:23:02 2019,杭州租房 > 滨江租房 > 浦沿租房 > 朗诗寓·东信大道店租房 , None, 朗诗寓·东信大道店 2550元/月起 ,None, None, None, None, 地址和交通, end
Sample output file
data dea_data Sample output file is as follows:
Tue Mar 19 杭州 萧山 钱江世纪城 佳境天城人合苑 2430元/月 16㎡ 4室 2卫 1厅 朝南 5/18层 NULL 随时入住 暂无 暂无 暂无 暂无 暂无 暂无
Tue Mar 19 杭州 滨江 浦沿 朗诗寓·东信大道店 2550元/月 NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL
Task Analysis
The first thing I noticed is that the sample file is Unix (LF) format, you may need to consider the problem file encoding format.
Secondly, not difficult to find between each row of data is to Comma Separated Values, which is to remind us to use Python's csv module for processing.
Prior to treatment, it should be washed off before those data useless, so you can make the process work behind organized more clearly.
The sample is a specific output format, we can import the Python re module, using regular expressions to filter the data.
Brief Description of Source Code and
Description of the program has been written in the form of comments by the following code.
#!/usr/bin/env python3
import csv
import re
with open('ori_data', mode='r', encoding='utf-8', newline='') as csv_in_file:
with open('dea_data_output', mode='w', newline='') as out_file:
filereader = csv.reader(csv_in_file)
for row_list in filereader:
# 创建要写入输出文件的输出字符串
out_str = ''
# 删去共有的无用信息
row_list.pop()
row_list.pop(2)
row_list.pop(3)
# 修改前三列的数据
row_list[0] = re.search(r'Tue Mar 19', row_list[0]).group()
row_list[1] = ''.join(row_list[1].split())
row_list[1] = ''.join(row_list[1].replace('>', '').replace('租房', ' ').replace('网', '').rstrip())
row_list[2] = re.search(r'(\d*)元/月', row_list[2]).group()
# 根据列表长度删去各自的无用信息
if len(row_list) == 9:
row_list.pop()
row_list.pop()
row_list.pop()
row_list.pop()
row_list.pop()
elif len(row_list) == 8:
row_list.pop()
row_list.pop()
row_list.pop()
row_list.pop()
elif len(row_list) == 7:
row_list.pop()
row_list.pop()
row_list.pop()
elif len(row_list) == 6:
row_list.pop()
row_list.pop()
# 修改最后一列的信息
if row_list[-1].strip() == 'None':
row_list[-1] = 'NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL'
else:
s = ''
# 房屋面积
s += re.search(r'(\d*㎡)', row_list[-1]).group()
s += ' '
# 室、卫、厅
s += re.search(r'(\d*室)', row_list[-1]).group()
s += ' '
s += re.search(r'(\d*卫)', row_list[-1]).group()
s += ' '
s += re.search(r'(\d*厅)', row_list[-1]).group()
s += ' '
# 房屋朝向
s += re.search(r'朝\w', row_list[-1]).group()
s += ' '
# 所在楼层
if re.search(r'(\d*/\d*层)', row_list[-1]) is None:
s += 'NULL'
else:
if re.search(r'(\d*/\d*层)', row_list[-1]).group()[0] == '/':
s += 'NULL'
else:
s += re.search(r'(\d*/\d*层)', row_list[-1]).group()
s += ' '
# 租期
if re.search(r'(\d*~\d*年)', row_list[-1]) is None:
s += 'NULL'
else:
s += re.search(r'(\d*~\d*年)', row_list[-1]).group()
s += ' '
# 入住
if re.search(r'随时入住', row_list[-1]) is None:
s += 'NULL'
else:
s += re.search(r'随时入住', row_list[-1]).group()
s += ' '
# 电梯
if re.search(r'电梯:有', row_list[-1]):
s += '有 '
elif re.search(r'电梯:无', row_list[-1]):
s += '无 '
elif re.search(r'电梯:暂无数据', row_list[-1]):
s += '暂无 '
# 车位
if re.search(r'车位:免费', row_list[-1]):
s += '免费 '
elif re.search(r'车位:租用', row_list[-1]):
s += '租用 '
elif re.search(r'车位:暂无数据', row_list[-1]):
s += '暂无 '
# 用水
if re.search(r'用水:民水', row_list[-1]):
s += '民水 '
elif re.search(r'用水:商水', row_list[-1]):
s += '商水 '
elif re.search(r'用水:暂无数据', row_list[-1]):
s += '暂无 '
# 用电
if re.search(r'用电:民电', row_list[-1]):
s += '民电 '
elif re.search(r'用电:商电', row_list[-1]):
s += '商电 '
elif re.search(r'用电:暂无数据', row_list[-1]):
s += '暂无 '
# 燃气
if re.search(r'燃气:有', row_list[-1]):
s += '有 '
elif re.search(r'燃气:无', row_list[-1]):
s += '无 '
elif re.search(r'燃气:暂无数据', row_list[-1]):
s += '暂无 '
# 采暖
if re.search(r'采暖:自采暖', row_list[-1]):
s += '自采暖'
elif re.search(r'采暖:集中供暖', row_list[-1]):
s += '集中'
elif re.search(r'采暖:暂无数据', row_list[-1]):
s += '暂无'
row_list[-1] = s
# 向输出字符串内添加信息
out_str += row_list[0] + ' ' + row_list[1] + ' ' + row_list[2] + ' ' + row_list[3] + '\n'
# 写入文件
out_file.write(out_str)
Output
Dea_data file is left of the original sample output, dea_data_output right of the file is obtained by running the above program output file.