Laboratory data processing tasks are summarized 20,200,314

If you want to get data sets and code, please click here .


mission details

The basic requirements of
the data samples written to the output file in the file format in the following examples. It should be noted that all of the input files 暂无数据are by 暂无written to the output file, all Noneare by NULLwritten to the output file. CCP 240 sample data file.
Input File Sample
Sample Sample data file ori_data follows:

Tue Mar 19 16:23:02 2019,杭州租房网 >  萧山租房 >  钱江世纪城租房 >   佳境天城人合苑租房  , 合租·佳境天城人合苑4室1厅, 2430元/月(季付价), 公寓 独立卫生间 近地铁 押一付一 随时看房 , 合租 4室1厅2卫 16㎡ 朝南  房屋信息  基本信息 发布:12天前 入住:随时入住   租期:暂无数据 看房:随时可看   楼层:5/18层 电梯:暂无数据   车位:暂无数据 用水:暂无数据   用电:暂无数据 燃气:暂无数据   采暖:暂无数据  ,None, 地址和交通距离地铁2号线-振宁路329m, end 
Tue Mar 19 16:23:02 2019,杭州租房 >  滨江租房 >  浦沿租房 >   朗诗寓·东信大道店租房  , None,  朗诗寓·东信大道店  2550元/月起  ,None, None, None, None, 地址和交通, end 

Sample output file
data dea_data Sample output file is as follows:

Tue Mar 19 杭州  萧山  钱江世纪城  佳境天城人合苑 2430元/月 16㎡ 4室 2卫 1厅 朝南 5/18层 NULL 随时入住 暂无 暂无 暂无 暂无 暂无 暂无
Tue Mar 19 杭州  滨江  浦沿  朗诗寓·东信大道店 2550元/月 NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL

Task Analysis

  The first thing I noticed is that the sample file is Unix (LF) format, you may need to consider the problem file encoding format.
Here Insert Picture Description
  Secondly, not difficult to find between each row of data is to Comma Separated Values, which is to remind us to use Python's csv module for processing.
  Prior to treatment, it should be washed off before those data useless, so you can make the process work behind organized more clearly.
  The sample is a specific output format, we can import the Python re module, using regular expressions to filter the data.

Brief Description of Source Code and

  Description of the program has been written in the form of comments by the following code.

#!/usr/bin/env python3

import csv
import re

with open('ori_data', mode='r', encoding='utf-8', newline='') as csv_in_file:
    with open('dea_data_output', mode='w', newline='') as out_file:
        filereader = csv.reader(csv_in_file)
        for row_list in filereader:
            # 创建要写入输出文件的输出字符串
            out_str = ''

            # 删去共有的无用信息
            row_list.pop()
            row_list.pop(2)
            row_list.pop(3)

            # 修改前三列的数据
            row_list[0] = re.search(r'Tue Mar 19', row_list[0]).group()
            row_list[1] = ''.join(row_list[1].split())
            row_list[1] = ''.join(row_list[1].replace('>', '').replace('租房', '  ').replace('网', '').rstrip())
            row_list[2] = re.search(r'(\d*)元/月', row_list[2]).group()

            # 根据列表长度删去各自的无用信息
            if len(row_list) == 9:
                row_list.pop()
                row_list.pop()
                row_list.pop()
                row_list.pop()
                row_list.pop()
            elif len(row_list) == 8:
                row_list.pop()
                row_list.pop()
                row_list.pop()
                row_list.pop()
            elif len(row_list) == 7:
                row_list.pop()
                row_list.pop()
                row_list.pop()
            elif len(row_list) == 6:
                row_list.pop()
                row_list.pop()

            # 修改最后一列的信息
            if row_list[-1].strip() == 'None':
                row_list[-1] = 'NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL'
            else:
                s = ''
                # 房屋面积
                s += re.search(r'(\d*㎡)', row_list[-1]).group()
                s += ' '
                # 室、卫、厅
                s += re.search(r'(\d*室)', row_list[-1]).group()
                s += ' '
                s += re.search(r'(\d*卫)', row_list[-1]).group()
                s += ' '
                s += re.search(r'(\d*厅)', row_list[-1]).group()
                s += ' '
                # 房屋朝向
                s += re.search(r'朝\w', row_list[-1]).group()
                s += ' '
                # 所在楼层
                if re.search(r'(\d*/\d*层)', row_list[-1]) is None:
                    s += 'NULL'
                else:
                    if re.search(r'(\d*/\d*层)', row_list[-1]).group()[0] == '/':
                        s += 'NULL'
                    else:
                        s += re.search(r'(\d*/\d*层)', row_list[-1]).group()
                s += ' '
                # 租期
                if re.search(r'(\d*~\d*年)', row_list[-1]) is None:
                    s += 'NULL'
                else:
                    s += re.search(r'(\d*~\d*年)', row_list[-1]).group()
                s += ' '
                # 入住
                if re.search(r'随时入住', row_list[-1]) is None:
                    s += 'NULL'
                else:
                    s += re.search(r'随时入住', row_list[-1]).group()
                s += ' '
                # 电梯
                if re.search(r'电梯:有', row_list[-1]):
                    s += '有 '
                elif re.search(r'电梯:无', row_list[-1]):
                    s += '无 '
                elif re.search(r'电梯:暂无数据', row_list[-1]):
                    s += '暂无 '
                # 车位
                if re.search(r'车位:免费', row_list[-1]):
                    s += '免费 '
                elif re.search(r'车位:租用', row_list[-1]):
                    s += '租用 '
                elif re.search(r'车位:暂无数据', row_list[-1]):
                    s += '暂无 '
                # 用水
                if re.search(r'用水:民水', row_list[-1]):
                    s += '民水 '
                elif re.search(r'用水:商水', row_list[-1]):
                    s += '商水 '
                elif re.search(r'用水:暂无数据', row_list[-1]):
                    s += '暂无 '
                # 用电
                if re.search(r'用电:民电', row_list[-1]):
                    s += '民电 '
                elif re.search(r'用电:商电', row_list[-1]):
                    s += '商电 '
                elif re.search(r'用电:暂无数据', row_list[-1]):
                    s += '暂无 '
                # 燃气
                if re.search(r'燃气:有', row_list[-1]):
                    s += '有 '
                elif re.search(r'燃气:无', row_list[-1]):
                    s += '无 '
                elif re.search(r'燃气:暂无数据', row_list[-1]):
                    s += '暂无 '
                # 采暖
                if re.search(r'采暖:自采暖', row_list[-1]):
                    s += '自采暖'
                elif re.search(r'采暖:集中供暖', row_list[-1]):
                    s += '集中'
                elif re.search(r'采暖:暂无数据', row_list[-1]):
                    s += '暂无'
                row_list[-1] = s

            # 向输出字符串内添加信息
            out_str += row_list[0] + ' ' + row_list[1] + ' ' + row_list[2] + ' ' + row_list[3] + '\n'

            # 写入文件
            out_file.write(out_str)

Output

  Dea_data file is left of the original sample output, dea_data_output right of the file is obtained by running the above program output file.
Here Insert Picture Description

Published 44 original articles · won praise 19 · views 7420

Guess you like

Origin blog.csdn.net/qq_45554010/article/details/104908663