Python case implementation|Processing and analysis of rental website data tables

picture

 In the comprehensive actual combat project, the task of capturing the renting data of "Beijing Lianjia.com" has been completed in the  previous article  , and the data table bj_lianJia.csv was obtained, as shown in Figure 1. The data table contains ID, district name (district), street name (street), community name (community), floor information (floor), whether there is an elevator (lift), area (area), house orientation (toward), and house type ( model), rent and other information.

picture

■ Figure 1 Partial data display of "Beijing Lianjia.com" rental data table

01. Case realization

This section breaks down the tasks as follows.

(1) Processing of duplicate rows: delete duplicate rows.

(2) Handling of missing values: There are missing values ​​in the lift and rent columns in the data table, and different missing value handling methods are adopted respectively.

(3) Content format cleaning.

① Delete "m2" in the area column, so that the data in the area column becomes numeric, which is convenient for subsequent data analysis.

② Delete the spaces between the characters in the column toward, for example, delete the spaces in "North and South" to become "North and South".

③ Convert the content format of the model column to "*room*hall*bath", for example, "2 rooms 1 bathroom" is converted to "2 rooms 0 hall 1 bathroom".

(4) Attribute reconstruction: Separate the total floor from the floor column to form a new column named "totalfloor". For example, "middle floor/6 floors" separates the total floor "6".

(5) Statistically analyze the rent column data.

The implementation steps and codes of the above tasks are as follows.

(1) Import library. Among them, the re library is a regular expression library, which is the standard library of Python and is mainly used for string matching. code show as below.

import pandas as pd
import re

 (2) Read in data. Use the read_csv() method of the Pandas library to read in the rental data set bj_lianJia.csv of "Beijing Lianjia.com", where header=0 means that the first row of the data table is used as the column name, and the parameter value of usecols means that the column number in the data table is used The data of 1 to 9, that is, do not use the column data of "ID" whose column number is 0. The data columns read in are: floor (floor), whether there is an elevator (lift), district name (district), street name (street), community name (community), area (area), house orientation (toward), and house type (model), rent (rent). code show as below.

df=pd.read csv('bj lianJia.csv', encoding='gbk', header=0,usecols=[1,2,3,4
5,6,7,8,9])
print(df)

The output is:

            floor      lift   district     ...        model       
0         中楼层/6层     无       房山      ...     2室2厅1卫   
1         低楼层/17层    有       顺义      ...     3室1卫
2         中楼层/6层     无       大兴      ...     2室1厅1卫
...
4338      高楼层/28层    有       朝阳      ...     2室1厅1卫
4339      低楼层/2层     有       怀柔      ...     5室2厅5卫
4340      低楼层/4层     无       通州      ...     4室2厅3卫

(3) Duplicate value processing. First check for duplicate rows, using the duplicated() method of the Pandas library. If there are duplicate rows, use the drop_duplicates() method to remove these duplicate rows. code show as below.

print ('----检测有无重复行----')
print(len(dfldf.duplicated()]))  # 原地修改 df
df.drop duplicates(inplace=True)print('----打印删除重复行后 df 的行数----)
print(len(df))

The output is:

- ---检测有无重复行- 
15
----打印删除重复行后 df 的行数----
4326

(4) Missing value processing. First count the columns and the number of missing values. code show as below.

print ('----未做缺失值处理之前----'!)
print(df.isnul1() .sum())

The output is:

----未做缺失值处理之前---
floor       0
lift        8
district    0
street      0
community   0
area        0
toward      0
mode1       0
rent        4
dtype: int64

 It can be seen that there are 8 missing values ​​in the lift column, and 4 missing values ​​in the rent column. Different methods are used to deal with the missing values: use the filling method to fill the missing values ​​in the lift column with a fixed value "unknown"; use The deletion method directly deletes the rows with missing values ​​in the rent column. code show as below.

print ('----将 lift 列的缺失值填充为"未知"---')
df['lift'].fillna('未知’,inplace=True)
print(df.isnul1() .sum())
print ('----将 rent 有缺失值的行直接删除----')
df.dropna(subset=['rent'],inplace=True)
print(df.isnull() .sum() )
print(len(df))   # 输出删除缺失值后 df 的行数

The output is:

----将lift列的缺失值填充为“未知”---
floor       0
lift        0
district    0
street      0
community   0
area        0
toward      0
mode1       0
rent        4
dtype: int64

----将rent列有缺失值的行直接删除---
floor       0
lift        0
district    0
street      0
community   0
area        0
toward      0
mode1       0
rent        0
dtype: int64
4322

After deleting rows with missing values, at this point the DataFrame is no longer a continuous index, and the index can be reset using the reset_index() method. code show as below.

df=df .reset index(drop=True)
print(df)

The output is:

            floor      lift   district     ...        model       
0         中楼层/6层     无       房山      ...     2室2厅1卫   
1         低楼层/17层    有       顺义      ...     3室1卫
2         中楼层/6层     无       大兴      ...     2室1厅1卫
...
4319      中楼层/8层     有       朝阳      ...     3室1厅1卫
4320      高楼层/28层    有       朝阳      ...     2室2厅1卫
4321      低楼层/2层     无       怀柔      ...     5室2厅5卫
[4322 rows x 9 columns]

(5) Content format cleaning.

① Delete "m2" in the area column. First, use the findall() method of the regular expression library re to extract the numbers in the area column in the data table. At this time, the data in the obtained area list discards "m2", and then writes the data in the area list back to the data table. . code show as below.

area= re.findall('d+. d+',a) for a in df 'area'].values.tolist()]
df['area']=[i for jin range(len(area)) for i in arealj]]
print(df.loc[:5,'area'])

The output is:

0     85.00
1     107.00
2     72
3     71.13
4     54.41
5     132.00
Name: area,dtype: object

② Delete the spaces between the characters in the column toward. The replacement method str.replace() of the Series object is used here, and the syntax format is the series object sr.replace(pat, repl), where the parameter pat represents the string to be replaced, and repl represents the new string. In the following code, the data type obtained by df['toward'] is a Series type. In the replace() method, the string to be replaced uses the regular expression '\\s+', which means to match any number of Space, the new string to be replaced is an empty string, so use the replace() method to replace the found space with an empty string, that is, delete the space. code show as below.

print(df.loc[:5,'toward'])
df['toward']=df['toward'].str.replace('\s+',')

The output is:

0  南北
1  南
2  南北
3  东
4  东
5  南北
Name: toward,dtype: object

 ③ Convert the content format of the model column to "*Room*Hall*Guard". Since the value of the house type model in the original data table is 3 bedrooms, 2 halls, 1 bathroom or 2 rooms, 1 bathroom, and a few values ​​are "unknown room, 1 hall, 1 bathroom", the representation of the house types is not uniform, and now they are unified For "*Room*Hall*Guard", the conversion rule is: the room is represented as a room, the number of halls not given is represented as 0 hall, and the unknown room is represented as 0 room. code show as below.

print ("----首先将 model 列中'未知·替换为'0'----")
dff=dfLdfl'model'].str.contains('未知)==True]
print'替换前:\n',dff)
df.loc[dff.index,'model =dffl'model'].str.replace('未知',0')
print('替换后:\n',df.loc dff.index])
print("----然后将 model 列统一为¥室¥厅*卫----")
model_n=_re.findall('d+',m) for m in dfl'model'] .values.tolist()]
new model=list()
for m in model n:
if len(m)==3:
new model.append(m[0]+'室'+m[1]+厅'+m[2]+'卫')
elif len(m)==2:
new model.append(m[o] +室'+0厅'+m[1]+卫')
dfl'model =new model
print(df.loc[:5,'model'])

The output is:

----首先将 model 列中·未知,替换为 0'----
替换前:
             floor      lift   district   ...     model         rent  
3964      低楼层/25层    有       海淀     ...   未知室 1厅1卫   38000.0
[1 rows x 9 columns]
替换后:
             floor      lift   district   ...     model         rent  
3964      低楼层/25层    有       海淀     ...   0室0厅0卫       38000.0
[1 rows x 9 columns]
----然后将 model 列统一为*室*厅*卫----
rent
38000.0
0    2室1厅1卫
1    3室0厅1卫
3    3室0厅2卫
4    2室1厅1卫
5    3室2厅2卫
Name: model,dtype: object

(6) Attribute reconstruction: Separate the total floor from the floor column to form a new column. The string split method split() is used here. This method splits the string by specifying the delimiter and returns the split string list. For example, "middle floor/6th floor" is passed by the split() method through the delimiter "/ "Cut into ['middle floor', '6 floors']. Then write the floor back to df['floor'], extract the number in the total floor using the slice() method, and write it into df['totalfloor']. code show as below.

dff=df['floor'].str.split('/',expand=True)
df['floor =dff 0]
df['totalfloor']=dff[1].str.slice(0,-1,1)
print(df.loc[:5,['floor','totalfloor']])

 The output is:

     floor     totalfloor
0    中楼层         6
1    低楼层         17
2    中楼层         6
3    中楼层         8
4    中楼层         4
 

(7) Statistically analyze the rent column data.

(8) Save the processed data. code show as below.

df.to csv('newbj lianJia.csv',encoding='gbk',index label='ID')

Guess you like

Origin blog.csdn.net/qq_41640218/article/details/131888307