Summary of Data Projects - Rental Housing Data Analysis (Complete)

 Datawhale dry goods 

Author: Pi Qianchao, Xiamen University, Datawhale member

Complete Analysis of Shenzhen Rental Housing Data

From the publication of the first article on Shenzhen rental data analysis in November 2020 to this modeling analysis and prediction based on the deep learning framework Keras, here are the characteristics of the three articles:

1. The first article:

Written in November 2020, as a data analyst, the author has learned Python, SQL, crawlers, visualization and some common algorithms and models of machine learning, so the focus of the first article is: statistical and visual analysis . Friends who have read the article should know that there are many beautiful visualization charts (the following are some pictures).

A picture is worth a thousand words . From the perspective of statistics and visualization charts, you can quickly and intuitively see the data distribution and changing trends. The visualization library used in the article is Plotly, a great dynamic visualization library, which is strongly recommended to learn~

cfec97a5c0b3813711a55b6ca1e25a81.png

Article address :

https://mp.weixin.qq.com/s/DEsclUfdnmVqICiK5rM57Q

2. Part Two:

Written in March 2022, the author is still a data analyst. From the end of 2020 to the beginning of 2022, in about a year, the author came into contact with and learned more knowledge points such as machine learning algorithms, feature engineering, and model interpretability .

In this article, the author spent a lot of work on the preprocessing and feature engineering of the 10 fields. The focus is on how to do encoding processing to facilitate subsequent input into different regression models and the comparison of various models.

Finally, the author explores the interpretability of the model, mainly based on a popular interpretable library: SHAP.

SHAP considers all features as "contributors". For each predicted sample, the model generates a corresponding predicted value, and the SHAP value is the value assigned to each feature in the sample

Regarding the study of feature engineering, the author recommends a book: "Introduction and Practice of Feature Engineering".

5e02ef559b8135a84f52ff4314b8eb38.png

Article address :

https://mp.weixin.qq.com/s/iO47yo6IgYgw6xZ8W-8lbQ

3. The third article (this article)

When writing the third article (that is, the one I saw), the author was still a data analyst. This year's learning focus has shifted to deep learning and kaggle competitions. Recently, I have learned some DL foundations and the Keras framework for modeling classification and regression problems. The basic process of the entire modeling process is completed from the steps of network model building, compilation, and network training. .

After in-depth study of the DL modeling process, the existing model will be optimized in the next step!

ba1e9fd56d00d80d778d1be4da24a1c7.png

Written at the end: The three articles only represent some of the author's learning experience and knowledge summary. Regarding the specific details of the article, if you think there is a better way to deal with it or something that is inappropriate, please point it out and discuss it together~

Datawhale's backstage reply that  Shenzhen  can download rental data

This article is the third analysis of Shenzhen rental data, mainly including data preprocessing, sampling processing, Keras-based modeling, etc.:

cf96b56e9f4dada5db483ae3400a8008.png

import library

import numpy as np
import pandas as pd

import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
# plt.style.use("fivethirtyeight")
plt.style.use('ggplot')

import sklearn.preprocessing as pre_processing
from imblearn.over_sampling import SMOTE   
from sklearn.preprocessing import StandardScaler,MinMaxScaler

import tensorflow as tf
from keras import models
from keras import layers

np.random.seed(123)

pd.options.mode.chained_assignment = None

Basic data information

read data

df = pd.read_excel("leyoujia.xls")
df.head()
e273a97e6db6e6950be4d0943668fa0a.png

data shape

In [3]:

# 数据形状

df.shape

Out[3]:

(2020, 12)

The data shape returns a list, the first value indicates the number of rows of data, and the second is the number of attributes, that is, the number of fields

Field Type

In [4]:

# 数据的字段类型

df.dtypes

Out[4]:

Most of them are string types, only money, that is, the final predicted value is numeric

name           object
layout         object
location       object
size           object
sizeInside     object
zhuangxiu      object
numberFloor    object
time           object
zone           object
position       object
money           int64
way            object
dtype: object

In [5]:

# 数据中的缺失值

df.isnull().sum()

Out[5]:

name           0
layout         0
location       0
size           0
sizeInside     0
zhuangxiu      0
numberFloor    0
time           6  # 缺失值
zone           0
position       0
money          0
way            0
dtype: int64

Missing value handling

find missing values

There is a missing value in the time field, find out the row data information where the missing value is located:

134a96d835188e40fd67c6d162c641ec.png

fill missing values

There are several ways to fill in missing values:

  • Fill in the specific value

  • Fill in a statistical value of existing data, such as mean, mode, etc.

  • Fill in the values ​​of front and back items, etc.

This article directly finds the specific time of each community on the Internet to fill in:

In [7]:

# 2019 2003 2004  2019 2019 2020

times = ["2019年", "2003年", "2004年", "2019年", "2019年", "2020年"]

# 通过对应的索引位置来填充缺失值
for i in range(len(df0)):
    df.iloc[df0.index.tolist()[i], 7] = times[i]

In [8]:

df.isnull().sum()

Out[8]:

After filling it is found that there are no missing values:

name           0
layout         0
location       0
size           0
sizeInside     0
zhuangxiu      0
numberFloor    0
time           0
zone           0
position       0
money          0
way            0
dtype: int64

preprocessing

The following content is preprocessing for different fields, including rich feature engineering work

name

The name of the community is not helpful to our modeling, consider deleting it directly

In [9]:

df.drop("name",axis=1,inplace=True)  # inplace=True表示原地修改

layout

The layout is divided into three specific attributes: room, hall, and bathroom; when the layout is "shop", delete this part of the data directly

In [10]:

df[df["layout"] == "商铺"]
5e277f6e5ae673ecae36ae0de7b56c74.png

Extract several rooms, halls and bathrooms: Here we use the extract function in Pandas

df1 = df["layout"].str.extract(r'(?P<shi>\d)室(?P<ting>\d)厅(?P<wei>\d)卫')
df1.head()
39e2d8b84904de4f3fa028965e39e3dd.png
# 合并到原数据
df = pd.concat([df1,df],axis=1)
# 原地删除原字段layout
df.drop("layout",axis=1,inplace=True)
68691e23ea2568015f9c50732fa34659.png

Remove null values ​​in 3 fields:

# 基于3个字段删除空值
df.dropna(subset=["shi","ting","wei"],inplace=True)
df
d5543a674c4ea3f602bfeb23a066444c.png

location

Count the number of occurrences under each location:

In [14]:

df["location"].value_counts()

Out[14]:

朝南     552
朝南北    284
朝东南    241
朝北     241
朝西南    174
朝西北    142
朝东北    140
朝东     132
朝西      92
朝东西      2
Name: location, dtype: int64

In [15]:

The price distribution of rent in different orientations:

fig = px.violin(df,y="money",color="location")

fig.show()
3f17a6f72cf241f0f472614cf298eed7.png

According to common sense, the price of a house is more expensive when it is facing north and south than when it is facing east and west. Here we take the maximum value for each direction down:

# 不同朝向下的均值:
price = (df.groupby("location")["money"].max()
         .reset_index()
         .sort_values("money")
         .reset_index(drop=True))

price
1617b1a8806d7e09fc8e6b68f2844ba1.png

Note: Then encode according to the index number of the orientation: east-west-0; northeast-1; north-south-9

This point is different from the second analysis. The second analysis is based on the custom coding order of the violin plot of the price distribution

# 第二弹中编码自定义:location = ["朝东西","朝东北","朝西","朝西北","朝东","朝西南","朝东南","朝南","朝北","朝南北"]

location = price["location"].tolist()
location_dict = {}

for n, i in enumerate(location):
    location_dict[i] = n+1 # 编码从1开始 
    
df["location"] = df["location"].map(location_dict)
df.head()
5f6ca5271f704f38c8135098f1a084c7.png

size和sizeInside

Processing of construction area and inner area: extract the number and decimal point from the original data. Provides two methods

In [18]:

df.dtypes

Out[18]:

shi            object
ting           object
wei            object
location        int64
size           object
sizeInside     object
zhuangxiu      object
numberFloor    object
time           object
zone           object
position       object
money           int64
way            object
dtype: object

In [19]:

# 1、通过切割的方式来提取
df["size"] = df["size"].apply(lambda x: x.split("面积")[1].split("㎡")[0])
df.head()
13addc3ba5acd690d205f21f7aba8595.png
# 2、使用正则的方式提取
df["sizeInside"] = df["sizeInside"].str.extract(r'面积(?P<sizeInside>[\d.]+)')
df.head()
7d8f33b7f8c2aa00715829b8210a5769.png

Zhuangxiu

The way the decoration differs is through custom hardcoding.

In [21]:

df["zhuangxiu"].value_counts()

Out[21]:

精装    1172
普装     747
豪装      62
毛坯      19
Name: zhuangxiu, dtype: int64

Ideas in the subjective sense: the grade of the blank is the lowest, and the luxury is the highest, so the custom hard-coded method is directly used here:

In [22]:

# 硬编码
zhuangxiu = {"毛坯":1,"普装":2, "精装":3, "豪装":4}
zhuangxiu

Out[22]:

{'毛坯': 1, '普装': 2, '精装': 3, '豪装': 4}

In [23]:

df["zhuangxiu"] = df["zhuangxiu"].map(zhuangxiu)

numberFloor

The height of the floor also has a great influence on the price. The following analyzes the relationship between the middle, low and high floors and the price of money:

In [24]:

# 提取中低高楼层

df["numberFloor"] = df["numberFloor"].apply(lambda x: x.split("(")[0])
df.head()
3fdaff8ab6d2f134ec9599964097abb3.png
# 中低高楼层和价格money之间的关系
fig = px.violin(df,y="money",color="numberFloor")

fig.show()
6c6d97315f5952d2590aad73cb4a0b51.png

Encoding in the form of one-hot code: through the get_dummie function

93a25cca78414ab2a1f3a4baf34c38f0.png
# 中高低楼层采用独热码的形式

df = (df.join(pd.get_dummies(df["numberFloor"]))
    .rename(columns={"中楼层":"middleFloor",
                    "低楼层":"lowFloor",
                    "高楼层":"highFloor"}))

df.drop("numberFloor", axis=1, inplace=True)

df.head()
b2d36c3cf499b04a45740169074efe44.png

time

Completion time of houses in the community:

In [30]:

df["time"].value_counts()

# 部分结果
2003年建成    133  # 数量
2005年建成    120
2006年建成    114
2004年建成    111
2010年建成    104
2007年建成    101
2016年建成     94
2008年建成     92
2002年建成     79
2015年建成     78

Extract specific year information from raw data:

8ce51d9c7bea01e36306b82cb9e8d4b3.png

Convert it to a numeric value, and find the time interval with 2022:

# time转成数值型
df["time"] = df["time"].astype("float")
# 建成时间和当前年份的时间间隔
df["time"] = 2022 - df["time"]
df.head()
28e124b971bc4cac9edec83190edaec1.png

zone+position

The impact of administrative regions and specific locations on prices

In [33]:

df["zone"].value_counts()

Out[33]:

龙岗      548
福田      532
龙华      293
南山      218
宝安      173
罗湖      167
光明       32
坪山       31
盐田        5
大鹏新区      1
Name: zone, dtype: int64

In [34]:

fig = px.violin(df,y="money",color="zone")

fig.show()
f06ecb1748f8f602e59520529a3b7365.png

Merge zone and position, and calculate the average price of each specific position, and encode according to the size of the average :

df["zone_position"] = df["zone"] + "_" + df["position"]

zone_position_mean = (df.groupby("zone_position")["money"].mean()
                      .reset_index()
                      .sort_values("money",ascending=True,ignore_index=True)) # 升序排列

zone_position_mean
da1da9a04c8de51d5fbdcaacaf9c2fc7.png
zone_position = zone_position_mean["zone_position"].tolist()

zone_position_dict = {}

for n, i in enumerate(zone_position):
    zone_position_dict[i] = n+1     
df["zone_position"] = df["zone_position"].map(zone_position_dict)
df.drop(["zone","position"],axis=1,inplace=True)  # 原地删除

df.head()
902ffeb14c764ff684b9650b8ca9735a.png

way

Different ways of renting a house, only two pieces of information are extracted here: whole rent and shared rent

In [39]:

fig = px.violin(df,y="money",color="way")

fig.show()

The impact of different renting methods on the price: Obviously, one-pay-one and two-pay-one are the most common ways

361cb976476a1ecac0057ee6773ca87e.png

Extract whole lease and joint lease, and implement type encoding:

# 整租-合租
df["way"] = df["way"].apply(lambda x: x.split(" ")[0])
df["way"] = df["way"].map({"整租":1,"合租":0})

df["way"].value_counts()  # 1-整租 0-合租
9fbd88f5729d5cbf009146d0189f6b04.png

We found that most of them are rented as a whole, and very few are shared. These two types of samples are unbalanced, and sampling processing will be implemented later.

Sampling processing

As mentioned above, the number of samples of the whole lease and the shared lease is extremely unbalanced. Here, upsampling is implemented to increase the number of shared leases to ensure that the two are the same:

Before sampling:

8acdfa62596338f3a40afd64d4d931f1.png 77c7b0a6c6c4c4fe1379dadef396c8d9.png

After sampling:

0897039ced929b4e51c21ab0c2949e7b.png

type conversion

The above preprocesses and encodes different fields, and finds that the types of some fields need to be converted:

624998350fdc21293107979160261df9.png
col1 = ["shi","ting","wei","time"]
for i in col1:
    smoted_df[i] = smoted_df[i].apply(lambda x: round(x))

col2 = ["size", "sizeInside"]
for i in col2:
    smoted_df[i] = smoted_df[i].astype(float)

All become numeric:

ab2a4b03842c75a4acb598b71b5c2f4e.png

modeling

The following is to model the data smoted_df data:

Features and Labels

X = smoted_df.drop("money",axis=1)  # 特征
y = smoted_df["money"] # 标签

Split data

from sklearn.model_selection import train_test_split

# 8-2的比例
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=44)

data standardization

mean = X_train.mean(axis=0)
X_train -= mean  
std = X_train.std(axis=0)
X_train /= std

# 测试集:使用训练集的均值和标准差来归一化
X_test -= mean 
X_test /= std
26697cfb990b06ee432d83db915d39b7.png

Generally, the data in the neural network is required to be relatively small. Here, the same dependent variables are uniformly converted into data in units of tens of thousands :

In [58]:

y_train = y_train / 10000
y_test = y_test / 10000

y_train[:5]

Out[58]:

3201    0.6000
3013    0.2154
597     0.5000
1524    0.4300
3354    0.1737
Name: money, dtype: float64

build network

The total number of samples in the training set is 3000+. Here we use a very small network with only two hidden layers and 64 units in each layer.

The last layer of the network has only one unit and no other activation functions. It is a relatively pure linear layer.

In [59]:

model = models.Sequential()
model.add(tf.keras.layers.Dense(64,
                                activation="relu",
                                input_shape=(X_train.shape[1],)))

model.add(tf.keras.layers.Dense(64,
                                activation="relu"))

model.add(tf.keras.layers.Dense(1))

compile network

In this modeling, the loss function loss uses mse: mean squared error mean squared error, the square of the difference between the predicted value and the target actual value,

The monitored indicator is mae: mean absolute error, which represents the absolute value of the difference between the predicted value and the actual target value.

In [60]:

model.compile(optimizer="rmsprop",  # 优化器
              loss="mse",  # 损失函数
              metrics=["mae"]  # 评估指标:平均绝对误差
             )

Network Architecture

View the basic architecture of the entire network

In [61]:

4091b330ae9fc027e7bf84c7d4a6f4fe.png

training network

Introduce 5-fold cross-validation: divide a certain verification data set from the training data set

In [62]:

k = 5
number_val = len(X_train) // k  # 验证数据集的大小
number_epochs = 20
all_mae_scores = []
all_loss_scores = []

for i in range(k):
    # 只取i到i+1部分作为验证集
    vali_X = X_train[i * number_val: (i+1) * number_val]
    vali_y = y_train[i * number_val: (i+1)*number_val]
    
    # 训练集
    part_X_train = np.concatenate([X_train[:i * number_val],
                                  X_train[(i+1) * number_val:]],
                                  axis=0
                                 )
    part_y_train = np.concatenate([y_train[:i * number_val],
                                  y_train[(i+1) * number_val:]],
                                  axis=0
                                 )
    
    # 模型训练
    history = model.fit(part_X_train,
                        part_y_train,
                        epochs=number_epochs
                        # 传入验证集的数据
                        validation_data=(vali_X, vali_y),
                        batch_size=300,
                        verbose=0  # 0-静默模式 1-日志模式
                       )
    
    mae_history = history.history["mae"]
    loss_history = history.history["loss"]
    all_mae_scores.append(mae_history)
    all_loss_scores.append(loss_history)

network metrics

# 每个轮次中的平均值
average_mae = [np.mean([x[i] for x in all_mae_scores]) for i in range(number_epochs)]
average_loss = [np.mean([x[i] for x in all_loss_scores]) for i in range(number_epochs)]
# 每个轮次(20轮)的均值
average_mae 

# 结果
[0.14793895035982133,
 0.12548727840185164,
 0.1141199067234993,
 0.1111918956041336,
 0.10730082243680954,
 0.10863531827926635,
 0.10383812189102173,
 0.10521284639835357,
 0.10574782490730286,
 0.10005746781826019,
 0.10514769405126571,
 0.10096234679222107,
 0.10278342366218567,
 0.0960465505719185,
 0.10629244297742843,
 0.09704757779836655,
 0.09838753342628478,
 0.10160793513059616,
 0.0998133972287178,
 0.09991184771060943]

The overall mean of loss and mae is as follows:

# 整体均值
print("mae的均值:",np.mean(average_mae))
print("loss的均值:",np.mean(average_loss))

mae的均值: 0.1068765591084957
loss的均值: 0.058070002682507026

mae is about 0.1, which means that the difference between the predicted value and the actual value is about 1,000 yuan (unit is ten thousand)

model evaluation

Evaluate the model through the evaluate function and pass in the data in the test set:

In [65]:

model.evaluate(X_test, y_test)
25/25 [==============================] - 0s 3ms/step - loss: 0.0696 - mae: 0.1295

Out[65]:

[0.06960742175579071, 0.12945905327796936]

It can be seen that the value of loss is 0.0696, and the value of mae is about 0.1295, which means that the difference between the predicted value and the actual value is 0.1295 million yuan, about 1295 yuan

loss-mae visualization

# 损失绘图
import matplotlib.pyplot as plt

history_dict = history.history
loss_values = average_loss
mae_values = average_mae

epochs = range(1,len(loss_values) + 1)

plt.plot(epochs,  # 循环轮数
         loss_values,  # loss取值
         "r",  
         label="loss"
        )

plt.plot(epochs,
         mae_values,
         "b",
         label="mae"
        )

plt.title("Loss and Mae")
plt.xlabel("Epochs")
plt.legend()

plt.show()
9160178423118ac30e07aef68e06d198.png

Introduce regular

There are mainly two aspects: adding L1 or L2 regularization items, adding a Dropout layer, and implementing an Early Stopping strategy.

In this paper, the L2 regular term is introduced:

c0e229327c1e0f9a1de966d24d18c018.png
mae的均值: 0.11811128027737142
loss的均值: 0.07401402860879899

The newly generated loss-mae visualization:

# 损失绘图
import matplotlib.pyplot as plt

history_dict = history.history
loss_values = average_loss
mae_values = average_mae

epochs = range(1,len(loss_values) + 1)

# 训练
plt.plot(epochs,  # 循环轮数
         loss_values,  
         "r",  # 红色
         label="loss"
        )

plt.plot(epochs,
         mae_values,
         "b",
         label="mae"
        )

plt.title("Loss and Mae")
plt.xlabel("Epochs")
plt.legend()

plt.show()
f7ab9854720d6d28e3b3eebfd6496166.png

re-forecast

In [69]:

model.evaluate(X_test, y_test)
25/25 [==============================] - 0s 3ms/step - loss: 0.0634 - mae: 0.1101

Out[69]:

[0.06338492780923843, 0.110066257417202]

# 优化前 [0.06960742175579071, 0.12945905327796936]

After introducing the regular term, the model is optimized: both loss and mae have a certain decline. mae becomes 0.11, and the difference between the predicted value and the real value is about 1100 yuan.

summary

This paper starts from a piece of renting data crawled from the web, and conducts modeling analysis from multiple steps such as data basic information exploration, missing value processing, feature engineering, sample imbalance processing, Keras-based deep learning model building and optimization, etc., and completes Prediction and analysis of rental data prices, and the final error is controlled at about 1,100 yuan.

After adding a regular term to the model, the loss and mae of the model have been optimized~

Summarize

Paste the code of the crawler before writing the summary content of the 3 articles, you only need to modify the request header to run:

1. Crawler code

The source code of the rental data crawler in two different ways is provided below:

import pandas as pd
import numpy as np

import json
from lxml import etree
import requests
import xlwt
import re
import time

1. Crawler code based on xpath (crawl 100 pages from the whole web)

The author recently debugged again and can run directly

20320ba65b51482bf6c98ff301230945.png

2. Crawling based on regular parsing ( single page crawling and parsing)

Recently, the author has used regular expressions to analyze the fields:

import pandas as pd
import numpy as np
import requests
import re

url = "https://shenzhen.leyoujia.com/zf/?n=1"
headers = {"User-Agent": "个人请求头"}  # 更换请求头即可

# 返回的是解析内容
response = requests.get(url=url, headers=headers).content.decode('utf-8', 'ignore')

parse different fields

5f3f76ac7744a406e44ec6ccaf2ccdf2.png ed7383cbeb80376ee4c6817551d63652.png f9e73dad999cfe54f15ef4b41fd8dcba.png

Regular parsing expressions for other fields:

22ce2f052022564aff1b66160d2665c8.png

956082b0aab2c78793fae80e9b21b191.png

It's not easy to organize, so I like it three times

Guess you like

Origin blog.csdn.net/Datawhale/article/details/125093131