project description
Through Fangtianxia Lanzhou's second-hand housing information, the data is further cleaned and processed, the data of each dimension is analyzed, the characteristic variables that have a significant impact on housing prices are screened, the overall situation, price situation and price influencing factors of Lanzhou's second-hand housing are explored, and housing price predictions are established Model.
Ask a question
- Explore the relationship between unit price, quantity, total price and administrative area
- Explore the relationship between other factors and total price
- House type distribution
- Analyze the age of the building
- Use machine learning models to build regression analysis models for price prediction
Data understanding
Import module
import pandas as pd
import numpy as np
import random
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
import numpy as np
retrieve data
# 获取数据
df = pd.read_csv('./data/house.csv', encoding='gbk')
Data analysis and processing
## 1)数据总体情况
print(f'样本量共有 {
df.shape[0]} 个')
## 2) 判断是否有重复项
df.duplicated().sum()
## 3) 判断是否有缺失值
df.isnull().sum()
## 4) 查看数据类型
df.dtypes
## 4) 唯一标签值
print(df['朝向'].unique())
print(df['楼层'].unique())
print(df['装修'].unique())
print(df['产权性质'].unique())
print(df['住宅类别'].unique())
print(df['建筑结构'].unique())
print(df['建筑类别'].unique())
print(df['区域'].unique())
print(df['建筑年代'].unique())
Preliminary exploratory results:
- Deduplication and missing value processing
- The building area, age, and unit price need to be converted (remove the unit)
- Floors and regions need data integration
Data cleaning
Data format conversion
# 数据格式转换
df.replace('暂无',np.nan,inplace=True)
df['建筑面积'] = df['建筑面积'].map(lambda x: x.replace('平米','')).astype('float')
df['单价'] = df['单价'].map(lambda x: x.replace('元/平米','')).astype('float')
def process_year(year):
if year is not None:
year = str(year)[:4]
return year
df['建筑年代'] = df['建筑年代'].map(process_year)
floor = {
'低楼层': '低','中楼层': '中','高楼层': '高','低层': '低','中层': '中','高层': '高'}
df['楼层'] = df['楼层'].map(floor)
def process_area(area):
if area != '新区':
area = area.replace('区','').replace('县','')
return area
df['区域'] = df['区域'].map(process_area)
df.replace('nan',np.nan,inplace=True)
Duplicate value handling
# 重复值处理
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)
Missing value handling
# 缺失值处理
df.info()
data visualization
Box plot analysis
# 描述性分析
df.describe()
- The minimum area of a second-hand house is 15.9 square meters, and the maximum area is 423.43 square meters. The cheapest one is 140,000 yuan, and the most expensive one is 145 million yuan.
- The area is about 80-118 square meters, and the price is about 830,000-1.48 million.
- Those with a total price higher than the upper limit are treated as outliers (direct deletion method), and only those cases that can be purchased by most people are considered.
# 将高于房价200万的删除
df.drop(index = df[df['总价'] > 200].index, inplace=True)
# 另存为新文件
df.to_excel('house.xlsx',encoding='utf8',index=False)
The relationship between unit price, quantity, total price and administrative area
The average unit price, total price, and quantity of second-hand houses in each district are in the same order, with the highest in Chengguan District and the lowest in Gaolan County.
The relationship between area and total price
Basically, the larger the area, the higher the price.
The relationship between orientation and total price
Feng Shui: When people choose a house, they often like to choose a house facing north and south, because this kind of house has good lighting, good light and water, warm in winter and cool in summer, so it is very suitable for people to live in.
Prices including south and north orientations are relatively more expensive.
The relationship between decoration and total price
Different decoration information will have a certain impact on the total price. The better the decoration is, the higher the price will be.
The relationship between floor and total price
Different floors have little impact on the total price.
The relationship between elevator and total price
A house with an elevator is more expensive than a house without an elevator.
The relationship between school district housing and total price
Houses near schools will be more expensive.
Analysis of the age of the building and its relationship to the total price
Most of the second-hand houses for sale are from more than ten years ago, which is more in line with the reality. New houses are rarely sold.
The prices of houses that are too old will be lower, the prices of houses after 2008 will be higher, and the prices of the latest houses (after 2017) will be lower than those of previous generations.
The relationship between the nature of property rights, housing type, building structure, construction type and total price
The nature of property rights, housing type, building structure, and construction type all have a certain influence on the price.
The relationship between house type and total price
House type has a greater impact on the total price, and different rooms, living rooms, and bathrooms will have different impacts.
Most demand is focused on 2 or 3 bedrooms, 1 or 2 living rooms, 1 or 2 bathrooms.
Model building and prediction
Remove all missing values
# 删除所有缺失值
d1 = df.dropna().reset_index(drop=True)
Decompose house type
# 分解户型
def apart_room(x):
room = x.split('室')[0]
return int(room)
def apart_hall(x):
hall = x.split('厅')[0].split('室')[1]
return int(hall)
def apart_wc(x):
wc = x.split('卫')[0].split('厅')[1]
return int(wc)
d1['室'] = d1['户型'].map(apart_room)
d1['厅'] = d1['户型'].map(apart_hall)
d1['卫'] = d1['户型'].map(apart_wc)
coding
# 编码-有序多分类(根据上面可视化的结果,按照对价格的影响程度排序,越大影响越高)
# 无序多分类无法直接引入,必须“哑元”化变量
# 等级变量(有序多分类)可以直接引入模型
map1 = {
'南':5, '南北':6, '北':1, '西南':10, '东西':4, '东':2, '东北':8, '东南':9, '西':3, '西北':7}
d1['朝向'] = d1['朝向'].map(map1)
map2 = {
'毛坯':1, '简装修':2, '精装修':3, '中装修':4, '豪华装修':5}
d1['装修'] = d1['装修'].map(map2)
map3 = {
'有 ':1, '无 ':0}
d1['电梯'] = d1['电梯'].map(map3)
map4 = {
'商品房':6, '个人产权':5, '商品房(免税)':7, '普通商品房':4, '经济适用房':2, '房改房':3, '限价房':8, '房本房':1}
d1['产权性质'] = d1['产权性质'].map(map4)
map5 = {
'普通住宅':4, '经济适用房':3, '公寓':1, '商住楼':2, '酒店式公寓':5}
d1['住宅类别'] = d1['住宅类别'].map(map5)
map6 = {
'平层':4, '开间':2, '跃层':5, '错层':1, '复式':3}
d1['建筑结构'] = d1['建筑结构'].map(map6)
map7 = {
'板楼':4, '钢混':5, '塔板结合':3, '平房':6, '砖混':1, '塔楼':7, '砖楼':2}
d1['建筑类别'] = d1['建筑类别'].map(map7)
map8 = {
'城关':6, '安宁':5, '七里河':4, '西固':3,'榆中':2, '永登':1}
d1['区域'] = d1['区域'].map(map8)
# 删除超过2019年的房子,年代转变为房龄
d1['建筑年代'] = d1['建筑年代'].astype('int32')
d1.drop(index=d1[d1['建筑年代']>2019].index,inplace=True)
d1['房龄'] = d1['建筑年代'].map(lambda x: 2020-x)
d1.drop(columns=['建筑年代'],inplace=True)
X = d1.drop(columns=['总价'])
y = d1['总价']
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=33)
poly = PolynomialFeatures(degree=2)
x_train = poly.fit_transform(X_train.values)
x_test = poly.fit_transform(X_test)
Lasso return
# 套索回归
la = Lasso(alpha=0.1,max_iter=100000)
la.fit(x_train,y_train)
print(f'训练集得分:{
round(la.score(x_train,y_train),2)}')
print(f'测试集得分:{
round(la.score(x_test,y_test),2)}')
random forest
# 随机森林
rf = RandomForestRegressor()
rf.fit(x_train,y_train)
print(f'训练集得分:{
round(rf.score(x_train,y_train),2)}')
print(f'测试集得分:{
round(rf.score(x_test,y_test),2)}')
decision tree
# 决策树
dt = DecisionTreeRegressor(max_depth = 6)
dt.fit(x_train,y_train)
print(f'训练集得分:{
round(dt.score(x_train,y_train),2)}')
print(f'测试集得分:{
round(dt.score(x_test,y_test),2)}')
k nearest neighbor
# k近邻
kn = KNeighborsRegressor(n_neighbors=20)
kn.fit(x_train,y_train)
print(f'训练集得分:{
round(kn.score(x_train,y_train),2)}')
print(f'测试集得分:{
round(kn.score(x_test,y_test),2)}')
Comparing several models, the final score on the test set can remain above 70%.
The random forest training set score reaches more than 90%, and the test set score is also the best among several models.
Scenario simulation
There is a family of three, the children are about to go to school, and the adults work in Chengguan District. They need to buy a house. The assumed requirements are as follows: 3 bedrooms, 1 living room, 1 bathroom (3, 1, 1), an area of about 95 square meters (95), and a room in the school district (1). Southeast (10), medium decoration (4), no elevator (0), personal property rights (5), ordinary residence (4), flat floor (4), steel-concrete (5), city gate (6), house age (10) ).