Visual analysis of the overall situation of second-hand houses + house price prediction based on Python [500010099]

project description

Through Fangtianxia Lanzhou's second-hand housing information, the data is further cleaned and processed, the data of each dimension is analyzed, the characteristic variables that have a significant impact on housing prices are screened, the overall situation, price situation and price influencing factors of Lanzhou's second-hand housing are explored, and housing price predictions are established Model.

Ask a question

  • Explore the relationship between unit price, quantity, total price and administrative area
  • Explore the relationship between other factors and total price
  • House type distribution
  • Analyze the age of the building
  • Use machine learning models to build regression analysis models for price prediction

Data understanding

Import module

import pandas as pd
import numpy as np
import random
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
import numpy as np

retrieve data

# 获取数据
df = pd.read_csv('./data/house.csv', encoding='gbk')

Data analysis and processing

## 1)数据总体情况
print(f'样本量共有 {
      
      df.shape[0]} 个')

image.png

## 2) 判断是否有重复项
df.duplicated().sum()

image.png

## 3) 判断是否有缺失值
df.isnull().sum()

image.png

## 4) 查看数据类型
df.dtypes

image.png

## 4) 唯一标签值
print(df['朝向'].unique())
print(df['楼层'].unique())
print(df['装修'].unique())
print(df['产权性质'].unique())
print(df['住宅类别'].unique())
print(df['建筑结构'].unique())
print(df['建筑类别'].unique())
print(df['区域'].unique())
print(df['建筑年代'].unique())

image.png
Preliminary exploratory results:

  • Deduplication and missing value processing
  • The building area, age, and unit price need to be converted (remove the unit)
  • Floors and regions need data integration

Data cleaning

Data format conversion
# 数据格式转换
df.replace('暂无',np.nan,inplace=True)
df['建筑面积'] = df['建筑面积'].map(lambda x: x.replace('平米','')).astype('float')
df['单价'] = df['单价'].map(lambda x: x.replace('元/平米','')).astype('float')
def process_year(year):
    if year is not None:
        year = str(year)[:4]
    return year   
df['建筑年代'] = df['建筑年代'].map(process_year)
floor = {
    
    '低楼层': '低','中楼层': '中','高楼层': '高','低层': '低','中层': '中','高层': '高'}
df['楼层'] = df['楼层'].map(floor)
def process_area(area):
    if area != '新区':
        area = area.replace('区','').replace('县','')
    return area   
df['区域'] = df['区域'].map(process_area)
df.replace('nan',np.nan,inplace=True)
Duplicate value handling
# 重复值处理
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)
Missing value handling
# 缺失值处理
df.info()

image.png

data visualization

Box plot analysis

image.png

# 描述性分析
df.describe()

image.png

  • The minimum area of ​​a second-hand house is 15.9 square meters, and the maximum area is 423.43 square meters. The cheapest one is 140,000 yuan, and the most expensive one is 145 million yuan.
  • The area is about 80-118 square meters, and the price is about 830,000-1.48 million.
  • Those with a total price higher than the upper limit are treated as outliers (direct deletion method), and only those cases that can be purchased by most people are considered.
# 将高于房价200万的删除
df.drop(index = df[df['总价'] > 200].index, inplace=True)

# 另存为新文件
df.to_excel('house.xlsx',encoding='utf8',index=False)
The relationship between unit price, quantity, total price and administrative area

The average unit price, total price, and quantity of second-hand houses in each district are in the same order, with the highest in Chengguan District and the lowest in Gaolan County.
image.png

The relationship between area and total price

Basically, the larger the area, the higher the price.
image.png

The relationship between orientation and total price

Feng Shui: When people choose a house, they often like to choose a house facing north and south, because this kind of house has good lighting, good light and water, warm in winter and cool in summer, so it is very suitable for people to live in.
Prices including south and north orientations are relatively more expensive.
image.png

The relationship between decoration and total price

Different decoration information will have a certain impact on the total price. The better the decoration is, the higher the price will be.
image.png

The relationship between floor and total price

Different floors have little impact on the total price.
image.png

The relationship between elevator and total price

A house with an elevator is more expensive than a house without an elevator.
image.png

The relationship between school district housing and total price

Houses near schools will be more expensive.
image.png

Analysis of the age of the building and its relationship to the total price

Most of the second-hand houses for sale are from more than ten years ago, which is more in line with the reality. New houses are rarely sold.
The prices of houses that are too old will be lower, the prices of houses after 2008 will be higher, and the prices of the latest houses (after 2017) will be lower than those of previous generations.
image.png
image.png

The relationship between the nature of property rights, housing type, building structure, construction type and total price

The nature of property rights, housing type, building structure, and construction type all have a certain influence on the price.
image.png
image.png
image.png
image.png

The relationship between house type and total price

House type has a greater impact on the total price, and different rooms, living rooms, and bathrooms will have different impacts.
Most demand is focused on 2 or 3 bedrooms, 1 or 2 living rooms, 1 or 2 bathrooms.
image.png
image.png

Model building and prediction

Remove all missing values
# 删除所有缺失值
d1 = df.dropna().reset_index(drop=True)
Decompose house type
# 分解户型
def apart_room(x):
    room = x.split('室')[0]
    return int(room)
def apart_hall(x):
    hall = x.split('厅')[0].split('室')[1]
    return int(hall)
def apart_wc(x):
    wc = x.split('卫')[0].split('厅')[1]
    return int(wc)
d1['室'] = d1['户型'].map(apart_room)
d1['厅'] = d1['户型'].map(apart_hall)
d1['卫'] = d1['户型'].map(apart_wc)
coding
# 编码-有序多分类(根据上面可视化的结果,按照对价格的影响程度排序,越大影响越高)
# 无序多分类无法直接引入,必须“哑元”化变量
# 等级变量(有序多分类)可以直接引入模型
map1 = {
    
    '南':5, '南北':6, '北':1, '西南':10, '东西':4, '东':2, '东北':8, '东南':9, '西':3, '西北':7}
d1['朝向'] = d1['朝向'].map(map1)
map2 = {
    
    '毛坯':1, '简装修':2, '精装修':3, '中装修':4, '豪华装修':5}
d1['装修'] = d1['装修'].map(map2)
map3 = {
    
    '有 ':1, '无 ':0}
d1['电梯'] = d1['电梯'].map(map3)
map4 = {
    
    '商品房':6, '个人产权':5, '商品房(免税)':7, '普通商品房':4, '经济适用房':2, '房改房':3, '限价房':8, '房本房':1}
d1['产权性质'] = d1['产权性质'].map(map4)
map5 = {
    
    '普通住宅':4, '经济适用房':3, '公寓':1, '商住楼':2, '酒店式公寓':5}
d1['住宅类别'] = d1['住宅类别'].map(map5)
map6 = {
    
    '平层':4, '开间':2, '跃层':5, '错层':1, '复式':3}
d1['建筑结构'] = d1['建筑结构'].map(map6)
map7 = {
    
    '板楼':4, '钢混':5, '塔板结合':3, '平房':6, '砖混':1, '塔楼':7, '砖楼':2}
d1['建筑类别'] = d1['建筑类别'].map(map7)
map8 = {
    
    '城关':6, '安宁':5, '七里河':4, '西固':3,'榆中':2, '永登':1}
d1['区域'] = d1['区域'].map(map8)
# 删除超过2019年的房子,年代转变为房龄
d1['建筑年代'] = d1['建筑年代'].astype('int32')
d1.drop(index=d1[d1['建筑年代']>2019].index,inplace=True)
d1['房龄'] = d1['建筑年代'].map(lambda x: 2020-x)
d1.drop(columns=['建筑年代'],inplace=True)

X = d1.drop(columns=['总价'])
y = d1['总价']
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=33)
poly = PolynomialFeatures(degree=2)
x_train = poly.fit_transform(X_train.values)
x_test = poly.fit_transform(X_test)
Lasso return
# 套索回归
la = Lasso(alpha=0.1,max_iter=100000)
la.fit(x_train,y_train)
print(f'训练集得分:{
      
      round(la.score(x_train,y_train),2)}')
print(f'测试集得分:{
      
      round(la.score(x_test,y_test),2)}')

image.png

random forest
# 随机森林
rf = RandomForestRegressor()
rf.fit(x_train,y_train)
print(f'训练集得分:{
      
      round(rf.score(x_train,y_train),2)}')
print(f'测试集得分:{
      
      round(rf.score(x_test,y_test),2)}')

image.png

decision tree
# 决策树
dt = DecisionTreeRegressor(max_depth = 6)
dt.fit(x_train,y_train)
print(f'训练集得分:{
      
      round(dt.score(x_train,y_train),2)}')
print(f'测试集得分:{
      
      round(dt.score(x_test,y_test),2)}')

image.png

k nearest neighbor
# k近邻
kn = KNeighborsRegressor(n_neighbors=20)
kn.fit(x_train,y_train)
print(f'训练集得分:{
      
      round(kn.score(x_train,y_train),2)}')
print(f'测试集得分:{
      
      round(kn.score(x_test,y_test),2)}')

image.png
Comparing several models, the final score on the test set can remain above 70%.
The random forest training set score reaches more than 90%, and the test set score is also the best among several models.

Scenario simulation

There is a family of three, the children are about to go to school, and the adults work in Chengguan District. They need to buy a house. The assumed requirements are as follows: 3 bedrooms, 1 living room, 1 bathroom (3, 1, 1), an area of ​​about 95 square meters (95), and a room in the school district (1). Southeast (10), medium decoration (4), no elevator (0), personal property rights (5), ordinary residence (4), flat floor (4), steel-concrete (5), city gate (6), house age (10) ).
image.png

Guess you like

Origin blog.csdn.net/s1t16/article/details/135390532