【Project Combat】Analysis and forecast of second-hand housing prices in Beijing

Project Introduction
This project conducts data analysis of second-hand housing information in Beijing according to individual needs, observes housing characteristics and laws through data analysis, and uses machine learning models to make simple predictions.

Data source
The data source is obtained by crawling third-party housing intermediary websites (Lianjia and Anjuke) through crawlers, which is only for learning use.

Objective
Housing prices in Beijing are the most concerned topic. Therefore, the purpose of this project is to study the housing prices of second-hand housing in Beijing, and conduct data analysis on the housing prices of second-hand housing.

Statistics on second-hand house prices in various regions of Beijing
Statistics on the number of second-hand houses in various regions in Beijing
Statistics on second-hand house prices in Xicheng District, Dongcheng District and Haidian District
Statistics on housing prices and the number of houses in the housing area

Techniques and Tools
This project completes the data analysis by programming in Python language.

Data analysis: pandas, numpy, matplolib

1. Data import and cleaning

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 导入链家二手房数据 
lianjia_df = pd.read_csv('./lianjia.csv') 
print(lianjia_df.head())
print('\n')

# 删除没用的列 ['Id', 'Direction', 'Elevator', 'Renovation'],为了与安居客数据合并
drop = ['Id', 'Direction', 'Elevator', 'Renovation']
lianjia_df_clean = lianjia_df.drop(drop, axis=1)

# 重新摆放列位置 ['Region', 'District', 'Garden', 'Layout', 'Floor', 'Year', 'Size', 'Price']
columns = ['Region', 'District', 'Garden', 'Layout', 'Floor', 'Year', 'Size', 'Price']
lianjia_df_clean = pd.DataFrame(lianjia_df_clean, columns=columns)
print(lianjia_df_clean.head())
print('\n')

# 计算Region列数据的总量
lianjia_total_num = lianjia_df_clean['Region'].count()
print(lianjia_total_num)

operation result:
insert image description here

# 导入安居客二手房数据
anjuke_df = pd.read_csv('./anjuke.csv')
print(anjuke_df.head())
print('\n')

# 正则表达式
anjuke_df['District'] = anjuke_df['Region'].str.extract(r'.+?-(.+?)-.+?', expand= False)
anjuke_df['Region'] = anjuke_df['Region'].str.extract(r'(.+?)-.+?-.+?', expand= False)
print(anjuke_df.head())
print('\n')

#清洗数据,重新摆放列位置
columns = ['Region', 'District', 'Garden', 'Layout', 'Floor', 'Year', 'Size', 'Price']
anjuke_df = pd.DataFrame(anjuke_df, columns=columns)
print(anjuke_df.head())
print('\n')

#计算Region列数据的总量
anjuke_total_num = anjuke_df['Region'].count()
print(anjuke_total_num)

operation result:
insert image description here

# 数据集合并:将链家数据集与安居客数据集合并
df = pd.merge(lianjia_df_clean, anjuke_df, how='outer')
print(df)
print('\n')

# 增加一列:每平方的价格
df['PriceMs'] = df['Price'] / df['Size']
print(df)
print('\n')

# 对汇总数据再次清洗 (Null, 重复)
df.dropna(how='any')
df.drop_duplicates(keep='first', inplace=True)

# 一些别墅的房屋单价有异常,删选价格大于25万一平的
df = df.loc[df['PriceMs']<25]   # 保留25万以下的数据

total_num = anjuke_total_num + lianjia_total_num
df_num = df['Region'].count()
drop_num = total_num - df_num

print(total_num)
print(df_num)
print(drop_num)

operation result:
insert image description here
insert image description here

2. Data visualization analysis

Comparison of the average price of second-hand housing in various regions of Beijing & the comparison of the number of second-hand housing

# 统计北京各区域二手房房价数量
df_house_count = df.groupby('Region')['Price'].count().sort_values(ascending=False)
print(df_house_count)

print('\n')

# 统计北京各区域二手每平方米房房价
df_house_mean = df.groupby('Region')['PriceMs'].mean().sort_values(ascending=False)
print(df_house_count)

Running results:
insert image description here
insert image description here
Option 1:

plt.figure(figsize=(20,10))
plt.rc('font', family='SimHei', size=13) 
plt.style.use('ggplot')

plt.subplot(211)
plt.title('各区域二手房平均价格的对比', fontsize = 20)
plt.ylabel('二手房平均价格 (万/平方米)', fontsize = 15)
bar1 = plt.bar(np.arange(len(df_house_mean.index)),  df_house_mean.values, color='c')
plt.show()

Running results:
insert image description here
It can be seen from Scheme 1 that there is no index value on the x-axis, which needs to be further improved.

Option 2:
Improve on the basis of Option 1.
First, the bar1 obtained in scheme 1 is conveniently printed in a loop:

for i in bar1:
    print(i)

Run the result:
insert image description here
Then write a function to add the index value to the horizontal axis based on the printed result:

def auto_x(bar, x_index):
    x = []
    for i in bar: 
        x.append(i.get_x() + i.get_width()/2)
    x = tuple(x)
    plt.xticks(x, x_index)

Call functions:

auto_x(bar1, df_house_mean.index)

The final result of the improvement of program one:

def auto_x(bar, x_index):
    x = []
    for i in bar: 
        x.append(i.get_x() + i.get_width()/2)
    x = tuple(x)
    plt.xticks(x, x_index)

plt.figure(figsize=(20,10))
plt.rc('font', family='SimHei', size=13) 
plt.style.use('ggplot')

plt.subplot(211)
plt.title('各区域二手房平均价格的对比', fontsize = 20)
plt.ylabel('二手房平均价格 (万/平方米)', fontsize = 15)
bar1 = plt.bar(np.arange(len(df_house_mean.index)),  df_house_mean.values, color='c')

auto_x(bar1, df_house_mean.index)

plt.show()

Running results:
insert image description here
Option 3:
Option 2 is not the best practice, and Option 3 is more concise.

# 各区域二手房平均价格对比 # plt.rc('font', family='SimHei', size=13) plt.style.use('ggplot')
plt.figure(figsize=(20,10))
plt.rc('font', family='SimHei', size=13) 
plt.style.use('ggplot')

plt.subplot(211)
plt.title('各区域二手房平均价格的对比', fontsize = 20)
plt.ylabel('二手房平均价格 (万/平方米)', fontsize = 15)
bar1 = plt.bar(df_house_mean.index,  df_house_mean.values, color='c')

plt.show()

# 各区域二手房数量对比
plt.figure(figsize=(20,10))
plt.subplot(212)
plt.title('各区域二手房平均数量的对比', fontsize = 20)
plt.ylabel('二手房数量', fontsize = 15)
bar1 = plt.bar(df_house_count.index,  df_house_count.values, color='c')

plt.show()

operation result:
insert image description here

3. Pie chart visualization

# 各区域二手房数量百分比
plt.figure(figsize=(6,6))
plt.title('各区域二手房数量的百分比', fontsize=20)
ex = [0]*len(df_house_count)
ex[0] = 0.1
print(ex)
plt.pie(df_house_count, radius=1, autopct='%1.f%%', labels = df_house_count.index, explode=ex )

plt.show()

# 各区域二手房每平方米房价
plt.figure(figsize=(6,6))
plt.title('各区域二手房每平方米房价的百分比', fontsize=20)
ex = [0]*len(df_house_mean)
ex[0] = 0.1
print(ex)
plt.pie(df_house_mean, radius=1, autopct='%1.f%%', labels = df_house_count.index, explode=ex )

plt.show()

operation result:
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/lingchen1906/article/details/127932535