北京房价信息(2002-2018)数据可视化

本文是数据可视化的第二篇练习文，目的是承接上一篇中国(2002-2018)全国各省结婚率和离婚率数据可视化

该篇文章主要使用的是Python数据可视化，用来分析北京地区从2002到2018年房价的趋势变化

为了方便读者理解，在写的篇幅中，不加入代码，所有代码放在最后的附录里。

一：加载数据以及相应的库包

二：检验数据

加载成功后，需要验证数据的正确性

三：数据处理

对于数据的处理，我们一般查看数据中的缺失值

以上6列数据具有缺失值。

然后我们计算数据中空列缺失值的个数

另外解释一下数据中各列值的含义

url: 　　the url which fetches the data

id:　　 the id of transaction

lng: 经

lat: 纬

cid: community id

tradeTime: 　　the time of transaction

followers: 　　the number of people follow the transaction.

price: 　　 the average price by square

square: 　　 the square of house(1m*1m)

livingRoom: 　 the number of living room

drawingRoom: the number of drawing room

kitchen: 　　　 the number of kitchen

bathroom:　　 the number of bathroom
constructionTime: 建造年代

floor: 　　　　　　 the height of the house. I will turn the Chinese characters to English in the next version.

buildingType: including tower( 1 ) , bungalow( 2 )，combination of plate and tower( 3 ), plate( 4 ).

renovationCondition: including other( 1 ), rough( 2 ),Simplicity( 3 ), hardcover( 4 )

buildingStructure: including unknow( 1 ), mixed( 2 ), brick and wood( 3 ), brick and concrete( 4 ),steel( 5 ) and steel-concrete composite ( 6 ).
ladderRatio: the proportion between number of residents on the same floor and number of elevator of ladder. It describes how many ladders a resident have on average.

elevator: have ( 1 ) or not have elevator( 0 )
fiveYearsProperty: if the owner have the property for less than 5 years,

district列表中各区指代内容：

1:东城区
2:丰台区
3:亦庄
4:大兴区
5:房山
6:昌平区
7:朝阳区
8.海淀区
9.石景山
10:西城区
11:通州区
12门头沟
13:顺义区

查看各值的数量：

若空列数量占总过多或者不影响关键信息，可以直接删除空列

其中，

axis=1 删除 columns

axis=0 删除 rows

四、数据可视化

北京（2002-2018）年房价总体走势图（单价）

　　元/平方米

北京房子总价与房屋面积之间的关系：

北京各区房子总价与房屋面积之间的关系

北京各区房价单价的中位数（元/平方米）

　　　可以看到西城最高，东城次之

区域与房子单价之间的关系图（盒图）

区域与房子单价之间的关系图（分类散点图）

各区房价与总价之间的关系

　　可以明显看到，朝阳区3千万以上好在比其它区多，西城次之，富人应该多集居在朝阳，西城，海淀

北京各区房价的中位数

　　中位数代表着普通百姓的消费能力，可以看到，西城，东城，海淀最高

北京各区房子面积分布

　　朝阳，昌平的大房比较多

各区房子面积的中位数

　　昌平居冠，西城最小

五、买房应该注意什么

　　1.人们喜欢什么价位的房子

　　可以发现1000万以下的房子followers比较多，说明这个段位以下的房子面向的人群最广

　　　2.地铁对房价的影响

　　　可以很明显的发现，各区周围有地铁的房子，比无地铁的房子，单价要高几千，其中海淀，门口沟，朝阳，房山影响很明显，说明这几个区居住的上班族多

　　3.建造年代与房子单价的关系（单位:万/平方米）

　　最近几年的房子以及老房普遍比新房贵，可能是区域的关系，因为老房多位于老城区，房价比较高

　　4.建造年代与房子总价的关系（单位:百万/套）

　　5.人们更喜欢什么楼层的房子

　　可以看到5楼和6楼比较抢手

　　6.五年产权对房价的影响

　　可以看到政策的影线，没有到五年产权的房子单价普遍贵一些

六、总结

以上是使用Python对北京房价的数据可视化，可以发现一些有趣的信息。笔者准备使用热力图显示相应的信息，但是由于地图测绘法的关系，echarts.js的北京地图信息不能使用了，如果有谁知道其它工具的化可以给我留言，期望改进，比心！

附录

本文所有的代码：

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

file_path = "../input/lianjia/new.csv"

file_content = pd.read_csv(file_path,encoding='gbk')

file_temporary = pd.read_csv(file_path,encoding='gbk',index_col='tradeTime',parse_dates=True)

file_temporary.sort_index()

df2=pd.DataFrame(file_temporary.sort_index())

plt.figure(figsize=(30,6))
sns.lineplot(data=df2['price'],label='the trend of housing price in beijing')

df = pd.DataFrame(file_content)
with_missing_value = [col for col in df.columns if df[col].isnull().any()]

count_missing_null_column = (df.isnull().sum())

df = df.drop(with_missing_value,axis=1)

df['price'].shape

sns.regplot(x=df.loc[0:10000,'square'],y=df.loc[0:10000,'totalPrice'])

df_copy= df.copy()

df_copy.district[df.district==1]='Dongcheng'
df_copy.district[df.district==2]='Fengtai'
df_copy.district[df.district==3]='Yizhuang'
df_copy.district[df.district==4]='Daxing'
df_copy.district[df.district==5]='Fangshan'
df_copy.district[df.district==6]='Changping'
df_copy.district[df.district==7]='Chaoyang'
df_copy.district[df.district==8]='Haidian'
df_copy.district[df.district==9]='Shijingshan'
df_copy.district[df.district==10]='Xicheng'
df_copy.district[df.district==11]='Tongzhou'
df_copy.district[df.district==12]='Mentougou'
df_copy.district[df.district==13]='Shunyi'

sns.lmplot(x="square", y="totalPrice", hue="district", data=df_copy)

plt.figure(figsize=(30,6))
sns.swarmplot(x=df_copy.loc[0:10000,'district'],
y=df_copy.loc[0:10000,'price'])

plt.figure(figsize=(30,6))
sns.boxplot(x=df_copy.loc[0:10000,'district'], y=df_copy.loc[0:10000,'price'])

len=[]
for i in range(1,14):
len.append(df.price[df.district==i].median())
print (len)

plt.figure(figsize=(16,6))
sns.barplot(x=['Dongcheng','Fengtai','Yizhuang','Daxing','Fangshan','Changping',
'Chaoyang','Haidian','Shijingshan','Xicheng','Tongzhou','Mentougou'
,'Shunyi'
],y=len)

plt.figure(figsize=(20,6))
sns.swarmplot(x=df_copy.loc[0:10000,'district'],
y=df_copy.loc[0:10000,'totalPrice'])

total_len=[]
for i in range(1,14):
total_len.append(df.totalPrice[df.district==i].median())

print (total_len)
plt.figure(figsize=(16,6))
sns.barplot(x=['Dongcheng','Fengtai','Yizhuang','Daxing','Fangshan','Changping',
'Chaoyang','Haidian','Shijingshan','Xicheng','Tongzhou','Mentougou'
,'Shunyi'
],y=total_len)

plt.figure(figsize=(20,6))
sns.swarmplot(x=df_copy.loc[0:10000,'district'],
y=df_copy.loc[0:10000,'square'])

plt.figure(figsize=(20,6))
sns.boxplot(x=df_copy.loc[0:10000,'district'],
y=df_copy.loc[0:10000,'square'])

total_square=[]
for i in range(1,14):
total_square.append(df.square[df.district==i].median())

plt.figure(figsize=(20,6))
sns.regplot(x=df.loc[0:10000,'followers'],y=df.loc[0:10000,'totalPrice'])

pd_construct_time = pd.read_csv(file_path,encoding='gbk',index_col='constructionTime',parse_dates=True)

df_construct = pd.DataFrame(pd_construct_time)

plt.figure(figsize=(40,6))
sns.lineplot(data=df_construct["price"])

plt.figure(figsize=(40,6))
sns.lineplot(data=df_construct["totalPrice"])

plt.figure(figsize=(40,6))
sns.boxplot(x="constructionTime", y="totalPrice", data=df)

plt.figure(figsize=(40,20))
ax = sns.boxplot(x="district", y="price", hue="subway",data=df_copy, palette="Set3")

plt.figure(figsize=(40,20))
ax = sns.boxplot(x="district", y="price", hue="fiveYearsProperty",data=df_copy, palette="Set3")

plt.figure(figsize=(70,20))
ax = sns.boxplot(x="district", y="price", hue="buildingStructure",data=df, palette="Set3")

plt.figure(figsize=(50,6))
sns.boxplot(x=df.loc[0:10000,'floor'],y=df.loc[0:10000,'followers'])

first=[]
second=[]
for i,j in zip(df.floor,df.followers):
temp=i.split(" ")
if len(temp)==2:
first.append(temp[1])
second.append(j)

output = pd.DataFrame({
'floor':first,
'followers':second
})

output.to_csv('floor-followers.csv',index=False)

floor_price = pd.read_csv('floor-followers.csv')

df = pd.DataFrame(floor_price)

df_sorted=df.sort_values(by='floor')

plt.figure(figsize=(30,30))
sns.boxplot(x="floor",y="followers",data=df_sorted)

北京房价信息(2002-2018)数据可视化

猜你喜欢