Python data analysis and visualization (real estate data)

Data link: https://pan.baidu.com/s/1I0w4129XYEW2Iwvc4rm1pA 
Extraction code: hdc3 

hint:

The data is crawled by myself. If anyone wants to see it, I will update the crawler part of the data.

Table of contents

Foreword (data source)

1. Data processing

1. Data import

 2. Data preprocessing

2. Feature extraction

1. Data standardization

2. LDA topic classification and model optimization

3. Data Analysis and Visualization

1. kmeans clustering and model tuning

2. Visualization (using pycharts)

1. The top 10 communities with the highest housing prices in Beijing

2. The top 10 communities with the lowest housing prices in Beijing and Yanjiao 

 3. The average housing price of communities in Beijing and its surrounding districts

 4. Distribution of real estate housing prices in various regions of the Beijing metropolitan area

 5. Distribution of real estate popularity levels in various regions of the Beijing metropolitan area

 6. Comment popularity distribution (pie chart, divided by administrative region)

 7. Dendrogram of heat level (divided by grade)

Summary and Epilogue


Foreword (data source)

Fangtianxia website: https://newhouse.fang.com/house/s/b81-b91/

Small data visualization project, I mixed it myself, I hope it will be helpful to everyone, if you have any questions, leave a message in the comment area~

1. Data processing

1. Data import

import pandas as pd
import numpy as np  
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv('北京小区数据信息.csv')
df.head()

 2. Data preprocessing

Delete the non-Beijing area in the data

(The data is recommended by the website according to the needs of Beijing users. I only want Beijing and its surrounding areas)

df=df[-df.所在区.isin(['非北京周边','海阳城区','宝坻','秦皇岛','永清','涞水','怀来','天津','霸州','大厂','廊坊','涿州','固安','崇礼'])]
df=df[-df.均价.isin(['价格待定元/㎡'])]

Let's process the data and add some more meaningful columns:

1. The average price of real estate in various regions of Beijing

There are two types of housing price data: xxx yuan/m^2, starting from xxxx yuan/set;

First do xxx yuan/m^2

estate_single = df[df['均价'].str.contains('元/㎡')]
estate_single['均价'] = [int(i.split('元/㎡')[0]) for i in estate_single['均价']]
estate_mean = estate_single[['所在区', '均价']].groupby('所在区').mean()
estate_mean.reset_index(inplace=True)
estate_mean

Let’s do the data of xxxx ten thousand yuan/set

import re
estate_tao = df[df['均价'].str.contains('套')].reset_index(drop=True)
strinfo = re.compile('万元|/套|起')
#去除中文字符

estate_tao['均价'] = estate_tao['均价'].apply(lambda x: strinfo.sub('',x))
estate_tao['均价']=estate_tao['均价'].astype(int)
#把str型转为int

Let's take a look

estate_tao['均价']=estate_tao['均价'].sort_index()
estate_tao.head(10)
#把以套为价格的房价排个名

Converting the house price data of xxx yuan/set to xxx yuan/m^2 is not rigorous enough, so if the data requirements are high, it is recommended to delete the house price data of xxx yuan/set directly, and only analyze xxx yuan/m ^2 data will do

#由于房产是以套为单位显示价格,所以只能结合实际,人工评估价格
for i in range(127):
    if estate_tao['小区名称'][i]=='恒大丽宫':
        estate_tao['均价'][i] = estate_tao['均价'][i]*5#恒大丽宫面积极大
    elif estate_tao['均价'][i]>=1500:
        estate_tao['均价'][i] = estate_tao['均价'][i]*30#例如圆明天颂户型较大,所以每平米更便宜一些
    elif estate_tao['均价'][i]<=1000 and estate_tao['所在区'][i]!='朝阳'and estate_tao['所在区'][i]!='海淀':
        estate_tao['均价'][i] = estate_tao['均价'][i]*100#例如兴创荣墅
    elif estate_tao['均价'][i]<=1000 and estate_tao['所在区'][i]=='朝阳':
        estate_tao['均价'][i] = estate_tao['均价'][i]*130#例如北京书院,户型较小,所以每平米更贵
    elif estate_tao['均价'][i]<=1000 and estate_tao['所在区'][i]=='海淀':
        estate_tao['均价'][i] = estate_tao['均价'][i]*130
    elif estate_tao['均价'][i]>1000 and estate_tao['均价'][i]<1500:
        estate_tao['均价'][i] = estate_tao['均价'][i]*55#例如玖瀛府,户型较大,所以每平米更贵
    else:
        estate_tao['均价'][i] = estate_tao['均价'][i]*75

Combined and processed data (xxx yuan/m^2 and xxx yuan/set)

estate = pd.concat([estate_single,estate_tao], axis=0)
estate['小区名称'].value_counts() #看看情况,有没有重复值

 

remove duplicates

estate = estate.drop_duplicates() 
#数据删除重复值

2. Feature extraction

1. Data standardization

The main purpose is to obtain the real estate popularity index, so the comment data of the real estate is standardized, and the popularity is determined according to the number of comments:

        I normalize the comment data to 0-1, set the maximum value of the number of comments to 1, set the minimum value to 0, and scale (map) the data to the set 0-1 interval. And the number of normalized comments is set as the normalized popularity and added to the data set.

from sklearn import preprocessing
import  pandas
a=pd.DataFrame(estate['评论数'])
min_max_normalizer=preprocessing.MinMaxScaler(feature_range=(0,1))
#feature_range设置最大最小变换值,默认(0,1)
scaled_data=min_max_normalizer.fit_transform(a)
#将数据缩放(映射)到设置固定区间
price_frame_normalized=pandas.DataFrame(scaled_data)
#将变换后的数据转换为dataframe对象
print(price_frame_normalized)

#新建列-标准化热度(用来存储房产热度数值)
estate['标准化热度'] = price_frame_normalized

#复制一份数据,为之后的分析埋下伏笔
import copy
estate1=copy.deepcopy(estate)
estate1

2. LDA topic classification and model optimization

Looking at the entire data set, although there is a column of districts, there are 19 types of districts, which is not conducive to intuitive visualization, because each district has buildings with convenient transportation and relatively remote buildings, so I thought of using The detailed address information is classified by LDA subject, so as to obtain the specific location type of the real estate.

First, replace the line of null characters with nan, and then delete;

#第一步 将空字符的行替换为nan,方便进行删除
estate['详细地址'].replace(to_replace=r'^\s*$', value=np.nan, regex=True, inplace=True)
estate['详细地址'].replace(to_replace=r'[a-zA-Z]', value=np.nan, regex=True, inplace=True)
print(estate['详细地址'])

#第二步 删除所有值为nan的行
estate.dropna(axis=0, how='any', inplace=True)

Then remove the numbers contained in the paragraph, because the specific numbers are not very helpful for LDA topic classification, and because the units and specific meanings are difficult to distinguish, it is easy to have a negative impact on the results;

#去除数字
strinfo = re.compile('[0-9]|北京')
estate['详细地址'] =  estate['详细地址'].apply(lambda x: strinfo.sub('',x))

→→→→  

Then use jieba to perform Chinese word segmentation for each line of text:

import jieba
import jieba.posseg as psg
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
#格式转换 否则会报错  'float' object has no attribute 'decode'
estate = pd.DataFrame(estate['详细地址'].astype(str))

def chinese_word_cut(mytext):
    return ' '.join(jieba.cut(mytext))

#增加一列数据
estate['content_cutted'] = estate['详细地址'].apply(chinese_word_cut)
print(estate.content_cutted.head())

Then by constructing TfidfVectorizer, calculate the TF-IDF value of the vocabulary for the next lda analysis:

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

#设置特征数
n_features = 2000


tf_vectorizer = TfidfVectorizer(strip_accents = 'unicode',
                                max_features=n_features,
                                max_df = 0.99,
                                min_df = 0.002) #去除文档内出现几率过大或过小的词汇
tf = tf_vectorizer.fit_transform(estate.content_cutted)

print(tf.shape)
print(tf)

 


Afterwards, LDA analysis is performed, and the number of topics in the LDA model is set to 3 to obtain three topics:

from sklearn.decomposition import LatentDirichletAllocation

#设置主题数
n_topics = 3

#Python 2.X: n_topics=n_topics
lda = LatentDirichletAllocation(n_components=n_topics,
                                max_iter=100,
                                learning_method='online',
                                learning_offset=50,
                                random_state=0)
lda.fit(tf)

#显示主题数 model.topic_word_
print(lda.components_)
#几个主题就是几行 多少个关键词就是几列 
print(lda.components_.shape)                         

#主题-关键词分布
def print_top_words(model, tf_feature_names, n_top_words):
    for topic_idx,topic in enumerate(model.components_):    # lda.component相当于model.topic_word_
        print('Topic #%d:' % topic_idx)
        print(' '.join([tf_feature_names[i] for i in topic.argsort()[:-n_top_words-1:-1]]))
        print("")


n_top_words = 10                                       
tf_feature_names = tf_vectorizer.get_feature_names()
#调用函数
print_top_words(lda, tf_feature_names, n_top_words)

 

Finally, by visualizing the LDA topic model , you can feel the classification results more intuitively:

import pyLDAvis
import pyLDAvis.gensim_models
keshihua_data = pyLDAvis.sklearn.prepare(lda,tf,tf_vectorizer)
pyLDAvis.display(keshihua_data)

It can be seen from the classification results that all the detailed addresses of real estate are clearly divided into 3 categories, and each category has its own theme keywords. The specific analysis will be given in the data analysis section.

3. Data Analysis and Visualization

1. kmeans clustering and model tuning

I want to use the data set to infer whether the popularity of the real estate is related to the price of the real estate, and whether all the real estate can be grouped into several representative real estate types according to the price and popularity.

estate1 is the dataframe copied earlier , (the code is at the end of 2, feature extraction 1. data standardization)

estate1['评论数']=estate1['评论数']*100
wnc = list(estate1.groupby(['均价', '评论数']).groups.keys())

from sklearn.cluster import KMeans 
import matplotlib.pyplot as plt
X = wnc
# Kmeans聚类
clf = KMeans(n_clusters=3)  
y_pred = clf.fit_predict(X)  
print(clf)   
print(y_pred)  
# 可视化操作
import numpy as np  
import matplotlib.pyplot as plt  
x = [n[0] for n in X]  
y = [n[1] for n in X]
plt.scatter(x, y, c=y_pred, marker='x')   
plt.title("Kmeans-House Data")   
plt.xlabel("price")  
plt.ylabel("redu")  
plt.legend(["house"])   
plt.show() 

 

It is found that the effect of clustering into three categories is not good. It seems that the real estate is still divided into three grades based on the price, high, medium and low, and the classification of the popularity index is not obvious.

After I adjusted the number of clusters to 5, I found that the clustering effect was much better:

from sklearn.cluster import KMeans  
X = wnc
# Kmeans聚类
clf = KMeans(n_clusters=5)  
y_pred = clf.fit_predict(X)  
print(clf)   
print(y_pred)  
# 可视化操作
import numpy as np  
import matplotlib.pyplot as plt  
x = [n[0] for n in X]  
y = [n[1] for n in X]
plt.scatter(x, y, c=y_pred, marker='x')   
plt.title("Kmeans-House Data")   
plt.xlabel("price")  
plt.ylabel("redu")  
plt.legend(["house"])   
plt.show() 

A cluster plot divides the dataset into:

Light green with high popularity and relatively affordable price, cyan with cheap price and relatively low popularity in the lower left corner, purple with low popularity in the mid-range housing price range, yellow with moderate popularity in the middle and high-end, and dark blue with ultra-high-end unpopular real estate in the lower right color;

It can be analyzed through the cluster diagram that the average price of the hot real estate is all below 130,000 per square meter, and most of the hot real estate is concentrated between 50 million and 100,000.

2. Visualization (using pycharts)

Mainly based on a small number of data, draw some meaningful images.

#将评论数转化为小数,154→1.54
estate1['评论数']=estate1['评论数']/100

from pyecharts.charts import Bar
from pyecharts.commons.utils import JsCode
from pyecharts import options as opts  ##导入配置项

high_top10 = estate1[['小区名称','均价']].sort_values(by='均价',ascending=False)[:10] 
# 最高房价top10

low_top5 = estate1[['小区名称','均价']].sort_values(by='均价',ascending=True)[:5] 
# 最低房价top10

 Draw pictures draw pictures! ! ! If you want to change the color, you can just use the Baidu pycharts color code, and you can search for it.

1. The top 10 communities with the highest housing prices in Beijing

color_js = """new echarts.graphic.LinearGradient(0, 1, 0, 0,
    [{offset: 0, color: '#8d4653'}, {offset: 1, color: '#8d4653'}], false)"""
bar = (
    Bar()
    .add_xaxis(high_top10['小区名称'].values.tolist())
    .add_yaxis("均价", high_top10['均价'].tolist(),itemstyle_opts=opts.ItemStyleOpts(color=JsCode(color_js)))
    .set_global_opts(title_opts=opts.TitleOpts(title='北京市房价最高的前10个小区',pos_top='2%',pos_left = 'center'),
            legend_opts=opts.LegendOpts(is_show=False),
            xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-45)),
            yaxis_opts=opts.AxisOpts(name="均价",name_location='middle',name_gap=50,name_textstyle_opts=opts.TextStyleOpts(font_size=16)))
)
bar.render_notebook()

 

2. The top 10 communities with the lowest housing prices in Beijing and Yanjiao 

color_js = """new echarts.graphic.LinearGradient(0, 1, 0, 0,
    [{offset: 0, color: '#FFFFFF'}, {offset: 1, color: '#8085e8'}], false)"""
bar = (
    Bar()
    .add_xaxis(low_top5['小区名称'].values.tolist())
    .add_yaxis("均价", low_top5['均价'].tolist(),itemstyle_opts=opts.ItemStyleOpts(color=JsCode(color_js)))
    .set_global_opts(title_opts=opts.TitleOpts(title='北京市及燕郊房价最低的前10个小区',pos_top='2%',pos_left = 'center'),
            legend_opts=opts.LegendOpts(is_show=False),
            xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-45)),
            yaxis_opts=opts.AxisOpts(name="均价",name_location='middle',name_gap=50,name_textstyle_opts=opts.TextStyleOpts(font_size=16)))
)
bar.render_notebook()

 3. The average housing price of communities in Beijing and its surrounding districts

meanprice = estate1[['所在区', '均价']].groupby('所在区').mean()
color_js = """new echarts.graphic.LinearGradient(0, 1, 0, 0,
    [{offset: 0, color: '#FFFFFF'}, {offset: 1, color: '#f7a35c'}], false)"""
xdata = list(meanprice.index)
ydata = [int(price) for price in meanprice['均价'].values.tolist()]
bar = (
    Bar()
    .add_xaxis(xdata)
    .add_yaxis("均价", ydata,itemstyle_opts=opts.ItemStyleOpts(color=JsCode(color_js)))
    .set_global_opts(title_opts=opts.TitleOpts(title='北京市及周边各区小区房价均价',pos_top='2%',pos_left = 'center'),
            legend_opts=opts.LegendOpts(is_show=False),
            xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-45)),
            yaxis_opts=opts.AxisOpts(name="均价",name_location='middle',name_gap=50,name_textstyle_opts=opts.TextStyleOpts(font_size=16)))
)
bar.render_notebook()

 

 4. Distribution of real estate housing prices in various regions of the Beijing metropolitan area

 First, classify the real estate according to the price:

bins = [0, 5000, 10000, 50000, 100000, 150000, 1000000]
group_names = ['低的档','中低档', '中档', '中高档', '高的档','超高档']
estate1['小区级别'] = pd.cut(estate1['均价'], bins, labels=group_names)
estate1.head()

The painting is done:

estate1['超高档'] = estate1['小区级别'].apply(lambda x : 1 if '超高档' in x else 0)
estate1['高的档'] = estate1['小区级别'].apply(lambda x : 1 if '高的档' in x else 0)
estate1['中高档'] = estate1['小区级别'].apply(lambda x : 1 if '中高档' in x else 0)
estate1['中档'] = estate1['小区级别'].apply(lambda x : 1 if '中档' in x else 0)
estate1['中低档'] = estate1['小区级别'].apply(lambda x : 1 if '中低档' in x else 0)
estate1['低的档'] = estate1['小区级别'].apply(lambda x : 1 if '低的档' in x else 0)
qu_huxing = estate1[['所在区','低的档','中低档', '中档', '中高档', '高的档','超高档']].groupby('所在区').sum()
# 绘图
bar = (
    Bar()
    .add_xaxis(list(qu_huxing.index))
    .add_yaxis("低的档", qu_huxing['低的档'].values.tolist(),stack='stack1')
    .add_yaxis("中低档", qu_huxing['中低档'].values.tolist(),stack='stack1')
    .add_yaxis("中档", qu_huxing['中档'].values.tolist(),stack='stack1')
    .add_yaxis("中高档", qu_huxing['中高档'].values.tolist(),stack='stack1')
    .add_yaxis("高的档", qu_huxing['高的档'].values.tolist(),stack='stack1')
    .add_yaxis("超高档", qu_huxing['超高档'].values.tolist(),stack='stack1')
    .set_series_opts(label_opts=opts.LabelOpts(is_show=False))
    .set_global_opts(title_opts=opts.TitleOpts(title='北京都市圈各区域楼盘房价级别分布',pos_top='2%',pos_left = 'center'),
                    xaxis_opts=opts.AxisOpts(name='不同级别',axislabel_opts=opts.LabelOpts(rotate=-45)),       
                    yaxis_opts=opts.AxisOpts(name='小区数量'),
                    legend_opts=opts.LegendOpts(is_show=True, pos_top = '15%'))
)
bar.render_notebook()

 

 5. Distribution of real estate popularity levels in various regions of the Beijing metropolitan area

bins = [0, 100, 500, 1000, 2000] #设置10000为最大值
group_names = ['冷门', '一般', '热门', '非常热门']
estate1['热度级别'] = pd.cut(estate1['评论数'], bins, labels=group_names)
estate1.head()

estate1['非常热门'] = estate1['热度级别'].apply(lambda x : 1 if '非常热门' in x else 0)
estate1['热门'] = estate1['热度级别'].apply(lambda x : 1 if '热门' in x else 0)
estate1['一般'] = estate1['热度级别'].apply(lambda x : 1 if '一般' in x else 0)
estate1['冷门'] = estate1['热度级别'].apply(lambda x : 1 if '冷门' in x else 0)

qu_huxing = estate1[['所在区', '冷门', '一般', '热门', '非常热门']].groupby('所在区').sum()
# 绘图
bar = (
    Bar()
    .add_xaxis(list(qu_huxing.index))
    .add_yaxis("冷门", qu_huxing['冷门'].values.tolist(),stack='stack1')
    .add_yaxis("一般", qu_huxing['一般'].values.tolist(),stack='stack1')
    .add_yaxis("热门", qu_huxing['热门'].values.tolist(),stack='stack1')
    .add_yaxis("非常热门", qu_huxing['非常热门'].values.tolist(),stack='stack1')
    .set_series_opts(label_opts=opts.LabelOpts(is_show=False))
    .set_global_opts(title_opts=opts.TitleOpts(title='北京都市圈各区域楼盘热度级别分布',pos_top='2%',pos_left = 'center'),
                    xaxis_opts=opts.AxisOpts(name='不同热度',axislabel_opts=opts.LabelOpts(rotate=-45)),       
                    yaxis_opts=opts.AxisOpts(name='小区数量'),
                    legend_opts=opts.LegendOpts(is_show=True, pos_top = '15%'))
)
bar.render_notebook()

 6. Comment popularity distribution (pie chart, divided by administrative region)

This piece of data is directly entered manually. Friends who are too lazy to type can directly value.counts(), and then copy 

import matplotlib.pyplot as plt
import plotly.offline
import plotly.graph_objs
import cufflinks as cf
import numpy as np
cf.go_offline()###这两句是离线生成图片的设置
cf.set_config_file(offline=True, world_readable=True)

labels=np.array(['东城', '丰台', '北京周边', '大兴', '密云', '平谷', '延庆', '怀柔', '房山', '昌平', '朝阳','海淀', '燕郊', '石景山', '西城', '通州', '门头沟', '顺义', '香河'])#np.array数组创建方法
sizes=np.array([136,5065,5927,9920,1516,988,277,629,6081,3959,5402,1412,2541,2042,234,4668,2910,4482,2126])
plt.figure(figsize=(10,8),dpi=600)
#构造trace,配置相关参数
trace=plotly.graph_objs.Pie(labels=labels,values=sizes)
layout=plotly.graph_objs.Layout(title='评论热度分布')
#将trace保存于列表之中
data=[trace]
#将data补分和layout补分组成figure对象
fig=plotly.graph_objs.Figure(data=data,layout=layout)
#使用plotly.offline.iplot方法,将生成的图形嵌入到ipynb文件中
plotly.offline.iplot(fig)

 7. Dendrogram of heat level (divided by grade)

estate_cor = estate1[['小区级别', '热度级别', '小区名称']].groupby(['小区级别', '热度级别']).count()
estate_cor.reset_index(inplace=True)
estate_cor.rename(columns={'小区名称':'小区数'}, inplace=True)
estate_cor['小区数'].fillna(value=0, inplace=True)
estate_cor

  

import seaborn as sns
import matplotlib.pyplot as plt

sns.set(font="simhei")

plt.figure(figsize=(10,6), dpi=100)
estate_cor['小区数'] = [int(i) for i in estate_cor['小区数']]
hmap = estate_cor.pivot("小区级别", "热度级别", "小区数")
sns.heatmap(hmap, annot=True, fmt="d",cmap="OrRd")

 

Summary and Epilogue

This article is mainly about the visualization part, and more in-depth content such as importing databases, django cockpit visualization, etc. will be updated later.
If you have any questions, just leave a message in the comment area~~~

Guess you like

Origin blog.csdn.net/weixin_50706330/article/details/127039274