Python Yichang housing data acquisition and analysis

1.
Write in the front I have nothing to do on the weekend. I was concerned about the house problem recently. I suddenly wanted to write a blog, just to review the knowledge I learned before and provide a little inspiration for colleagues who need to buy a house. The amount of work completed is not small, it is a small project. There are two major parts of the process. One is the acquisition of Yichang (Yiling District) housing data, and the other is the analysis of the acquired data and discovering some interesting problems and phenomena.
2. Data Acquisition The
first thing that comes to mind when acquiring data on the Internet is definitely writing crawlers. For the crawler, I will only write some simple ones. The Scrapy framework dedicated to the great god will not be used for the time being, but it is enough for regular data crawling (movie information, pictures, comments, etc.). It should be friendship to thank the chain of home network , for reptile novice simply do not get too friendly and not worry about the other people site to engage in a crash, the equivalent of the chain of home to tell you: "Nothing, my site in the resource you just crawl, site collapse on me lose ". The crawler code is as follows, there is not much bb here, familiar can understand, unfamiliar children's shoes, let's watch the fun O(∩_∩)O~

import requests
from lxml import etree
import pandas as pd
import re

headers={
    
    
        'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3868.400'
        }
class House_price_analyze:
    def __init__(self,city_name,district_name,house_num):
        self.city_name = city_name
        self.district_name = district_name
        self.house_num = house_num
        self.urls = []
  
    def Page_urls(self):
        for i in range(1, int(self.house_num / 30) + 1):
            out_url = 'https://'+self.city_name+'.lianjia.com/ershoufang/'+self.district_name+'/pg' + str(i) + '/'
            response = requests.get(out_url,headers=headers)
            res_xpath = etree.HTML(response.text)
            inner_url = res_xpath.xpath("//ul[@class='sellListContent']/li/a/@href")
            for item in inner_url:
                self.urls.append(item)

    def Basic_information(self):
        self.Data = []
        self.success_count = 0
        self.fail_count = 0
        for url in self.urls:
            response = requests.get(url,headers=headers)
            try:
                introduce = re.findall(r"<title>(.*?)</title>", response.text)[0]
                community_name =  re.findall(r'class="info no_resblock_a">(.*?)</a>', response.text)[0]
                main_info =  re.findall(r'<div class="mainInfo">(.*?)</div>', response.text)[0]
                area = re.findall(r'</div><div class="area"><div class="mainInfo">(.*?)</div>', response.text)[0]
                sub_info =  ''.join(re.findall(r'<div class="subInfo">(.*?)</div>', response.text))
                unit_price = re.findall(r'<span class="unitPriceValue">(.*?)<i>', response.text)[0]
                total_price = re.findall(r'<span class="total">(.*?)</span>', response.text)[0]
                pattern = re.findall(r"resblockPosition:'(.*?)'", response.text)[0]
                longitude = pattern.split(',')[0]
                latitude  = pattern.split(',')[1]
                               
                data = {
    
    
                    'introduce':introduce,
                    'community_name':community_name,
                    'main_info':main_info,
                    'area':area,
                    'sub_info':sub_info,
                    'unit_price':unit_price,
                    'total_price':total_price,
                    'longitude':longitude,
                    'latitude':latitude
                    }
                self.success_count += 1
                print(data,'第' + str(self.success_count) + '个爬取成功')
                self.Data.append(data)
            except:                
                self.fail_count += 1
                print('第' + str(self.fail_count) + '个数据产生异常')
                pass
        
        house_data = pd.DataFrame(self.Data)
        house_data.to_csv(self.city_name + '_' + self.district_name + '_Basic_infomation.csv')

Yichang = House_price_analyze('yichang', 'yilingqu', 1200)
Yichang.Page_urls()
data = Yichang.Basic_information()
print('抓取成功,共成功抓取抓取{}条信息,失败{}条'.format(Yichang.success_count,Yichang.fail_count))

There are 1,200 real estate data collected in Yiling District. The data includes (landlord description, community name, house type, building height, building structure, whether it is hard-covered, area, total price, average price, geographic location (latitude and longitude)). The data shown in the figure below is actually not the original data that was crawled. I accidentally deleted the sub_info column when I removed the "square meter" character at the end of the area column. I don't want to crawl again (it took more than half an hour to crawl (┭┮﹏) ┭┮)), and considering that it will have little impact on the subsequent data analysis, so we use this to analyze.
Property data obtained

2. Data analysis
First, use the drop_duplicates method to remove 44 duplicate data, leaving 1156.
1. Look at the overall situation of the house area, average price, and total price data we are concerned about.

import pandas as pd
import seaborn as sns

house_data = pd.read_csv(r'yichang_yilingqu_Basic_infomation.csv')
house_data = house_data.drop_duplicates(subset = 'introduce')
house_data_describe = house_data[['area','unit_price','total_price']]
print(house_data_describe.describe())

The results of the operation are as follows: It can be found that the current average value of house prices in Yiling District is 6,841 yuan per square meter, and the average total price is 871,400 yuan.

              area    unit_price  total_price
count  1156.000000   1156.000000  1156.000000
mean    126.142189   6841.746540    87.146946
std      52.835946   1403.095185    49.650589
min      34.000000   3424.000000    16.000000
25%      96.460000   5938.750000    63.000000
50%     122.000000   6792.500000    79.800000
75%     137.025000   7613.250000    95.650000
max     738.680000  16322.000000   665.000000

2. Draw a scatter plot of area and total price to observe the overall data. As shown in the figure below, we can observe some special (discrete) samples, such as samples with an area of ​​more than 700 square meters or a total price of more than 4 million. Seeing these samples, I can't help but wonder, what kind of house is with an area of ​​more than 700 square meters? Will there be houses with housing prices over 4 million in Yiling District?

sns.scatterplot(x = 'area',y = 'total_price',data = house_data)

Scatter plot of total house price and area
We will pull out houses with an area of ​​more than 400 square meters and a house price of more than 4 million yuan to see what is going on.

house_data = house_data[(house_data['area'] > 400) | (house_data['total_price'] > 400)]
print(house_data['introduce'])

operation result:

266    国宾一号别墅,诚心出售,独栋独院_宜昌夷陵区夷陵区国宾一号A区二手房(宜昌链家)
320    夷陵万达旁 稀有独栋别墅 户型好 有大花园_宜昌夷陵区夷陵区国宾一号A区二手房(宜昌链家)
465    私房整栋出售,一楼门面,二..四楼可出租可居住_宜昌夷陵区夷陵区黄金路25号二手房(宜昌链家)
600    国宾一号联排别墅 精装修业主自住房出租 可看一线湖景_宜昌夷陵区夷陵区国宾一号A区二手房(...
963    长江市场繁华地段,独栋别墅.大花园大露台_宜昌夷陵区夷陵区玛歌庄园二手房(宜昌链家)
1199   昌耀电力别墅区三层别墅 全新毛坯可以几代同堂_宜昌夷陵区夷陵区昌耀沁园二手房(宜昌链家)

From the results, we can see that they are basically all villas, so the high prices are reasonable. But one of them (No. 465) is the whole building sold together. It really has been a long time to see. As expected, poverty limits my imagination.

3. Seeing this, some friends may be more curious about how the area and price of villas and hardcover rooms will be distributed? The following pictures are used to visually show you all (here I removed some discrete samples, which is more intuitive).

f, [ax1,ax2] = plt.subplots(1,2,figsize=(10,5))
sns.set_style({
    
    'font.sans-serif':['simhei','Arial']})
sns.scatterplot(x = 'area',y = 'total_price',hue = 'villa',data = house_data,ax = ax1)
ax1.set_title('房屋面积与总价的关系(别墅与商品房)',fontsize = 10)
ax1.set_xlabel('房屋面积')
ax1.set_ylabel('房屋总价')

sns.scatterplot(x = 'area',y = 'total_price',hue = 'hardcover',data = house_data,ax = ax2)
ax2.set_title('房屋面积与总价的关系(精装与简装)',fontsize = 10)
ax2.set_xlabel('房屋面积')
ax2.set_ylabel('房屋总价')

Operation result: It can be found that the villa is really expensive, the cheapest is 1.5 million, most of which are 200w+. Next, let’s look at a sample of hardcover: In the case of the same area, the overall price of the hardcover is higher than that of the non-hardcover. This can be found intuitively and well understood. People also have to spend money on decoration. Right, decoration does not cost money, so construction workers have to spend money on meals (`・ω・´)
Insert picture description here
4. Use the heat map to see the overall area and selling price of houses in Yiling District.

g = sns.jointplot(x=house_data['area'], y=house_data['total_price'], kind="hex", color="b")
g.add_legend()

Operation result: It can be found that the house area in Yiling District has two concentrated intervals, which are 85 to 95 and 125 to 135 respectively. House prices are mostly concentrated between 75w and 100w.
Insert picture description here
5. Use categorical data to observe sample distribution. It mainly answered the two questions: "What is the proportion of various types of houses in Yiling District?" and "Which residential areas are more abundant?"

f, [ax1,ax2] = plt.subplots(1,2,figsize=(30,5))
sns.set_style({
    
    'font.sans-serif':['simhei','Arial']})
df_house_count_1 = house_data.groupby('main_info')['total_price'].count().sort_values(ascending=False).to_frame().reset_index()
df_house_count_2 = house_data.groupby('community_name')['total_price'].count().sort_values(ascending=False).to_frame().reset_index()

sns.barplot(x='main_info',y='total_price',data = df_house_count_1[df_house_count_2['total_price'] > 30],ax = ax1)
ax1.set_title('各户型在售数量',fontsize = 10)
ax1.set_xlabel('户型')
ax1.set_ylabel('数量(个)')

sns.barplot(x='community_name',y='total_price',data = df_house_count_2[df_house_count_2['total_price'] > 30],ax = ax2)
ax2.set_title('各小区在售数量',fontsize = 10)
ax2.set_xlabel('小区')
ax2.set_ylabel('数量(个)')

The results of the operation are as follows: It can be found that there are more than 600 units of 3 bedrooms and 2 halls, which accounted for more than 50% of the number on sale. In terms of the number of housing units in each community, Evergrande Oasis has the most abundant housing units, with more than 70 units. Other major communities (Qingjiang Runcheng, Diaoyutai, etc.) are all selling around 50 units, and the quantity is acceptable.
Insert picture description here
6. We know that there must be a relationship between market prices and the number of houses. Then the question arises: Which communities have the highest house prices?

df = house_data[house_data['total_price'] > 80][['unit_price','community_name','total_price']]
df_1 = df.groupby(df['community_name'])['total_price'].mean().sort_values(ascending = False)[3:10]

df_1.plot(kind = 'bar')
plt.title('总价最高小区')
plt.ylabel('均价(万)')

Operation result: (I didn’t consider some communities with few housing listings (for example, the quantity on sale <5), because their pricing has a large subjective deviation and the price is not representative.) It can be seen that the top four in the average price of the community are national guests No. 1, Hujin Garden, Wushi Villa, Huayu Villa. In the Baidu map below, we can see that these residential areas seem to be on the banks of lakes and rivers. With the blessing of villas, house prices can't fall even if they want to fall. Just take a look at ╮(╯﹏╰)╭.
Insert picture description here
8. Use Baidu map to display the heat map of the regional distribution of housing prices in Yiling District. This part is more complicated, of course the final display effect will be very good.
First, the latitude and longitude and total price information in the data should be called out and modified into json format before they can be called in the Baidu map API.

import pandas as pd
file = open('经纬度总价.json','w')
data = pd.read_csv("yichang_yilingqu_Basic_infomation.csv")
df = data.copy()
columns = ['total_price','longitude','latitude']
df = pd.DataFrame(df,columns=columns)

for i in df.values:
    total_price = i[0]
    lng = i[1]#获取经度
    lat = i[2]#获取纬度
    str_temp = '{"lat":' + str(lat) + ',"lng":' + str(lng) +',"total_price":'+str(total_price) +'},'
    file.write(str_temp)
file.close()

After the conversion, you can see a file named "Longitude and Latitude Total Price.json" in the current folder. Next, start calling the Baidu Map API. First, you need to register a key on the official Baidu map website, so that you can have a call entry. After the registration is complete, open here , copy all the content in the source code editor, modify the key to be registered by yourself, pass all the data in your json in the var points, and modify some parameters later to make the heat map more Beautiful. I don’t want to go into details here. If any friends are not clear, please give me a thumbs up and chat with me privately, O(∩_∩)O~
(1) Fill in the key
Insert picture description here

(2) Fill in the coordinates and price data

Insert picture description here
(3) Modify the parameters.
Insert picture description here
Operation result: the darker the color, the higher the price, but I don’t know if multiple samples will overlap and add to the same place to cause the data to become darker. For example: If the average price of houses in Diaoyutai No.1 Community is 6,500, because there are many samples here, the original price is not very high, but will the color become darker by accident? Don’t understand, do this for now, if anyone knows, please correct me!

As can be seen from the picture, a large number of houses in Yiling District are concentrated near Huangbai River, because this is the old city. But at the same time, you can see that there are many houses on Development Avenue, and the housing prices are not low. Maybe this is also the focus of the future development of Yiling District. Of course, this is also a guess from the housing price.

Insert picture description here
Insert picture description here

3. Conclusion
My end of the day is finally finished! In fact, many of the contents in this have been scattered and groped before, but now they are coded in a summary way. In fact, the coding process is very annoying. Some seemingly easy operations are not easy to implement, for example, I don’t understand Dataframe. The groupby function can only be programmed for Baidu, and it is not easy to code by myself. But looking back at these pictures, I suddenly feel that coding is a boring and interesting process. Of course, I am constantly learning and growing in this process. After I finish writing, I am more familiar with crawlers, Dataframe operations, and seaborn drawing, which is not a small gain. In the future study plan, I will study more data structures and algorithms, hoping to learn to blog and share experience one day earlier. Any happiness comes from ordinary hard work and persistence. Come on, everyone~!

Guess you like

Origin blog.csdn.net/weixin_44086712/article/details/113087604