After a round of housing price hikes in the past few years, now due to the country's vigorous regulation and other reasons, the market has faded away and is slowly becoming stable, and the price has dropped compared to the peak. So what is the price now? What will be the next development trend? Here we can use Python to capture the recent house price data for analysis.
Module installation
The following modules need to be installed here. Of course, if they are already installed, they do not need to be installed again:
# 安装引用模块
pip3 install bs4
pip3 install requests
pip3 install lxml
pip3 install numpy
pip3 install pandas
Configure request headers
Generally, when we crawl a website, in order to deal with the anti-crawling mechanism of the website, we will encapsulate the header information of the request. The following is the simplest processing, which is to randomly select and use the requested client information. The code is as follows:
# 代理客户端列表
USER_AGENTS = [
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
]
# 创建请求头信息
def create_headers():
headers = dict()
headers["User-Agent"] = random.choice(USER_AGENTS)
headers["Referer"] = "http://www.ke.com"
return headers
Configure proxy IP
In addition to the request headers configured above, if you use the same IP to make a large number of requests for crawling, it is likely that the IP will be blocked. When you use this IP to request a website after being blocked, you will be prompted that the request timed out. To avoid being blocked, it is best for us to To crawl through the proxy IP, how can I find the proxy IP that can be used?
# 引入模块
from bs4 import BeautifulSoup
import requests
from lib.request.headers import create_headers
# 定义变量
proxys_src = []
proxys = []
# 请求获取代理地址
def spider_proxyip(num=10):
try:
url = 'http://www.xicidaili.com/nt/1'
# 获取代理 IP 列表
req = requests.get(url, headers=create_headers())
source_code = req.content
# 解析返回的 html
soup = BeautifulSoup(source_code, 'lxml')
# 获取列表行
ips = soup.findAll('tr')
# 循环遍历列表
for x in range(1, len(ips)):
ip = ips[x]
tds = ip.findAll("td")
proxy_host = "{0}://".format(tds[5].contents[0]) + tds[1].contents[0] + ":" + tds[2].contents[0]
proxy_temp = {tds[5].contents[0]: proxy_host}
# 添加到代理池
proxys_src.append(proxy_temp)
if x >= num:
break
except Exception as e:
print("获取代理地址异常:")
print(e)
house price data object
Here we create the price information of the new house as an object, and then we only need to save the acquired data as an object, and then processing it will be much more convenient. NewHouse
The object code looks like this:
# 新房对象
class NewHouse(object):
def __init__(self, xiaoqu, price, total):
self.xiaoqu = xiaoqu
self.price = price
self.total = total
def text(self):
return self.xiaoqu + "," + \
self.price + "," + \
self.total
Get price information and save
Okay, make the above preparations, let's take Shell as an example to crawl its new house data in Beijing in batches and save it locally. In fact, as long as the data can be captured, it can be saved in any format, and of course it can also be saved to the database. What I mainly want to talk about here is how to capture data, so save it in the simplest txt
text format.
# 创建文件准备写入
with open("newhouse.txt", "w", encoding='utf-8') as f:
# 获得需要的新房数据
total_page = 1
loupan_list = list()
page = 'http://bj.fang.ke.com/loupan/'
# 调用请求头
headers = create_headers()
# 请求 url 并返回结果
response = requests.get(page, timeout=10, headers=headers)
html = response.content
# 解析返回 html
soup = BeautifulSoup(html, "lxml")
# 获取总页数
try:
page_box = soup.find_all('div', class_='page-box')[0]
matches = re.search('.*data-total-count="(\d+)".*', str(page_box))
total_page = int(math.ceil(int(matches.group(1)) / 10))
except Exception as e:
print(e)
print('总页数:' + total_page)
# 配置请求头
headers = create_headers()
# 从第一页开始遍历
for i in range(1, total_page + 1):
page = 'http://bj.fang.ke.com/loupan/pg{0}'.format(i)
print(page)
response = requests.get(page, timeout=10, headers=headers)
html = response.content
# 解释返回结果
soup = BeautifulSoup(html, "lxml")
# 获得小区信息
house_elements = soup.find_all('li', class_="resblock-list")
# 循环遍历获取想要的元素
for house_elem in house_elements:
price = house_elem.find('span', class_="number")
desc = house_elem.find('span', class_="desc")
total = house_elem.find('div', class_="second")
loupan = house_elem.find('a', class_='name')
# 开始清理数据
try:
price = price.text.strip() + desc.text.strip()
except Exception as e:
price = '0'
loupan = loupan.text.replace("\n", "")
# 继续清理数据
try:
total = total.text.strip().replace(u'总价', '')
total = total.replace(u'/套起', '')
except Exception as e:
total = '0'
# 作为对象保存到变量
loupan = NewHouse(loupan, price, total)
print(loupan.text())
# 将新房信息加入列表
loupan_list.append(loupan)
# 循环获取的数据并写入到文件中
for loupan in loupan_list:
f.write(loupan.text() + "\n")
The code is written, and now we can python newhouse.py
run the code through the command for data capture. The captured results are shown in the following figure:
Summarize
This article introduces how to use Python to capture the new house data on the real estate network in batches, and then you can compare and analyze the results captured every day with the historical data to judge the general trend of the real estate market. It involves parsing with BeautifulSoup
the html
whole code. It is not difficult to see how the whole code is implemented. I hope this process can provide you with some help.
Friends who are learning programming and Python, it is difficult to learn by one person, and bloggers are also here. Here, a new deduction group has been created: 1020465983, which has prepared learning resources and fun projects for everyone. Welcome everyone to join and communicate.