我首先选择杭州的二手房作为爬取的对象。
页面分析
问题
链家只显示前100页数据,100以后的数据根本就不显示,这样一次性最多只能抓取3000条数据。
解决办法
我通过分类抓取,只要保证每一类中的房源小于3000即可。这里我以面积作为选择的参数:
- 50平以下:2378
- 50-70平:3532
- 70-90平:5787
- 90-120平:2640
- 120-140平:2602
- 140-160平:984
- 160-200平:920
- 200平以上:848
其中50-70平、70-90平这两个分类房源超过3000,我按照居室数目再次细分抓取。
代码实现
import requests
from fake_useragent import UserAgent
from lxml import etree
import pandas as pd
import numpy as np
import time
import json
from collections import OrderedDict #用来生成有序的字典
import re
import os
import glob
页面解析函数
- 输入:页面数据(response.text)
- 输出:Pandas DataFrame数据、该分类的总页数
字符串中取出一个字典
我们可以通过json
来转换:
>>> import json
>>> user_info= '{"name" : "john", "gender" : "male", "age": 28}'
>>> user_dict = json.loads(user_info)
>>> user_dict
{u'gender': u'male', u'age': 28, u'name': u'john'}
但是使用 json 进行转换存在一个潜在的问题。
注意:json
语法规定 数组或对象之中的字符串必须使用双引号,不能使用单引号
def parse(text):
selector = etree.HTML(text)
###下面是总页数解析过程
totalPageStr = selector.xpath('//div[@class="page-box fr"]/div[1]/@page-data')[0] #这是一个字符串,里面包含了一个字典
totalPageDict = json.loads(totalPageStr)
totalPage = totalPageDict["totalPage"]
###下面是数据解析过程
sellList = selector.xpath('//ul[@class="sellListContent"]/li')
house = []
for sell in sellList:
link = sell.xpath('a/@href')[0]
title = sell.xpath('div[@class="info clear"]/div[@class="title"]/a/text()')[0]
address = sell.xpath('div[@class="info clear"]/div[@class="address"]/div[@class="houseInfo"]/a/text()')[0]
#下面是房子内部信息
houseInfo = sell.xpath('div[@class="info clear"]/div[@class="address"]/div[@class="houseInfo"]/text()')[0].split('|')
room = houseInfo[1]
area = houseInfo[2]
orientation = houseInfo[3]
if len(houseInfo) >= 5:
decoration = houseInfo[4]
else:
decoration = []
if len(houseInfo) == 6:
elevator = houseInfo[5]
else:
elevator = []
#下面是房子的总体信息
positionIcon = sell.xpath('div[@class="info clear"]/div[@class="flood"]/div[@class="positionInfo"]/text()')[0]
positionIconTemp = re.split("年建|\-",positionIcon)#进行字符串切分多段
floor = positionIconTemp[0][:-4] #楼层信息
year = positionIconTemp[0][-4:] #建造年份
genre = positionIconTemp[1].strip()
positionInfo = sell.xpath('div[@class="info clear"]/div[@class="flood"]/div[@class="positionInfo"]/a/text()')[0]
#下面是房子的关注者信息
followInfo = sell.xpath('div[@class="info clear"]/div[@class="followInfo"]/text()')[0]
followInfoTemp = followInfo.split('/')
follower = followInfoTemp[0].split('人')[0] #关注人数
interestedFollower = re.split('共|次',followInfoTemp[1])[1] #看房人数
datetime = followInfoTemp[2].strip() #发布时间
#下面是tag标签
tag = []
tagList = sell.xpath('div[@class="info clear"]/div[@class="tag"]/span')
for tags in tagList:
tag.append(tags.xpath('text()')[0])
#下面是价格
totalPrice = sell.xpath('div[@class="info clear"]/div[@class="priceInfo"]/div[@class="totalPrice"]/span/text()')[0]
unitPrice = sell.xpath('div[@class="info clear"]/div[@class="priceInfo"]/div[@class="unitPrice"]/span/text()')[0]
#因为字典是无序,若你一开始设计的时候就希望它按照的添加的顺序进行有序排列(比如读取CSV文件),那么我们就是利用collection模块里面的OrderedDict()处理:
houseDict = OrderedDict()
houseDict['link'] = link
houseDict['title'] = title
houseDict['address'] = address
houseDict['room'] = room
houseDict['area'] = area
houseDict['orientation'] = orientation
houseDict['decoration'] = decoration
houseDict['elevator'] = elevator
houseDict['floor'] = floor
houseDict['year'] = year
houseDict['genre'] = genre
houseDict['positionInfo'] = positionInfo
houseDict['follower'] = follower
houseDict['interestedFollower'] = interestedFollower
houseDict['datetime'] = datetime
houseDict['tag'] = tag
houseDict['totalPrice'] = totalPrice
houseDict['unitPrice'] = unitPrice
#下面合并成一个列表
house.append(houseDict)
df = pd.DataFrame(house)
return df,totalPage
页面请求函数
def getData(url,headers):
try:
time.sleep(1)
response = requests.get(url,headers = headers)
text = response.text
return text
except Exception as e:
time.sleep(10)
print(url)
print("requests fail, retry!")
return getData(url,headers) #递归调用
主函数
def main():
#下面是请求头构造
ua = UserAgent()
headers = {
'User-Agent':ua.random,
'Host': 'hz.lianjia.com',
'Referer': 'https://hz.lianjia.com/ershoufang/pg1/'
}
url = "https://hz.lianjia.com/ershoufang/a8/pg{}/"
#下面要获取总页数
text = getData(url.format('1'),headers)
total_df,page = parse(text)
print(page)
#下面是爬取全部页面
for i in range(2,int(page)+1):
text = getData(url.format(str(i)),headers)
df,_ = parse(text)
total_df = pd.concat([total_df,df],axis = 0)
#下面是保存到csv文件
total_df.to_csv('./data/House-Second-Hangzhou-above200.csv', sep = ',', header = True, index = False)
main()
多个csv文件合并函数
前面分类爬取了所有的杭州二手房数据,现在把这些csv文件合并成一个csv文件。
def merge():
csv_list = glob.glob('*.csv') #查看同文件夹下的csv文件数
print(u'共发现%s个CSV文件'% len(csv_list))
print(u'正在处理............')
for i in csv_list: #循环读取同文件夹下的csv文件
fr = open(i,'rb').read()
with open('House-Second-Hangzhou.csv','ab') as f: #将结果保存为result.csv
f.write(fr)
print(u'合并完毕!')
merge()
共发现13个CSV文件
正在处理............
合并完毕!
df_read = pd.read_csv("House-Second-Hangzhou.csv")
df_read.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19716 entries, 0 to 19715
Data columns (total 18 columns):
link 19716 non-null object
title 19716 non-null object
address 19716 non-null object
room 19716 non-null object
area 19716 non-null object
orientation 19716 non-null object
decoration 19716 non-null object
elevator 19716 non-null object
floor 19713 non-null object
year 19716 non-null object
genre 18869 non-null object
positionInfo 19716 non-null object
follower 19716 non-null object
interestedFollower 19716 non-null object
datetime 19716 non-null object
tag 19716 non-null object
totalPrice 19716 non-null object
unitPrice 19716 non-null object
dtypes: object(18)
memory usage: 2.7+ MB
至此,全部近20000条杭州二手房数据爬取完毕,后面我要对它进行数据分析。