Py爬虫北京租房价格数据

记录自己的练习第一条！

最近北京的租房市场掀起了轩然大波，作为即将租房的人就顺便练手下爬虫北京的租房价格。爬房价已经有很多人在做了，但我还是分享些不同思路给大家。

首先是数据来源的网站，目前比较火的链家、自如、蛋壳的都可以。

仔细看了下这三家网站，自如在价格页面处用的是图的显示，链家和蛋壳都是字符，如若要爬自如就需要用图像识别的库，当然数字只有0-9，不需要图像识别那么高级只需要匹配0-9与页面的background-position。

如-240px对应数字0

background-position:-240px ----0
background-position:-30px  ----1

初级爬虫当然是竟可能简单了，所以自如排除。链家和蛋壳在网页结构上差别不大，不过蛋壳不显示页面数量，你不知道什么时候是最后一页，这样很容易出错

链家稍微人性告诉你北京一共有的数量，以及有多少页面是知道的，最有趣的是它提供了一个有多少人看过此房的因素，这是其他家网站没有的，这点可以作为后续数据分析的亮点。

进入正题开始爬虫，鉴于数据比较少不超过一万，没必要在去配置scrapy框架的，如果你想也可啊。

这里我选的维度是：标题、价格、户型、面积、浏览热度。（做完我才发现其实房龄这个参数也很有趣，有需要的朋友可以加这个分析）。

流程思路：

获取北京租房页面（https://bj.lianjia.com/zufang/）下，分区的连接，如昌平，朝阳。。。。。。
获取分区的最大页数，构造每页的连接地址
解析每一页，这里我用的是XPATH没有用煲汤soup，谷歌浏览器的xpath工具十分便捷，开发工具定位所需要的网页元素，右键即可复制出xpath
最后保存到CSV

详细代码及说明：

import requests
#可以设置时间不要访问过于频繁，或者做一个代理，博主找了几个都不太稳定，在数据不大的情况下不伪装ip那就用time访问慢一丢丢了
import time  
from lxml import etree
#json在解析当前页面有多少房子时用，每个区的房源最后一页数量是不定的
import json
import csv
#获取request页面，伪装浏览器
def response(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'}
    return requests.get(url, headers=headers)
#获取页面内的所有需要分区链接
def get_area(url):
    content=etree.HTML(response(url).text)   
    areas_name=content.xpath('//*[@id="filter-options"]/dl[1]/dd/div/a/text()')
    areas_url=content.xpath('//*[@id="filter-options"]/dl[1]/dd/div/a/@href')
   
    for i in range(1,len(areas_name)):   #已经是0不算了，定了几个区
        area_name=areas_name[i]
        area_url='https://bj.lianjia.com'+areas_url[i]
        get_detailurl(area_name,area_url)
    
#由页数计算页的链接
def get_detailurl(area_name,area_url):
    content=etree.HTML(response(area_url).text)
#最后一页可能不是满数量
    pages =json.loads(content.xpath('/html/body/div[4]/div[3]/div[2]/div[2]/div[2]/@page-data')[0])['totalPage']
    for page in range(1,pages+1):
        url=area_url+'pg'+str(page)
        print('当前为'+area_name+','+str(page)+'of'+str(pages),url)
        get_house_info(area_name,url)
        
#解析某页的信息
def get_house_info(area,url):
    time.sleep(1)
#经常用try防止出错哦
    try:
        content=etree.HTML(response(url).text)
        maxdital=len(content.xpath('//*[@id="house-lst"]/li'))
        with open('租房.csv','a',encoding='utf-8') as f:
            for i in range(1,maxdital+1):
#这里就要用xpath获取了，每个房源变化的就是li[编号]
                title=content.xpath('//*[@id="house-lst"]/li['+str(i)+']/div[2]/h2/a/text()')[0]
                price=content.xpath('//*[@id="house-lst"]/li['+str(i)+']/div[2]/div[2]/div[1]/span/text()')[0]
                room_type=content.xpath('//*[@id="house-lst"]/li['+str(i)+']/div[2]/div[1]/div[1]/span[1]/span/text()')[0]
                square=str(content.xpath('//*[@id="house-lst"]/li['+str(i)+']/div[2]/div[1]/div[1]/span[2]/text()')[0])[:-4]
                people_flow=content.xpath('//*[@id="house-lst"]/li['+str(i)+']/div[2]/div[3]/div/div[1]/span/text()')
#写入文件
                f.write("{},{},{},{},{},{}\n".format(area,title,price,room_type,square,people_flow))
                #print('当前为第'+str(i)+'of30')
    except Exception as e:
        print( ' connecting error, retrying.....')
        time.sleep(10)
        return get_house_info(area, url)
        
    
def main():
    url = 'https://bj.lianjia.com/zufang'
    get_area(url)
    
if __name__ == '__main__':
    main()

爬完的结果：

下一步就是数据分析了，待我这两天看完pandas

Py爬虫北京租房价格数据

猜你喜欢