scrapy实例：爬取安居客租房信息

本次爬取安居客网站，获取上海长宁区的租房信息，参考自：微信公众号

仍然是用scrapy框架构建爬虫，步骤：1.分析网页

　　　　　　　　　　　　　　　　　2.items.py

　　　　　　　　　　　　　　　　　3.spiders.py

　　　　　　　　　　　　　　　　　4. pipelines.py

　　　　　　　　　　　　　　　　　5.settings.py

观察网页

上海长宁区租房信息： https://sh.zu.anjuke.com/fangyuan/changning/

items.py

　　　　　　这里定义字段保存要爬取的信息

import scrapy

class AnjukespiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    price = scrapy.Field()
    rent_type = scrapy.Field()
    house_type = scrapy.Field()
    area = scrapy.Field()
    towards = scrapy.Field()
    floor = scrapy.Field()
    decoration = scrapy.Field()
    building_type = scrapy.Field()
    community = scrapy.Field()

spider.py

　　　　这里编写爬虫文件，告诉爬虫要爬取什么，怎么爬取

import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from anjukeSpider.items import AnjukespiderItem


# 定义爬虫类
class anjuke(scrapy.spiders.CrawlSpider):
    #爬虫名称
    name = 'anjuke'
    #爬虫起始网页
    start_urls = ['https://sh.zu.anjuke.com/fangyuan/changning/']
    #爬取规则
    rules = (
                Rule(LinkExtractor(allow=r'fangyuan/p\d+/'), follow=True), #网页中包含下一页按钮，所以这里设置True爬取所有页面
                Rule(LinkExtractor(allow=r'https://sh.zu.anjuke.com/fangyuan/\d{10}'), follow=False, callback='parse_item'),#网页里含有【推荐】的房源信息但不一定是我们想要的长宁区，所以设置False不跟进
            )
    #回调函数，主要就是写xpath路径，上一篇实例说过，这里就不赘述了
    def parse_item(self, response):
        item = AnjukespiderItem()
        # 租金
        item['price'] = int(response.xpath("//ul[@class='house-info-zufang cf']/li[1]/span[1]/em/text()").extract_first())
        # 出租方式
        item['rent_type'] = response.xpath("//ul[@class='title-label cf']/li[1]/text()").extract_first()
        # 户型
        item['house_type'] = response.xpath("//ul[@class='house-info-zufang cf']/li[2]/span[2]/text()").extract_first()
        # 面积
        item['area'] = int(response.xpath("//ul[@class='house-info-zufang cf']/li[3]/span[2]/text()").extract_first().replace('平方米',''))
        # 朝向
        item['towards'] = response.xpath("//ul[@class='house-info-zufang cf']/li[4]/span[2]/text()").extract_first()
        # 楼层
        item['floor'] = response.xpath("//ul[@class='house-info-zufang cf']/li[5]/span[2]/text()").extract_first()
        # 装修
        item['decoration'] = response.xpath("//ul[@class='house-info-zufang cf']/li[6]/span[2]/text()").extract_first()
        # 住房类型
        item['building_type'] = response.xpath("//ul[@class='house-info-zufang cf']/li[7]/span[2]/text()").extract_first()
        # 小区
        item['community'] = response.xpath("//ul[@class='house-info-zufang cf']/li[8]/a[1]/text()").extract_first()
        yield item

pipelines.py

　　　　保存爬取的数据，这里只保存为json格式

　　　　其实可以不写这部分，不写pipeline ，运行时加些参数：scrapy crawl anjuke -o anjuke.json -t json

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　scrapy crawl 爬虫名称 -o 目标文件名称 -t 保存格式

from scrapy.exporters import JsonItemExporter


class AnjukespiderPipeline(object):
    def __init__(self):
        self.file = open('zufang_shanghai.json', 'wb') #设置文件存储路径
        self.exporter = JsonItemExporter(self.file, ensure_ascii=False)
        self.exporter.start_exporting()

    def process_item(self, item, spider):
        print('write')
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        print("close")
        self.exporter.finish_exporting()
        self.file.close()

settings.py

　　　　修改settings文件，使pipeline生效

　　　　设置下载延迟，防止访问过快导致被网站屏蔽

ITEM_PIPELINES = {
    'anjukeSpider.pipelines.AnjukespiderPipeline': 300,
}

DOWNLOAD_DELAY = 2

运行命令行，进入项目根目录，键入
```
scrapy crawl [爬虫名称]
```

PS F:\ScrapyProject\anjukeSpider\anjukeSpider> scrapy crawl anjuke

执行完成

　　　　爬取到61条信息，json文件在指定路径已生成

2018-10-22 09:02:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 40861,
 'downloader/request_count': 61,
 'downloader/request_method_count/GET': 61,
 'downloader/response_bytes': 1925879,
 'downloader/response_count': 61,
 'downloader/response_status_count/200': 61,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 10, 22, 1, 2, 55, 245128),
 'item_scraped_count': 60,
 'log_count/DEBUG': 122,
 'log_count/INFO': 9,
 'request_depth_max': 1,
 'response_received_count': 61,
 'scheduler/dequeued': 61,
 'scheduler/dequeued/memory': 61,
 'scheduler/enqueued': 61,
 'scheduler/enqueued/memory': 61,
 'start_time': datetime.datetime(2018, 10, 22, 1, 0, 29, 555537)}
2018-10-22 09:02:55 [scrapy.core.engine] INFO: Spider closed (finished)

爬虫到此完成，但爬取到的数据并不直观，还需对其做可视化处理（pyecharts模块），这部分另写一篇pyecharts使用

pyecharts官方文档：http://pyecharts.org/#/zh-cn/

scrapy实例：爬取安居客租房信息

猜你喜欢