Crawler multi-level pages

Multi-level page scrapy crawler

Project requirements

  • Goal description

    1】在抓取一级页面的代码基础上升级
    【2】一级页面所抓取数据(和之前一样):
        2.1) 汽车链接
        2.2) 汽车名称
        2.3) 汽车价格
    【3】二级页面所抓取数据
        3.1) 行驶里程: //ul[@class="assort clearfix"]/li[2]/span/text()
        3.2) 排量:    //ul[@class="assort clearfix"]/li[3]/span/text()
        3.3) 变速箱:  //ul[@class="assort clearfix"]/li[4]/span/text()
    

Realized in the original project

  • Step 1-items.py

    # 添加二级页面所需抓取的数据结构
    
    import scrapy
    
    class GuaziItem(scrapy.Item):
        # define the fields for your item here like:
        # 一级页面: 链接、名称、价格
        url = scrapy.Field()
        name = scrapy.Field()
        price = scrapy.Field()
        # 二级页面: 时间、里程、排量、变速箱
        time = scrapy.Field()
        km = scrapy.Field()
        disp = scrapy.Field()
        trans = scrapy.Field()
    
  • Step 2-car2.py

    """
    	重写start_requests()方法,效率极高
    """
    # -*- coding: utf-8 -*-
    import scrapy
    from ..items import CarItem
    
    class GuaziSpider(scrapy.Spider):
        # 爬虫名
        name = 'car2'
        # 允许爬取的域名
        allowed_domains = ['www.guazi.com']
        # 1、去掉start_urls变量
        # 2、重写 start_requests() 方法
        def start_requests(self):
            """生成所有要抓取的URL地址,一次性交给调度器入队列"""
            for i in range(1,6):
                url = 'https://www.guazi.com/bj/buy/o{}/#bread'.format(i)
                # scrapy.Request(): 把请求交给调度器入队列
                yield scrapy.Request(url=url,callback=self.parse)
    
        def parse(self, response):
            # 基准xpath: 匹配所有汽车的节点对象列表
            li_list = response.xpath('//ul[@class="carlist clearfix js-top"]/li')
            # 给items.py中的 GuaziItem类 实例化
            item = CarItem()
            for li in li_list:
                item['url'] = 'https://www.guazi.com' + li.xpath('./a[1]/@href').get()
                item['name'] = li.xpath('./a[1]/@title').get()
                item['price'] = li.xpath('.//div[@class="t-price"]/p/text()').get()
                # Request()中meta参数: 在不同解析函数之间传递数据,item数据会随着response一起返回
                yield scrapy.Request(url=item['url'], meta={
          
          'meta_1': item}, callback=self.detail_parse)
    
        def detail_parse(self, response):
            """汽车详情页的解析函数"""
            # 获取上个解析函数传递过来的 meta 数据
            item = response.meta['meta_1']
            item['km'] = response.xpath('//ul[@class="assort clearfix"]/li[2]/span/text()').get()
            item['disp'] = response.xpath('//ul[@class="assort clearfix"]/li[3]/span/text()').get()
            item['trans'] = response.xpath('//ul[@class="assort clearfix"]/li[4]/span/text()').get()
    
            # 1条数据最终提取全部完成,交给管道文件处理
            yield item
    
  • Step 3-pipelines.py

    # 将数据存入mongodb数据库,此处我们就不对MySQL表字段进行操作了,如有兴趣可自行完善
    # MongoDB管道
    import pymongo
    
    class GuaziMongoPipeline(object):
        def open_spider(self,spider):
            """爬虫项目启动时只执行1次,用于连接MongoDB数据库"""
            self.conn = pymongo.MongoClient(MONGO_HOST,MONGO_PORT)
            self.db = self.conn[MONGO_DB]
            self.myset = self.db[MONGO_SET]
    
        def process_item(self,item,spider):
            car_dict = dict(item)
            self.myset.insert_one(car_dict)
            return item
    
  • Step 4-settings.py

    # 定义MongoDB相关变量
    MONGO_HOST = 'localhost'
    MONGO_PORT = 27017
    MONGO_DB = 'guazidb'
    MONGO_SET = 'guaziset'
    

Guess you like

Origin blog.csdn.net/weixin_49304690/article/details/112371487