Use scrapy frame, inside Detailed

Use scrapy framework

  • Based on a coding process in the persistent store pipeline

    • Data analysis in reptiles file

    • The parsed data called encapsulated Itemobject type

    • Will itembe submitted to the type of object管道

    • 管道Responsible for calling process_itema method of receiving item, then some form of persistent storage

    • Open pipe in the configuration file

      ITEM_PIPELINES = {
         'frist_scrapy.pipelines.FristScrapyPipeline': 300,
      }
      
      # 将这段代码的注释去掉
    • Precautions:

      1.什么情况下需要用到多个管道类
        - 一个管道类对应一种形式的持久化存储
      
      2.process_item中的return item:
        - 可以将item提交给下一个即将被执行的管道类
      
      3.如果直接将一个字典写入到redis报错的话:
        - pip install redis==2.10.6
  • Full stack data crawling

    • Manual transmission request

      yield scrapy.Request(url=new_url,callback=self.parse)
      
      # 可以传入参数 meta
      yield scrapy.Request(url=new_url,callback=self.parse,meta={'item':item})
    • Summary: When to use yield

      1.向管道提交item的时候
      2.手动请求发送的时候
  • How to send a post request:

      yield scrapy.FromRequest(url=new_url,callback=self.parse,formdata={})
    
    • Why start_urls list can be send get requests:

      父类对start_requests的原始实现:
      def start_requests(self):
          for url in self.start_urls:
              yield scrapy.Request(url,callback=self.parse)
      
  • Five core components (objects)

    img

    • Asynchronous scrapy achieve a certain degree of understanding

    • Related methods and object instantiation call flow

    • The role of components:

      引擎(Scrapy)
          用来处理整个系统的数据流处理, 触发事务(框架核心)
      
      调度器(Scheduler)
              用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
      
      下载器(Downloader)
          用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
      
      爬虫(Spiders)
          爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
      
      项目管道(Pipeline)
          负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。    
  • How to improve the efficiency of crawling appropriate data scrapy

    增加并发:
        默认scrapy开启的并发线程为16个,可以适当进行增加。在settings配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100。
    
    降低日志级别:
            在运行scrapy时,会有大量日志信息的输出,为了减少CPU的使用率。可以设置log输出信息为INFO或者ERROR即可。在配置文件中编写:LOG_LEVEL = ‘ERROR’
    
    禁止cookie:
            如果不是真的需要cookie,则在scrapy爬取数据时可以禁止cookie从而减少CPU的使用率,提升爬取效率。在配置文件中编写:COOKIES_ENABLED = False
    
    禁止重试:
            对失败的HTTP进行重新请求(重试)会减慢爬取速度,因此可以禁止重试。在配置文件中编写:RETRY_ENABLED = False
    
    减少下载超时:
            如果对一个非常慢的链接进行爬取,减少下载超时可以能让卡住的链接快速被放弃,从而提升效率。在配置文件中进行编写:DOWNLOAD_TIMEOUT = 10 超时时间为10s    
  • Request parameter passing

    • Role: to help realize the depth of crawling scrapy

      • Depth crawl:
        • Crawling data is not on the same page in a
    • Requirements: crawling name and profile, https: //www.4567tv.tv/frim/index1.html

    • Implementation process

      • Parameter passing:

        yield scrapy.Request(url,callback,meta),将meta这个字典传递给callback
      • Receive parameters

        response.meta
    • Code:

      # -*- coding: utf-8 -*-
      import scrapy
      
      
      class MvspidersSpider(scrapy.Spider):
          name = 'mvspiders'
          # allowed_domains = ['https://www.4567tv.tv/frim/index1.html']
          start_urls = ['https://www.4567tv.tv/frim/index1.html']
      
          url = "https://www.4567tv.tv/index.php/vod/show/id/5/page/%s.html"
          pageNum = 1
      
          def parse(self, response):
      
              li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')
      
              for li in li_list:
                  a_href = li.xpath('./div/a/@href').extract_first()
                  url = 'https://www.4567tv.tv/' + a_href
      
                  # 对详情页的url进行手动请求发送
                  # 请求传参:
                  # 参数meta是一个字典,字典会传递给callback
                  yield scrapy.Request(url,callback=self.infoparse)
      
              # 用于全栈的爬取
              if self.pageNum < 5:
                  self.pageNum += 1
                  new_url = self.url%self.pageNum
                  # 递归调用自己
                  yield scrapy.Request(new_url,callback=self.parse)
      
          def infoparse(self,response):
      
              title = response.xpath("/html/body/div[1]/div/div/div/div[2]/h1/text()").extract_first()
      
              content = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[5]/span[2]/text()').extract_first()
      

      items.pyFile, its role is to define your needs to the top file in the itempackage data it is necessary to add in this class corresponding data name behind scrapy.Field()that is built-in dictionary class (dict) is an alias, and does not provide additional methods and properties, are used based on class attributes to support life grammar item.

      # -*- coding: utf-8 -*-
      
      # Define here the models for your scraped items
      #
      # See documentation in:
      # https://docs.scrapy.org/en/latest/topics/items.html
      
      import scrapy
      class MvItem(scrapy.Item):
          # define the fields for your item here like:
          title = scrapy.Field()
          content = scrapy.Field()
      

      pipelines.pyVarious methods of data file storage

      # -*- coding: utf-8 -*-
      
      # Define your item pipelines here
      #
      # Don't forget to add your pipeline to the ITEM_PIPELINES setting
      # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
      
      #写入到文本文件中
      import pymysql
      from redis import Redis
      class DuanziproPipeline(object):
          fp = None
          def open_spider(self,spider):
              print('开始爬虫......')
              self.fp = open('./duanzi.txt','w',encoding='utf-8')
          #方法每被调用一次,参数item就是其接收到的一个item类型的对象
          def process_item(self, item, spider):
              # print(item)#item就是一个字典
              self.fp.write(item['title']+':'+item['content']+'\n')
              return item#可以将item提交给下一个即将被执行的管道类
          def close_spider(self,spider):
              self.fp.close()
              print('爬虫结束!!!')
      #将数据写入到mysql
      class MysqlPipeLine(object):
          conn = None
          cursor = None
          def open_spider(self,spider):
              self.conn = pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='222',db='spider',charset='utf8')
              print(self.conn)
          def process_item(self,item,spider):
              sql = 'insert into duanzi values ("%s","%s")'%(item['title'],item['content'])
              self.cursor = self.conn.cursor()
              try:
                  self.cursor.execute(sql)
                  self.conn.commit()
              except Exception as e:
                  print(e)
                  self.conn.rollback()
              return item
          def close_spider(self,spider):
              self.cursor.close()
              self.conn.close()
      
      #将数据写入到redis
      class RedisPileLine(object):
          conn = None
          def open_spider(self,spider):
              self.conn = Redis(host='127.0.0.1',port=6379)
              print(self.conn)
          def process_item(self,item,spider):
              self.conn.lpush('duanziData',item)
              return item
      

Guess you like

Origin www.cnblogs.com/zhufanyu/p/12013073.html