python-scrapy framework

Introduction

What is a framework?

The so-called frame is actually a [semi-finished product of the project] to put it bluntly. The semi-finished product of the project needs to be integrated with various functions and have strong versatility.

Scrapy is an application framework written to crawl website data and extract structured data. It is very famous and very powerful. The so-called framework is a highly versatile project template that has been integrated with various functions (high-performance asynchronous download, queue, distribution, parsing, persistence, etc.). For learning about frameworks, the focus is to learn the characteristics of the framework and the usage of each function.

How to learn the framework in the early stage?

You just need to learn how to use the various functions integrated into the framework! Do not delve into the source code of the framework in the early stage!

Install

Linux/mac系统:
      pip install scrapy(任意目录下)

Windows系统:

      a. pip install wheel(任意目录下)

      b. 下载twisted文件,下载网址如下: http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

      c. 终端进入下载目录,执行 pip install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl
      注意:如果该步骤安装出错,则换一个版本的whl文件即可

      d. pip install pywin32(任意目录下)

      e. pip install scrapy(任意目录下)
      
如果安装好后,在终端中录入scrapy指令按下回车,如果没有提示找不到该指令,则表示安装成功

You can also download Visual Studio

Basic use

  • Create project

    • scrapy startproject project name

    • Project directory structure:

      • firstBlood   # 项目所在文件夹, 建议用pycharm打开该文件夹
            ├── firstBlood  		# 项目跟目录
            │   ├── __init__.py
            │   ├── items.py  		# 封装数据的格式
            │   ├── middlewares.py  # 所有中间件
            │   ├── pipelines.py	# 所有的管道
            │   ├── settings.py		# 爬虫配置信息
            │   └── spiders			# 爬虫文件夹, 稍后里面会写入爬虫代码
            │       └── __init__.py
            └── scrapy.cfg			# scrapy项目配置信息,不要删它,别动它,善待它. 
        
        
  • Create a crawler crawler file:

    • cd project_name (enter the project directory)
    • scrapy genspider crawler file name (just customize a name) starting url
      • (For example: scrapy genspider first www.xxx.com)
    • After successful creation, a py crawler file will be generated in the crawler folder.
  • Write crawler files

    • Understand the different components of a crawler file

    • import scrapy
      
      class FirstSpider(scrapy.Spider):
          #爬虫名称:爬虫文件唯一标识:可以使用该变量的值来定位到唯一的一个爬虫文件
          name = 'first' #无需改动
          #允许的域名:scrapy只可以发起百度域名下的网络请求
          # allowed_domains = ['www.baidu.com']
          #起始的url列表:列表中存放的url可以被scrapy发起get请求
          start_urls = ['https://www.baidu.com/','https://www.sogou.com']
      
          #专门用作于数据解析
          #参数response:就是请求之后对应的响应对象
          #parse的调用次数,取决于start_urls列表元素的个数
          def parse(self, response):
              print('响应对象为:',response)
      
      
  • Configuration file modification: settings.py

    • Does not comply with robots protocol: ROBOTSTXT_OBEY = False
    • Specify the type of output log: LOG_LEVEL = ‘ERROR’
    • Type:USER_AGENT = ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.109 Safari/537.36’
  • Run the project

    • scrapy crawl 爬虫名称 :该种执行形式会显示执行的日志信息(推荐)
      scrapy crawl 爬虫名称 --nolog:该种执行形式不会显示执行的日志信息(一般不用)
      

data analysis

  • Note that if the terminal is still in the folder of the first project, you need to execute cd .../ in the terminal to return to the upper-level directory and create another project.

  • Create a new data analysis project:

    • Create project: scrapy startproject project name
    • cd project name
    • Create a crawler file: scrapy genspider crawler file name www.xxx.com
  • Modification of configuration file: settings.py

    • Does not comply with robots protocol: ROBOTSTXT_OBEY = False
    • Specify the type of output log: LOG_LEVEL = ‘ERROR’
    • Type:USER_AGENT = ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.109 Safari/537.36’
  • Write the crawler file: spiders/duanzi.py

    • import scrapy
      
      class DuanziSpider(scrapy.Spider):
          name = 'duanzi'
          # allowed_domains = ['www.xxx.com']
          #对首页进行网络请求
          #scrapy会对列表中的url发起get请求
          start_urls = ['https://ishuo.cn/duanzi']
      
          def parse(self, response):
              #如何获取响应数据
              #调用xpath方法对响应数据进行xpath形式的数据解析
              li_list = response.xpath('//*[@id="list"]/ul/li')
              for li in li_list:
                  # content = li.xpath('./div[1]/text()')[0]
                  # title = li.xpath('./div[2]/a/text()')[0]
                  # #<Selector xpath='./div[2]/a/text()' data='一年奔波,尘缘遇了谁'>
                  # print(title)#selector的对象,且我们想要的字符串内容存在于该对象的data参数里
                  #解析方案1:
                  # title = li.xpath('./div[2]/a/text()')[0]
                  # content = li.xpath('./div[1]/text()')[0]
                  # #extract()可以将selector对象中data参数的值取出
                  # print(title.extract())
                  # print(content.extract())
                  #解析方案2:
                  #title和content为列表,列表只要一个列表元素
                  title = li.xpath('./div[2]/a/text()')
                  content = li.xpath('./div[1]/text()')
                  #extract_first()可以将列表中第0个列表元素表示的selector对象中data的参数值取出
                  print(title.extract_first())
                  print(content.extract_first())
      
      

Persistent storage

Two options:

  • Persistent storage based on terminal commands
  • Pipeline-based persistent storage (recommended)
Persistent storage based on terminal commands
  • Only the return value of the parse method can be stored in a text file with a specified suffix.

  • Coding process:

    • In the crawler file, encapsulate all the crawled data into the return value of the parse method.

      • import scrapy
        
        class DemoSpider(scrapy.Spider):
            name = 'demo'
            # allowed_domains = ['www.xxx.com']
            start_urls = ['https://ishuo.cn/duanzi']
        
            def parse(self, response):
                # 如何获取响应数据
                # 调用xpath方法对响应数据进行xpath形式的数据解析
                li_list = response.xpath('//*[@id="list"]/ul/li')
                all_data = []#爬取到的数据全部都存储到了该列表中
                for li in li_list:
                    title = li.xpath('./div[2]/a/text()').extract_first()
                    content = li.xpath('./div[1]/text()').extract_first()
                    #将段子标题和内容封装成parse方法的返回
                    dic = {
                  
                  
                        'title':title,
                        'content':content
                    }
                    all_data.append(dic)
        
                return all_data
        
        
    • Store the return value of the parse method into a text file with the specified suffix:

      • scrapy crawl crawler file name -o duanzi.csv
  • Summarize:

    • Advantages: simple and convenient
    • Disadvantages: strong limitations
      • Can only store data to text files and cannot write to database
      • The suffix of the storage data file is specified, usually .csv is used
      • The stored data needs to be encapsulated into the return value of the parse method
Implement persistent storage based on pipelines

Advantages: Greatly improve the efficiency of data storage

Disadvantages: more coding process

coding process

1. Perform data analysis in crawler files

def parse(self, response):
  # 如何获取响应数据
  # 调用xpath方法对响应数据进行xpath形式的数据解析
  li_list = response.xpath('//*[@id="list"]/ul/li')
  all_data = []  # 爬取到的数据全部都存储到了该列表中
  for li in li_list:
    title = li.xpath('./div[2]/a/text()').extract_first()
    content = li.xpath('./div[1]/text()').extract_first()

2. Encapsulate the parsed data into an object of type Item

  • 2.1 Define relevant fields in the items.py file

    • class SavedataproItem(scrapy.Item):
          # define the fields for your item here like:
          # name = scrapy.Field()
          #爬取的字段有哪些,这里就需要定义哪些变量存储爬取到的字段
          title = scrapy.Field()
          content = scrapy.Field()
      
  • 2.2 Introduce the Item class into the crawler file, instantiate the item object, and store the parsed data into the item object.

    •     def parse(self, response):
          		from items import SavedataproItem #导入item类
              # 如何获取响应数据
              # 调用xpath方法对响应数据进行xpath形式的数据解析
              li_list = response.xpath('//*[@id="list"]/ul/li')
              all_data = []  # 爬取到的数据全部都存储到了该列表中
              for li in li_list:
                  title = li.xpath('./div[2]/a/text()').extract_first()
                  content = li.xpath('./div[1]/text()').extract_first()
                  #实例化一个item类型的对象
                  item = SavedataproItem()
                  #通过中括号的方式访问item对象中的两个成员,且将解析到的两个字段赋值给item对象的两个成员即可
                  item['title'] = title
                  item['content'] = content
      

3. Submit the item object to the pipeline

  • #将存储好数据的item对象提交给管道
    yield item
    

Need to open pipeline

image-20230408235322158

4. Receive item type objects in the pipeline (pipelines.py is the pipeline file)

  • Pipes can only receive objects of item type and cannot receive other types of objects.

  • class SavedataproPipeline:
        #process_item用来接收爬虫文件传递过来的item对象
        #item参数,就是管道接收到的item类型对象
        def process_item(self, item, spider):
            print(item)
            return item
    

5. Perform any form of persistent storage operation on the received data in the pipeline

  • Can be stored in a file or in a database

  • # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    # useful for handling different item types with a single interface
    from itemadapter import ItemAdapter
    
    
    class SavedataproPipeline:
        #重写父类的方法
        fp = None
        def open_spider(self,spider):
            print('我是open_spider方法,我在项目开始运行环节,只会被执行一次!')
            self.fp = open('duanzi.txt','w',encoding='utf-8')
        #process_item用来接收爬虫文件传递过来的item对象
        #item参数,就是管道接收到的item类型对象
        #process_item方法调用的次数取决于爬虫文件给其提交item的次数
        def process_item(self, item, spider):
            #item类型的对象其实就是一个字典
            # print(item)
            #将item字典中的标题和内容获取
            title = item['title']
            content = item['content']
            self.fp.write(title+':'+content+'\n')
            print(title,':爬取保存成功!')
            return item
    
        def close_spider(self,spider):
            print('在爬虫结束的时候会被执行一次!')
            self.fp.close()
    

6. Enable the pipeline mechanism in the configuration file

  • Note: By default, the pipeline mechanism is not enabled and needs to be enabled manually in the configuration file.
  • Uncommenting ITEM_PIPELINES in setting.py means that the pipeline mechanism is enabled.

Guess you like

Origin blog.csdn.net/jiuwencj/article/details/130348233