Course Note 7: Scrapy Framework - Regularized Crawler

Build a basic crawler

1. Create a new project:

scrapy startproject scrapyuniversaldemo

2. View available templates and specify crawl templates to create crawlers

scrapy genspider -l
# 查看模版非必要
scrapy genspider -t crawl movie ssr1.scrape.center

3. Use Rule in the rules of the crawler to define the crawling logic and parsing logic in the index page

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import MovieItem


class MoiveSpider(CrawlSpider):
    name = 'movie'
    allowed_domains = ['ssr1.scrape.center']
    start_urls = ['http://ssr1.scrape.center/']

    rules = (
        Rule(LinkExtractor(restrict_css='.item .name'), callback='parse_detail', follow=True),
        Rule(LinkExtractor(restrict_css='.next'), follow=True)
    )

Three parameters of Rule are used here: link_extractor, callback, follow

  • link_extractor: a LinkExtractor object, pointing to the link to be extracted, the extracted link will automatically generate a Request;
  • callback: Points to the method responsible for processing the response, which returns a list containing Item or Request objects (do not use the parse method);
  • follow: A Boolean value that specifies whether the link extracted from the response needs to be followed up (that is, to further generate a Request). If not, a callback method can generally be defined to parse the content and generate an Item.

A parameter is used in LinkExtractor: restrict_css

  • restrict_css: Indicates that links are extracted from the area matching the CSS selector in the current page
  • restrict_xpath: Indicates to extract links from the area matching XPath in the current page
  • tags: Specifies to extract links from a node, the default is ('a','area')
  • attrs: Specifies to extract links from an attribute of the node, the default is ('href',), used with tags
  • unique: whether to deduplicate the extracted links, the default is True
  • strip: Whether to remove the leading and trailing spaces from the extracted results, the default is True
  • allow: A regular expression (or list) specifying a whitelist for extracting links
  • deny: a regular expression (or list) specifying a blacklist for extracting links
  • allow_domains: Define domain name whitelist
  • deny_domains: Define domain name blacklist
  • deny_extensions: Define link suffix blacklist (default value includes 7z, 7zip, apk, dmg, ico, iso, tar, etc.)

4. Create a parse_detail method in the crawler to define the crawling logic in the detail page

-省略-


class MoiveSpider(CrawlSpider):
    -省略-

    def parse_detail(self, response):
        item = MovieItem()
        item['name'] = response.css('.item h2::text').get()
        item['categories'] = response.css('.categories .category span::text').getall()
        item['cover'] = response.css('.cover::attr(src)').get()
        item['published_at'] = response.css('.info span::text').re_first('(\d{4}-\d{2}-\d{2})\s?上映')
        item['score'] = response.css('.score::text').get().strip()
        item['drama'] = response.css('.drama p::text').get().strip()
        yield item

5. Define MovieItem in items and specify the required fields

import scrapy


class MovieItem(scrapy.Item):
    name = scrapy.Field()
    cover = scrapy.Field()
    categories = scrapy.Field()
    published_at = scrapy.Field()
    drama = scrapy.Field()
    score = scrapy.Field()

So far, the following effects can be achieved: 

Improved into a semi-regularized crawler

But now this implementation method cannot be configurable and needs to be modified:

6. Rewrite the parse_detail method as Item Loaders to achieve

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import MovieItem
from ..loaders import MovieItemLoader


class MoiveSpider(CrawlSpider):
    name = 'movie'
    allowed_domains = ['ssr1.scrape.center']
    start_urls = ['http://ssr1.scrape.center/']

    rules = (
        Rule(LinkExtractor(restrict_css='.item .name'), callback='parse_detail', follow=True),
        Rule(LinkExtractor(restrict_css='.next'), follow=True)
    )

    def parse_detail(self, response):
        loader = MovieItemLoader(item=MovieItem(), response=response)
        # 声明一个MovieItem,用该Item和Response对象实例化MovieItemLoader
        loader.add_css('name', '.item h2::text')
        # 调用add_css方法将数据提取出来,分配给name属性
        loader.add_css('categories', '.categories .category span::text')
        loader.add_css('cover', '.cover::attr(src)')
        loader.add_css('published_at', '.info span::text', re='(\d{4}-\d{2}-\d{2})\s?上映')
        loader.add_css('score', '.score::text')
        loader.add_css('drama', '.drama p::text')
        yield loader.load_item()
        # 调用load_item方法实现对Item的解析

Two parameters of Item Loader are used here: item and response

  • item: Item object, you can call add_xpath, add_css, add_value and other methods to fill the Item object
  • response: Response object, used to construct the Response selector
  • selector: Selector object, used to extract the selector for filling data

7. Create loaders.py (same level as items.py), and define the subclass of ItemLoader inside 

  • The default ItemLoader is not good enough here. By inheriting and creating a subclass of it, add the functions required by the project;
  • Each field of Item Loader contains an Input Processor and an Out Processor;
  • Input Processor (input processor): extract the data immediately when the data is received, the results are collected and stored in the ItemLoader, and will not be assigned to the Item;
  • Output Processor (output processor): After collecting all the data, the load_item method will be called to fill and generate the Item object - before that, the Output Processor will be called to process the data.
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, Identity, Compose


class MovieItemLoader(ItemLoader):
    default_output_processor = TakeFirst()
    # 设置默认的输出处理器
    # 代替get()操作,以每个字段的第一个提取结果作为最终结果
    categories_out = Identity()
    # 保持原来的结果不变(列表),覆盖前面的默认值
    score_out = Compose(TakeFirst(), str.strip)
    drama_out = Compose(TakeFirst(), str.strip)
    # 取出第一个结果并作去除前后空格处理,覆盖前面的默认值

Two Processors provided by Scrapy are used here: TakeFirst, Identity

  • TakeFirst: Returns the first non-empty value of the list, similar to the function of get()
  • Identity: Return the original data directly without any processing
  • Join: join the list into a string (the string is separated by spaces by default)
  • Compose: Multiple functions can be passed in to process an input value
  • MapCompose: Multiple functions can be passed in to process a list input value
  • SelectJmes: JSON can be queried (pass in the Key and return the value obtained from the query; the jmespath library needs to be installed first)

 

The output is consistent with the previous one. So far, the semi-regularization of crawlers has been realized.

Improved to a fully regularized crawler

8. Create a universal Spider (universal.py)

scrapy genspider -t crawl universal universal

9. Create the configs directory (same level as the spiders directory), create movie.json in the configs directory, extract the properties in the Spider and configure them in this JSON file

{
  "spider": "universal",
  "type": "电影",
  "home": "https://ssr1.scrape.center/",
  "settings": {
    "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36 Edg/98.0.1108.43"
  },
  "start_urls": [
    "https://ssr1.scrape.center/"
  ],
  "allowed_domains": [
    "ssr1.scrape.center"
  ],
  "rules": [
    {
      "link_extractor": {
        "restrict_css": ".item .name"
      },
      "follow": true,
      "callback": "parse_detail"
    },
    {
      "link_extractor": {
        "restrict_css": ".next"
      },
      "follow": true
    }
  ]
}

10. Create utils.py (same level as items.py) to read the previously defined JSON, and then dynamically load it into the Spider

from os.path import realpath, dirname, join
import json


def get_config(name):
    path = join(dirname(realpath(__file__)), 'configs', f'{name}.json')
    with open(path, 'r', encoding='utf-8') as f:
        return json.loads(f.read())

11. Create an entry run.py in the project root directory (same level as scrapy.cfg)

from scrapy.utils.project import get_project_settings
from scrapyuniversaldemo.utils import get_config
from scrapy.crawler import CrawlerProcess
import argparse

parser = argparse.ArgumentParser(description='Universal Spider')
parser.add_argument('name', help='name of spider to run')
args = parser.parse_args()
name = args.name
# 使用argparse要求运行时指定name参数(即对应的JSON配置文件的名称)


def run():
    config = get_config(name)
    # 利用config传入JSON配置文件
    spider = config.get('spider', 'universal')
    # 获取爬取使用的Spider名称
    project_settings = get_project_settings()
    settings = dict(project_settings.copy())
    # 获取配置文件中的settings配置
    settings.update(config.get('settings'))
    # 将获取到的settings配置和项目全局的settings配置进行合并
    process = CrawlerProcess(settings)
    # 新建一个CrawlerProcess,通过代码更加灵活自定义需要的Spider和启动配置
    process.crawl(spider, **{'name': name})
    process.start()


if __name__ == '__main__':
    run()

12. Create a new __init__ method in universal.py for initialization configuration

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..utils import get_config

class UniversalSpider(CrawlSpider):
    name = 'universal'

    def __init__(self, name, *args, **kwargs):
        config = get_config(name)
        # 接收name参数并通过get_config方法读取配置文件的内容
        self.config = config
        self.start_urls = config.get('start_urls')
        self.allowed_domains = config.get('allowed_domains')
        rules = []
        # 分别将start_urls、allowed_domains、rules进行初始化
        for rule_kwargs in config.get('rules'):
            # 遍历rules配置
            link_extractor = LinkExtractor(**rule_kwargs.get('link_extractor'))
            # 每个rule的配置赋值为rule_kwargs字典,
            # 然后读取rule_kwargs的link_extractor属性,将其构造为LinkExtractor对象
            rule_kwargs['link_extractor'] = link_extractor
            # 将link_extractor属性赋值到rule_kwargs字典中
            rule = Rule(**rule_kwargs)
            # 使用rule_kwargs初始化一个Rule对象
            rules.append(rule)
            # 将多个Rule对象构造成一个rules列表
        self.rules = rules
        super(UniversalSpider, self).__init__(*args, **kwargs)
        # 将rules列表赋值给CrawlSpider

13. Extract the parsing part from the original Spider file and configure it into the JSON file

{
  "spider": "universal",
  "type": "电影",
  "home": "https://ssr1.scrape.center/",
  "settings": {
    "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36 Edg/98.0.1108.43"
  },
  "start_urls": [
    "https://ssr1.scrape.center/"
  ],
  "allowed_domains": [
    "ssr1.scrape.center"
  ],
  "rules": [
    {
      "link_extractor": {
        "restrict_css": ".item .name"
      },
      "follow": true,
      "callback": "parse_detail"
    },
    {
      "link_extractor": {
        "restrict_css": ".next"
      },
      "follow": true
    }
  ],
  "item": {
//item和rules同级并列
    "class": "MovieItem",
    "loader": "MovieItemLoader",
//分别代表Item和ItemLoader的所使用的类
    "attrs": {
//定义attrs属性来定义每个字段的提取规则
      "name": [
        {
          "method": "css",
          "arg": ".item h2::text"
        }
      ],
      "categories": [
        {
          "method": "css",
          "args": ".categories button span::text"
        }
      ],
      "cover": [
        {
          "method": "css",
          "arg": ".cover::attr(src)"
        }
      ],
      "published_at": [
        {
          "method": "css",
          "arg": ".info span::text",
          "re": "(\\d{4}-\\d{2}-\\d{2})\\s?上映"
        }
      ],
      "score": [
        {
          "method": "css",
          "arg": ".score::text"
        }
      ],
      "drama": [
        {
         "method": "css",
          "arg": ".drama p::text"
        }
      ]
    }
  }
}

14. Dynamically load the above configuration into the parse_detail method

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..utils import get_config
from .. import items
from .. import loaders

class UniversalSpider(CrawlSpider):
    name = 'universal'

    def __init__(self, name, *args, **kwargs):
        -省略-

    def parse_detail(self, response):
        item = self.config.get('item')
        # 通过utils(get_config)获取JSON文件中的item配置信息
        if item:
            cls = getattr(items, item.get('class'))()
            # 获取class的配置(代表Item使用的类),将Item进行初始化
            loader = getattr(loaders, item.get('loader'))(cls, response=response)
            # 获取loader的配置(代表ItemLoader使用的类),将ItemLoader进行初始化
            for key, value in item.get('attrs').items():
                # 遍历Item的attrs代表的各个属性
                for extractor in value:
                    if extractor.get('method') == 'xpath':
                        # 判断method字段,调用对应的处理方法进行处理
                        loader.add_xpath(key, extractor.get('arg'), **{'re': extractor.get('re')})
                    if extractor.get('method') == 'css':
                        loader.add_css(key, extractor.get('arg'), **{'re': extractor.get('re')})
                    if extractor.get('method') == 'value':
                        loader.add_value(key, extractor.get('arg'), **{'re': extractor.get('re')})
                yield loader.load_item()
                # 所有配置动态加载完毕之后,调用load_item方法将Item提取出来


Compared with the previous common crawlers, these parts are mainly added: movie.json, universal.py, loaders.py, utils.py, run.py

movie.json: the content it contains is actually migrated from the original movie.py

universal.py: is a universal crawler

utils.py: used to help crawlers read JSON files

loaders.py: It is equivalent to the "middleware" between the crawler and the Item, which further processes the data (because in JSON, only how to crawl the data is defined; in the crawler, in order to maintain versatility, the processing method cannot be It’s hardcoded, so it’s up to loaders to patch—that’s how I understand it)

run.py: project launcher, used to match crawlers and JSON configuration files.


<end>

Guess you like

Origin blog.csdn.net/weixin_58695100/article/details/122793427