Python crawler scrapy library

Insert image description here

General introduction

Scrapy is an open source Python framework for scraping website data. It provides a powerful and flexible set of tools for extracting the required data from a website. Scrapy is built on top of the Twisted asynchronous network library, so it can handle large numbers of concurrent requests efficiently. Here are some of Scrapy’s main features and components:

  1. Selectors: Scrapy uses XPath and CSS selectors to locate and extract data from web pages. This makes it very convenient to locate and extract the required information when processing HTML or XML documents.

  2. Item: A container that defines the structured data to be extracted from the web page. By creating a custom Item class, the structure of the data can be standardized, making the data extraction process clearer and maintainable.

  3. Pipeline: Pipeline is the component that processes data extracted from Spider. By writing custom pipelines, data can be cleaned, validated, and stored. For example, store data to a database or export to a file.

  4. Middleware: Middleware is a hook that handles Scrapy requests and responses. They can modify requests and responses before the request is sent to the server or after it is returned from the server. This enables various custom features such as proxies, user agents, etc. to be implemented during the crawling process.

  5. Downloader: The component responsible for processing sending HTTP requests and receiving HTTP responses. Scrapy's downloader supports concurrent requests, which can be configured through settings.

  6. Scheduler: A component used to control when Spider sends requests. The scheduler maintains a queue and schedules Spider requests according to certain rules in order to crawl data efficiently.

  7. Spider Middleware: Similar to global middleware, but specifically used to handle Spider requests and responses.

  8. Project: The Scrapy project is an overall structure that contains crawlers, Item definitions, pipelines and other configurations. A Scrapy project can contain multiple spiders, and each spider defines specific crawling rules.

Scrapy makes it easy to build a flexible, efficient, and easy-to-maintain web crawler. Here is a simple Scrapy crawler example:

import scrapy

class MySpider(scrapy.Spider):
    name = 'my_spider'
    start_urls = ['http://example.com']

    def parse(self, response):
        # 提取数据的代码
        title = response.css('h1::text').get()
        yield {
    
    'title': title}

The above code defines a Spider named my_spider, the starting URL is http://example.com, in the parse method Use CSS selectors to extract the title data from the page and pass the results to the pipeline for processing. yield

Related modules

The Scrapy library contains several important modules, each with specific functions and used for different tasks. The following are some commonly used modules in the Scrapy library:

  1. scrapy.Spider:
    • Core module used to define the basic structure and behavior of the crawler.
    • Developers need to create a class that inherits fromscrapy.Spider, and define the starting URL and rules for how to track links, extract data, etc.
import scrapy

class MySpider(scrapy.Spider):
    name = 'my_spider'
    start_urls = ['http://example.com']

    def parse(self, response):
        # 爬虫逻辑
  1. scrapy.Item:
    • A container used to define structured data that needs to be extracted from a web page.
    • Developers standardize the data structure by creating custom Item classes.
import scrapy

class MyItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
  1. scrapy.Selector:
    • Tool for extracting data from web pages, supports XPath and CSS selectors.
    • In Spider, you can use response.css or response.xpath to create a Selector object and use the corresponding selector expression to extract data.
title = response.css('h1::text').get()
  1. scrapy.Request:
    • An object used to define the HTTP request to be sent.
    • In Spider, you can usescrapy.Request to create a request object and specify a callback function to handle the response.
yield scrapy.Request(url='http://example.com', callback=self.parse)
  1. scrapy.ItemLoader:
    • A tool for loading Items, providing a convenient API to populate Item fields.
    • can be used in SpiderItemLoader to load data into Item.
from scrapy.loader import ItemLoader

loader = ItemLoader(item=MyItem(), response=response)
loader.add_css('title', 'h1::text')
loader.add_value('link', response.url)
yield loader.load_item()
  1. scrapy.Pipeline:
    • Component for processing data extracted by Spider.
    • Developers can write custom pipelines to process, validate or store data.
class MyPipeline:
    def process_item(self, item, spider):
        # 处理item的逻辑
        return item
  1. scrapy.settings:
    • Contains the configuration settings of the Scrapy project. Various parameters can be set in the project, such as download delay, middleware, etc.
BOT_NAME = 'my_project'
DOWNLOAD_DELAY = 2
  1. scrapy.exceptions:
    • Contains the exception class of the Scrapy library, which can be used to handle exceptions that may occur in the crawler.
from scrapy.exceptions import CloseSpider

raise CloseSpider('Crawling stopped due to a specific condition')

This is just a brief introduction to some commonly used modules in the Scrapy library. In actual use, developers can further understand and use other modules according to specific needs to build more powerful and customized crawlers.

Guess you like

Origin blog.csdn.net/weixin_74850661/article/details/134403057