Crawler framework Scrapy study notes-2

Preface

Scrapy is a powerful Python crawler framework that is widely used to crawl and process data on the Internet. This article will introduce the architectural overview, workflow, installation steps and detailed instructions of a sample crawler of the Scrapy framework, aiming to help beginners understand how to use Scrapy to build and run their own web crawlers.

Scrapy architecture overview

Scrapy consists of the following main components:

  • Engine : Responsible for controlling the flow of data between various components and triggering transactions.
  • Scheduler : Receives the Request request sent by the engine, puts it into the queue after deduplication, and returns the request to the engine at the appropriate time.
  • Downloader : Responsible for obtaining the response corresponding to the Request request sent by the Scrapy engine and returning the Response to the engine.
  • Spider : parses response content, extracts Item field data or generates additional Request requests.
  • Project pipeline (Pipeline) : Process the items extracted by the crawler and perform subsequent processing, such as data cleaning, storage, etc.
  • Downloader Middlewares : The functions and behaviors of the downloader can be customized and extended.
  • Spider Middlewares : The functions and behaviors of crawlers can be customized and extended.

Insert image description here

Scrapy workflow

  1. Engine opens a website (starts the request) and hands it to the Scheduler for enqueueing
  2. After being processed by Scheduler, it is dequeued and sent to Downloader for execution through Downloader Middlewares.
  3. Downloader obtains web page data and returns Response through Downloader Middlewares
  4. Response is sent to Spider through Spider Middlewares
  5. Spider parses Response, extracts Item and generates Request
  6. Item passes through Spider Middlewares and is sent to Item Pipeline for processing
  7. Request passes through Spider Middlewares and is sent to Scheduler to join the team.
  8. Scheduler sends a new Request to Downloader and repeats steps 2-7.

A more visual Scrapy workflow

Scrapy is a collection factory, and the components correspond to the following:

Role correspondence

  • Engine - Supervisor
  • Scheduler - warehouse manager
  • Downloader - Buyer
  • Spider - Processing and Assembling Worker
  • Item Pipeline - Quality Inspection Department and Finished Product Warehouse
  • Downloader Middlewares - Procurement Assistant
  • Spider Middlewares - Pipeline Management Engineer

work process

  1. The supervisor instructs the warehouse manager to provide raw materials——Engine opens the website and hands it to Scheduler
  2. The warehouse manager takes out the raw material list in order - Scheduler processes it and then exits the queue
  3. Send the raw material list to the buyer for purchase - send it to Downloader through Downloader Middlewares
  4. The buyer goes out to purchase and the assistant performs preprocessing - the Downloader obtains the data and returns the Response through Downloader Middlewares
  5. Send the purchased raw materials to workers for processing and assembly - Response is sent to Spider via Spider Middlewares
  6. Engineers check the compliance of the workflow and optimize it - Spider parses Response, extracts Item and generates Request
  7. Products are sent to quality inspection and finished product warehouse - Item is sent to Item Pipeline via Spider Middlewares
  8. Worker feedback also requires a list of raw materials - Request is sent to Scheduler via Spider Middlewares
  9. Repeat steps 2-8 until the order is completed - Scheduler sends Request to Downloader and enters the next cycle

Scrapy installation

It is recommended to open a new environment separately, use Virtualenv environment or Conda environment
to semi-automatically use .bat to manually package and migrate python projects

python3.9
scrapy 2.5.1 -> scrapy-redis (0.7.2)

Note that due to the upgrade of scrapy,
scrapy-redis cannot be used normally. So here we choose version 2.5.1 for learning.
Later, you can upgrade according to the upgrade of scrapy-redis.
After the installation is completed, please adjust the version of OpenSSL.
In short, Finally, you
can enter scrapy version and scrapy version --verbose in the console to display the version number. Even if it is successful,

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple scrapy==2.5.1
pip install pyopenssl==22.0.0
pip install cryptography==36.0.2 
scrapy version
crapy version --verbose 

Scrapy workflow

  1. Need to create a project
    scrapy startproject project name
  2. Enter the project directory
    cd project name
  3. Generate spider
    scrapy genspider crawler name domain name of the website
  4. Adjust spider
    to give start_urls
    and how to parse data. parse

For example, you can run:

scrapy startproject demo
cd demo
scrapy genspider example example.com

you will get

demo/demo/spiders/example.py
# 导入scrapy框架
import scrapy

# 定义爬虫类 ExampleSpider,继承自scrapy.Spider
class ExampleSpider(scrapy.Spider):
    # 定义爬虫名称
    name = 'example' 
    # 允许爬取的域名列表
    allowed_domains = ['example.com']
    # 起始URL列表
    start_urls = ['http://example.com/']
    
    # 解析响应内容的回调函数
    def parse(self, response):
    	pass
        # 简单打印一下响应内容
        # print(response.text)
        # 可以在这里使用提取器提取数据
        # 使用yield关键字yield Request生成其他请求


  1. Adjust the settings configuration file
    to get rid of the log information (if you want to leave useful content), just adjust the logging level.
    LOG_LEVEL = “WARNING”
demo/demo/settings.py

# Scrapy项目的设置
#
# 为了简单起见,这个文件只包含了被认为重要或常用的设置。您可以在文档中找到更多设置:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'demo'  # 机器人名称

SPIDER_MODULES = ['demo.spiders']  # 爬虫模块位置
NEWSPIDER_MODULE = 'demo.spiders' # 新建爬虫的模块位置

# 通过在User-Agent中标识自己(和自己的网站)来负责任地爬取
# USER_AGENT = 'demo (+http://www.yourdomain.com)'

# 遵守robots.txt规则
ROBOTSTXT_OBEY = True

# 配置Scrapy执行的最大并发请求数量(默认值: 16)(Scrapy默认是携程人物)
# CONCURRENT_REQUESTS = 32

# 为相同网站的请求配置延迟(默认值: 0)建议打开调大
# 请参阅 https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# 还请参阅自动限速设置和文档
# DOWNLOAD_DELAY = 3

# 下载延迟设置只会考虑以下之一:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16

# 禁用Cookies(默认启用)
# COOKIES_ENABLED = False

# 禁用Telnet控制台(默认启用)
# TELNETCONSOLE_ENABLED = False

# 覆盖默认的请求头:
# DEFAULT_REQUEST_HEADERS = {
    
    
#    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#    'Accept-Language': 'en',
# }

# 启用或禁用爬虫中间件
# 请参阅 https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
    
    
#    'demo.middlewares.DemoSpiderMiddleware': 543,
# }

# 启用或禁用下载器中间件
# 请参阅 https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
    
    
#    'demo.middlewares.DemoDownloaderMiddleware': 543,
# }

# 启用或禁用扩展
# 请参阅 https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
    
    
#    'scrapy.extensions.telnet.TelnetConsole': None,
# }

# 配置Item管道
# 请参阅 https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# ITEM_PIPELINES = {
    
    
#    'demo.pipelines.DemoPipeline': 300,
# }

# 启用并配置自动限速扩展(默认禁用)
# 请参阅 https://docs.scrapy.org/en/latest/topics/autothrottle.html

# 初始下载延迟
# AUTOTHROTTLE_START_DELAY = 5

# 在高延迟情况下设置的最大下载延迟
# AUTOTHROTTLE_MAX_DELAY = 60

# Scrapy应该并行发送到每个远程服务器的平均请求数
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# 启用显示每个接收到的响应的限速统计信息:
# AUTOTHROTTLE_DEBUG = False

# 启用并配置HTTP缓存(默认禁用)
# 请参阅 https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = 'httpcache'
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

  1. Run scrapy program
    scrapy crawl crawler name

Example

create

scrapy startproject xyx
cd xyx
scrapy genspider xiaoyx 4399.com

xyx/xyx/spiders/xiaoyx.py

import scrapy

class XiaoyxSpider(scrapy.Spider):
    name = 'xiaoyx'  # 爬虫的名称
    allowed_domains = ['4399.com']  # 允许爬取的域名
    start_urls = ['http://www.4399.com/flash/']  # 初始爬取的URL地址

    def parse(self, response):
        # 使用 XPath 提取数据
        li_list = response.xpath("//ul[@class='n-game cf']/li")  # 选择包含游戏信息的 li 元素列表
        for li in li_list:
            name = li.xpath("./a/b/text()").extract_first()  # 提取游戏名称
            fenlei = li.xpath("./em/a/text()").extract_first()  # 提取分类信息
            shijian = li.xpath("./em/text()").extract_first()  # 提取时间信息

            # 使用 yield 返回提取的数据
            yield {
    
    
                "name": name,      # 游戏名称
                "fenlei": fenlei,  # 分类信息
                "shijian": shijian  # 时间信息
            }

This code is a web crawler written using the Scrapy framework to extract game information from the 'http://www.4399.com/flash/' website. Here is a detailed explanation of the code:

  • import scrapy: Import the module of Scrapy framework.

  • class XiaoyxSpider(scrapy.Spider): Defines a Scrapy crawler class named 'xiaoyx'.

  • name = 'xiaoyx': Specifies the name of the crawler.

  • allowed_domains = ['4399.com']: Set the domain names that are allowed to be crawled to ensure that only pages under the specified domain name are crawled.

  • start_urls = ['http://www.4399.com/flash/']: Specify the URL address for initial crawling, and the crawler will start running from this address.

  • def parse(self, response):: Defines a method for processing the response. Scrapy will automatically call this method to process the response obtained from the initial URL.

  • li_list = response.xpath("//ul[@class='n-game cf']/li"): Use the XPath selector to select lia list of elements containing game information.

  • In the loop, use XPath lito extract the game name, category information, and time information from each element.

  • yieldThe statement returns the extracted data in the form of a dictionary, each dictionary representing a piece of game information.

The main function of this crawler is to extract game information from specified web pages and output the information in the form of a dictionary. This is just one part of a crawler project, typically you will need to configure other settings and data processing pipelines to save or further process the crawled data.

yield and return

Inside the Scrapy engine is a loop that is constantly executing next(). Scrapy's engine is an event-driven asynchronous framework, and its core working method can be summarized as an event loop. This event loop is a continuously iterative process, which is responsible for the following work:

  1. Scheduling Requests : The engine starts from the starting URL of the crawler, generates initial requests and puts them into the request queue.

  2. Downloading Requests : The Downloader gets the request from the request queue, sends an HTTP request through the Downloader Middleware, and then waits for the response.

  3. Handling Responses : Once the downloader obtains the response, the engine will send the response to the crawler middleware (Spider Middleware) and crawler callback function (Spider Callback). This is where the crawler code actually processes and parses the page content.

  4. Generating New Requests : In the crawler callback function, you can generate new requests and put them into the request queue for subsequent processing.

  5. Repeat the above process : The engine will continue to execute the above steps in a loop until there are no more requests in the request queue, or the predetermined crawl depth or other termination conditions are reached.

This event loop approach allows Scrapy to crawl data efficiently and perform multiple requests asynchronously. Through the generator, Scrapy can achieve non-blocking crawling, that is, when a request is waiting for a response, the engine can continue to process other requests without waiting for the current request to complete.

This event loop and asynchronous operations are key features of Scrapy, which help to efficiently handle large numbers of HTTP requests and improve the performance of the crawler. At the same time, Scrapy also provides many configuration options and middleware, allowing users to flexibly control and customize the crawling process.

In this Scrapy crawler, the reason for using yieldinstead of returnis because the Scrapy framework is asynchronous. When used yield, it allows you to generate data in the form of a generator and return it to the Scrapy engine. The Scrapy engine can then process the data one by one and pass it to subsequent pipelines for processing, saving, or other operations. This asynchronous approach is useful for crawling large amounts of data or requests that take a long time, because it does not block the execution of the crawler, but processes the data when it is ready.

In contrast, if used return, it will return the data immediately and end parsethe method execution. This can cause the crawler to be less efficient as it waits for data processing to complete before continuing with the next request. This may cause performance degradation of the crawler, especially when a large number of pages need to be crawled or when some asynchronous operations need to be waited for to complete.

In short, using yieldallows you to process data in an asynchronous manner, thus improving the efficiency and performance of your crawler. Using returnit will block the crawler and may lead to low efficiency. Therefore, it is generally recommended to use Scrapy crawlers yieldto generate and return data.

xyx/xyx/pipelines.py

Notice! If you use ITEM_PIPELINES in pipe settings.py, you need to turn it on

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    
    
   'xyx.pipelines.XyxPipeline': 300,
}

This is a code example of a Scrapy data processing pipeline that processes data extracted from a crawler and saves it to a CSV file called "data.csv". Here is a detailed explanation of the code:

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

# useful for handling different item types with a single interface
from itemadapter import ItemAdapter

class XyxPipeline:
    def process_item(self, item, spider):
        print("我是管道,我看到的东西是", item)
        with open("data.csv", mode="a", encoding="utf-8") as f:
            f.write(f"{
      
      item['name']},{
      
      item['fenlei']}{
      
      item['shijian']}\n")
        
        return item

This code is a custom Scrapy pipeline. Its main function is to save the data extracted by the crawler into a CSV file named "data.csv" and print out the data on the console.

The explanation is as follows:

  • class XyxPipeline:: Defines a XyxPipelinepipeline class named .

  • process_item(self, item, spider):: This method is a method that the Scrapy pipeline must implement. It will be automatically called to process the data passed from the crawler. In this example, it accepts two parameters: itemthe data item passed from the crawler, spiderand the crawler instance being executed.

  • print("我是管道,我看到的东西是", item): This line of code will print out data items on the console when processing data, which is used for debugging and viewing data.

  • with open("data.csv", mode="a", encoding="utf-8") as f:: This is a file operation, which will open a CSV file named "data.csv" for writing data. mode="a"Indicates opening the file in append mode to ensure that previous data will not be overwritten.

  • f.write(f"{item['name']},{item['fenlei']},{item['shijian']}\n"): This line of code writes the extracted data to a file in CSV format.

  • return item: Finally, process_itemthe method returns item, indicating that the data processing is completed.

To use this pipeline, you need to add it to ITEM_PIPELINESthe list in Scrapy's settings and configure the appropriate priority. This way, when the crawler extracts data, the data will be passed to this pipeline for processing and saving.

Pipeline detailed explanation

In Scrapy's Pipeline, data items (items) are passed from the crawler to the Pipeline, and the Pipeline process_itemmethod receives two parameters: itemand spider. The following focuses on the meaning of these parameters:

  1. item : itemThe parameter is a Python dictionary (or similar object) containing the data extracted from the crawler. This data item includes various fields and data crawled. In the Pipeline process_itemmethod, you can access and process these fields, clean, verify, transform or store the data.

    For example:

    def process_item(self, item, spider):
        name = item['name']  # 访问数据项的字段
        # 对数据进行处理
        # ...
        return item  # 可选:返回处理后的数据项
    
  2. spider : spiderThe parameter represents the crawler instance currently being executed. Although in most cases, data processing in Pipeline does not require the use of spiderparameters, it can provide information about the current crawler status and configuration to allow for more flexible processing of data.

Summary: The Pipeline process_itemmethod receives the data items extracted by the crawler as parameters. After processing these data, the processed data items can optionally be returned. The returned data items will be passed to the next Pipeline (if there are multiple Pipelines) or used for final storage and output. The role of Pipeline is to process and convert data between crawlers and storage.

run

To get rid of the log information (if you want to leave useful content), you only need to adjust the logging level.
LOG_LEVEL = “WARNING”

scrapy crawl xiaoyx

You can also create xyx/runner.py

# -*- coding = utf-8 -*-
"""
# @Time : 2023/9/16 22:55
# @Author : FriK_log_ff 374591069
# @File : runner.py
# @Software: PyCharm
# @Function: 请输入项目功能
"""
from scrapy.cmdline import execute

if __name__ == "__main__":
    execute("scrapy crawl xiaoyx".split())

The function of this code is to call the Scrapy command line tool to run the Scrapy crawler named "xiaoyx". Before executing, please make sure that you have correctly configured the Scrapy project and defined the crawler named "xiaoyx".

To run this script, execute it from the command line, making sure your working directory is at the root of the project containing Scrapy and that Scrapy is installed. Once executed, it will launch the crawler and start scraping data from the web page.

Summarize

Scrapy is a powerful web crawler framework suitable for various data scraping scenarios. By understanding its architecture and workflow, you can better utilize Scrapy to build your own crawler project. In this article, we introduce the basic concepts of Scrapy, provide installation steps and sample crawler code, hoping to help you get started with Scrapy, and apply this powerful tool in actual projects to obtain the required data. When using Scrapy, remember to follow the site's rules and ethics to ensure legal and ethical data harvesting practices.

Special statement:
This tutorial is purely technical sharing! This tutorial is in no way intended to provide technical support to those with ill intentions! We also do not assume any joint liability arising from the misuse of technology! The purpose of this tutorial is to record and share the process of learning technology

Guess you like

Origin blog.csdn.net/qq_42531954/article/details/132908232