Scrapy of Python crawler (with examples) macOS environment

1. Concept

Scrapy is an open source web crawler framework written in Python. It is a framework designed to crawl web data and extract structured data.

Scrapy uses the Twisted asynchronous network framework, which can speed up our download speed.

Official document address: Scrapy

2. Workflow

  • The initial url in the crawler is constructed as a request object—> crawler middleware—> engine—> scheduler
  • The scheduler puts request —> engine —> download middleware —> downloader
  • The downloader sends a request and gets a response —> download middleware —> engine —> crawler middleware —> crawler
  • The crawler extracts the url address and assembles it into a request object —> crawler middleware —> engine —> scheduler, repeat step 2
  • Crawler extracts data —> engine —> pipeline processing and saving data

2.1 Basic crawler process

insert image description here

2.2 Basic crawler module relationship

insert image description here

2.3 Scrapy workflow

insert image description here

3. The specific role of each module in Scrapy

insert image description here

4. Example (Crawling of the latest 100 updates of the American drama Paradise)

insert image description here

4.1 Create a project

cd Desktop # 我习惯项目都放在桌面
Scrapy startproject movie

insert image description here
You can see the folder with the following directory on the desktop, then the project is created successfully.

insert image description here

4.2 Create a crawler

cd movie # 用 cd 先进入 movie 目录
Scrapy genspider meiju meijutt.tv # 创建了一个叫 meiju 的爬虫

insert image description here
Check the spiders directory and you can see that there is one more meiju.py, which is the crawler we just created.

4.3 Edit crawler

  • The pipelines and items files are responsible for saving the data passed by the Spider (crawler). Where to save it depends on the developer's own needs.
  • middlewares are used to define and implement middleware functions
  • settings The configuration file for the entire project
  • The spiders/ directory is our crawler file

4.3.1 meiju.py

import scrapy
from movie.items import MovieItem


class MeijuSpider(scrapy.Spider):  # 继承这个类
    name = 'test'  # 名字
    allowed_domains = ['meijutt.tv']  # 域名
    start_urls = ['https://www.meijutt.tv/new100.html']  # 补充完整网址

    def parse(self, response):
        movies = response.xpath('//ul[@class="top-list  fn-clear"]/li')   # 选中所有的属性class值为"top-list  fn-clear"的ul下的li标签内容
        for each_movie in movies:
            item = MovieItem()
            item['number'] = each_movie.xpath('.//div[@class="lasted-num fn-left"]//text()').extract()[0]  # 提取影片序号
            item['name'] = each_movie.xpath('.//h5/a//text()').extract()[0]  # 提取影片名
            item['url'] = each_movie.xpath('.//h5/a/@href').extract()[0]  # 提取影片网址
            item['time'] = each_movie.xpath('.//div[@class="lasted-time new100time fn-right"]//text()').extract()[0]  # 提取影片更新时间
            yield item  # 一种特殊的循环

4.3.2 items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class MovieItem(scrapy.Item):
    # define the fields for your item here like:
    number = scrapy.Field()
    name = scrapy.Field()
    url = scrapy.Field()
    time = scrapy.Field()
    # pass

4.3.3 Setting configuration file settings.py

Add this sentence:

ITEM_PIPELINES = {
    
    'movie.pipelines.MoviePipeline': 100}

4.3.4 Set the data processing script pipelines.py

import json

class MoviePipeline(object):
    def process_item(self, item, spider):
        return item

4.4 Start the crawler

Start the terminal in the crawler program meiju.py directory

Scrapy crawl test # 启动的名字和你前面给他取的名字对应,注意不是文件名!!!也不是函数名!!!

insert image description here

As shown in the figure below, it is the successful crawling of the content we need
insert image description here

Guess you like

Origin blog.csdn.net/a6661314/article/details/124673353#comments_25027489