scrapy framework-novice practical notes

This is the first attempt of reptile white's scrapy, please correct me if there is any error!

Install python+pycharm

There are many online installation tutorials, so I won’t repeat them here.
Pycharm crack reference: https://blog.csdn.net/qs17809259715/article/details/90115751

Introduction to scrapy framework

A fast, high-level Python framework for screen scraping and web scraping, used to scrape web sites and extract structured data from pages, which can be used for data mining, monitoring and automated testing, and can be personalized according to specific needs custom made.

Scrapy architecture diagram

Insert picture description here

The functions of each component are as follows

  • Scrapy Engine (engine): used to process the data transmission of the entire system, which is the core part of the entire system.
  • Scheduler (Scheduler): Used to accept the Request request sent by the engine, press it into the queue, and return when the engine requests it again.
  • Downloader: Used to request the corresponding web content of the Request sent by the engine, and return the obtained Responses to the Spider.
  • Spiders: Process the Responses to obtain the required fields (ie Item) from them, or obtain the required links from the Responses, and let Scrapy continue to crawl.
  • Item Pipeline (pipeline): Responsible for processing the entities obtained in the Spider, cleaning the data, and saving the required data.
  • Downloader Middlewares: Mainly used to process requests and responses between the Scrapy engine and the downloader.
  • Spider Middlewares (crawler middleware): mainly used to process Spider Responses and Requests

Scrapy work flow chart

Insert picture description here

  • Yeild of SPIDERS sends the request to ENGIN
  • ENGINE does not do any processing on the request and sends it to SCHEDULER
  • SCHEDULER (url scheduler), generate request to ENGIN
  • ENGINE gets the request, filters it layer by layer through MIDDLEWARE and sends it to DOWNLOADER
  • After DOWNLOADER obtains the response data on the Internet, it is filtered through MIDDLEWARE and sent to ENGIN.
  • After ENGINE obtains the response data, it returns to SPIDERS. The parse() method of SPIDERS processes the obtained response data, parses out items or requests, and
    sends the parsed items or requests to ENGIN.
  • ENGIN gets the items or requests, sends the items to ITEMPIPELINES, and sends the requests to SCHEDULER

note! Only when there is no request in the scheduler, the whole program will stop, (that is, for the URL that failed to download, Scrapy will download it again.)

Install Scrapy

pip install pywin32

pip install zope.interface

pip install Twisted

pip install pyOpenSSL

pip install Scrapy

Create scrapy project

pycharm terminal input

scrapy startproject [项目名]

Move to create project folder

cd [项目名]

Create spider

scrapy genspider -t crawl [爬虫名] [爬取网站]

Create start.py (crawler startup file)

from scrapy import cmdline

cmdline.execute("scrapy crawl [爬虫名]".split())

Created

Actual Combat-Crawling Biquge

Little white notes big guy light spray

Crawl requirements

Crawl the origin_url (url), title, author, class_ (fiction classification), introduction of all novels on the homepage of Biquge, and store them in json format.
(Too many texts do not crawl)

Website analysis

  • Open the Google browser and enter the homepage
    Insert picture description hereof Biqugex https://www.biqugex.com, right-click to check and find that the URL of all novels is /book_.+

  • Enter the novel details page and
    Insert picture description here
    find that all the information we need is under div class='book'

Create a biqugex project

pycharm terminal input

scrapy startproject biqugex

cd biqugex

scrapy genspider -t crawl bqg_spider biqugex.com

Create start.py (crawler startup file)

from scrapy import cmdline

cmdline.execute("scrapy crawl bqg_spider".split())

Created
Insert picture description here

settings.py (configuration file)

ROBOTSTXT_OBEY = False#不遵守机器协议

DOWNLOAD_DELAY = 1#下载延迟1秒

DEFAULT_REQUEST_HEADERS = {
    
    
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36',
}#设置User-Agent

DOWNLOADER_MIDDLEWARES = {
    
    
   'bqg_sipder.middlewares.BqgSipderDownloaderMiddleware': 543,
}#取消注释DOWNLOADER_MIDDLEWARES 

ITEM_PIPELINES = {
    
    
   'bqg_sipder.pipelines.BqgSipderPipeline': 300,
}#取消注释ITEM_PIPELINES 

items.py (set to get fields)

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class BiqugexItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    origin_url = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()
    class_ = scrapy.Field()
    introduction = scrapy.Field()
    content = scrapy.Field()

bqg_spider.py (crawler)

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from biqugex.items import BiqugexItem

class BqgSpider(CrawlSpider):
    name = 'bqg_spider'
    allowed_domains = ['biqugex.com']
    start_urls = ['https://www.biqugex.com']

    rules = (
        Rule(LinkExtractor(allow=r'/book_.+'), callback='parse_item', follow=False),
    )# 匹配符合要求链接,response到parse_item
    def parse_item(self, response):#xpath匹配要需要的信息,最后yield
        origin_url = response.url
        title = response.xpath("//h2/text()").get().strip()
        author = response.xpath("//div[@class='small']/span[1]/text()").get().lstrip('作者:').strip()
        class_ = response.xpath("//div[@class='small']/span[2]/text()").get().lstrip('分类:').strip()
        introduction = response.xpath("//div[@class='intro']/text()[1]").get().replace("\n","").replace("/r","").strip()

        item = BiqugexItem(
            origin_url = origin_url,
            title = title,
            author = author,
            class_ = class_,
            introduction = introduction
        )

        yield item

pipelines.py (saved as json file)

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json

class BiqugexPipeline(object):
    def __init__(self):
        self.file = open('novel.json', 'w', encoding='utf-8')

    def start_spider(self, spider):
        print("爬虫开始了" + "<==" * 10)
        pass

    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"#ensure_ascii=False防止中文乱码
        self.file.write(line)
        return item

    def close_spider(self, spider):
        print("爬虫结束了" + "<==" * 10)
        pass

Run crawler

  • start.py, wait for a while to get the resultInsert picture description here

Concluding remarks

The simple crawling of Biquge is intended to be familiar with the basic process of scrapy. If there are any errors, please correct me!

Guess you like

Origin blog.csdn.net/else_tdk/article/details/102229291