Use Scrapy framework for crawler development

Scrapy framework introduction

The Scrapy framework is a framework for quickly developing crawlers. It only requires writing a small amount of code to achieve crawler effects.

Advantage

Reduces the amount of duplicate code
Improved development efficiency

Scrapy components

spider folder: stores crawlers, with independent names between crawlers
init.py : initialization file
setting.py: Configuration file, set the configuration of the crawler
middlewares.py: middleware file, used to intercept requests and configure middleware connections, such as setting cookies, UserAgent, IP, etc.
pipelines.py: pipeline file, used to persist data
item.py: Class file, used to set the attributes of the element entities that need to be obtained

Develop using the Scrapy framework

Crawl the Douban website.
In a python environment, use pip to install Scrapy and open the command line in the development directory.

pip install Scrapy

After the installation is complete, create a crawler project. The project name can be customized, here it is douban_scrapy

scrapy startproject douban_scrapy

After creation, enter the generated crawler directory

cd douban_scrapy

After entering the crawler directory, create a crawler. The name of the crawler can be customized. Here is douban. The crawler name is followed by the target website to be crawled, which can be customized.

scrapy genspider douban https://movie.douban.com/top250

After creating the crawler, you can see the crawler we just created in the spider folder, then open the setting.py file in the crawler directory to set the crawler configuration, and add the following code to this file

Configure log level

LOG_LEVEL="WARNING"

Configure the proxy and disguise the crawler. UserAgent can be obtained from the browser's developer tools and does not have to be the same as below.

USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.200"

Follow site local rules

ROBOTSTXT_OBEY = True

Maximum request line

CONCURRENT_REQUESTS = 8

Delay requests to avoid IP being blocked due to frequent requests

DOWNLOAD_DELAY = 3

After configuring the crawler information, open the item.py file and create the class to be crawled

import scrapy

class MovieItem(scrapy.Item):
    title = scrapy.Field()
    rank = scrapy.Field()
    subject = scrapy.Field()

Open the crawler file (douban.py) just created and modify the parse method

# 记住要用这个方式导入类
from ..items import MovieItem

    (此方法用于解析得到的html文件)
    def parse(self, response):
        sel = scrapy.Selector(response)
        list_items = sel.css('#content > div > div.article > ol > li')
        for list_item in list_items:
            movie_item = MovieItem()
            movie_item['title'] = list_item.css('span.title::text').extract_first()
            movie_item['rank'] = list_item.css('span.rating_num::text').extract_first()
            movie_item['subject'] = list_item.css('span.inq::text').extract_first()
            yield movie_item

Open the pipelines.py file and configure persistence information:

Method 1: Save in excel format
and install the Excel library

pip install openpyxl

Configure Excel pipeline class

class ExcelPipeline:

    def __init__(self):
        self.wb = openpyxl.Workbook()
        self.ws = self.wb.active
        self.ws.title = 'Top'
        self.ws.append(('标题', '评分', '主题'))

    def close_spider(self, spider):
        self.wb.save('电影数据.xlsx')

    def process_item(self, item, spider):
        title = item.get('title', '')
        rank = item.get('rank', '')
        subject = item.get('subject', '')
        self.ws.append((title, rank, subject))
        return item

Method 2: Save as mysql database
and install mysql database

pip install pymysql

Pipeline class to build mysql database

class MysqlPipeline:

    def __init__(self):
        self.conn = pymysql.connect(host='127.0.0.1', port=3306, user='root',
                                    password='1234', database='spider', charset='utf8mb4')
        self.cursor = self.conn.cursor()
        self.data = []

    def close_spider(self, spider):
        if len(self.data) > 0:
            self.__write_to_DB()
        self.conn.close()

    def process_item(self, item, spider):
        title = item.get('title', '')
        rank = item.get('rank', 0)
        subject = item.get('subject', '')
        self.data.append((title, rank, subject))
        if len(self.data) == 100:
            self.__write_to_DB()
            self.data.clear()
        return item

    def __write_to_DB(self):
        self.cursor.executemany(
            'insert into tb_douban_movie(title, rating, subject) values (%s,%s,%s)',
            self.data
        )
        self.conn.commit()

After the above two configurations, open the setting.py file again and configure the pipeline class information.

ITEM_PIPELINES = {
	# 设置管道类的优先级，数值越小优先级越高
   "douban_scrapy.pipelines.ExcelPipeline": 300,
   "douban_scrapy.pipelines.MysqlPipeline": 200,
}

Open the command line and re-enter our crawler directory

cd douban_scrapy

Start the crawler and get an excel file named movie data/view the database to find the data and complete the crawling

scrapy scrawl douban