Analysis and examples of Scrapy crawler framework (China University MOOC)

Scrapy framework

 Scrapy is an application framework written for crawling website data and extracting structured data. It can be used in a series of programs including data mining, information processing or storing historical data. It was originally designed for page crawling (more precisely, web crawling), and it can also be used to retrieve data returned by APIs (such as Amazon Associates Web Services) or general web crawlers. Scrapy is versatile and can be used for data mining, monitoring and automated testing.

 Scrapy uses the Twisted asynchronous network library to handle network communication. The overall structure is roughly as follows:

Insert picture description here

Component

The Scrapy Engine
engine is responsible for controlling the flow of data among all components in the system and triggering events when corresponding actions occur. For details, see the Data Flow section below.

Scheduler (Scheduler) The
scheduler accepts requests from the engine and enqueues them so that they can be provided to the engine when the engine requests them.

Spiders
Spider is a class written by Scrapy users to analyze the response and extract the item (ie the obtained item) or additional URL for follow-up. Each spider is responsible for handling a specific (or some) website.

Item Pipeline
Item Pipeline is responsible for processing items extracted by spiders. Typical processing includes cleanup, verification and persistence (for example, access to the database).

Downloader middlewares (Downloader middlewares)
Downloader middleware is a specific hook between the engine and the downloader, which handles the response from the Downloader to the engine. It provides a simple mechanism to extend Scrapy functionality by inserting custom code.

Spider middlewares (Spider middlewares)
Spider middlewares are specific hooks between the engine and the spider, processing spider input (response) and output (items and requests). It provides a simple mechanism to extend Scrapy functionality by inserting custom code.

Data flow

The data flow in Scrapy is controlled by the execution engine, and the process is as follows:

  1. The engine opens a website (open a domain), finds the spider processing the website and requests the spider the first URL(s) to be crawled.
  2. The engine obtains the first URL to be crawled from the Spider and schedules it with Request in the Scheduler.
    The engine requests the next URL to be crawled from the scheduler.
  3. The scheduler returns the next URL to be crawled to the engine, and the engine forwards the URL to the downloader through the download middleware (request direction).
  4. Once the page is downloaded, the downloader generates a response for the page and sends it to the engine through the download middleware (response direction).
  5. The engine receives the Response from the downloader and sends it to the Spider for processing through the Spider middleware (input direction).
    Spider processes Response and returns the crawled Item and (follow-up) new Request to the engine.
  6. The engine sends the crawled Item (returned by the Spider) to the Item Pipeline, and sends the Request (returned by the Spider) to the scheduler.
    (From the second step) Repeat until there are no more requests in the scheduler and the engine closes the website.

Scrapy installation and project generation

Download method

windows:

Install pip from https://pip.pypa.io/en/latest/installing.html

Open a command line window and confirm that pip is installed correctly:

pip --verison

Install scrapy:

pip install scrapy

Project initialization

startproject
语法: scrapy startproject <project_name>
是否需要项目: no
在 project_name 文件夹下创建一个名为 project_name 的Scrapy项目。

Open the command line window, switch to the directory of the project you want to create, and initialize a project named myproject

scrapy startporject myproject

After creation, enter the project we created, the directory structure inside is as follows:

scrapy.cfg
myproject/
    __init__.py 
    items.py	
    pipelines.py
    middlewares.py
    settings.py
    spiders/
        __init__.py
        ...

Next, you can use the scrapy command to manage and control your project. For example, to create a new spider, we take the MOOC network as an example:

genspider
语法: scrapy genspider [-t template] <name> <domain>
是否需要项目: yes
在当前项目中创建spider。
这仅仅是创建spider的一种快捷方法。该方法可以使用提前定义好的模板来生成spider。您也可以自己创建spider的源码文件。
scrapy genspider course https://www.icourse163.org/

After creation, a course.py file will be generated in the spider directory:
Insert picture description here
Next, let’s look at the project configuration file. The project configuration file is in the directory setting.py

BOT_NAME:项目名称

USER_AGENT:用户代理

ROBOTSTXT_OBEY:是否遵循机器人协议,默认是true

CONCURRENT_REQUESTS:最大并发数

DOWNLOAD_DELAY:下载延迟时间,单位是秒,控制爬虫爬取的频率,根据你的项目调整,不要太快也不要太慢,默认是3秒,即爬一个停3秒,设置为1秒性价比较高,如果要爬取的文件较多,写零点几秒也行

COOKIES_ENABLED:是否保存COOKIES,默认关闭

DEFAULT_REQUEST_HEADERS:默认请求头,上面写了一个USER_AGENT,其实这个东西就是放在请求头里面的,这个东西可以根据你爬取的内容做相应设置。

ITEM_PIPELINES:项目管道,300为优先级,越低越爬取的优先度越高,默认是注释掉的

When using Scrapy, you can declare the settings you want to use. This can be done by using the environment variable: SCRAPY_SETTINGS_MODULE.

SCRAPY_SETTINGS_MODULE must be written in Python path syntax, such as myproject.settings . Note that the setting module should be in the Python import search path .

Then you can start the crawling task, open the command line and enter scrapy crawl to start the crawling task

crawl
语法: scrapy crawl <spider>
是否需要项目: yes
使用spider进行爬取。
scrapy crawl course

Log module

 Scrapy provides log function. You can use it through the scrapy.log module. The current underlying implementation uses Twisted logging, but this may change in the future.

 The log service must be started by explicitly calling scrapy.log.start().

Scrapy provides 5 levels of logging:

CRITICAL - 严重错误(critical)
ERROR - 一般错误(regular errors)
WARNING - 警告信息(warning messages)
INFO - 一般信息(informational messages)
DEBUG - 调试信息(debugging messages)

The log level can be set by the terminal option (command line option) -loglevel/-L or LOG_LEVEL. Set the log level to WARNING, only ERROR and CRITICAL below the WARNING level will be
Insert picture description here
logged with WARNING

from scrapy import log
log.msg("This is a warning", level=log.WARNING)

The recommended way to add log to spider is to use Spider's log() method. This method will automatically assign the spider parameter when calling scrapy.log.msg(). Other parameters are passed directly to the msg() method.

You can also encapsulate a log module by yourself
Insert picture description here
, call the log module in use

from .course_logger import logger

actual case

Next, take the Chinese University MOOC as an example to crawl some information about free public courses

Take Advanced Mathematics (3) as an example, the address of the course must be similar to the following two formats:

https://www.icourse163.org/course/NUDT-42002
https://www.icourse163.org/course/NUDT-42002?tid=1462924451

Insert picture description here
First of all, in item.py, define the content you want to extract. For example, what we extract is the name, ID, lecturer, profile, and school of the course, and create these variables accordingly. The actual method of the Field method is to create a dictionary, add a build to the dictionary, temporarily do not assign a value, wait for the data to be extracted and then assign

import scrapy


class MoocItem(scrapy.Item):
    term_id = scrapy.Field()
    title = scrapy.Field()
    description = scrapy.Field()
    college = scrapy.Field()
    lector = scrapy.Field()

Ideas

What we need to crawl is only a single page, mainly to analyze HTML elements, and use xpath to write path expressions to obtain nodes/node sets, similar to the path of regular computer files.

re — Regular expression operation The
re module is a unique Python string matching module. Many functions provided in this module are implemented based on regular expressions, and regular expressions are used to perform fuzzy matching on strings to extract what you need The string part, it is common to all languages

The yield
scrapy framework will perform different operations according to the instance type returned by yield:
a. If it is a scrapy.Request object, the scrapy framework will get the link pointed to by the object and call the callback function of the object after the request is completed
b. If it is scrapy.Item object, the scrapy framework will pass this object to pipelines.py for further processing

def parse(self, response):
    item = {
    
    'term_id': re.search(r'termId : "(\d+)"', response.text).group(1),
            'title': response.xpath("//meta[@name= 'description']/@content").extract_first().split(',')[0],
            'description': response.xpath("//meta[@name= 'description']/@content").extract_first().split(',')[1][
                           10:],
            'college': response.xpath("//meta[@name= 'keywords']/@content").extract_first().split(',')[1],
            }
        lectors = []
        script = response.css('script:contains("window.staffLectors")::text').get()
        chiefLector_str = ''.join(re.findall('chiefLector = \\{([^}]*)\\}', script))
        chiefLector_list = re.sub('\s+', '', ' '.join(chiefLector_str.split())).strip()
        chiefLector = demjson.decode("{" + chiefLector_list + "}")
        lectors.append(chiefLector)
        staffLectors_str = ''.join(re.findall('staffLectors = \[([^\[\]]+)\]', script))
        staffLectors_list = re.sub('\s+', '', ' '.join(staffLectors_str.split())).strip()
        staffLector = demjson.decode("[" + staffLectors_list + "]")
        if staffLector:
            for staff in staffLector:
                lectors.append(staff)
        item['lector'] = lectors
        yield item

There is a bit of trouble when extracting the content of the lecturer. The content of the lecturer is to
Insert picture description here
use regular expressions in the script tag to match the variable name, and then to match the content in the brackets or braces, and you need to use demjson to process JSON

Demjson
python processing json requires the support of a third-party json library. When dealing with json data during work, there is no third-party json library installed. The demjson module provides classes and functions for encoding or decoding data represented in language-neutral JSON format (this is often used as a simple alternative to XML in ajax web applications). This implementation tries to comply with the JSON specification (RFC 4627) as much as possible, while still providing many optional extensions to allow less restrictive JavaScript syntax. It includes complete Unicode support, including UTF-32, BOM, and surrogate pair processing.

The pipline.py pipeline can process the extracted data, such as storing it in the Mongo database. Don't forget to open the pipeline in the settings after the code is typed.

from pymongo import MongoClient

class MyprojectPipeline:
    MONGO_URL = "mongodb://localhost:27017"
    MONGO_DB = "mooc"
    MONGO_TABLE = "course"

    client = MongoClient(MONGO_URL)
    db = client[MONGO_DB]

    def process_item(self, item, spider):
        self.save_to_mongo(item)

        return item

    def save_to_mongo(self, data):
        if self.db[self.MONGO_TABLE].insert(data):
            print("SAVE SUCCESS", data)
            return True
        return False

Stored successfully
Insert picture description here

Guess you like

Origin blog.csdn.net/woaichihanbao/article/details/112919711