Python crawler 5.4 - use items scrapy frame module

Python crawler 5.4 - use items scrapy frame module

Overview

At the same time this series document for learning Python crawler technology simple tutorial to explain and consolidate their technical knowledge, just in case they accidentally useful to you so much the better.
Python version is 3.7.4

This article, we use items to tell the story of the module.

items Introduction

The main objective is unstructured items from the source (usually a page) extracted structured data. Scrap crawlers can extract the data returned as Python statement. Although convenient and familiar, Python dicts lack of structure: it is easy to enter the wrong or inconsistent data is returned in the field name, especially in large projects with many reptiles. To define a common output data format, Scrapy provide Item class. Item object is simply a container for collecting the captured data. They provide a dictionary-like API, with syntax statement for convenience of their available fields.

Additional information about the various components Scrapy use the project: Exporter View field declaration in order to calculate the column you want to export, serialized item you can use custom field metadata trackref, tracking instance items to help find memory leaks (see Using trackref debug memory leakage) and so on.

Declaration defines

Use simple class definition syntax and Field objects to declare the project. The sample code below (items.py):

import scrapy


class QsbkItem(scrapy.Item):
    # 定义item数据字段
    author = scrapy.Field()
    content = scrapy.Field()

Use items

Example code:

import scrapy
# 引入items类
from qsbk.items import QsbkItem


class QsbkSpiderSpider(scrapy.Spider):
    name = 'qsbk_spider'
    allowed_domains = ['qiushibaike.com']
    start_urls = ['https://www.qiushibaike.com/text/page/1/']
    base_url = 'https://www.qiushibaike.com'

    def parse(self, response):
        # SelectorList
        # 解析页面
        content_left = response.xpath('//div[@id="content-left"]/div')
        # 提取数据
        for dz_div in content_left:
            # Selector
            author = dz_div.xpath(".//h2/text()").get().strip()
            content_tmp = dz_div.xpath(".//div[@class='content']//text()").getall()
            content = ''.join(content_tmp).strip()
            # 使用items进行返回
            item = QsbkItem(author=author, content=content)
            # 使用yield返回给pipline
            yield item

We can define the QsbkItem class is understood as a dictionary (of course, it is not a dictionary).

Continuing to pass data items using the acquired time when different items used to store different data, then this data to the pipeline when, can isinstance(item, MyspiderItem)be judged that the data belongs to Item, different data processing (this section in "5.2 Python reptile Notes - use scrapy framework pipline module" has been explained).

Other Bowen link

Published 154 original articles · won praise 404 · Views 650,000 +

Guess you like

Origin blog.csdn.net/Zhihua_W/article/details/103970628