Python crawler 5.4 - use items scrapy frame module
Overview
At the same time this series document for learning Python crawler technology simple tutorial to explain and consolidate their technical knowledge, just in case they accidentally useful to you so much the better.
Python version is 3.7.4
This article, we use items to tell the story of the module.
items Introduction
The main objective is unstructured items from the source (usually a page) extracted structured data. Scrap crawlers can extract the data returned as Python statement. Although convenient and familiar, Python dicts lack of structure: it is easy to enter the wrong or inconsistent data is returned in the field name, especially in large projects with many reptiles. To define a common output data format, Scrapy provide Item class. Item object is simply a container for collecting the captured data. They provide a dictionary-like API, with syntax statement for convenience of their available fields.
Additional information about the various components Scrapy use the project: Exporter View field declaration in order to calculate the column you want to export, serialized item you can use custom field metadata trackref, tracking instance items to help find memory leaks (see Using trackref debug memory leakage) and so on.
Declaration defines
Use simple class definition syntax and Field objects to declare the project. The sample code below (items.py):
import scrapy
class QsbkItem(scrapy.Item):
# 定义item数据字段
author = scrapy.Field()
content = scrapy.Field()
Use items
Example code:
import scrapy
# 引入items类
from qsbk.items import QsbkItem
class QsbkSpiderSpider(scrapy.Spider):
name = 'qsbk_spider'
allowed_domains = ['qiushibaike.com']
start_urls = ['https://www.qiushibaike.com/text/page/1/']
base_url = 'https://www.qiushibaike.com'
def parse(self, response):
# SelectorList
# 解析页面
content_left = response.xpath('//div[@id="content-left"]/div')
# 提取数据
for dz_div in content_left:
# Selector
author = dz_div.xpath(".//h2/text()").get().strip()
content_tmp = dz_div.xpath(".//div[@class='content']//text()").getall()
content = ''.join(content_tmp).strip()
# 使用items进行返回
item = QsbkItem(author=author, content=content)
# 使用yield返回给pipline
yield item
We can define the QsbkItem class is understood as a dictionary (of course, it is not a dictionary).
Continuing to pass data items using the acquired time when different items used to store different data, then this data to the pipeline when, can isinstance(item, MyspiderItem)
be judged that the data belongs to Item, different data processing (this section in "5.2 Python reptile Notes - use scrapy framework pipline module" has been explained).
Other Bowen link
- Python Reptile 1.1 - urllib tutorial Basic usage
- Python Reptile 1.2 - urllib Advanced Usage tutorial
- Python Reptile 1.3 - requests tutorial Basic usage
- Python Reptile 1.4 - requests Advanced Usage tutorial
- Python Reptile 2.1 - BeautifulSoup usage Tutorial
- Python Reptile 2.2 - xpath usage Tutorial
- Python Reptile 3.1 - json Usage tutorial
- Python Reptile 3.2 - csv usage Tutorial
- Python Reptile 3.3 - txt usage Tutorial
- Python Reptile 4.1 - threading (multi-threaded) Usage tutorial
- Python Reptile 4.2 - ajax (dynamic web crawler) Usage tutorial
- Python Reptile 4.3 - selenium tutorial Basic usage
- Python Reptile 4.4 - selenium Advanced Usage tutorial
- Python Reptile 4.5 - tesseract (image verification code identification) Usage tutorial
- Python Reptile 5.1 - scrapy framework simple entry
- Python crawler 5.2 - pipeline module frame using scrapy
- Python crawler 5.3 - scrapy frame spider [Request and the Response] modules used