All source code analysis article index directory portal
[Scrapy Framework] Version 2.4.0 Source Code: All Configuration Directory Index
Article Directory
Introduction
The main goal of data scraping is to extract structured data from unstructured sources (usually web pages).
This chapter introduces the case in the source code. I personally feel that the operation of processing data is cumbersome, and the data processing process is simplified to the simplest content. In the crawler example in the column, if you feel that the data processing in the article is cumbersome, friends can jump over. Just look at the example.
Items provides a dictionary of data that can be read, written, and modified for use.
dictionaries : The data type is a dictionary.
Item objects : Have the same operations as a dictionary.
from scrapy.item import Item, Field
class CustomItem(Item):
one_field = Field()
another_field = Field()
dataclass objects : support serialization to define data types in project data
from dataclasses import dataclass
@dataclass
class CustomItem:
one_field: str
another_field: int
attrs objects : support serialization conversion number attributes
import attr
@attr.s
class CustomItem:
one_field = attr.ib(str)
another_field = attr.ib(convert=float)
Use Items
Declare field
The item subclass uses simple class definition syntax and Field attributes, that is, to customize the list field names of the captured content, and fill the captured data into the table in the form of a contingency table.
import scrapy
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
stock = scrapy.Field()
tags = scrapy.Field()
last_updated = scrapy.Field(serializer=str)
Field data
- Create Items
>>> product = Product(name='Desktop PC', price=1000)
>>> print(product)
Product(name='Desktop PC', price=1000)
- Get the value of Items
>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC
>>> product['price']
1000
# 一般错误提示,同字典报错
>>> product['lala'] # 获取未定义的字段值
Traceback (most recent call last):
...
KeyError: 'lala'
- Set the value of Items
>>> product['last_updated'] = 'today'
>>> product['last_updated']
today
- Dictionary operation Items
>>> product.keys()
['price', 'name']
>>> product.items()
[('price', 1000), ('name', 'Desktop PC')]
- Copy Items
product2 = product.copy()
# 或者
product2 = Product(product)
- Dictionary creation Items
>>> Product({
'name': 'Laptop PC', 'price': 1500})
Product(price=1500, name='Laptop PC')
- Data type extension
# 直接定义数据类型
class DiscountedProduct(Product):
discount_percent = scrapy.Field(serializer=str)
discount_expiration_date = scrapy.Field()
# 使用序列化的方式进行定义
class SpecificProduct(Product):
name = scrapy.Field(Product.fields['name'], serializer=my_serializer)
Use in Spider
from myproject.items import Product
def parse(self, response):
item = Product()
item["name"]= response.xpath('//div[@class="xxx"]/text()').extract()
item["price"]= response.xpath('//div[@class="xxx"]/text()').extract()
item["stock"]= response.xpath('//div[@class="xxx"]/text()').extract()
item["tags"]= response.xpath('//div[@class="xxx"]/text()').extract()
item["last_updated"]= response.xpath('//div[@class="xxx"]/text()').extract()
yield item