[Scrapy Framework] "Version 2.4.0 Source Code" Item (Items) Detailed Chapter

All source code analysis article index directory portal

[Scrapy Framework] Version 2.4.0 Source Code: All Configuration Directory Index

Introduction

The main goal of data scraping is to extract structured data from unstructured sources (usually web pages).

This chapter introduces the case in the source code. I personally feel that the operation of processing data is cumbersome, and the data processing process is simplified to the simplest content. In the crawler example in the column, if you feel that the data processing in the article is cumbersome, friends can jump over. Just look at the example.

Items provides a dictionary of data that can be read, written, and modified for use.

dictionaries : The data type is a dictionary.

Item objects : Have the same operations as a dictionary.

from scrapy.item import Item, Field

class CustomItem(Item):
    one_field = Field()
    another_field = Field()

dataclass objects : support serialization to define data types in project data

from dataclasses import dataclass

@dataclass
class CustomItem:
    one_field: str
    another_field: int

attrs objects : support serialization conversion number attributes

import attr

@attr.s
class CustomItem:
    one_field = attr.ib(str)
    another_field = attr.ib(convert=float)

Use Items

Declare field

The item subclass uses simple class definition syntax and Field attributes, that is, to customize the list field names of the captured content, and fill the captured data into the table in the form of a contingency table.

import scrapy

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    tags = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

Field data

  1. Create Items
>>> product = Product(name='Desktop PC', price=1000)
>>> print(product)
Product(name='Desktop PC', price=1000)
  1. Get the value of Items
>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC
>>> product['price']
1000
# 一般错误提示,同字典报错
>>> product['lala'] # 获取未定义的字段值
Traceback (most recent call last):
    ...
KeyError: 'lala'
  1. Set the value of Items
>>> product['last_updated'] = 'today'
>>> product['last_updated']
today
  1. Dictionary operation Items
>>> product.keys()
['price', 'name']

>>> product.items()
[('price', 1000), ('name', 'Desktop PC')]
  1. Copy Items
product2 = product.copy()
# 或者
product2 = Product(product)
  1. Dictionary creation Items
>>> Product({
    
    'name': 'Laptop PC', 'price': 1500})
Product(price=1500, name='Laptop PC')
  1. Data type extension
# 直接定义数据类型
class DiscountedProduct(Product):
    discount_percent = scrapy.Field(serializer=str)
    discount_expiration_date = scrapy.Field()

# 使用序列化的方式进行定义
class SpecificProduct(Product):
    name = scrapy.Field(Product.fields['name'], serializer=my_serializer)

Use in Spider

from myproject.items import Product
 
def parse(self, response):
	item = Product()
    item["name"]= response.xpath('//div[@class="xxx"]/text()').extract()
	item["price"]= response.xpath('//div[@class="xxx"]/text()').extract()
	item["stock"]= response.xpath('//div[@class="xxx"]/text()').extract()
	item["tags"]= response.xpath('//div[@class="xxx"]/text()').extract()
	item["last_updated"]= response.xpath('//div[@class="xxx"]/text()').extract()
    yield item 

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/113491113