Scrapy (5) detailed explanation of item

I came, and I was upset for a while today. I went to the river to listen to the sound of the water. I came back to think clearly. I still feel too impetuous. I have to calm down and study hard. I still have to look for my career and sideline. I'm too stressed


I still tell myself that when my talents still can’t support my ambitions, I should calm down and learn. When my economy can’t support my ideals, I should work down-to-earth, invest and manage money down-to-earth, and constantly To buy assets, it is better to invest in Bitcoin, Ethereum, CSI 500, Hang Seng Index, and dividend index regularly. Anyway, these indexes are now in the stage of underestimation. I always feel that this year is definitely a year full of opportunities.


image


Today I have to explain the topic item


The main goal of crawling is to extract structured data from unstructured sources (usually web pages). The Scrapy spider can return the extracted data just like Python. Although convenient and familiar, Python lacks structure: it is easy to enter typos in field names or return inconsistent data, especially in larger projects with many spiders.

In order to define a common output data format, Scrapy provides Itemclasses. ItemObjects are simple containers used to collect scraped data. They provide a dictionary-like API and have a convenient syntax for declaring their available fields.

Various Scrapy components use additional information provided by Items: the exporter looks at the declared fields to determine the columns to be exported, can use Item field metadata to customize serialization, and trackref track Item instances to help find memory leaks (see Use trackref to debug memory Leakage) etc.


Declare item

Use simple class definition syntax and Field object declaration items. This is an example:

import scrapy

class Product(scrapy.Item):
   name = scrapy.Field()
   price = scrapy.Field()
   stock = scrapy.Field()
   last_updated = scrapy.Field(serializer=str)

note

Those familiar with Django will notice that Scrapy Items are declared similar to Django Models, except that Scrapy Items are simpler because there is no concept of different field types.



Item field

FieldObjects are used to specify the metadata of each field. For example, the serialization function of the field described in the example last_updatedabove.

You can specify any type of metadata for each field. FieldThere is no limit to the values ​​accepted by the object. For the same reason, there is no reference list of all available metadata keys. FieldEach key defined in the object can be used by different components, and only those components know it. You can also Fielddefine and use any other keys in the project according to your needs. FieldThe main goal of the object is to provide a way to define all field metadata in one place. Generally, those components whose behavior depends on each field use certain field keys to configure the behavior. You must refer to its documentation to see the metadata keys used by each component.

It is important to note that Fieldthe object used to declare the project will not remain as a class attribute. Instead, they can be accessed through Item.fieldsattributes.

Use item

Here are Productsome examples of common tasks performed on projects using the projects declared above . You will notice that the  API is very similar to the dict API.

Create project

>>> product = Product(name='Desktop PC', price=1000)
>>> print product
Product(name='Desktop PC', price=1000)

Get the word segment value

>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC

>>> product['price']
1000

>>> product['last_updated']
Traceback (most recent call last):
   ...
KeyError: 'last_updated'

>>> product.get('last_updated', 'not set')
not set

>>> product['lala'] # getting unknown field
Traceback (most recent call last):
   ...
KeyError: 'lala'

>>> product.get('lala', 'unknown field')
'unknown field'

>>> 'name' in product  # is name field populated?
True

>>> 'last_updated' in product  # is last_updated populated?
False

>>> 'last_updated' in product.fields  # is last_updated a declared field?
True

>>> 'lala' in product.fields  # is lala a declared field?
False

Setting word segment value

>>> product['last_updated'] = 'today'
>>> product['last_updated']
today

>>> product['lala'] = 'test' # setting unknown field
Traceback (most recent call last):
   ...
KeyError: 'Product does not support field: lala'

Access all fill values

To access all fill values, just use the typical dict  API:

>>> product.keys()
['price', 'name']

>>> product.items()
[('price', 1000), ('name', 'Desktop PC')]

Other common tasks

Copy items:

>>> product2 = Product(product)
>>> print product2
Product(name='Desktop PC', price=1000)

>>> product3 = product2.copy()
>>> print product3
Product(name='Desktop PC', price=1000)

Create dict s from the project :

>>> dict(product) # create a dict from all populated values
{'price': 1000, 'name': 'Desktop PC'}

Create project from dict s:

>>> Product({'name': 'Laptop PC', 'price': 1500})
Product(price=1500, name='Laptop PC')
>>> Product({'name': 'Laptop PC', 'lala': 1500}) # warning: unknown field in dict
Traceback (most recent call last):
   ...
KeyError: 'Product does not support field: lala'

Extension project

You can extend Items by declaring a subclass of the original Item (to add more fields or change some metadata of some fields).

E.g:

class DiscountedProduct(Product):
   discount_percent = scrapy.Field(serializer=str)
   discount_expiration_date = scrapy.Field()

You can also use the previous field metadata to extend the field metadata and append more values ​​or change the existing values ​​as follows:

class SpecificProduct(Product):
   name = scrapy.Field(Product.fields['name'], serializer=my_serializer)

This will add (or replace) the serializermetadata key of the field name, retaining all previously existing metadata values.

Object

classscrapy.item.Item[arg ]

Return an optional new Item initialized from the given parameters.

Items copy the standard dict API , including its constructor. The only additional attributes provided by Items are:

fields

A dictionary containing all the declared fields of this Item , not only the fields that have been filled . The key is the field name, and the value is the object used in the Item declarationField .

Field object

classscrapy.item.Field[arg ]


The Fieldclass is just a dictionary class with built-in aliases , and does not provide any additional functions or attributes. In other words, the  Fieldobjects are ordinary Python dicts. A separate class is used to support item declaration syntax based on class attributes .




Guess you like

Origin blog.51cto.com/15067249/2574444