I came, and I was upset for a while today. I went to the river to listen to the sound of the water. I came back to think clearly. I still feel too impetuous. I have to calm down and study hard. I still have to look for my career and sideline. I'm too stressed
I still tell myself that when my talents still can’t support my ambitions, I should calm down and learn. When my economy can’t support my ideals, I should work down-to-earth, invest and manage money down-to-earth, and constantly To buy assets, it is better to invest in Bitcoin, Ethereum, CSI 500, Hang Seng Index, and dividend index regularly. Anyway, these indexes are now in the stage of underestimation. I always feel that this year is definitely a year full of opportunities.
Today I have to explain the topic item
The main goal of crawling is to extract structured data from unstructured sources (usually web pages). The Scrapy spider can return the extracted data just like Python. Although convenient and familiar, Python lacks structure: it is easy to enter typos in field names or return inconsistent data, especially in larger projects with many spiders.
In order to define a common output data format, Scrapy provides Item
classes. Item
Objects are simple containers used to collect scraped data. They provide a dictionary-like API and have a convenient syntax for declaring their available fields.
Various Scrapy components use additional information provided by Items: the exporter looks at the declared fields to determine the columns to be exported, can use Item field metadata to customize serialization, and trackref
track Item instances to help find memory leaks (see Use trackref to debug memory Leakage) etc.
Declare item
Use simple class definition syntax and Field
object declaration items. This is an example:
import scrapy
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
stock = scrapy.Field()
last_updated = scrapy.Field(serializer=str)
note
Those familiar with Django will notice that Scrapy Items are declared similar to Django Models, except that Scrapy Items are simpler because there is no concept of different field types.
Item field
Field
Objects are used to specify the metadata of each field. For example, the serialization function of the field described in the example last_updated
above.
You can specify any type of metadata for each field. Field
There is no limit to the values accepted by the object. For the same reason, there is no reference list of all available metadata keys. Field
Each key defined in the object can be used by different components, and only those components know it. You can also Field
define and use any other keys in the project according to your needs. Field
The main goal of the object is to provide a way to define all field metadata in one place. Generally, those components whose behavior depends on each field use certain field keys to configure the behavior. You must refer to its documentation to see the metadata keys used by each component.
It is important to note that Field
the object used to declare the project will not remain as a class attribute. Instead, they can be accessed through Item.fields
attributes.
Use item
Here are Product
some examples of common tasks performed on projects using the projects declared above . You will notice that the API is very similar to the dict API.
Create project
>>> product = Product(name='Desktop PC', price=1000)
>>> print product
Product(name='Desktop PC', price=1000)
Get the word segment value
>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC
>>> product['price']
1000
>>> product['last_updated']
Traceback (most recent call last):
...
KeyError: 'last_updated'
>>> product.get('last_updated', 'not set')
not set
>>> product['lala'] # getting unknown field
Traceback (most recent call last):
...
KeyError: 'lala'
>>> product.get('lala', 'unknown field')
'unknown field'
>>> 'name' in product # is name field populated?
True
>>> 'last_updated' in product # is last_updated populated?
False
>>> 'last_updated' in product.fields # is last_updated a declared field?
True
>>> 'lala' in product.fields # is lala a declared field?
False
Setting word segment value
>>> product['last_updated'] = 'today'
>>> product['last_updated']
today
>>> product['lala'] = 'test' # setting unknown field
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
Access all fill values
To access all fill values, just use the typical dict API:
>>> product.keys()
['price', 'name']
>>> product.items()
[('price', 1000), ('name', 'Desktop PC')]
Other common tasks
Copy items:
>>> product2 = Product(product)
>>> print product2
Product(name='Desktop PC', price=1000)
>>> product3 = product2.copy()
>>> print product3
Product(name='Desktop PC', price=1000)
Create dict s from the project :
>>> dict(product) # create a dict from all populated values
{'price': 1000, 'name': 'Desktop PC'}
Create project from dict s:
>>> Product({'name': 'Laptop PC', 'price': 1500})
Product(price=1500, name='Laptop PC')
>>> Product({'name': 'Laptop PC', 'lala': 1500}) # warning: unknown field in dict
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
Extension project
You can extend Items by declaring a subclass of the original Item (to add more fields or change some metadata of some fields).
E.g:
class DiscountedProduct(Product):
discount_percent = scrapy.Field(serializer=str)
discount_expiration_date = scrapy.Field()
You can also use the previous field metadata to extend the field metadata and append more values or change the existing values as follows:
class SpecificProduct(Product):
name = scrapy.Field(Product.fields['name'], serializer=my_serializer)
This will add (or replace) the serializer
metadata key of the field name
, retaining all previously existing metadata values.
Object
classscrapy.item.
Item
([arg ])
Return an optional new Item initialized from the given parameters.
Items copy the standard dict API , including its constructor. The only additional attributes provided by Items are:
fields
A dictionary containing all the declared fields of this Item , not only the fields that have been filled . The key is the field name, and the value is the object used in the Item declarationField
.
Field object
classscrapy.item.
Field
([arg ])
The Field
class is just a dictionary class with built-in aliases , and does not provide any additional functions or attributes. In other words, the Field
objects are ordinary Python dicts. A separate class is used to support item declaration syntax based on class attributes .