All source code analysis article index directory portal
[Scrapy Framework] Version 2.4.0 Source Code: All Configuration Directory Index
Article Directory
Introduction
This chapter introduces the case in the source code. I personally feel that the operation of processing data is cumbersome, and the data processing process is simplified to the simplest content. In the crawler example in the column, if you feel that the data processing in the article is cumbersome, friends can jump over. Just look at the example.
Item Loader provides us with a very convenient way to generate Item. Item provides a container for the captured data, and Item Loader allows us to fill the input into the container very conveniently.
ItemLoader parameter definition
item : The Item object parsed by the loader.
context : The ItemLoader of the content.
default_item_class : instantiate the __init__ method.
default_input_processor : The default input processor for the specified field.
default_output_processor : The default output processor for the specified field.
default_selector_class : used to construct the ItemLoader of the selector. This attribute is ignored if it exists in the __init__ method. This attribute is sometimes overridden in subclasses.
selector : Extract data from this Selector object.
add_css (field_name, CSS, *processors, **kw):
similar to ItemLoader.add_value() but receives a css selector instead of a value, which is used to extract a list of Unicode strings from the selector related to this .
# HTML snippet: <p class="product-name">Color TV</p>
loader.add_css('name', 'p.product-name')
# HTML snippet: <p id="price">the price is $1200</p>
loader.add_css('price', 'p#price', re='the price is (.*)')
add_value (field_name, value, *processors, **kw): Process to add a given value to a given field. get_value() adds data usage to processors and kwargs
loader.add_value('name', 'Color TV')
loader.add_value('colours', ['white', 'blue'])
loader.add_value('length', '100')
loader.add_value('name', 'name: foo', TakeFirst(), re='name: (.+)')
loader.add_value(None, {
'name': 'foo', 'sex': 'male'})
add_xpath (field_name, value, *processors, **kw):
Similar to ItemLoader.add_value() but receives an XPath instead of a value, which is used to extract a list of strings from the selector related to this.
# HTML snippet: <p class="product-name">Color TV</p>
loader.add_xpath('name', '//p[@class="product-name"]')
# HTML snippet: <p id="price">the price is $1200</p>
loader.add_xpath('price', '//p[@id="price"]', re='the price is (.*)')
get_collected_values (field_name): Returns the collected values of the field.
get_css (CSS, *processors, **kw): Similar to ItemLoader.get_value() but receives a css selector instead of a value, which is used to extract a list of Unicode strings from the selector related to this.
# HTML snippet: <p class="product-name">Color TV</p>
loader.get_css('p.product-name')
# HTML snippet: <p id="price">the price is $1200</p>
loader.get_css('p#price', TakeFirst(), re='the price is (.*)')
get_output_value (field_name): Returns the collected value parsed for the given field using the output processor.
get_value (value, *processors, **kw): Processing the given value is given to the processors keyword parameter.
>>> from itemloaders import ItemLoader
>>> from itemloaders.processors import TakeFirst
>>> loader = ItemLoader()
>>> loader.get_value('name: foo', TakeFirst(), str.upper, re='name: (.+)')
'FOO'
get_xpath (XPath, *processors, **kw): Similar to ItemLoader.get_value() but receives an XPath instead of a value, which is used to extract a list of Unicode strings from the selector associated with this.
# HTML snippet: <p class="product-name">Color TV</p>
loader.get_xpath('//p[@class="product-name"]')
# HTML snippet: <p id="price">the price is $1200</p>
loader.get_xpath('//p[@id="price"]', TakeFirst(), re='the price is (.*)')
load_item() : The collected data is filled and returned.
nested_css (CSS, **context): Use CSS selectors to create nested loaders. The provided selector is relative to ItemLoader.
nested_xpath (XPath, **context): Use the XPath selector to create a nested loader. The provided selector is relative to ItemLoader.
replace_css (field_name, CSS, *processors, **kw): similar to add_css() but replace the collected data instead of adding data.
replace_value (field_name, value, *processors, **kw): similar to add_value() but replace the collected data with a new value instead of adding it.
replace_xpath (field_name, XPath, *processors, **kw): similar to add_xpath() but replace the collected data instead of adding data.
Data processing example
Define Items
import scrapy
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
stock = scrapy.Field()
last_updated = scrapy.Field(serializer=str)
Loading fill data in Spider
from scrapy.loader import ItemLoader
from myproject.items import Product
def parse(self, response):
l = ItemLoader(item=Product(), response=response)
l.add_xpath('name', '//div[@class="product_name"]')
l.add_xpath('name', '//div[@class="product_title"]')
l.add_xpath('price', '//p[@id="price"]')
l.add_css('stock', 'p#stock]')
l.add_value('last_updated', 'today') # you can also use literal values
return l.load_item()
Processing data set items
Define the field type and default value of the data.
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class InventoryItem:
name: Optional[str] = field(default=None)
price: Optional[float] = field(default=None)
stock: Optional[int] = field(default=None)
Input/output processor
Each Item Loader has an input processor and an output processor for each Field. The input processor is executed when the data is received, and the output processor is executed when ItemLoader.load_item() is called after the data is collected, and the final result is returned.
l = ItemLoader(Product(), some_selector)
l.add_xpath('name', xpath1) # (1)
l.add_xpath('name', xpath2) # (2)
l.add_css('name', css) # (3)
l.add_value('name', 'test') # (4)
return l.load_item() # (5)
Declare Item Loader
The input and output processors are defined by _in and _out suffixes, and the default ItemLoader.default_input_processor and ItemLoader.default_input_processor can also be defined.
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join
class ProductLoader(ItemLoader):
default_output_processor = TakeFirst()
name_in = MapCompose(unicode.title)
name_out = Join()
price_in = MapCompose(unicode.strip)
# ...
Declare the input/output processor in the Field definition
It is very convenient to directly define and add input/output processors in the Field.
- Field_in and field_out defined in Item Loader
- Filed metadata (input_processor and output_processor keywords)
- Default in Item Loader
import scrapy
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from w3lib.html import remove_tags
def filter_price(value):
if value.isdigit():
return value
class Product(scrapy.Item):
name = scrapy.Field(
input_processor=MapCompose(remove_tags),
output_processor=Join(),
)
price = scrapy.Field(
input_processor=MapCompose(remove_tags, filter_price),
output_processor=TakeFirst(),
)
Item Loader context
The Item Loader context is shared by all input/output processors.
def parse_length(text, loader_context):
unit = loader_context.get('unit', 'm')
# ... 这里写入长度解析代码 ...
return parsed_length
# 初始化和修改上下文的值
loader = ItemLoader(product)
loader.context['unit'] = 'cm'
loader = ItemLoader(product, unit='cm')
class ProductLoader(ItemLoader):
length_out = MapCompose(parse_length, unit='cm')
Rewrite and extend Item Loaders
Used for large-scale code maintenance operations, using common settings for operations.
The operation deletes the specified symbol, and it is recommended to deal with it in the later data cleaning part.
from itemloaders.processors import MapCompose
from myproject.ItemLoaders import ProductLoader
def strip_dashes(x):
return x.strip('-')
class SiteSpecificLoader(ProductLoader):
name_in = MapCompose(strip_dashes, ProductLoader.name_in)
from itemloaders.processors import MapCompose
from myproject.ItemLoaders import ProductLoader
from myproject.utils.xml import remove_cdata
class XmlProductLoader(ProductLoader):
name_in = MapCompose(remove_cdata, ProductLoader.name_in)