[Scrapy Framework] "Version 2.4.0 Source Code" Item Loaders Detailed Explanation

All source code analysis article index directory portal

[Scrapy Framework] Version 2.4.0 Source Code: All Configuration Directory Index

Introduction

This chapter introduces the case in the source code. I personally feel that the operation of processing data is cumbersome, and the data processing process is simplified to the simplest content. In the crawler example in the column, if you feel that the data processing in the article is cumbersome, friends can jump over. Just look at the example.

Item Loader provides us with a very convenient way to generate Item. Item provides a container for the captured data, and Item Loader allows us to fill the input into the container very conveniently.

ItemLoader parameter definition

item : The Item object parsed by the loader.

context : The ItemLoader of the content.

default_item_class : instantiate the __init__ method.

default_input_processor : The default input processor for the specified field.

default_output_processor : The default output processor for the specified field.

default_selector_class : used to construct the ItemLoader of the selector. This attribute is ignored if it exists in the __init__ method. This attribute is sometimes overridden in subclasses.

selector : Extract data from this Selector object.

add_css (field_name, CSS, *processors, **kw):
similar to ItemLoader.add_value() but receives a css selector instead of a value, which is used to extract a list of Unicode strings from the selector related to this .

# HTML snippet: <p class="product-name">Color TV</p>
loader.add_css('name', 'p.product-name')
# HTML snippet: <p id="price">the price is $1200</p>
loader.add_css('price', 'p#price', re='the price is (.*)')

add_value (field_name, value, *processors, **kw): Process to add a given value to a given field. get_value() adds data usage to processors and kwargs

loader.add_value('name', 'Color TV')
loader.add_value('colours', ['white', 'blue'])
loader.add_value('length', '100')
loader.add_value('name', 'name: foo', TakeFirst(), re='name: (.+)')
loader.add_value(None, {
    
    'name': 'foo', 'sex': 'male'})

add_xpath (field_name, value, *processors, **kw):
Similar to ItemLoader.add_value() but receives an XPath instead of a value, which is used to extract a list of strings from the selector related to this.

# HTML snippet: <p class="product-name">Color TV</p>
loader.add_xpath('name', '//p[@class="product-name"]')
# HTML snippet: <p id="price">the price is $1200</p>
loader.add_xpath('price', '//p[@id="price"]', re='the price is (.*)')

get_collected_values (field_name): Returns the collected values ​​of the field.

get_css (CSS, *processors, **kw): Similar to ItemLoader.get_value() but receives a css selector instead of a value, which is used to extract a list of Unicode strings from the selector related to this.

# HTML snippet: <p class="product-name">Color TV</p>
loader.get_css('p.product-name')
# HTML snippet: <p id="price">the price is $1200</p>
loader.get_css('p#price', TakeFirst(), re='the price is (.*)')

get_output_value (field_name): Returns the collected value parsed for the given field using the output processor.

get_value (value, *processors, **kw): Processing the given value is given to the processors keyword parameter.

>>> from itemloaders import ItemLoader
>>> from itemloaders.processors import TakeFirst
>>> loader = ItemLoader()
>>> loader.get_value('name: foo', TakeFirst(), str.upper, re='name: (.+)')
'FOO'

get_xpath (XPath, *processors, **kw): Similar to ItemLoader.get_value() but receives an XPath instead of a value, which is used to extract a list of Unicode strings from the selector associated with this.

# HTML snippet: <p class="product-name">Color TV</p>
loader.get_xpath('//p[@class="product-name"]')
# HTML snippet: <p id="price">the price is $1200</p>
loader.get_xpath('//p[@id="price"]', TakeFirst(), re='the price is (.*)')

load_item() : The collected data is filled and returned.

nested_css (CSS, **context): Use CSS selectors to create nested loaders. The provided selector is relative to ItemLoader.

nested_xpath (XPath, **context): Use the XPath selector to create a nested loader. The provided selector is relative to ItemLoader.

replace_css (field_name, CSS, *processors, **kw): similar to add_css() but replace the collected data instead of adding data.

replace_value (field_name, value, *processors, **kw): similar to add_value() but replace the collected data with a new value instead of adding it.

replace_xpath (field_name, XPath, *processors, **kw): similar to add_xpath() but replace the collected data instead of adding data.

Data processing example

Define Items

import scrapy
 
class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

Loading fill data in Spider

from scrapy.loader import ItemLoader
from myproject.items import Product
 
def parse(self, response):
    l = ItemLoader(item=Product(), response=response)
    l.add_xpath('name', '//div[@class="product_name"]')
    l.add_xpath('name', '//div[@class="product_title"]')
    l.add_xpath('price', '//p[@id="price"]')
    l.add_css('stock', 'p#stock]')
    l.add_value('last_updated', 'today') # you can also use literal values
    return l.load_item()

Processing data set items

Define the field type and default value of the data.

from dataclasses import dataclass, field
from typing import Optional

@dataclass
class InventoryItem:
    name: Optional[str] = field(default=None)
    price: Optional[float] = field(default=None)
    stock: Optional[int] = field(default=None)

Input/output processor

Each Item Loader has an input processor and an output processor for each Field. The input processor is executed when the data is received, and the output processor is executed when ItemLoader.load_item() is called after the data is collected, and the final result is returned.

l = ItemLoader(Product(), some_selector)
l.add_xpath('name', xpath1) # (1)
l.add_xpath('name', xpath2) # (2)
l.add_css('name', css) # (3)
l.add_value('name', 'test') # (4)
return l.load_item() # (5)

Declare Item Loader

The input and output processors are defined by _in and _out suffixes, and the default ItemLoader.default_input_processor and ItemLoader.default_input_processor can also be defined.

from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join
 
class ProductLoader(ItemLoader):
    default_output_processor = TakeFirst()
 
    name_in = MapCompose(unicode.title)
    name_out = Join()
 
    price_in = MapCompose(unicode.strip)
 
    # ...

Declare the input/output processor in the Field definition

It is very convenient to directly define and add input/output processors in the Field.

  1. Field_in and field_out defined in Item Loader
  2. Filed metadata (input_processor and output_processor keywords)
  3. Default in Item Loader
import scrapy
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from w3lib.html import remove_tags
 
def filter_price(value):
    if value.isdigit():
        return value
 
class Product(scrapy.Item):
    name = scrapy.Field(
        input_processor=MapCompose(remove_tags),
        output_processor=Join(),
    )
    price = scrapy.Field(
        input_processor=MapCompose(remove_tags, filter_price),
        output_processor=TakeFirst(),
    )

Item Loader context

The Item Loader context is shared by all input/output processors.

def parse_length(text, loader_context):
    unit = loader_context.get('unit', 'm')
    # ... 这里写入长度解析代码 ...
    return parsed_length
# 初始化和修改上下文的值
loader = ItemLoader(product)
loader.context['unit'] = 'cm'
 
loader = ItemLoader(product, unit='cm')
 
class ProductLoader(ItemLoader):
    length_out = MapCompose(parse_length, unit='cm')

Rewrite and extend Item Loaders

Used for large-scale code maintenance operations, using common settings for operations.

The operation deletes the specified symbol, and it is recommended to deal with it in the later data cleaning part.

from itemloaders.processors import MapCompose
from myproject.ItemLoaders import ProductLoader

def strip_dashes(x):
    return x.strip('-')

class SiteSpecificLoader(ProductLoader):
    name_in = MapCompose(strip_dashes, ProductLoader.name_in)
from itemloaders.processors import MapCompose
from myproject.ItemLoaders import ProductLoader
from myproject.utils.xml import remove_cdata

class XmlProductLoader(ProductLoader):
    name_in = MapCompose(remove_cdata, ProductLoader.name_in)

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/113494261