Scrapy (6) Item loader detailed explanation

The item loader provides a convenient way to fill in items scraped from the website.

Declare project loader

The declaration class of the item loader: Items. E.g:

from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, MapCompose, Join

class DemoLoader(ItemLoader):

    default_output_processor = TakeFirst()

    title_in = MapCompose(unicode.title)
    title_out = Join()

    size_in = MapCompose(unicode.strip)

    # you can continue scraping here

As you can see in the code above, the input processor uses _id as the suffix and the output processor declaration uses _out as the suffix declaration.

ItemLoader.default_input_processor

And ItemLoader.default_output_processor attribute is used to declare the default input/output processor.

Use item loader to populate items

To use the project loader,

First use a dictionary-like object,

Or the item uses the Loader.default_item_class property to specify the Item class to instantiate.

You can use selectors to collect values into the item loader.

You can add more values in the same item field,

The item loader will use the corresponding handler to add these values

The following code demonstrates how the project is filled using the project loader:

from scrapy.loader import ItemLoader
from demoproject.items import Demo

def parse(self, response):
    l = ItemLoader(item = Product(), response = response)
    l.add_xpath("title", "//div[@class='product_title']")
    l.add_xpath("title", "//div[@class='product_name']")
    l.add_xpath("desc", "//div[@class='desc']")
    l.add_css("size", "div#size]")
    l.add_value("last_updated", "yesterday")
    return l.load_item()

As shown in the figure above, there are two different XPaths, which are extracted from the title field using the add_xpath() method:

1. //div[@class="product_title"]
2. //div[@class="product_name"]

Thereafter, similar requests are used for the content description (desc) field. The size data is extracted using the add_css() method and last_updated is filled with the value "yesterday" using the add_value() method.

After completing all data collection, call ItemLoader.load_item() method to return to fill and use add_xpath(), add_css() and dadd_value() methods to extract data items.

Input and output processor

Each field of an item loader contains an input processor and an output processor.

When extracting data, input the processor to process the result, and then store the result in the data loader.

Next, after collecting the data, call the ItemLoader.load_item() method to obtain the Item object.

Finally, specify the output processor to the result of the project.

The following code demonstrates how to call the input and output processors for a specific field:

l = ItemLoader(Product(), some_selector)
l.add_xpath("title", xpath1) # [1]
l.add_xpath("title", xpath2) #  [2]
l.add_css("title", css) # [3]
l.add_value("title", "demo") # [4]
return l.load_item() # [5]

Line 1: The title data is extracted from xpath1 and passed through the input processor, and the result is collected and stored in ItemLoader.

第2行: 同样地，标题(title)从xpath2提取并通过相同的输入处理器，其结果收集的数据加到[1]中。

第3行: 标题(title)被从css选择萃取和通过相同的输入处理器传递并将收集的数据结果加到[1]及[2]。

第4行: 接着，将“demo”值分配并传递到输入处理器。

第5行: 最后，数据是从所有字段内部收集并传递给输出处理器，最终值将分配给项目

声明输入和输出处理器

输入和输出的处理器在项目加载器(ItemLoader )定义声明。除此之外，它们还可以在项目字段的元数据指定。

例如：

import scrapy
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from w3lib.htmll import remove_tags

def filter_size(value):
    if value.isdigit():
        return value

class Item(scrapy.Item):
    name = scrapy.Field(
        input_processor = MapCompose(remove_tags),
        output_processor = Join(),
    )
    size = scrapy.Field(
       input_processor = MapCompose(remove_tags, filter_price),
       output_processor = TakeFirst(),
    )

>>> from scrapy.loader import ItemLoader
>>> il = ItemLoader(item=Product())
>>> il.add_value('title', [u'Hello', u'<strong>world</strong>'])
>>> il.add_value('size', [u'<span>100 kg</span>'])
>>> il.load_item()

它显示的输出结果如下：

{'title': u'Hello world', 'size': u'100 kg'}

项目加载器上下文

项目加载器上下文是输入和输出的处理器中共享的任意键值的字典。

例如，假设有一个函数parse_length：

def parse_length(text, loader_context):
    unit = loader_context.get('unit', 'cm')
    # You can write parsing code of length here
    return parsed_length

通过接收loader_context参数，它告诉项目加载器可以收到项目加载器上下文。有几种方法可以改变项目加载器上下文的值：

修改当前的活动项目加载器上下文：

loader = ItemLoader (product)
loader.context ["unit"] = "mm"

在项目加载器实例中修改：

loader = ItemLoader(product, unit="mm")

在加载器项目声明与项目加载器上下文实例输入/输出处理器中修改：

class ProductLoader(ItemLoader):
    length_out = MapCompose(parse_length, unit="mm")

ItemLoader对象

它是一个对象，它返回一个新项加载器到填充给定项目。它有以下类：

class scrapy.loader.ItemLoader([item, selector, response, ]**kwargs)

嵌套加载器

这是使用从文档解析分段的值来创建嵌套加载器。如果不创建嵌套装载器，需要为您想提取的每个值指定完整的XPath或CSS。

例如，假设要从一个标题页中提取数据：

<header>
  <a class="social" href="http://facebook.com/whatever">facebook</a>
  <a class="social" href="http://twitter.com/whatever">twitter</a>
  <a class="email" href="mailto:[email protected]">send mail</a>
</header>

接下来，您可以通过添加相关的值到页眉来创建头选择器嵌套装载器：

loader = ItemLoader(item=Item())
header_loader = loader.nested_xpath('//header')
header_loader.add_xpath('social', 'a[@class = "social"]/@href')
header_loader.add_xpath('email', 'a[@class = "email"]/@href')
loader.load_item()

重用和扩展项目加载器

项目加载器的设计以缓解维护，当要获取更多的蜘蛛时项目变成一个根本的问题。

举例来说，假设一个网站自己的产品名称是由三条短线封闭的(例如： ---DVD---)。您可以通过重复使用默认产品项目加载器，如果你不希望它在最终产品名称所示，下面的代码删除这些破折号：

from scrapy.loader.processors import MapCompose
from demoproject.ItemLoaders import DemoLoader

def strip_dashes(x):
    return x.strip('-')

class SiteSpecificLoader(DemoLoader):
    title_in = MapCompose(strip_dashes, DemoLoader.title_in)

可用内置处理器

以下是一些常用的内置处理器：

class scrapy.loader.processors.Identity

回原始的值而并不修改它。例如：

>>> from scrapy.loader.processors import Identity
>>> proc = Identity()
>>> proc(['a', 'b', 'c'])
['a', 'b', 'c']

class scrapy.loader.processors.TakeFirst

回一个值来自收到列表的值即非空/非null值。例如：

>>> from scrapy.loader.processors import TakeFirst
>>> proc = TakeFirst()
>>> proc(['', 'a', 'b', 'c'])
'a'

class scrapy.loader.processors.Join(separator = u' ')

回附连到分隔符的值。默认的分隔符是 u''，这相当于于 u' '.join 的功能。例如：

>>> from scrapy.loader.processors import Join
>>> proc = Join()
>>> proc(['a', 'b', 'c'])
u'a b c'
>>> proc = Join('<br>')
>>> proc(['a', 'b', 'c'])
u'a<br>b<br>c'

class scrapy.loader.processors.SelectJmes(json_path)

查询使用提供JSON路径值，并返回输出。

例如：

>>> from scrapy.loader.processors import SelectJmes, Compose, MapCompose
>>> proc = SelectJmes("hello")
>>> proc({'hello': 'scrapy'})
'scrapy'
>>> proc({'hello': {'scrapy': 'world'}})
{'scrapy': 'world'}

下面是一个查询通过导入JSON值的代码：

>>> import json
>>> proc_single_json_str = Compose(json.loads, SelectJmes("hello"))
>>> proc_single_json_str('{"hello": "scrapy"}')
u'scrapy'
>>> proc_json_list = Compose(json.loads, MapCompose(SelectJmes('hello')))
>>> proc_json_list('[{"hello":"scrapy"}, {"world":"env"}]')
[u'scrapy']