[Scrapy framework] "Version 2.4.0 source code" output file (Feed exports) detailed articles

All source code analysis article index directory portal

[Scrapy Framework Analysis] Version 2.4.0 Source Code: All Configuration Directory Index

Introduction

The most frequently mentioned requirement when implementing a crawler is to be able to properly save the crawled data, or in other words, to generate an "output file" (usually called "output feed") with the crawled data for use by other systems.

This section can be ignored if all the crawled content is stored in the data warehouse.

Scrapy comes with Feed output, and supports multiple serialization formats (serialization format) and storage methods (storage backends).

Serialization formats

Item exporters are used for feed output. The supported types are: JSON, JSON Lines, CSV, XML, etc.

type of data Output formatting Item type output
JSON json JsonItemExporter
JSON lines jsonlines JsonLinesItemExporter
CSV csv CsvItemExporter
XML xml XmlItemExporter
Pickle pickle PickleItemExporter
Marshal marshal MarshalItemExporter

Data storage (Storage)

When using feed output, you can define the storage end by using URL (set by FEED_URI). Feed output supports multiple storage backend types supported by URI mode.

The storage backends that come with support are: local file system, FTP, S3 (requires boto), and label output.

Storage URI parameters

The storage URI also contains parameters. These parameters can be overridden when the feed is created:

  • %(time)s-is overwritten by timestamp when the feed is created
  • %(name)s-covered by spider's name

Other named parameters will be overridden by the attributes of the spider with the same name. For example, when the feed is created, %(site_id)s will be overridden by the spider.site_id property.

# 存储在 FTP,每个 spider 一个目录
ftp://user:password@ftp.example.com/scraping/feeds/%(name)s/%(time)s.json

# 存储在 S3,每个 spider 一个目录
s3://mybucket/scraping/feeds/%(name)s/%(time)s.json

Storage backends

Storage type System limitations URI scheme Dependent library Sample
Local file system Unix file - file://tmp/export.csv
FTP - ftp - tp://user:[email protected]/path/to/export.csv
S3 - s3 vote s3://aws_key:aws_secret@mybucket/path/to/export.csv
Google Cloud Storage (GCS) - gs - gs://mybucket/path/to/export.csv
Standard output - stdout - stdout:

Settings

  • FEEDS (mandatory)
  • FEED_EXPORT_ENCODING
  • FEED_STORE_EMPTY
  • FEED_EXPORT_FIELDS
  • FEED_EXPORT_INDENT
  • FEED_STORAGES
  • FEED_STORAGE_FTP_ACTIVE
  • FEED_STORAGE_S3_ACL
  • FEED_EXPORTERS
  • FEED_EXPORT_BATCH_ITEM_COUNT

Output (FEEDS)

The default output dictionary format. This setting is required to enable the feed export function.

{
    
    
    'items.json': {
    
    
        'format': 'json',
        'encoding': 'utf8',
        'store_empty': False,
        'fields': None,
        'indent': 4,
        'item_export_kwargs': {
    
    
           'export_empty_fields': True,
        },
    },
    '/home/user/documents/items.xml': {
    
    
        'format': 'xml',
        'fields': ['name', 'price'],
        'encoding': 'latin1',
        'indent': 8,
    },
    pathlib.Path('items.csv'): {
    
    
        'format': 'csv',
        'fields': ['price', 'name'],
    },
}

List of main parameters:

version parameter name Parameter Description
- format Mandatory serialization value format
- batch_item_count FEED_EXPORT_BATCH_ITEM_COUNT
2.3.0 encoding FEED_EXPORT_ENCODING, set the encoding format of json
2.3.0 fields FEED_EXPORT_FIELDS, set the output fields
2.3.0 indent FEED_EXPORT_INDENT, set the indentation method
2.3.0 item_export_kwargs the output category of the dict,
2.4.0 overwrite Whether to overwrite it (True) or append to its content (False)
2.4.0 store_empty FEED_STORE_EMPTY, whether to export empty
2.4.0 uri_params FEED_URI_PARAMS is used to set the parameters to be applied.

FEED_STORAGES_BASE
file storage base dictionary

{
    
    
    '': 'scrapy.extensions.feedexport.FileFeedStorage',
    'file': 'scrapy.extensions.feedexport.FileFeedStorage',
    'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage',
    's3': 'scrapy.extensions.feedexport.S3FeedStorage',
    'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',
}

FEED_EXPORTERS_BASE
file output basic dictionary

{
    
    
    'json': 'scrapy.exporters.JsonItemExporter',
    'jsonlines': 'scrapy.exporters.JsonLinesItemExporter',
    'jl': 'scrapy.exporters.JsonLinesItemExporter',
    'csv': 'scrapy.exporters.CsvItemExporter',
    'xml': 'scrapy.exporters.XmlItemExporter',
    'marshal': 'scrapy.exporters.MarshalItemExporter',
    'pickle': 'scrapy.exporters.PickleItemExporter',
}

FEED_EXPORT_BATCH_ITEM_COUNT
Scrapy generates multiple output files and stores the number of items specified in each output file.

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/113499876