All source code analysis article index directory portal
[Scrapy Framework Analysis] Version 2.4.0 Source Code: All Configuration Directory Index
Article Directory
Introduction
The most frequently mentioned requirement when implementing a crawler is to be able to properly save the crawled data, or in other words, to generate an "output file" (usually called "output feed") with the crawled data for use by other systems.
This section can be ignored if all the crawled content is stored in the data warehouse.
Scrapy comes with Feed output, and supports multiple serialization formats (serialization format) and storage methods (storage backends).
Serialization formats
Item exporters are used for feed output. The supported types are: JSON, JSON Lines, CSV, XML, etc.
type of data | Output formatting | Item type output |
---|---|---|
JSON | json | JsonItemExporter |
JSON lines | jsonlines | JsonLinesItemExporter |
CSV | csv | CsvItemExporter |
XML | xml | XmlItemExporter |
Pickle | pickle | PickleItemExporter |
Marshal | marshal | MarshalItemExporter |
Data storage (Storage)
When using feed output, you can define the storage end by using URL (set by FEED_URI). Feed output supports multiple storage backend types supported by URI mode.
The storage backends that come with support are: local file system, FTP, S3 (requires boto), and label output.
Storage URI parameters
The storage URI also contains parameters. These parameters can be overridden when the feed is created:
- %(time)s-is overwritten by timestamp when the feed is created
- %(name)s-covered by spider's name
Other named parameters will be overridden by the attributes of the spider with the same name. For example, when the feed is created, %(site_id)s will be overridden by the spider.site_id property.
# 存储在 FTP,每个 spider 一个目录
ftp://user:password@ftp.example.com/scraping/feeds/%(name)s/%(time)s.json
# 存储在 S3,每个 spider 一个目录
s3://mybucket/scraping/feeds/%(name)s/%(time)s.json
Storage backends
Storage type | System limitations | URI scheme | Dependent library | Sample |
---|---|---|---|---|
Local file system | Unix | file | - | file://tmp/export.csv |
FTP | - | ftp | - | tp://user:[email protected]/path/to/export.csv |
S3 | - | s3 | vote | s3://aws_key:aws_secret@mybucket/path/to/export.csv |
Google Cloud Storage (GCS) | - | gs | - | gs://mybucket/path/to/export.csv |
Standard output | - | stdout | - | stdout: |
Settings
- FEEDS (mandatory)
- FEED_EXPORT_ENCODING
- FEED_STORE_EMPTY
- FEED_EXPORT_FIELDS
- FEED_EXPORT_INDENT
- FEED_STORAGES
- FEED_STORAGE_FTP_ACTIVE
- FEED_STORAGE_S3_ACL
- FEED_EXPORTERS
- FEED_EXPORT_BATCH_ITEM_COUNT
Output (FEEDS)
The default output dictionary format. This setting is required to enable the feed export function.
{
'items.json': {
'format': 'json',
'encoding': 'utf8',
'store_empty': False,
'fields': None,
'indent': 4,
'item_export_kwargs': {
'export_empty_fields': True,
},
},
'/home/user/documents/items.xml': {
'format': 'xml',
'fields': ['name', 'price'],
'encoding': 'latin1',
'indent': 8,
},
pathlib.Path('items.csv'): {
'format': 'csv',
'fields': ['price', 'name'],
},
}
List of main parameters:
version | parameter name | Parameter Description |
---|---|---|
- | format | Mandatory serialization value format |
- | batch_item_count | FEED_EXPORT_BATCH_ITEM_COUNT |
2.3.0 | encoding | FEED_EXPORT_ENCODING, set the encoding format of json |
2.3.0 | fields | FEED_EXPORT_FIELDS, set the output fields |
2.3.0 | indent | FEED_EXPORT_INDENT, set the indentation method |
2.3.0 | item_export_kwargs | the output category of the dict, |
2.4.0 | overwrite | Whether to overwrite it (True) or append to its content (False) |
2.4.0 | store_empty | FEED_STORE_EMPTY, whether to export empty |
2.4.0 | uri_params | FEED_URI_PARAMS is used to set the parameters to be applied. |
FEED_STORAGES_BASE
file storage base dictionary
{
'': 'scrapy.extensions.feedexport.FileFeedStorage',
'file': 'scrapy.extensions.feedexport.FileFeedStorage',
'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage',
's3': 'scrapy.extensions.feedexport.S3FeedStorage',
'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',
}
FEED_EXPORTERS_BASE
file output basic dictionary
{
'json': 'scrapy.exporters.JsonItemExporter',
'jsonlines': 'scrapy.exporters.JsonLinesItemExporter',
'jl': 'scrapy.exporters.JsonLinesItemExporter',
'csv': 'scrapy.exporters.CsvItemExporter',
'xml': 'scrapy.exporters.XmlItemExporter',
'marshal': 'scrapy.exporters.MarshalItemExporter',
'pickle': 'scrapy.exporters.PickleItemExporter',
}
FEED_EXPORT_BATCH_ITEM_COUNT
Scrapy generates multiple output files and stores the number of items specified in each output file.