Official document: https://docs.scrapy.org/en/latest/topics/commands.html
Global commands:
Project-only commands:
genspider-->Note: -t specifies the creation template, the default is basic
-
Syntax:
scrapy genspider [-t template] <name> <domain>
-
Requires project: no
Usage example:
$ scrapy genspider -l
Available templates:
basic
crawl
csvfeed
xmlfeed
$ scrapy genspider example example.com
Created spider 'example' using template 'basic'
$ scrapy genspider -t crawl scrapyorg scrapy.org
Created spider 'scrapyorg' using template 'crawl'
runspider-->The advantage is that there is no need for project path support, but spider_file.py does not support relative paths
-
Syntax:
scrapy runspider <spider_file.py>
-
Requires project: no
Run a spider self-contained in a Python file, without having to create a project.
Example usage:
$ scrapy runspider myspider.py
[ ... spider starts crawling ... ]
parse-->Key recommendation, easy to debug, you can specify the callback function to run, for example: scrapy parse https://www.baidu.com -c parse_detail --spider=tencent
-
Syntax:
scrapy parse <url> [options]
-
Requires project: yes
Fetches the given URL and parses it with the spider that handles it, using the method passed with the --callback
option, or parse
if not given.
Supported options:
-
--spider=SPIDER
: bypass spider autodetection and force use of specific spider -
--a NAME=VALUE
: set spider argument (may be repeated) -
--callback
or-c
: spider method to use as callback for parsing the response -
--meta
or-m
: additional request meta that will be passed to the callback request. This must be a valid json string. Example: –meta=’{“foo” : “bar”}’ -
--cbkwargs
: additional keyword arguments that will be passed to the callback. This must be a valid json string. Example: –cbkwargs=’{“foo” : “bar”}’ -
--pipelines
: process items through pipelines -
--rules
or-r
: useCrawlSpider
rules to discover the callback (i.e. spider method) to use for parsing the response -
--noitems
: don’t show scraped items -
--nolinks
: don’t show extracted links -
--nocolour
: avoid using pygments to colorize the output -
--depth
or-d
: depth level for which the requests should be followed recursively (default: 1) -
--verbose
or-v
: display information for each depth level
Usage example:
$ scrapy parse http://www.example.com/ -c parse_item
[ ... scrapy log lines crawling example.com spider ... ]
>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items ------------------------------------------------------------
[{'name': 'Example item',
'category': 'Furniture',
'length': '12 cm'}]
# Requests -----------------------------------------------------------------
[]