Detailed explanation of scrapy common commands

 Official document: https://docs.scrapy.org/en/latest/topics/commands.html

 

Global commands:

Project-only commands:

 

genspider-->Note: -t specifies the creation template, the default is basic

  • Syntax: scrapy genspider [-t template] <name> <domain>

  • Requires project: no

Usage example:

$ scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed

$ scrapy genspider example example.com
Created spider 'example' using template 'basic'

$ scrapy genspider -t crawl scrapyorg scrapy.org
Created spider 'scrapyorg' using template 'crawl'

 

runspider-->The advantage is that there is no need for project path support, but spider_file.py does not support relative paths

  • Syntax: scrapy runspider <spider_file.py>

  • Requires project: no

Run a spider self-contained in a Python file, without having to create a project.

Example usage:

$ scrapy runspider myspider.py
[ ... spider starts crawling ... ]

 

parse-->Key recommendation, easy to debug, you can specify the callback function to run, for example: scrapy parse https://www.baidu.com -c parse_detail --spider=tencent

  • Syntax: scrapy parse <url> [options]

  • Requires project: yes

Fetches the given URL and parses it with the spider that handles it, using the method passed with the --callback option, or parse if not given.

Supported options:

  • --spider=SPIDER: bypass spider autodetection and force use of specific spider

  • --a NAME=VALUE: set spider argument (may be repeated)

  • --callback or -c: spider method to use as callback for parsing the response

  • --meta or -m: additional request meta that will be passed to the callback request. This must be a valid json string. Example: –meta=’{“foo” : “bar”}’

  • --cbkwargs: additional keyword arguments that will be passed to the callback. This must be a valid json string. Example: –cbkwargs=’{“foo” : “bar”}’

  • --pipelines: process items through pipelines

  • --rules or -r: use CrawlSpider rules to discover the callback (i.e. spider method) to use for parsing the response

  • --noitems: don’t show scraped items

  • --nolinks: don’t show extracted links

  • --nocolour: avoid using pygments to colorize the output

  • --depth or -d: depth level for which the requests should be followed recursively (default: 1)

  • --verbose or -v: display information for each depth level

Usage example:

$ scrapy parse http://www.example.com/ -c parse_item
[ ... scrapy log lines crawling example.com spider ... ]

>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items  ------------------------------------------------------------
[{'name': 'Example item',
 'category': 'Furniture',
 'length': '12 cm'}]

# Requests  -----------------------------------------------------------------
[]

Guess you like

Origin blog.csdn.net/zhu6201976/article/details/106604970