Command line tool

scrapy通过scrapy命令行工具来控制，这里称作scrapy工具，来跟称为命令，scrapy命令的子命令区分

scrapy 工具提供了多个命令，有很多目的，每一个都接受不同的一组参数和选项

配置设置

scrapy将在标准位置的ini格式的scrapy.cfg文件里寻找配置参数：

/etc/scrapy.cfg or c:\scrapy\scrapy.cfg (system-wide),
~/.config/scrapy.cfg ($XDG_CONFIG_HOME) and ~/.scrapy.cfg ($HOME) for global (user-wide) settings, and
scrapy.cfg inside a Scrapy project’s root (see next section).

这些文件的设置根据优先级合并成了列表，用户自定的值比全系统默认的和全项目的设置优先级高，当定义时将覆盖这些设置。

scrapy还可以了解并可以通过许多环境变量来配置，当前有这些：

SCRAPY_SETTINGS_MODULE (see Designating the settings)
SCRAPY_PROJECT (see Sharing the root directory between projects)
SCRAPY_PYTHON_SHELL (see Scrapy shell)

Default structure of Scrapy projects

在研究命令行工具和它的子命令前，先来了解一下项目的目录结构

尽管可以改，所有的项目都有相似的文件结构，

scrapy.cfg
myproject/
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
    spiders/
        __init__.py
        spider1.py
        spider2.py
        ...

scrapy.cfg文件存在的目录被称为项目根目录，这个文件包含了项目设置的python模块的名字

[settings]
default = myproject.settings

Sharing the root directory between projects

项目的根目录可以分享给拥有自己的设置模块的其他项目

这时，你必须定义一个或多个别名为这些设置模块在你的scrapy.cfg文件的[setting]下（可以在pycham中修改）

[settings]
default = myproject1.settings
project1 = myproject1.settings
project2 = myproject2.settings

scrapy命令行工具将使用默认的设置，使用SCRAPY_PROJECT环境变量来指定其他项目来给scrapy使用。

$ scrapy settings --get BOT_NAME
Project 1 Bot
$ export SCRAPY_PROJECT=project2
$ scrapy settings --get BOT_NAME
Project 2 Bot

Using the `scrapy` tool

你可以不带参数的运行scrapy工具，这将会打印出一下用法帮助和可以使用的命令

Scrapy X.Y - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  crawl         Run a spider
  fetch         Fetch a URL using the Scrapy downloader
[...]

第一行将打印当前可运行的项目，如果你进入了scrapy项目，在这个例子中他是从项目的外面运行的。如果在项目里面运行，将打印一些像这样的东西

Scrapy X.Y - project: myproject

Usage:
  scrapy <command> [options] [args]

[...]

Creating projects

你通常使用scrapy工具做的第一件事就是创造一个scrapy项目

scrapy startproject myproject [project_dir]

That will create a Scrapy project under the project_dir directory. If project_dir wasn’t specified, project_dir will be the same as myproject.

Next, you go inside the new project directory:

cd project_dir

And you’re ready to use the scrapy command to manage and control your project from there.

Controlling projects

你使用scrapy工具进入你的项目并控制管理他们

For example, to create a new spider:

scrapy genspider mydomain mydomain.com

一些命令（将crawl）必须在scrapy项目下运行，跟多信息去看下面的

Available tool commands

这部分是一些内置命令的描述和一些用法举例，记得你可以通过运行这命令来获得更多的信息

scrapy <command> -h

And you can see all available commands with:

scrapy -h

这有两种命令，一些是只能运行在scrapy项目里的，一些是即使没有可运行的项目也可以使用的（全局命令），尽管在项目里运行时的行为可能有些略微不同（因为他们要使用项目覆盖设置）

Global commands:

Project-only commands:

startproject

Syntax: scrapy startproject [project_dir]
Requires project: no

在project_dir下创造一个project_name的项目，如果dir没有指定，将于project_name相同

Usage example:

$ scrapy startproject myproject

genspider

Syntax: scrapy genspider [-t template]<name> <domain>
Requires project: no

在当前的文件夹或当前项目的爬虫文件夹里创建一个进的爬虫。domain被用来生成allowed_domains start_urls 这些怕中的属性

Usage example:

$ scrapy genspider -l
Available templates: # 可使用的范本
  basic
  crawl
  csvfeed
  xmlfeed

$ scrapy genspider example example.com
Created spider 'example' using template 'basic'

$ scrapy genspider -t crawl scrapyorg scrapy.org
Created spider 'scrapyorg' using template 'crawl'

这是用预定义的范本来创造爬虫的简便方法，当然不是只有这一种方法。你可以自己创造爬虫的资源代码文件。

crawl

Syntax: scrapy crawl
Requires project: yes

开启爬虫

Usage examples:

$ scrapy crawl myspider
[ ... myspider starts crawling ... ]

check

Syntax: scrapy check [-l]
Requires project: yes

Run contract checks. # 额，，这是干嘛的

Usage examples:

$ scrapy check -l
first_spider
  * parse
  * parse_item
second_spider
  * parse
  * parse_item

$ scrapy check
[FAILED] first_spider:parse_item
>>> 'RetailPricex' field is missing

[FAILED] first_spider:parse
>>> Returned 92 requests, expected 0..4

list

Syntax: scrapy list
Requires project: yes

List all available spiders in the current project. The output is one spider per line.

Usage example:

$ scrapy list
spider1
spider2

edit

Syntax: scrapy edit
Requires project: yes

使用定义在EDITOR环境变量或者EDITOR设置的编辑器来编辑所给的爬虫

在大多是情况下，这个命令只是提供一个方便的简写，开发者当然可用自由的选择任何工具或IDE来写爬虫

Usage example:

$ scrapy edit spider1

fetch

Syntax: scrapy fetch
Requires project: no

使用scrapy下载器下载给定的url，并将内容写入标准输出

这个命令有趣的地方是它获取爬虫是怎么下载页面的，例如：如果爬虫有用来替换用户代理的USER_AGENT属性，就将使用这个。

所有这个命令可以使用查看你的爬虫是怎么获取某些页面的

如果在项目的外面使用，就不会有特定的行为，他只会使用scrapy默认的下载器设置

Supported options:

--spider=SPIDER: 绕过爬虫自动检测，并强制使用指定的爬虫
--headers: print the response’s HTTP headers instead of the response’s body
--no-redirect: 不遵循HTTP 的3xx 重定向（默认是遵循的）

Usage examples:

$ scrapy fetch --nolog http://www.example.com/some/page.html
[ ... html content here ... ]                                 # 所有这是怎么爬的。。。

$ scrapy fetch --nolog --headers http://www.example.com/
{'Accept-Ranges': ['bytes'],
 'Age': ['1263   '],
 'Connection': ['close     '],
 'Content-Length': ['596'],
 'Content-Type': ['text/html; charset=UTF-8'],
 'Date': ['Wed, 18 Aug 2010 23:59:46 GMT'],
 'Etag': ['"573c1-254-48c9c87349680"'],
 'Last-Modified': ['Fri, 30 Jul 2010 15:30:18 GMT'],
 'Server': ['Apache/2.2.3 (CentOS)']}

view

Syntax: scrapy view
Requires project: no

用游览器打开所给的url，同时你的爬虫也可以看到。有时爬虫看到的页面是跟你定期看到的是不同的，因此你可以使用这来来检测爬虫看到的并确认这是系你想要的

Supported options:

--spider=SPIDER: bypass spider autodetection and force use of specific spider
--no-redirect: do not follow HTTP 3xx redirects (default is to follow them)

Usage example:

$ scrapy view http://www.example.com/some/page.html
[ ... browser starts ... ]

shell

Syntax: scrapy shell [url]
Requires project: no

使用所给的url或没给来启动scrapy shell。也支持UNIX格式的本地文件路径。可以是相对路径或是绝对路径

Supported options:

--spider=SPIDER: bypass spider autodetection and force use of specific spider
-c code: 测试代码，并返回结果
--no-redirect: 禁止重定向，这只会影响你作为参数传递的url，一旦你进入了shell，fetch(url)默认仍然跟进HTTP重定向。。。什么意思啊

Usage example:

$ scrapy shell http://www.example.com/some/page.html
[ ... scrapy shell starts ... ]

$ scrapy shell --nolog http://www.example.com/ -c '(response.status, response.url)'
(200, 'http://www.example.com/')

# shell follows HTTP redirects by default
$ scrapy shell --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
(200, 'http://example.com/')

# you can disable this with --no-redirect
# (only for the URL passed as command line argument)
$ scrapy shell --no-redirect --nolog http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F -c '(response.status, response.url)'
(302, 'http://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F')

parse

Syntax: scrapy parse [options]
Requires project: yes

使用于–callback选项一起的方法获取所给的url，并与处理它的爬虫进行解析，如果没有给出，则直接进行解析。

Supported options:

--spider=SPIDER: bypass spider autodetection and force use of specific spider
--a NAME=VALUE: 设置爬虫的参数）可以重复）
--callback or -c: 用来解析响应的方法，作为回调函数
--meta or -m: 格外的请求元将传递给回调请求。这必须是一个有效的json字符串
--cbkwargs: 将传递给回调函数的格外的关键字参数，必须是一个有效的json字符串
--pipelines: 通过管道处理项目
--rules or -r: 使用crawlspider的规则来发现用于解析响应的回调函数（即 spider方法）
--noitems: don’t show scraped items
--nolinks: don’t show extracted links
--nocolour: 避免使用色素来给输出着色
--depth or -d: 请求递归式跟进的深度
--verbose or -v: 显示每个深度级别的信息

Usage example:

$ scrapy parse http://www.example.com/ -c parse_item
[ ... scrapy log lines crawling example.com spider ... ]       # 所以有啥用。。。。

>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items  ------------------------------------------------------------
[{'name': 'Example item',
 'category': 'Furniture',
 'length': '12 cm'}]

# Requests  -----------------------------------------------------------------
[]

settings

Syntax: scrapy settings [options]
Requires project: no

Get the value of a Scrapy setting.

如果在项目中使用，他将显示项目的设置，否则将显示scrapy默认时的设置值

Example usage:

$ scrapy settings --get BOT_NAME
scrapybot
$ scrapy settings --get DOWNLOAD_DELAY
0

runspider

Syntax: scrapy runspider
Requires project: no

运行一个独立的爬虫，不用创造一个项目

Example usage:

$ scrapy runspider myspider.py
[ ... spider starts crawling ... ]

额，，，这都讲的好深奥啊，好多都不懂欸。。。。。到底也没明白这到底有啥用啊。。。。。。

Scrapy : Command line tool

Command line tool

配置设置

Default structure of Scrapy projects

Sharing the root directory between projects

Using the `scrapy` tool

Creating projects

Controlling projects

Available tool commands

startproject

genspider

crawl

check

list

edit

fetch

view

shell

parse

settings

runspider

猜你喜欢

Scrapy : Command line tool

Command line tool

配置设置

Default structure of Scrapy projects

Sharing the root directory between projects

Using the scrapy tool

Creating projects

Controlling projects

Available tool commands

startproject

genspider

crawl

check

list

edit

fetch

view

shell

parse

settings

runspider

猜你喜欢

Using the `scrapy` tool