重点
1. 自己的spider最好基于CrawlSpider,其功能比较完善
2. CSS用来解析数据,但是XPath功能更加强大
3. scrapy如何follow link
4. 数据可以保存在json文件中,但json line格式支持追加方式. 对于复杂应用可以考虑
pipeline文件

本教程要求已经正确安装了scrapy,如果没有,可以参考安装向导
(译注: anaconda下直接 conda install scrapy即可)

我们将爬取quotes.toscrape.com,这个网站上列举了很多名人名言.

本教程包括以下几个目的:
1. 创建一个新的scrapy工程
2. 动手写一个spider爬取网站和额外数据
3. 利用命令行导出爬取的数据
4. 修改spider,使其自动跟踪链接
5. 使用spider参数

Scrapy的编写语言是python. 为了更好的学习Scrapy,需要对python有些了解.

如果已经熟悉其他编程语言,希望快速学习python,我们推荐Dive Into Python 3或Python Turtorial.

如果对编程语言不熟悉,但希望从python开始学习,在
Learn Python The Hard Way上有很多在线书籍,或者访问this list of Python resources for non-programmers

Creating a project

开始爬取之前,你需要新建一个scrapy工程. 选择一个目录存储代码,在命令提示符中输入

scrapy startproject tutorial

将会在目录中生成一个turtorial目录,包括如下内容:

tutorial/
    scrapy.cfg            # 部署配置文件

    tutorial/             # python库目录
        __init__.py

        items.py          # project items definition file

        middlewares.py    # project middlewares file

        pipelines.py      # project pipelines file

        settings.py       # 工程配置文件

        spiders/          # 存储spider的目录
            __init__.py

Our first Spider

Spider是用户定义的类,帮助scrapy从网站爬取信息.它必须是scrapy.Spider的子类, 包括如下功能
* 初始的request
* 如何跟踪网页中的链接[可选]
* 如何把下载的网页转换成信息

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

如上例所示,我们的Spider继承自scrapy.Spider,并定义一些属性和方法:
* name: spider的标识.在一个工程中唯一标识这个spider,不能有重复的标识.
* start_requests(): 返回一个可迭代的Request,作为Spider的初始参数(可以返回一个request列表或一个生成函数).后续的request将通过这个初始的request生成.
* parse():对每个request的都会调用这个函数处理response. response的参数是TextResponse的实例,包含网页内容,有额外的辅助函数处理这个参数.
parse()函数一般解析response,把其中的数据保存成dict并找到新的URL以生成新的request.

How to run our spider

要启动spider,在工程的顶层目录,输入

scrapy crawl quotes

这个命令将启动名为”quotes”的spider,向quotes.toscrapy.com发送一些request,返回如下的信息:

... (omitted for brevity)
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened
2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)
...

当前目录应该会新增两个新文件: quotes-1.html和quotes2-.html. 他们来自刚刚代码中指定的两个URLs.

What just happened under the hood?

Scrapy负责调度start_requests()返回的scrapy.Request.一旦收到对应的response,就实例化Response对象并以其为参数调用request关联的回调函数(在我们例子中就是parse()函数)

A shortcut to the start_requests method

如果不想实现start_requests()函数来从URLs生成scrapy.Request,你还可以定义start_urls属性,指向一个URL列表.默认的start_request()函数会利用这个属性生成初始的requests

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

parse()函数会被调用去处理针对URLs的request,这是scrapy的默认行为.因为parse()是scrapy的默认回调函数,用来处理request.

Extracting data

学习scrapy中如何提取数据的最好方法是在scrapy shell中尝试selectors.

scrapy shell 'http://quotes.toscrapy.com/page/1/'

注意:scrapy shell中urls必须用引号包围,否则带有参数的url将无法正常工作
windows下使用双引号,linux下使用单引号

上述命令的结果如下:

[ ... Scrapy log here ... ]
2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fa91d888c90>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1/>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7fa91d888c10>
[s]   spider     <DefaultSpider 'default' at 0x7fa91c8af990>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
>>>

利用shell,可以用respose.css()选择元素

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

response.css(‘title’)的结果是一个类似列表的对象,称为SelectorList.它表示一个Selector对象列表,允许进一步细化查询或提取数据.
要从上面的title提取text字段,可以输入如下命令

>>> response.css('title::text').extract()
['Quotes to Scrape']

有两个事情需要注意
* 我们在css查询中增加了::text,表示我们希望仅仅选择 \

>>> response.css('title').extract()
['<title>Quotes to Scrape</title>']

另一件事是.extract()的返回结果是一个列表,因为我们在处理的是一个SelectorList实例.如果只需要第一个结果,可以通过如下命令获得

>>> response.css('title::text').extract_first()
'Quotes to Scrape'

另一个方法是

>>> response.css('title::text')[0].extract()
'Quotes to Scrape'

使用.extract_first()可以避免IndexError. 当没有找到任何匹配的selection,.extract_first()将返回None.

一个原则: 对大部分爬虫代码,我们都希望它对页面上查找不到的错误有一定弹性,这样即使一部分搜索失败,我们至少可以获得另一部分数据.

除了extract()和extract_first()函数,你还可以使用re()函数提取正则表达式:

>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']

为了找到合适的CSS selectors, 可以使用view(response)调用浏览器打开response页面.可用的浏览器工具或插件有Firebug或Selector Gadget.

XPath: a brief intro

除了CSS, Scrapy selectors也支持XPath表达式

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').extract_first()
'Quotes to Scrape'

XPath是Scrapy Selectors的基础,功能强大.CSS selectors也是转换成XPath执行.
虽然不像CSS selectors那样流行,XPath提供更加强大的功能. 除了浏览结构外,他还可审查内容.利用XPath,你可以完成如下操作: 选择一个包含”next page”的链接. XPath更加合适爬虫,所以我们建议学习Xpath,即使你已经懂得如何构建CSS selector, 因为XPath令爬取更加简单.

本教程中不会涉及太多XPath,但你可以参考using XPath with Scrapy Selectors here. 我们推荐阅读this tutorial to learn XPath through examples和
this tutorial to learn “how to think in XPath”来学习XPath.

Extracting quotes and authors

现在你已经知道如何选择和抽取了, 让我们继续编写代码从网页中提取名人名言.
每一段名人名言都用HTML元素表示,如下所示:

<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

我们先在scrapy shell中试验下如何提取需要的数据:

$ scrapy shell 'http://quotes.toscrape.com'

我们得到一个quote元素的selector 列表

>>> response.css("div.quote")

上述返回的的每个selectors都可以进一步查询子元素.我们先把第一个selector赋给一个变量,先分析这一个特殊的quote

>>> quote = response.css("div.quote")[0]

利用这个quote提取title,author和tags

>>> title = quote.css("span.text::text").extract_first()
>>> title
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").extract_first()
>>> author
'Albert Einstein'

如果tags是一个字符串列表,我们可以通过.extract()函数得到所有字符串

>>> tags = quote.css("div.tags a.tag::text").extract()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']

弄明白了如何提取,我们可以遍历所有的quotes元素,把他们组织到一个dict结构中.

>>> for quote in response.css("div.quote"):
...     text = quote.css("span.text::text").extract_first()
...     author = quote.css("small.author::text").extract_first()
...     tags = quote.css("div.tags a.tag::text").extract()
...     print(dict(text=text, author=author, tags=tags))
{'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
{'tags': ['abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'}
    ... a few more of these, omitted for brevity
>>>

Extracting data in our spider

回头看看我们的spider. 目前它还不能提取任何有效数据,只是保存整个HTML页面到本地. 我们把上面的提取逻辑集成到spider中去.
scrapy spider一般会生成很多字典保存页面中提取到的数据.我们使用yield关键字,如下

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

运行这个spider,它会输出如下信息

2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}

Storing the scraped data

保存提取到的数据的最简单方法是用Feed exports,如下所示

scrapy crawl quotes -o quotes.json

这个命令将生成quotes.json文件,以json方式保存爬取到的数据
由于某些历史原因,scrapy采用追加方式写文件,而不是覆盖方式. 如果不删除这个json文件而运行两次命令,将得到一个损坏的json文件.
你也可以使用其他格式,比如JSON Lines

scrapy crawl quotes -o quotes.jl

JSON lines格式是流式格式,方便追加新的记录.如果运行两次命令,它不会产生json格式的问题.此外,因为每个记录是独立的行,处理大文件时不用把整个文件读入内存.一些工具(比如JQ)支持命令行下处理这种格式.

对于一些小的工程,这些就足够了.如果你准备在爬取的结果上做更加复杂的操作,你可以写一个Item Pipeline.工程生成时一个Item Pipelines的空文件就建立了,turtorial/pipelines.py.如果只打算存储爬取的数据,可以保持空文件状态.

Following links

如果我们不仅仅要爬取最初的两个页面,还希望爬取网站所有页面,应该如何做?
我们已经知道如何提取数据,现在看看如何跟踪其中的链接.
首先提取我们希望跟踪的链接.我们发现指向下一个页面的链接有如下格式

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

在shell中提取它

>>> response.css('li.next a').extract_first()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'

这样得到auchor元素,但是我们希望得到的href元素. scrapy支持CSS扩展选择属性内容

>>> response.css('li.next a::attr(href)').extract_first()
'/page/2/'

修改spider跟踪下一页链接,并提取数据

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

提取数据后,parse()函数搜索下一页的链接,利用urljoin()建立绝对URL.并生成一个新的request,指向下一页.parse()依然是默认的回调函数,如此保证爬虫便利所有的页面.
这个例子展示了Scrapy跟踪链接的机制:在回调函数中生成新的request, scrapy将调度这个request,发送请求,注册response的回调函数.
如此你就可以搭建复杂的爬虫跟踪满足规则的链接,提取不同的数据.
我们的例子中利用一个循环,跟踪所有指向下一页的链接,知道最后一页.适合博客,论坛等基于分页的网站.

A shortcut for creating Requests

response.fllow()可以避免重新显示定义request

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

和scrapy.Request不同,response.follow支持相对URL,不需要调用urljoin(). response.follow()返回的是一个request,依然需要yield这个request.
除了字符串,你也可以传递一个selector给response.follow, 这个selector需要可以提取必须的属性

for href in response.css('li.next a::attr(href)'):
    yield response.follow(href, callback=self.parse)

对\元素有一个简单的方法: response.follow自动使用href属性.所以上述代码可以进一步简化

for a in response.css('li.next a'):
    yield response.follow(a, callback=self.parse)

More examples and patterns

下面是一个新的spider展示回调函数和链接跟踪,这次是爬取author信息

import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        # follow pagination links
        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).extract_first().strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

spider从主页开始爬取,通过调用parse_author()跟踪所有指向author页面的链接,而parse()负责跟踪下一页.
此处我们传递回调函数给response.follow(), 同样可以给scrapy.Request()
parse_author()回调函数提取并清洗CSS查询的返回结果,生成author数据字典.
这个spider展示的另一个有趣的事情,即使同一个author有多个引文,我们也不需要担心重复访问一个authors页面.scrapy默认忽略重复的URL,避免bug导致服务器负担过大.这个行为通过配置DUPEFILTER_CLASS设置.

希望到目前你理解scrapy如何跟踪链接.

另有一个例子展示如何跟踪链接,参考CrawlSpider类. 他是一个支持少量规则的通用spider,可以基于它编写自己的spider.

同样,一个常见的应用是通过不同页面建立一个新的数据,可以参考trick to pass additional data to the callbacks

Using spider arguments

命令行下,可以利用-a选项向spider传递参数

scrapy crawl quotes -o quotes-humor.json -a tag=humor

这些参数传递给spider的init函数并作为他的默认属性.

上面的例子中,命令行提供的tag参数可以通过self.tag访问.可以利用这个让spider只提取某一种tag的引文, 建立URL.

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

如果传递tag=humor到spider,将只访问来自humor的URL,比如
http://quotes.toscrape.com/tag/humor.
这里有更多关于spider参数的知识

1-Scrapy Tutorial