本节内容：

创建 Scrapy 项目
编写爬虫
从命令行启动爬虫
从命令行向爬虫传递参数

Python 学习：
《Dive Into Python 3》《Python Tutorial》《Learn Python The Hard Way》

创建项目

在命令行中执行： scrapy startproject tutorial

tutorial/
    scrapy.cfg            # 部署项目的配置文件（和 scrapyd 结合）

    tutorial/             # 项目目录
        __init__.py

        items.py          # items: 定义提取哪些数据（类似字典） 

        middlewares.py    # 定义中间件（进阶）

        pipelines.py      # 清洗、保存数据到数据库

        settings.py       # 项目配置文件

        spiders/          # 存放爬虫的目录
            __init__.py

编写爬虫

爬取 http://quotes.toscrape.com/ 上的名言和作者，可从命令行指定要爬取哪些标签下的名言

创建爬虫

进入项目目录，并生成爬虫

(base) E:\blog>cd tutorial

(base) E:\blog\tutorial>scrapy genspider quotes toscrape.com

在 scrapy shell 中测试提取表达式

执行命令： scrapy shell "http://quotes.toscrape.com/"

在 Windows 下，用双引号括住 URL；在 Unix 下，使用单引号。

response.css('.quote')
quote = response.css('.quote')[0]

In [6]: quote.css('.text::text').extract_first()
Out[6]: '“The world as we have created it is a process of our thinking. It cann
ot be changed without changing our thinking.”'

In [7]: quote.css('.author::text').extract_first()
Out[7]: 'Albert Einstein'

In [8]: quote.css('.tag::text').extract()
Out[8]: ['change', 'deep-thoughts', 'thinking', 'world']

提取数据

CSS, XPath

response.css('title')  # 会返回一个 SelectorList 对象（类似列表），可以在此基础上继续调用 css(), xpath()

response.css('title')[0] # 返回一个 Selector 对象，可以在此基础上调用 css(), xpath()

extract(), extract_first()
extract(), extract_first() 将 SelectorList 和 Selector 对象序列化为文本

In [4]: response.css('title').extract()
Out[4]: ['<title>Quotes to Scrape</title>']

In [5]: response.css('title::text').extract()
Out[5]: ['Quotes to Scrape']

没有匹配的数据时，extract() 返回空列表，extract_first() 返回 None

re()

类似 re.findall() 根据正则表达式提取数据

In [7]: response.css('title::text').re(r'\w+')
Out[7]: ['Quotes', 'to', 'Scrape']

编写爬虫代码

爬虫简介：爬虫首先从给定的若干个URL开始，解析响应的数据（HTML/JSON）得到数据和新的URL，进行爬行。

在 scrapy 中，一个爬虫就是一个继承 scrapy.Spider 的类

代码实现

import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['toscrape.com']

    def start_requests(self):
        tag = getattr(self, 'tag', None)
        if tag:
            url = 'http://quotes.toscrape.com/tag/' + tag
        else:
            url = 'http://quotes.toscrape.com/'
        yield scrapy.Request(url)

    def parse(self, response):
        for quote in response.css('.quote'):
            yield {
                'text': quote.css('.text::text').extract_first(),
                'author': quote.css('.author::text').extract_first(),
                'tags': quote.css('.tag::text').extract(),
            }
        next_page = response.css('.next a::attr(href)').extract_first()
        if next_page:
            yield response.follow(next_page)

说明：

name：在项目中唯一标示爬虫
allow_domains：运行爬行的网站（可选）
start_requests()：给定起始的 URL；返回 scrapy.Request 的可迭代对象或将其实现为生成器
parse()：获得新的数据和 URL；所有请求默认的回调函数；参数 response 是一个 TextResponse 对象

执行爬虫： scrapy crawl quotes

简化代码

属性：`start_urls`

如果起始 URL 只是发送几个简单的 GET 请求，可以将这些 URL 以列表的形式赋值给 start_urls 作为类属性。因为 start_requests() 的默认实现就是根据该属性的URL生成 scrapy.Request 并指定回调函数为 parse()

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

从返回的页面中提取 URL 并发送新的请求

方式1，像发送初始请求一样，指定一个绝对链接

next_page = response.css('li.next a::attr(href)').extract_first() # 得到相对链接 '/page/2/'
if next_page is not None:
    next_page = response.urljoin(next_page) # 拼接成绝对链接
    yield scrapy.Request(next_page, callback=self.parse)

发送2，使用 response.follow()

# response.follow() 可以直接使用相对链接
yield response.follow(next_page, callback=self.parse)

# response.follow() 可以直接使用 selector 对象
for href in response.css('li.next a::attr(href)'):
    yield response.follow(href, callback=self.parse)

# response.follow() 可以直接使用 a 标签，它会指定提取 href 属性
for a in response.css('li.next a'):
    yield response.follow(a, callback=self.parse)

再看一个爬虫

一个应用 response.follow() 的爬虫

import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        # follow pagination links
        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).extract_first().strip() # 如果页面没有匹配元素，extract_first() 会返回 None，再调用 strip() 会引发异常

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

这个爬虫从页面中提取作者的 URL，进入作者的页面并提取出信息

scrapy 默认设置不会发送重复的请求，所以一个作者的页面不会重复访问

其他 follow 的方式还可以使用 CrawlSpider 类

如果一个 item 需要从多个页面爬取，可以使用 scrapy.Request() 的 meta 参数，传递前面页面提取的数据

从命令行向爬虫传递数据

在命令行执行： scrapy crawl quotes -o quotes-humor.json -a tag=humor

-a 参数传递内容会以字符串的形式传给爬虫，默认的实现是将执行参数转换为爬虫的属性：比如上面的 tag 可以通过 self.tag 访问

3.Scrapy Tutorial

创建项目

编写爬虫

创建爬虫

在 scrapy shell 中测试提取表达式

编写爬虫代码

简化代码

属性：`start_urls`

从返回的页面中提取 URL 并发送新的请求

再看一个爬虫

从命令行向爬虫传递数据

猜你喜欢

3.Scrapy Tutorial

创建项目

编写爬虫

创建爬虫

在 scrapy shell 中测试提取表达式

编写爬虫代码

简化代码

属性：start_urls

从返回的页面中提取 URL 并发送新的请求

再看一个爬虫

从命令行向爬虫传递数据

猜你喜欢

属性：`start_urls`