有关爬虫scrapy框架简单使用

1.安装的问题

  • 贴吧询问:
    在这里插入图片描述
  • 看到官网居然推荐虚拟环境下使用scrapy:
    在这里插入图片描述

2.官网例子

  • 两属性一方法,必写
	name = None
	self.start_urls = []

	def parse(self, response):
        raise NotImplementedError('{}.parse callback is not defined'.format(self.__class__.__name__))
  • code
import scrapy


class QuoteSpider(scrapy.Spider):
    name = 'quote'
    start_urls = ['http://quotes.toscrape.com/page/1/']

    # 列表推导式
    # start_urls = [f'http://quotes.toscrape.com/page/{page}/' for page in range(1,11)]

    # 解析函数 response 响应源码
    def parse(self, response):
        # 选择出数据 css
        for selector in response.css('div.quote'):
            # 选择名言
            text = selector.css('span.text::text').get()
            # 选择作者
            author = selector.xpath('span/small/text()').get()
            # 返回数据 如果想保存数据,必须返回
            # 数据格式目前必须是字段格式 后期:item
            items = {
                "quote": text,
                "author": author
            }
            yield items

        # 翻页处理
        # 1.先找到下一页的网址
        # 2.发出请求,获取响应,交给parse函数
        next_page = response.xpath('//li[@class="next"]/a/@href').get()
        if next_page:
            # 法一
            # url = 'http://quotes.toscrape.com' + next_page
            # yield scrapy.Request(url)
            # 法二
            # yield response.follow(next_page)

            yield response.follow(next_page, callback=self.parse)

  • 运行 terminal下执行
(venv) E:\Graduation-design\spider\scrapy>scrapy runspider website_test.py
  • 保存json数据
(venv) E:\Graduation-design\spider\scrapy>scrapy runspider website_test.py -o quotes.json

3.虚拟环境安装并使用scrapy

  • 创建工程
(venv) E:\Graduation-design\spider\scrapy>scrapy startproject quotetutorial
  • 创建爬虫 scrapy genspider [-t template] <name> <domain>
(venv) E:\Graduation-design\spider\scrapy>scrapy genspider quotes quotes.toscrape.com
  • 运行爬虫
(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes
  • terminal下shell测试:scrapy shell <网址>
>>> quotes = response.css('.quote')
>>> type(quotes)
<class 'scrapy.selector.unified.SelectorList'>
>>> quotes[0]
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>
>>> quotes[0].css(.text)
  File "<console>", line 1
    quotes[0].css(.text)
                  ^
SyntaxError: invalid syntax
>>> quotes[0].css('.text')
[<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]" data='<span class="text" itemprop="text">“T...'>]
>>> quotes[0].css('.text::text')
[<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]/text()" data='“The world as we have created it is a...'>]
>>> quotes[0].css('.text::text').get()
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> quotes[0].css('.text::text').getall()
['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”']
>>> quotes[0].css('.tags tag::text').getall()
[]
>>> quotes[0].css('.tags tag::text').get()
>>> quotes[0].css('.tags tag::text').get()
>>> quotes[0].css('.tags tag::text').getall()
[]
>>> quotes[0].css('.tags .tag::text').getall()
['change', 'deep-thoughts', 'thinking', 'world']

  • code:
# -*- coding: utf-8 -*-
import scrapy
from ..items import QuoteItem


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item = QuoteItem()
            text = quote.css('.text::text').get()
            author = quote.css('.author::text').get()
            tags = quote.css('.tags .tag::text').getall()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            yield item

        next = response.css('.pager .next a::attr(href)').get()
        url = response.urljoin(next)
        yield scrapy.Request(url=url, callback=self.parse)
  • 数据保存
(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes -o quotes.json
(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes -o quotes.jl
(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes -o quotes.csv
(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes -o quotes.xml
(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes -o quotes.pickle
(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes -o quotes.marshal
(venv) E:\Graduation-design\spider\scrapy\quotetutorial>scrapy crawl quotes -o ftp://...
  • 让pipeline生效settings.py
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'quotetutorial.pipelines.TextPipeline': 300, # 优先级
   'quotetutorial.pipelines.MongoPipeline': 400,

}

4.scrapy命令行用法

猜你喜欢

转载自blog.csdn.net/xieyipeng1998/article/details/104602622
今日推荐