scrapy初识

爬虫

博客参考

  • 下载

  • shell界面调试(pip安装好ipython):
    scrapy shell “http://www.baidu.com

    • 查看请求返回的包头:response.headers
    • 查看请求返回的内容:response.body
  • 创建项目:
    scrapy startproject first_obj
    目录结构:

    • first_obj目录
      • middlewares:中间件
      • items:格式化
      • pipelines:持久化
      • settings:配置文件
    • scrapy.cfg:配置信息
  • 创建爬虫:
    cd first_obj
    scrapy genspider baidu baidu.com

  • 执行爬虫:
    scrapy crawl baidu [–nolog] [-o baidu.json/baidu.csv]

  • 其他命令:

    1. 列出当前项目中所有可用的spider:scrapy list
    2. 下载给定的URL,并将获取到的内容送到标准输出:scrapy fetch <url>
    3. 在浏览器中打开给定的URL,并以Scrapy spider获取到的形式展现:scrapy view <url>
    4. 获取Scrapy的配置:scrapy settings --get [options]
    5. 获取scrapy版本:scrapy version
    6. 性能测试:scrapy bench
  • 配置文件:
    settings:
    ROBOTSTXT_OBEY:是否遵循网站的robots.txt规则
    CONCURRENT_REQUESTS:Scrapy执行的最大并发请求

所有配置项必须大写

扫描二维码关注公众号,回复: 11102956 查看本文章
  • 基本操作
  1. selector
from scrapy.selector import Selector

hxs = Selector(response=response)
img_list = hxs.xpath("//div[@class='item']")
for item in img_list:
    title = item.xpath("./div[@class='news-content']/div[@class='part2']/@share-title").extract()[0]
    url = item.xpath("./div[@class='news-pic']/img/@original").extract_first().strip('//')
  1. yield
page_list = hxs.xpath('//a[re:test(@href, "/all/hot/recent/\d+")]/@href').extract()

for page in page_list:
    yield Request(url=page, callback=self.parse)
  1. pipline
import scrapy
from scrapy.selector import Selector
from scrapy.http import Request

from ..items import ChouTiItem


class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['dig.chouti.com']
    start_urls = ['https://dig.chouti.com/']

    def parse(self, response):
        hxs = Selector(response=response)
        img_list = hxs.xpath("//div[@class='item']")
        for item in img_list:
            title = item.xpath("./div[@class='news-content']/div[@class='part2']/@share-title").extract()[0]
            url = item.xpath("./div[@class='news-pic']/img/@original").extract_first().strip('//')
            obj = ChouTiItem(title=title, url=url)
            yield obj
import scrapy


class ChouTiItem(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
  • piplines.py
    • Pipeline执行顺序:
      1. 检测Pipeline类中是否有from_crawler方法
      如果有:obj = Pipeline.from_crawler()
      如果没有:obj = Pipeline()
      2. 开启爬虫:obj.open_spider()
      3. while True:
      爬虫运行,并执行parse… yield item
      obj.process_item()
      4. 关闭爬虫:obj.close_spider()
    • 一般重构process_item即可
from scrapy.exceptions import DropItem


class SavePipeline(object):
    def __init__(self, v):
        self.file = open('chouti.txt', 'a+')

    def process_item(self, item, spider):
        # 操作并进行持久化
        # return表示会被后续的pipeline继续处理
        self.file.write(item)
        return item

        # 表示将item丢弃,不会被后续pipeline处理
        # raise DropItem()

    @classmethod
    def from_crawler(cls, crawler):
        """
        初始化时候,用于创建pipeline对象
        :param crawler:
        :return:
        """
        val = crawler.settings.get('SIX')
        return cls(val)

    def open_spider(self, spider):
        """
        爬虫开始执行时,调用
        :param spider:
        :return:
        """
        print('开启爬虫')

    def close_spider(self, spider):
        """
        爬虫关闭时,被调用
        :param spider:
        :return:
        """
        print('关闭爬虫')
# 每行后面的整型值,确定了他们运行的顺序,item按数字从低到高的顺序通过pipeline,通常将这些数字定义在0-1000范围内。
# 当遇到raise DropItem()将不再往下执行
ITEM_PIPELINES = {
   'fone.pipelines.SavePipeline': 300,
}
  • 注意:settings.py中的ITEM_PIPELINES是全局生效的(所有爬虫都会执行)。如果要对个别爬虫做特殊操作可以在pipelines.py中Pipeline方法中做spider.name判断:
    def process_item(self, item, spider):
        if spider.name == 'chouti':
            pass
  1. 去重
  • 默认去重规则:
	DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
	DUPEFILTER_DEBUG = False
	# 保存范文记录的日志路径,如:/root/ 最终路径为 /root/requests.seen
	JOBDIR = ""
class RepeatUrl:
    def __init__(self):
        self.visited_url = set()

    @classmethod
    def from_settings(cls, settings):
        """
        初始化时,调用
        :param settings: 
        :return: 
        """
        return cls()

    def request_seen(self, request):
        """
        检测当前请求是否已经被访问过
        :param request: 
        :return: True表示已经访问过;False表示未访问过
        """
        if request.url in self.visited_url:
            return True
        self.visited_url.add(request.url)
        return False

    def open(self):
        """
        开始爬去请求时,调用
        :return: 
        """
        print('open replication')

    def close(self, reason):
        """
        结束爬虫爬取时,调用
        :param reason: 
        :return: 
        """
        print('close replication')

    def log(self, request, spider):
        """
        记录日志
        :param request: 
        :param spider: 
        :return: 
        """
        print('repeat', request.url)
2) settings.py
DUPEFILTER_CLASS = 'fone.rfp.RFPDupeFilter'
  1. 自定义拓展
from scrapy import signals


class MyExtension(object):
    def __init__(self, value):
        self.value = value

    @classmethod
    def from_crawler(cls, crawler):
        val = crawler.settings.getint('SIX')
        ext = cls(val)
        crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
        return ext

    def spider_opened(self, spider):
        print('open')

    def spider_closed(self, spider):
        print('close')
EXTENSIONS = {
   'fone.extensions.MyExtension': 100,
}
  • 更多拓展详见from scrapy import signals
  1. 中间件
    创建项目时会生成middlewares.py
  • 爬虫中间件SpiderMiddleware示例代码类的方法说明(执行顺序):

    • process_spider_input:下载完成,执行,然后交给parse处理 (2)
    • process_spider_output:spider处理完成,返回时调用,必须返回包含 Request 或 Item 对象的可迭代对象(iterable) (3)
    • process_spider_exception:异常调用,返回None时继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline (4)
    • process_start_requests:爬虫启动时调用,包含 Request 对象的可迭代对象 (1)
  • 下载中间件DownloaderMiddleware示例代码及说明:

    • 下载中间件的应用场景比较广,尤其是process_request方法中返回NoneResponse对象方法
class DownMiddleware1(object):
    def process_request(self, request, spider):
        """
        请求需要被下载时,经过所有下载器中间件的process_request调用
        :param request: 
        :param spider: 
        :return:  
            None,继续后续中间件去下载;
            Response对象,停止process_request的执行,开始执行process_response
            Request对象,停止中间件的执行,将Request重新调度器
            raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
        """
        pass



    def process_response(self, request, response, spider):
        """
        spider处理完成,返回时调用
        :param response:
        :param result:
        :param spider:
        :return: 
            Response 对象:转交给其他中间件process_response
            Request 对象:停止中间件,request会被重新调度下载
            raise IgnoreRequest 异常:调用Request.errback
        """
        print('response1')
        return response

    def process_exception(self, request, exception, spider):
        """
        当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
        :param response:
        :param exception:
        :param spider:
        :return: 
            None:继续交给后续中间件处理异常;
            Response对象:停止后续process_exception方法
            Request对象:停止中间件,request将会被重新调用下载
        """
        return None
  1. 自定义命令
  • 在spiders同级创建任意目录,如:commands
  • 在其中创建 crawlall.py 文件 (此处文件名就是自定义的命令)
from scrapy.commands import ScrapyCommand


class Command(ScrapyCommand):
    requires_project = True

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def run(self, args, opts):
        spider_list = self.crawler_process.spiders.list()
        for name in spider_list:
            self.crawler_process.crawl(name, **opts.__dict__)
        self.crawler_process.start()
  • 在settings.py 中添加配置 COMMANDS_MODULE = ‘项目名称.目录名称’
  • 在项目目录执行 scrapy crawlall
发布了29 篇原创文章 · 获赞 4 · 访问量 8221

猜你喜欢

转载自blog.csdn.net/super2feng/article/details/85296213