scrapy 总结

爬虫

scrapy开启请求

简便方式

import scrapy
class QuotesSpider(scrapy.Spider):
  name = "quotes"
  start_urls = [
      'http://quotes.toscrape.com/page/1/',
      'http://quotes.toscrape.com/page/2/',
  ]

重写start_request

import scrapy
class QuotesSpider(scrapy.Spider):
  name = "quotes"

  def start_requests(self):
      urls = [
          'http://quotes.toscrape.com/page/1/',
          'http://quotes.toscrape.com/page/2/',
      ]
      for url in urls:
          yield scrapy.Request(url=url, callback=self.parse)

start_requests():必须提供一个Spider开始抓取的迭代请求(你可以返回一个请求列表或者编写一个生成器函数)。 随后的请求将从这些初始请求中接连生成。

爬虫类scrapy.Spider

https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy-spider

crawerspider

https://docs.scrapy.org/en/latest/topics/spiders.html#crawlspider

csvspider

https://docs.scrapy.org/en/latest/topics/spiders.html#csvfeedspider

Selector类

常见内置选择器:https://docs.scrapy.org/en/latest/topics/selectors.html#module-scrapy.selector

常见选择器一般有xpath,css,还有re

请求类Request:

https://yiyibooks.cn/__trs__/zomin/Scrapy15/index.html#request-objects

errbacks(错误处理参数的使用)

https://yiyibooks.cn/__trs__/zomin/Scrapy15/index.html#using-errbacks-to-catch-exceptions-in-request-processing

响应类Response

https://yiyibooks.cn/__trs__/zomin/Scrapy15/index.html#response-objects

调度器

待补充

下载器

待补充

引擎

待补充

管道

item

https://yiyibooks.cn/__trs__/zomin/Scrapy15/index.html#document-topics/items

item操作方法就和字典一样,可以通过Item.fields来获取item所有属性

自定义itemloader

https://yiyibooks.cn/__trs__/zomin/Scrapy15/index.html#declaring-item-loaders

声明输入和输出处理器

https://yiyibooks.cn/__trs__/zomin/Scrapy15/index.html#declaring-input-and-output-processors

mongodb管道示例

https://yiyibooks.cn/__trs__/zomin/Scrapy15/index.html#write-items-to-mongodb

splash管道示例

https://yiyibooks.cn/__trs__/zomin/Scrapy15/index.html#take-screenshot-of-item

导出为excel等

https://yiyibooks.cn/__trs__/zomin/Scrapy15/index.html#feed-exports

 

下载器中间件

文档:

https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#downloader-middleware

实现代理IP:

https://www.jianshu.com/p/8449b9c397bb

自定义爬虫中间件:

https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#writing-your-own-downloader-middleware

常用内置下载器中间件:

https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#built-in-downloader-middleware-reference

爬虫中间件

文档

https://docs.scrapy.org/en/latest/topics/spider-middleware.html#spider-middleware

爬虫设置

https://yiyibooks.cn/__trs__/zomin/Scrapy15/index.html#settings

 

猜你喜欢

转载自www.cnblogs.com/ycg-blog/p/12514161.html