Python学习笔记--Python 爬虫入门 -18-2 Scrapy-shell

# scrapy-shell

- scrapy shell教程
- shell
- 启动
   - Linux： ctr+T,打开终端，然后输入scrapy shell "url:xxxx" (注意是双引号)
   - windows: scrapy shell "url:xxx"
   - 启动后自动下载指定url的网页

- 下载完成后，url的内容保存在response的变量中，如果需要，我们需要调用response
- response

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>百度一下，你就知道</title>'>]
>>> response.xpath('//title').extract()
['<title>百度一下，你就知道</title>']
>>> response.xpath('//title').extract()[0]
'<title>百度一下，你就知道</title>'

   - 爬取到的内容保存在response中给
   - response.body是网页的代码
   - resposne.headers是返回的http的头信息
   - response.xpath（）允许使用xpath语法选择内容
   - response.css()允许使用css语法选区内容
- selector
   - 选择器，允许用户使用选择器来选择自己想要的内容
   - response.selector.xpath: response.xpath是selector.xpath的快捷方式
   - response.selector.css: response.css是他的快捷方式
   - selector.extract:把节点的内容用unicode形式返回
   - selector.re:允许用户通过正则选区内容

补充部分: (2018年9月24日21:21:33)

如果网站禁止爬虫,如何用scrapy shell 设置headers ,用法参照下图.1-2-3-4

重点关注一下4,设置redirect=False/True

保存到本地的话,继续执行以下代码:

>>> response.xpath('/html/head/title/text()')
[<Selector xpath='/html/head/title/text()' data='IT之家 - 数码，科技，生活 - 软媒旗下'>]
>>> response.xpath('/html/head/title/text()').extract()
['IT之家 - 数码，科技，生活 - 软媒旗下']
>>> response.css('.cate-title')
[<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' cate-title ')]" data='<h2 class="cate-title">资讯</h2>'>, <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' cate-title ')]" data='<h2 class="cate-title">极客</h2>'>, <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' cate-title ')]" data='<h2 class="cate-title">微软</h2>'>, <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' cate-title ')]" data='<h2 class="cate-title">苹果</h2>'>, <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' cate-title ')]" data='<h2 class="cate-title">资源</h2>'>]

Python学习笔记--Python 爬虫入门 -18-2 Scrapy-shell

# scrapy-shell

猜你喜欢