1.Scrapy response
1.1 response方法和参数
(1)body:http响应的正文,字节
(2)body_as_unicode:字符串类型的响应
(3)copy:复制
(4)css: 以css进行匹配
(5)encoding:加码
(6)headers:响应头部
(7)meta:响应处理的参数
(8)replace:替换
(9)request:产生http请求的request对象
(10)selector:scrapy 的字符匹配器
(11)status:状态码 200 400
(12)text:文本形式的响应内容
(13)url:http响应的地址
(14)urljoin:构造绝对url
(15)xpath:以xpath进行匹配
①代码:
②结果:
1.2 response分类
TextResponse、 HtmlResponse、Xmlresponse
2. Scrapy selector
Scrapy匹配核心selector
Beautifulsoup 比较方便,但是解析速度比较慢
Lxml 解析速度比较快
Scrapy集合二者的优点,进行总和。
①代码:
from scrapy.selector import Selector
def parse(self,response):
selector = Selector(response)
self.log(selector)
②结果:
2.1 Selector对象支持
2.1.1 css查找
表达式 |
描述 |
例子 |
* |
所有元素 |
* 所有的标签 |
Tag |
指定标签 |
img 所有的img标签 |
Tag1,tag2 |
指定多个标签 |
img,a img和a标签 |
Tag1 tage2 |
下一层标签 |
img a img下的a标签 |
Attrib = value |
指定属性 |
Id = 1 id等于1的标签 |
①代码(selector调用)
from scrapy.selector import Selector
def parse(self,response):
selector = Selector(response)
img_list = selector.css("img")
for img in img_list:
self.log(img)
②代码【response直接调用,优化代码】
无需导入Selector,结果与优化前的显示一致
def parse(self,response):
img_list = response.css("img")
for img in img_list:
self.log(img)
③结果:
2.1.2 xpath查找
(1)在scrapy当中写xpath不会有 text() attrib() tag()这样的方法,我们需要把这些方法写到匹配当中
(2).当前节点,
(3)..上一层节点
表达式 |
描述 |
例子 |
/ |
当前文档的根或者层 |
/html/body/div 取div |
text() |
文本 |
/html/body/div/a/text() 取a的文本 |
@attrib |
属性 |
/html/body/div/a/@href 取a的属性 |
* |
代表所有 |
/html/body/*[@class=’hello’] 取所有class属性等于hello的标签 /html/body/a/@* 取a标签所有的属性 |
[] |
修饰语 |
/html/body/div[4] 取第4个div /html/body/div[@class=“xxx”] |
①代码:
def parse(self,response):
img_list = response.xpath("//img/@src")
for img in img_list:
self.log(img)
②结果:
2.1.3 re查找
他不可以独立用,只可以加在匹配项后面
①代码:
def parse(self,response):
img_list = response.xpath("//img/@src")
for img in img_list:
img = img.re(".*png$")
self.log(img)
②结果:
2.2 返回字符结果的方法
2.2.1 extract()针对单个对象
①代码:
def parse(self,response):
img_list = response.xpath("//img/@src")
for img in img_list:
img = img.extract()
self.log(img)
②结果:
2.2.2 extract_first()针对列表
①代码:
def parse(self,response):
img_list = response.xpath("//img/@src")
img_List = img_list.extract_first()
self.log(img_List)
②结果:
3.Scrapy item
3.1 item介绍
(1)Scrapy有一个巨大的优势,scrapy可以定义数据模型,我们用item可以定义我们的数据模型类(定义一个类),定义的方法类似django的模型,但也有不同。
(2)Scrapy默认会创建一个模型,我们可以在里面定义我们想要的数据模型。
(3)scrapy 的item当中所有的字段都可以为Field
(4)item将解析结果返回成字典形式
3.2 item使用
①代码(spider):
import scrapy
from ScrapySpider.items import ScrapyTest
from scrapy import Request
class TestSpider(scrapy.spiders.Spider):
name = "baiduSpider"
def start_requests(self):
url = "https://www.baidu.com/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"
}
yield Request(url,headers=headers)
def parse(self,response):
img_list = response.xpath("//img/@src")
for img in img_list:
item = ScrapyTest()
item["src"] = img.extract()
self.log(item)
②代码(items):
import scrapy
class ScrapyspiderItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
# 继承ScrapyspiderItem
class ScrapyTest(ScrapyspiderItem):
src = scrapy.Field()
③结果:
4.Scrapy pipeline
这个时候,我们可以把数据格式化了,但是数据输出我们还需要使用piplines
4.1 pipeline介绍
(1)使用pipelines的第一步是在settings当中解开(取消注释)pipelines的配置
这条配置分为两部分
①pipelines的位置
②优先级,优先级分为1-1000,数值越小越先执行
(2)爬虫要有生成器生成item步骤
4.2 pipelines使用
①代码(spider):
import scrapy
from ScrapySpider.items import ScrapyTest
from scrapy import Request
class TestSpider(scrapy.spiders.Spider):
name = "baiduSpider"
def start_requests(self):
url = "https://www.baidu.com/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"
}
yield Request(url,headers=headers)
def parse(self,response):
img_list = response.xpath("//img/@src")
for img in img_list:
item = ScrapyTest()
item["src"] = img.extract()
self.log(item)
yield item
②代码(pipelines):
③结果:
5.Scrapy 项目实例
5.1 新建爬虫文件qiushi.py
import scrapy
from ScrapySpider.items import ScrapyTest
class qiushiTest(scrapy.spiders.Spider):
name = "qiushi"
def start_requests(self):
url = "https://www.qiushibaike.com/"
headers = {
"Referer": "https://www.baidu.com/link?url=0NjZXCRuEfuf8lcVVYy8j3o_548KY5Nvc_GHkq6auqOxoY7-LnODt6dLkTcihaWC&wd=&eqid=8e7edbd1000211e4000000055bc19182",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"
}
yield scrapy.Request(url,headers=headers)
def parse(self,response):
img_list = response.xpath("//img/@src")
for img in img_list:
item = ScrapyTest()
item["src"] = img.extract()
self.log(item)
yield item
5.2 items.py
import scrapy
class ScrapyspiderItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
# 继承ScrapyspiderItem
class ScrapyTest(ScrapyspiderItem):
src = scrapy.Field()
5.3 pipelines.py
from urllib import request
class ScrapyspiderPipeline(object):
def process_item(self, item, spider):
src = item["src"]
url = "http:" + src
if "?" in src:
URL = src.split("?")[0] # 以“?”从左边开始分割,取左边第一个
name = URL.rsplit("/", 1)[1] # 以“/”从右边开始分割一次,取右边第一个
else:
name = src.rsplit("/", 1)[1]
print("===========")
print(name)
print(url)
path = "F:\\img\\" + name
try:
request.urlretrieve(url, path)
except Exception as e:
print(e)
else:
print("%s is down" % name)
return item
5.4 run.py
from scrapy import cmdline
cmdline.execute("scrapy crawl qiushi".split())
5.5 结果