scrapy框架-- response

1.Scrapy response

1.1 response方法和参数

(1)body:http响应的正文,字节
(2)body_as_unicode:字符串类型的响应
(3)copy:复制
(4)css: 以css进行匹配
(5)encoding:加码
(6)headers:响应头部
(7)meta:响应处理的参数
(8)replace:替换
(9)request:产生http请求的request对象
(10)selector:scrapy 的字符匹配器
(11)status:状态码 200 400
(12)text:文本形式的响应内容
(13)url:http响应的地址
(14)urljoin:构造绝对url
(15)xpath:以xpath进行匹配

 

①代码:

②结果:

1.2 response分类

TextResponse、 HtmlResponse、Xmlresponse

 

2. Scrapy selector

Scrapy匹配核心selector

Beautifulsoup 比较方便,但是解析速度比较慢

Lxml 解析速度比较快

扫描二维码关注公众号,回复: 3609654 查看本文章

Scrapy集合二者的优点,进行总和。

 

①代码:

from scrapy.selector import Selector

def parse(self,response):
    selector = Selector(response)
    self.log(selector)

 

②结果:

 

2.1 Selector对象支持

2.1.1 css查找

表达式

描述

例子

*

所有元素

所有的标签

Tag

指定标签

img 所有的img标签

Tag1,tag2

指定多个标签

img,a img和a标签

Tag1 tage2

下一层标签

img a img下的a标签

Attrib = value

指定属性

Id = 1 id等于1的标签

①代码(selector调用)

from scrapy.selector import Selector

def parse(self,response):
    selector = Selector(response)
    img_list = selector.css("img")
    for img in img_list:
        self.log(img)

 

②代码【response直接调用,优化代码】

无需导入Selector,结果与优化前的显示一致

def parse(self,response):
    img_list = response.css("img")
    for img in img_list:
        self.log(img)

 

③结果:

 

2.1.2 xpath查找

(1)在scrapy当中写xpath不会有 text() attrib() tag()这样的方法,我们需要把这些方法写到匹配当中

(2).当前节点,

(3)..上一层节点

 

表达式

描述

例子

/

当前文档的根或者层

/html/body/div 取div

text()

文本

/html/body/div/a/text() 取a的文本

@attrib

属性

/html/body/div/a/@href  取a的属性

*

代表所有

/html/body/*[@class=’hello’] 

取所有class属性等于hello的标签  

/html/body/a/@*

取a标签所有的属性

[]

修饰语

/html/body/div[4] 取第4个div

/html/body/div[@class=“xxx”]

 

①代码:

def parse(self,response):
    img_list = response.xpath("//img/@src")
    for img in img_list:
        self.log(img)

②结果:

 

2.1.3 re查找

他不可以独立用,只可以加在匹配项后面

 

①代码:

def parse(self,response):
    img_list = response.xpath("//img/@src")
    for img in img_list:
        img = img.re(".*png$")
        self.log(img)

 

②结果:

 

2.2 返回字符结果的方法

2.2.1 extract()针对单个对象

①代码:

def parse(self,response):
    img_list = response.xpath("//img/@src")
    for img in img_list:
        img = img.extract()
        self.log(img)

 

②结果:

2.2.2 extract_first()针对列表

①代码:

def parse(self,response):
    img_list = response.xpath("//img/@src")
    img_List = img_list.extract_first()
    self.log(img_List)

②结果:

 

3.Scrapy item

3.1 item介绍

(1)Scrapy有一个巨大的优势,scrapy可以定义数据模型,我们用item可以定义我们的数据模型类(定义一个类),定义的方法类似django的模型,但也有不同。

(2)Scrapy默认会创建一个模型,我们可以在里面定义我们想要的数据模型。

(3)scrapy 的item当中所有的字段都可以为Field

(4)item将解析结果返回成字典形式

 

3.2 item使用

 

①代码(spider):

import scrapy
from ScrapySpider.items import ScrapyTest
from scrapy import Request
class TestSpider(scrapy.spiders.Spider):

    name = "baiduSpider"

    def start_requests(self):
        url = "https://www.baidu.com/"
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"
        }
        yield Request(url,headers=headers)

    def parse(self,response):
        img_list = response.xpath("//img/@src")
        for img in img_list:
            item = ScrapyTest()
            item["src"] = img.extract()
            self.log(item)

 

②代码(items):

import scrapy

class ScrapyspiderItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()

# 继承ScrapyspiderItem
class ScrapyTest(ScrapyspiderItem):
    src = scrapy.Field()

 

③结果:

 

4.Scrapy pipeline

这个时候,我们可以把数据格式化了,但是数据输出我们还需要使用piplines

4.1 pipeline介绍

(1)使用pipelines的第一步是在settings当中解开(取消注释)pipelines的配置

 

这条配置分为两部分

①pipelines的位置

②优先级,优先级分为1-1000,数值越小越先执行

 

(2)爬虫要有生成器生成item步骤

 

4.2 pipelines使用

①代码(spider):

import scrapy
from ScrapySpider.items import ScrapyTest
from scrapy import Request
class TestSpider(scrapy.spiders.Spider):

    name = "baiduSpider"

    def start_requests(self):
        url = "https://www.baidu.com/"
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"
        }
        yield Request(url,headers=headers)

    def parse(self,response):
        img_list = response.xpath("//img/@src")
        for img in img_list:
            item = ScrapyTest()
            item["src"] = img.extract()
            self.log(item)
            yield item

 

②代码(pipelines):

 

③结果:

 

5.Scrapy 项目实例

5.1 新建爬虫文件qiushi.py

import scrapy
from ScrapySpider.items import ScrapyTest

class qiushiTest(scrapy.spiders.Spider):
    name = "qiushi"
    def start_requests(self):
        url = "https://www.qiushibaike.com/"
        headers = {
            "Referer": "https://www.baidu.com/link?url=0NjZXCRuEfuf8lcVVYy8j3o_548KY5Nvc_GHkq6auqOxoY7-LnODt6dLkTcihaWC&wd=&eqid=8e7edbd1000211e4000000055bc19182",
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"
        }
        yield scrapy.Request(url,headers=headers)

    def parse(self,response):
        img_list = response.xpath("//img/@src")
        for img in img_list:
            item = ScrapyTest()
            item["src"] = img.extract()
            self.log(item)
            yield item

 

5.2 items.py

import scrapy

class ScrapyspiderItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()

# 继承ScrapyspiderItem
class ScrapyTest(ScrapyspiderItem):
    src = scrapy.Field()

 

5.3 pipelines.py

from urllib import request
class ScrapyspiderPipeline(object):
    def process_item(self, item, spider):
        src = item["src"]
        url = "http:" + src
        if "?" in src:
            URL = src.split("?")[0]   # 以“?”从左边开始分割,取左边第一个
            name = URL.rsplit("/", 1)[1]  # 以“/”从右边开始分割一次,取右边第一个
        else:
            name = src.rsplit("/", 1)[1]
        print("===========")
        print(name)
        print(url)
        path = "F:\\img\\" + name
        try:
            request.urlretrieve(url, path)
        except Exception as e:
            print(e)
        else:
            print("%s is down" % name)
        return item

 

5.4 run.py

from scrapy import cmdline
cmdline.execute("scrapy crawl qiushi".split())

 

5.5 结果

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

猜你喜欢

转载自blog.csdn.net/qq_39620483/article/details/83040456
今日推荐