京东商品页面

[root@localhost pytest]# cat jdspider.py
#!/usr/bin/env python
# coding=utf-8
import scrapy
class JdSpider(scrapy.Spider):
    name='jd'
    start_urls=['http://list.jd.com/list.html?cat=737,794,798']
    def parse(self,response):
        for href in response.css('#plist .p-name a::attr(href)'):
            full_url=response.urljoin(href.extract())
            yield scrapy.Request(full_url,callback=self.parse_goods)

    def parse_goods(self,response):
        yield{
            'title':response.css('.sku-name::text').extract()[0],
            'link':response.url,
        }

运行

[root@localhost pytest]# scrapy runspider jdspider.py -o abc.csv

结果
[root@localhost pytest]# less abc.csv 
link,title
http://item.jd.com/1927536.html,长虹（CHANGHONG）55U3C 55英寸双64位4K安卓智能LED液晶电视(黑色)
http://item.jd.com/1589946.html,创维（Skyworth）55M6 55英寸 4K超高清智能酷开网络液晶电视（黑色）
http://item.jd.com/1366436.html,飞利浦（PHILIPS）55PFL6840/T3 55英寸 流光溢彩 4K超高清智能电视（京东微联APP控制）
http://item.jd.com/1612016.html,创维（Skyworth）58M6 58英寸 4K超高清智能酷开网络液晶电视（黑色）

组建:

选择器(Selectors)

http://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/selectors.html#topics-selectors

使用选择器(selectors)

我们将使用 Scrapy shell (提供交互测试)和位于Scrapy文档服务器的一个样例页面，来解释如何使用选择器：

http://doc.scrapy.org/en/latest/_static/selectors-sample1.html

这里是它的HTML源码:

 
   <html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>
 
  

首先, 我们打开shell:

scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html

接着，当shell载入后，您将获得名为 response 的shell变量，其为响应的response，并且在其response.selector 属性上绑定了一个selector。

因为我们处理的是HTML，选择器将自动使用HTML语法分析。

那么，通过查看 HTML code 该页面的源码，我们构建一个XPath来选择title标签内的文字:

 
   >>> response.selector.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]

由于在response中使用XPath、CSS查询十分普遍，因此，Scrapy提供了两个实用的快捷方式:response.xpath() 及 response.css():

 
   >>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.css('title::text')
[<Selector (text) xpath=//title/text()>]

如你所见， .xpath() 及 .css() 方法返回一个类 SelectorList 的实例, 它是一个新选择器的列表。这个API可以用来快速的提取嵌套数据。

为了提取真实的原文数据，你需要调用 .extract() 方法如下:

 
   >>> response.xpath('//title/text()').extract()
[u'Example website']

注意CSS选择器可以使用CSS3伪元素(pseudo-elements)来选择文字或者属性节点:

 
   >>> response.css('title::text').extract()
[u'Example website']

现在我们将得到根URL(base URL)和一些图片链接:

 
   >>> response.xpath('//base/@href').extract()
[u'http://example.com/']

>>> response.css('base::attr(href)').extract()
[u'http://example.com/']

>>> response.xpath('//a[contains(@href, "image")]/@href').extract()
[u'image1.html',
 u'image2.html',
 u'image3.html',
 u'image4.html',
 u'image5.html']

>>> response.css('a[href*=image]::attr(href)').extract()
[u'image1.html',
 u'image2.html',
 u'image3.html',
 u'image4.html',
 u'image5.html']

>>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

>>> response.css('a[href*=image] img::attr(src)').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']
 
  

嵌套选择器(selectors)

选择器方法( .xpath() or .css() )返回相同类型的选择器列表，因此你也可以对这些选择器调用选择器方法。下面是一个例子:

 
   >>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.extract()
[u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
 u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
 u'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
 u'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
 u'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']

>>> for index, link in enumerate(links):
        args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
        print 'Link number %d points to url %s and image %s' % args

Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg']
Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg']
Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg']
Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg']
Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']
 
  

结合正则表达式使用选择器(selectors)

Selector 也有一个 .re() 方法，用来通过正则表达式来提取数据。然而，不同于使用 .xpath() 或者 .css() 方法, .re() 方法返回unicode字符串的列表。所以你无法构造嵌套式的 .re() 调用。

下面是一个例子，从上面的 HTML code 中提取图像名字:

 
   >>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
[u'My image 1',
 u'My image 2',
 u'My image 3',
 u'My image 4',
 u'My image 5']
 
  

使用相对XPaths

记住如果你使用嵌套的选择器，并使用起始为 / 的XPath，那么该XPath将对文档使用绝对路径，而且对于你调用的 Selector 不是相对路径。

比如，假设你想提取在 <div> 元素中的所有 <p> 元素。首先，你将先得到所有的 <div> 元素:

 
   >>> divs = response.xpath('//div')

开始时，你可能会尝试使用下面的错误的方法，因为它其实是从整篇文档中，而不仅仅是从那些<div> 元素内部提取所有的 <p> 元素:

 
   >>> for p in divs.xpath('//p'):  # this is wrong - gets all <p> from the whole document
...     print p.extract()

下面是比较合适的处理方法(注意 .//p XPath的点前缀):

 
   >>> for p in divs.xpath('.//p'):  # extracts all <p> inside
...     print p.extract()

另一种常见的情况将是提取所有直系 <p> 的结果:

 
   >>> for p in divs.xpath('p'):
...     print p.extract()

更多关于相对XPaths的细节详见XPath说明中的 Location Paths 部分。

Items

http://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/items.html#module-scrapy.item

爬取的主要目标就是从非结构性的数据源提取结构性数据，例如网页。 Scrapy提供 Item 类来满足这样的需求。

Item 对象是种简单的容器，保存了爬取到得数据。其提供了类似于词典(dictionary-like) 的API以及用于声明可用字段的简单语法。

声明Item

Item使用简单的class定义语法以及 Field 对象来声明。例如:

 
    import scrapy

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)
 
   

注解

Item Pipeline

http://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/item-pipeline.html#item-pipeline

当Item在Spider中被收集之后，它将会被传递到Item Pipeline，一些组件会按照一定的顺序执行对Item的处理。

每个item pipeline组件(有时称之为“Item Pipeline”)是实现了简单方法的Python类。他们接收到Item并通过它执行一些行为，同时也决定此Item是否继续通过pipeline，或是被丢弃而不再进行处理。

以下是item pipeline的一些典型应用：

清理HTML数据
验证爬取的数据(检查item包含某些字段)
查重(并丢弃)
将爬取结果保存到数据库中

Item pipeline 样例

验证价格，同时丢弃没有价格的item

让我们来看一下以下这个假设的pipeline，它为那些不含税(price_excludes_vat 属性)的item调整了price 属性，同时丢弃了那些没有价格的item:

 
     from scrapy.exceptions import DropItem

class PricePipeline(object):

    vat_factor = 1.15

    def process_item(self, item, spider):
        if item['price']:
            if item['price_excludes_vat']:
                item['price'] = item['price'] * self.vat_factor
            return item
        else:
            raise DropItem("Missing price in %s" % item)
 
    

将item写入JSON文件

以下pipeline将所有(从所有spider中)爬取到的item，存储到一个独立地 items.jl 文件，每行包含一个序列化为JSON格式的item:

 
     import json

class JsonWriterPipeline(object):

    def __init__(self):
        self.file = open('items.jl', 'wb')

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

注解

JsonWriterPipeline的目的只是为了介绍怎样编写item pipeline，如果你想要将所有爬取的item都保存到同一个JSON文件，你需要使用 Feed exports 。

去重

一个用于去重的过滤器，丢弃那些已经被处理过的item。让我们假设我们的item有一个唯一的id，但是我们spider返回的多个item中包含有相同的id:

 
     from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['id'])
            return item
 
    

启用一个Item Pipeline组件

为了启用一个Item Pipeline组件，你必须将它的类添加到 ITEM_PIPELINES 配置，就像下面这个例子:

 
    ITEM_PIPELINES = {
    'myproject.pipelines.PricePipeline': 300,
    'myproject.pipelines.JsonWriterPipeline': 800,
}

分配给每个类的整型值，确定了他们运行的顺序，item按数字从低到高的顺序，通过pipeline，通常将这些数字定义在0-1000范围内。

4,

Feed exports

http://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/feed-exports.html#feed-exports

0.10 新版功能.

实现爬虫时最经常提到的需求就是能合适的保存爬取到的数据，或者说，生成一个带有爬取数据的”输出文件”(通常叫做”输出feed”)，来供其他系统使用。

Scrapy自带了Feed输出，并且支持多种序列化格式(serialization format)及存储方式(storage backends)。

序列化方式(Serialization formats)

feed输出使用到了 Item exporters 。其自带支持的类型有:

JSON

JSON lines

CSV

XML

您也可以通过 FEED_EXPORTERS 设置扩展支持的属性。

JSON

FEED_FORMAT: json

使用的exporter: JsonItemExporter

大数据量情况下使用JSON请参见这个警告

JSON lines

FEED_FORMAT: jsonlines

使用的exporter: JsonLinesItemExporter

CSV

FEED_FORMAT: csv

使用的exporter: CsvItemExporter

XML

FEED_FORMAT: xml

使用的exporter: XmlItemExporter

Pickle

FEED_FORMAT: pickle

使用的exporter: PickleItemExporter

Marshal

FEED_FORMAT: marshal

使用的exporter: MarshalItemExporter

存储(Storages)

使用feed输出时您可以通过使用 URI (通过 FEED_URI 设置) 来定义存储端。 feed输出支持URI方式支持的多种存储后端类型。

自带支持的存储后端有:

本地文件系统

FTP

S3 (需要 boto)

标准输出

有些存储后端会因所需的外部库未安装而不可用。例如，S3只有在 boto 库安装的情况下才可使用。

存储URI参数

存储URI也包含参数。当feed被创建时这些参数可以被覆盖:

%(time)s - 当feed被创建时被timestamp覆盖

%(name)s - 被spider的名字覆盖

其他命名的参数会被spider同名的属性所覆盖。例如，当feed被创建时， %(site_id)s 将会被spider.site_id 属性所覆盖。

下面用一些例子来说明:

存储在FTP，每个spider一个目录:

ftp://user:[email protected]/scraping/feeds/%(name)s/%(time)s.json

存储在S3，每一个spider一个目录:

s3://mybucket/scraping/feeds/%(name)s/%(time)s.json

存储端(Storage backends)

本地文件系统

将feed存储在本地系统。

URI scheme: file

URI样例: file:///tmp/export.csv

需要的外部依赖库: none

注意: (只有)存储在本地文件系统时，您可以指定一个绝对路径 /tmp/export.csv 并忽略协议(scheme)。不过这仅仅只能在Unix系统中工作。

FTP

将feed存储在FTP服务器。

URI scheme: ftp

URI样例: ftp://user:[email protected]/path/to/export.csv

需要的外部依赖库: none

S3

将feed存储在 Amazon S3 。

URI scheme: s3

URI样例:

s3://mybucket/path/to/export.csv

s3://aws_key:aws_secret@mybucket/path/to/export.csv

需要的外部依赖库: boto

您可以通过在URI中传递user/pass来完成AWS认证，或者也可以通过下列的设置来完成:

AWS_ACCESS_KEY_ID

AWS_SECRET_ACCESS_KEY

标准输出

feed输出到Scrapy进程的标准输出。

URI scheme: stdout

URI样例: stdout:

需要的外部依赖库: none

scrapy-2:scrapy的一些组件

选择器(Selectors)

使用选择器(selectors)

嵌套选择器(selectors)

结合正则表达式使用选择器(selectors)

使用相对XPaths

Items

声明Item

Item Pipeline

Item pipeline 样例

验证价格，同时丢弃没有价格的item

将item写入JSON文件

去重

启用一个Item Pipeline组件

4,

Feed exports

序列化方式(Serialization formats)

JSON

JSON lines

CSV

XML

Pickle

Marshal

存储(Storages)

存储URI参数

存储端(Storage backends)

本地文件系统

FTP

S3

标准输出

http://www.up123.cc/17.html

猜你喜欢