scrapy简单的使用和抓取程序

http://www.luyixian.cn/news_show_247283.aspx

借

# -*- coding: utf-8 -*-
import scrapy
from demo.items import MaoyanreyingItem

class MaoyanSpider(scrapy.Spider):
    name = 'maoyan'
    #allowed_domains = ['maoyan.com']
    start_urls = ['http://www.yub2b.com/mall/']

    def parse(self, response):
        dl=response.css('.icatalog_body div')
        for dd in dl:
            item=MaoyanreyingItem()
            item['title']=dd.css('li:nth-child(1) a strong::text').extract()
            print("输出在这里，这俩")
            yield item

才考上方的教程，自己写了一个简单的抓取程序爬虫demo.py如上，很显然正常的request的程序根本不需要,他会自动在爬取，然后response就是返回的数值。，上边土中的yield相当于return，只不过是在for语句中使用yield比较节省存储空间。

这个是我scrapy上边的item程序。在demo.py中对应from demo.items import MaoyanreyingItem。其中Field()相当于赋值方法，赋值到scrapy。

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class MaoyanreyingItem(scrapy.Item):
    title=scrapy.Field()

这里我们可以使用 return 来返回数据，也可以使用 yield 来返回。二者的区别是 return 是一次性返回数据，yield 是每处理一条返回一条。

我们再来运行一下命令：

scrapy crawl author --nolog

结果和上面一样，控制台打印出作者名字列表。这是因为我们只是存在内存中的 Item 中了。

下面我们再运行一下如下命令：

crawl author --nolog -o authors.json -t json

我们可以发现 mySpider 目录下多了一个 authors.json 的文件，打开根目录文件可以看到。

我们再来看看运行的这个命令，与前面相比，多了 -o 和 -t 两个参数，其中前者指输出到文件，后面接文件名，后者指输出文件类型，后面接文件类型。在本例中，我们输出了一个名为 authors.json 类型为 json 的文件。

Scrapy 支持四种简单的保存方法：

输出json格式：

scrapy crawl author -o authors.json

输出 json lines格式，默认为Unicode编码：

scrapy crawl author -o authors.jsonl

输出 csv 格式：

scrapy crawl author -o authors.csv
xml格式

输出 xml 格式：

scrapy crawl author -o author.xml

范之度

发布了56 篇原创文章 · 获赞 2 · 访问量 3万+

私信关注

scrapy简单的使用和抓取程序

猜你喜欢