[Scrapy framework persistent storage] --2019-08-08 20:40:10

Original: http://106.13.73.98/__/138/

Based on persistent storage instruction terminal

Prerequisites: Make sure the file crawler parse process iteration may return value data type (usually list / dict).
The return value can be written into the file format specified by the terminal in the form of instructions, for persistent storage.


Execute the following command to persistent storage:
scrapy crawl 应用名称 -o xx.文件格式

It supports file formats are: 'json', 'jsonlines', 'JL', 'CSV', 'xml', 'Marshal', 'pickle'

Based on the persistent storage pipeline


Scrapy framework provides us with efficient and convenient persistent operating functions, we can use directly.
Before using, let's recognize these two files:

  1. items.py : data structure template file is used to define data attributes.
  2. pipelines.py : conduit files, received data (items), for long-lasting operation.

--------------------------- ↓
persistence process:

  1. After crawling application file data, the data encapsulated into items object.
  2. Use yield keyword items submitted to the target pipelines for persistence operate the pipeline.
  3. Class file in a pipe process_item method of receiving crawler submissions over item objects,
    and then write the code persistent storage item data objects stored in the persistent store.


note:

  • In settings.py open pipeline configuration profiles in the dictionary: ITEM_PIPELINES
  • In items.py definition of a good field to be persistent data structure template file

---------------------------↑


Examples

The climb takes the following articles and all correspondence between the URL of your CSDN blog, and persistent storage.

The first step, let's write the application file blog.py :

import scrapy
from Test.items import TestItem


class BlogSpider(scrapy.Spider):
    name = 'blog'  # 应用名称
    start_urls = ['https://blog.csdn.net/qq_41964425/article/list/1']  # 起始url

    page = 1  # 用于计算页码数
    max_page = 7  # 你的CSDN博客一共有多少页博文
    url = 'https://blog.csdn.net/qq_41964425/article/list/%d'  # 用于定位页码的url


    def parse(self, response):
        """
        访问起始url页面并获取结果后的回调函数
        一般情况下,你写了多少个起始url,此方法就会被调用多少次
        :param response: 向起始URL发送请求后,获取的响应对象
        :return: 此方法的返回值必须为 可迭代对象(通常为list/dict) 或 None
        """


        # 定位到当期页面中所有博文的链接(a标签)
        a_list = response.xpath('//div[@class="article-list"]/div/h4/a')  # type: list

        # 每个人的博客,每一页的都有这篇文章,你打开网页源码看看就知道了
        a_list.pop(0)  # 帝都的凛冬 https://blog.csdn.net/yoyo_liyy/article/details/82762601

        # 开始解析处理博客数据
        for a in a_list:
            # 准备一个item对象
            item = TestItem()
            # 保存数据
            item['title'] = a.xpath('.//text()')[-1].extract().strip()  # 获取文章标题
            item['url'] = a.xpath('.//@href').extract_first()  # 获取文章url
            # 下面将解析到的数据存储到item对象中
            # 最后,将itme对象提交给管道进行持久化存储
            yield item

        # 开始爬取所有页面
        if self.page < self.max_page:
            self.page += 1
            current_page = format(self.url % self.page)  # 当前的要爬取的url页面
            # 手动发送请求
            yield scrapy.Request(url=current_page, callback=self.parse)
            # callback用于指定回调函数,即指定解析的方法
                # 1.如果页面内容与起始url的内容一致,可使用同一个解析方法(self.parse)
                # 2.如果页面的内容不一致,需手动创建新的解析方法,可见下面


    def dem(self, response):
        """
        2.如果页面的内容不一致,需手动创建新的解析方法
        注意:解析方法必须接收response对象
        """
        pass

The second step in the data structure template file items.py in the field ready to be saved:

import scrapy

class TestItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()  # 文章标题
    url = scrapy.Field()  # 文章url

The third step in the pipeline file pipelines.py write persistent in logic:

class TestPipeline(object):

    def process_item(self, item, spider):
        """应用文件每提交一次item,该方法就会被调用一次"""
        # 保存数据到文件
        self.fp.write(item['title'] + '\t' + item['url'] + '\n')

        # 你一定要注意:
        # 此方法一定要返回itme对象,用于传递给后面的(优先级低的)管道类中的此方法
        # 关于优先级的高低,可在settings.py文件中的ITEM_PIPELINES字典中定义
        return item

    # 重写父类方法,用于打开文件
    def open_spider(self, spider):
        """此方法在运行应用时被执行,注意:只会被执行一次"""
        print('开始爬取')
        self.fp = open('text.txt', 'w', encoding='utf-8')

    # 重写父类方法,用于关闭文件
    def close_spider(self, spider):
        """此方法结束应用时被执行,注意:只会被执行一次"""
        print('爬取结束')
        self.fp.close()

The fourth step in the configuration file settings.py to open the following configuration items:

# 伪装请求载体身份(User-Agent)
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'

# 是否遵守robots协议
ROBOTSTXT_OBEY = False

# 开启管道
ITEM_PIPELINES = {
   'Test.pipelines.TestPipeline': 300,  # 300表示的是优先级,数值越小优先级越高,优先级高的先执行
   # 自定义的管道类添加到此字典时才会被调用
}

The fifth step, execute the command in a terminal enter: scrapy crawl Blog , after the successful implementation, we open the file to look at:
Here Insert Picture Description

How, is not like this below.
Here Insert Picture Description
Ha ha ha.
Here Insert Picture Description

Do not panic, we can not only persist data to a file, you can also persist data to the database (for example, the following MySQL, Redis).


--------------------------------------- ↓
persistent storage to MySQL

# pipelines.py文件
import pymysql

# 自定义管道类,用于将数据写入MySQL
# 注意了:
# 自定义的管道类需要在settings.py文件中的ITEM_PIPELINES字典中注册后才会被调用
# 而调用的顺序取决于注册时指定的优先级
class TestPipeLineByMySQL(object):

    # 重写父类方法,用于建立MySQL链接,并创建一个游标
    def open_spider(self, spider):
        """此方法在运行应用时被执行,注意:只会被执行一次"""
        self.conn = pymysql.connect(
            host='localhost',
            port=3306,
            user='zyk',
            password='user@zyk',
            db='blog',  # 指定你要使用的数据库
            charset='utf8',  # 指定数据库的编码
        )  # 建立MySQL链接
        self.cursor = self.conn.cursor()  # 创建游标
        # 本人实测:这里并发使用同一个游标没有出现问题

    def process_item(self, item, spider):
        sql = 'insert into info(title, url) values (%s, %s)'  # 要执行的sql语句
        # 开始执行事务
        try:
            self.cursor.execute(sql, (item['title'], item['url']))  # 增
            self.conn.commit()  # 提交
        except Exception as e:
            print(e)
            self.conn.rollback()  # 回滚
        return item
        # 你一定要注意:
        # 此方法一定要返回itme对象,用于传递给后面的(优先级低的)管道类中的此方法
        # 关于优先级的高低,可在settings.py文件中的ITEM_PIPELINES字典中定义

    # 重写父类方法,用于关闭MySQL链接
    def close_spider(self, spider):
        """此方法在结束应用时被执行,注意:只会被执行一次"""
        self.cursor.close()
        self.conn.close()

After writing Do not forget the configuration file settings.py add this pipeline class:

ITEM_PIPELINES = {
   'Test.pipelines.TestPipeline': 300,  # 300表示的是优先级,数值越小优先级越高,优先级高的先执行
   'Test.pipelines.TestPipeLineByMySQL': 400,    # 自定义的管道类添加到此字典时才会被调用
}


--------------------------------------- ↓
persistent storage to Redis

# pipelines.py文件
import redis
import json

# 自定义管道类,用于将数据写入Redis
# 注意了:
# 自定义的管道类需要在settings.py文件中的ITEM_PIPELINES字典中注册后才会被调用
# 而调用的顺序取决于注册时指定的优先级
class TestPipelineByRedis(object):

    # 重写父类方法,用于创建Redis连接实例
    def open_spider(self, spider):
        """此方法在运行应用时被执行,注意:只会被执行一次"""

        # 创建Redis连接实例
        self.conn = redis.Redis(host='localhost', port=6379, password='', decode_responses=True, db=15)
        # decode_responses=True 写入的键值对中的value为字符串类型,否则为字节类型
        # db=15 表示使用第15个数据库


    def process_item(self, item, spider):
        dct = {
            'title': item['title'],
            'url': item['url']
        }

        # 将数据持久化至Redis
        self.conn.lpush('blog_info', json.dumps(dct, ensure_ascii=False))
        # ensure_ascii=False 这样可使取数据时显示的为中文

        return item
        # 你一定要注意:
        # 此方法一定要返回itme对象,用于传递给后面的(优先级低的)管道类中的此方法
        # 关于优先级的高低,可在settings.py文件中的ITEM_PIPELINES字典中定义

With the above, the configuration file settings.py add this pipeline class:

ITEM_PIPELINES = {
   'Test.pipelines.TestPipeline': 300,  # 300表示的是优先级,数值越小优先级越高,优先级高的先执行
   'Test.pipelines.TestPipelineByMySQL': 400,  # 自定义的管道类添加到此字典时才会被调用
   'Test.pipelines.TestPipelineByRedis': 500,  # 自定义的管道类添加到此字典时才会被调用
}

Query persistent data in Redis in:lrange blog_info 0 -1

Original: http://106.13.73.98/__/138/

Guess you like

Origin www.cnblogs.com/gqy02/p/11323698.html