02.scrapy framework persistent storage

1. The terminal-based persistent store instruction

  Ensure crawler file parse method has iterables (usually a list or dictionary) returns, the return value may be persistent action file format specified by the write command in the form of terminal

Storing execution format output specification: crawling a file data written in different formats for storage

  scrapy crawl reptile name -o xxx.json

  scrapy crawl reptile name -o xxx.xml

  scrapy crawl reptile name -o xxx.csv

2. Based on the pipeline of persistent storage

  scrapy framework has been specially integrated for our good efficient and convenient persistent operating functions, we can use directly in order to use persistence operations function scrapy, we first come to know the following two files:

  items.py:. template file defines the data structure of data attributes

  pipelines.py:. pipeline files received data (items), the operation for persistence

Persistence process:

  1. reptiles crawling data files, data items need to be packaged object

  2. Use the yield keyword submit items object persistence operations to pipelines Pipeline

  3. The item object accepts submissions crawler over the pipe in the method process_item file, and then write the persistent storage item code data stored in the object storage for persistence.

  4. Open the configuration file conduit settings.py

Small scale chopper: the embarrassments words in the piece of data and crawling down, then persistent storage.

Reptile File: qiubaidemo.py

import scrapy
from secondblood.items import SecondbloodItem

class QiubaidemoSpider(scrapy.Spider):
    name = 'qiubaiDemo'
    allowed_domains = ['www.qiushibaike.com']
    start_urls = ['http://www.qiushibaike.com/']

    def parse(self, response):
        odiv = response.xpath('//div[@id="content-left"]/div')
        for div in odiv:
            #xpath function list data, stored in the list returned by Selector type of data. We resolved the content is encapsulated in the Selector object, call the extract () function parses the content was removed from the Selecor. 
            = div.xpath author ( ' .//div[@class="author clearfix "] // H2 / text () ' ) .extract_first () 
            author = author.strip ( ' \ n- ' ) # filtered blank line 
            content = div.xpath ( ' .// div [@ class = "Content"] / span / text () ' ) .extract_first () 
            Content = content.strip ( ' \ n- '

            ' Author ' ] = author 
            item [ ' Content ' ] = Content 

            the yield item # submitted to a pipe item file (pipelines.py)

items File: items.py

import scrapy


class SecondbloodItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author = scrapy.Field() #存储作者
    content = scrapy.Field() #存储段子内容

Pipe File: pipelines.py

# The Define your Item Pipelines here Wallpaper 
# 
# the Do not forget to your Pipeline to the Add Setting at The ITEM_PIPELINES 
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html 


class SecondbloodPipeline (Object):
     # constructor 
    DEF  the __init__ (Self): 
        self.fp = None   # define a descriptor attribute 
  # following are in the process of rewriting the parent class: 
    # beginning reptiles performed once 
    DEF open_spider (Self, Spider):
         Print ( ' crawlers start ' ) 
        self.fp = Open ( ' ./data.txt ' ,' W ' ) 

   # because the method is called multiple times executed, the file is opened and closed method of operation written in the other two only once in their execution. 
    DEF process_item (Self, item, Spider):
         # The crawler submitted item to persistent storage 
        self.fp.write (item [ ' author ' ] + ' : ' + item [ ' Content ' ] + ' \ n- ' )
         return Item 

    # At the end of reptiles, perform a 
    DEF close_spider (Self, Spider): 
        self.fp.close () 
        Print ( ' end of the reptiles ' )

Profile: settings.py

# Open pipe 
ITEM_PIPELINES = {
     ' secondblood.pipelines.SecondbloodPipeline ' : 300, # 300 represents the priority, the higher priority the smaller the value 
}

2.1 mysql-based pipeline storage

  Small test chopper case, the item data file in the pipeline object value stored to disk, if the data item is written mysql database, file pipes only need to modify the example above the following form:

  pipelines.py

# - * - Coding: UTF-8 - * - 

# the Define your Item Pipelines here Wallpaper 
# 
# the Do not forget to your Pipeline to the Add Setting at The ITEM_PIPELINES 
# See: https://doc.scrapy.org/en/latest/topics /item-pipeline.html 

# import class database 
import pymysql
 class QiubaiproPipelineByMysql (Object): 

    Conn = None   # MySQL connection object declaration 
    cursor = None # MySQL cursor object declaration 
    DEF open_spider (Self, Spider):
         Print ( ' start crawlers ' )
         # link database 
        self.conn = pymysql.Connect (host =' 127.0.0.1 ' , Port = 3306, = User ' the root ' , password = ' 123456 ' , DB = ' Qiubai ' )
     # write code written to the database storing data 
    DEF process_item (Self, Item, Spider):
         # . 1 link database 
        # 2 execute sql statement 
        sql = ' INSERT INTO Qiubai values ( "% S", "% S") ' % (Item [ ' author ' ], Item [ ' Content ' ]) 
        self.cursor = Self. conn.
        the Cursor () # execute transactions 
        try:
            self.cursor.execute(sql)
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()

        return item
    def close_spider(self,spider):
        print('爬虫结束')
        self.cursor.close()
        self.conn.close()

settings.py

ITEM_PIPELINES = {
    'qiubaiPro.pipelines.QiubaiproPipelineByMysql': 300,
}

Based on the pipe storage redis 2.2

  Small test chopper, the data file in the pipeline to the value stored in the item object disk, if the data item written to the database, then redis, pipes only need to modify a file in the example above the following form:

  pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import redis

class QiubaiproPipelineByRedis(object):
    conn = None
    def open_spider(self,spider):
        print('开始爬虫')
        #创建链接对象
        self.conn = redis.Redis(host='127.0.0.1',port=6379)
    def process_item(self, item, spider):
        dict = {
            'author':item['author'],
            'content':item['content']
        }
        #写入redis中
        self.conn.lpush('data', dict)
        return item

settings.py

ITEM_PIPELINES = {
    'qiubaiPro.pipelines.QiubaiproPipelineByRedis': 300,
}

Interview questions: If you need to finally crawling to the data values ​​stored in a disk file, a copy stored in the database, you should how scrapy?

  pipelines.py

Such a pipe class, process_item methods in the class is used to achieve lasting operation

# Class-based pipeline, process_item methods of this kind are used to implement the persistent storage operation. 
class DoublekillPipeline (Object): 

    DEF process_item (Self, Item, Spider):
         # persistence operation code (mode 1: writing to disk files) 
        return Item 

# if you want to achieve another form of persistence operations, you can then customize a pipeline: 
class DoublekillPipeline_db (Object): 

    DEF process_item (Self, Item, Spider):
         # persistence operation code (mode 1: write database) 
        return Item

Open pipeline operation code in settings.py:

# The following is the structure of the dictionary, the dictionary represents the key is going to be enabled to perform the pipeline file and priorities of its execution. 
= ITEM_PIPELINES {
    ' doublekill.pipelines.DoublekillPipeline ' : 300 ,
     ' doublekill.pipelines.DoublekillPipeline_db ' : 200 is , 
} 

# code above, two sets of keys in the dictionary represent two conduit pipes class file is executed corresponding the method process_item, persistence operations implement two different forms.

 

 

 

Guess you like

Origin www.cnblogs.com/zhaoyang110/p/11525173.html