Scrapy advanced knowledge summary (four) - Item Pipeline

Item Pipeline

Item Pipeline call occurs after Spider generating Item. When Spider finished parsing Response, Item is passed on to Item Pipeline, Item Pipeline assembly defined sequentially call will complete the series of processes, such as data cleaning and storage.

The main purpose of Item Pipeline are:

Clean up HTML data.
Verification data crawling, crawling check field.
Weight check and discard duplicates.
Save the crawling results to the database.

Pipeline class

Pipeline can be customized, but each class must implement the following methods duct:

process_item(self, item, spider)

process_item () method must be achieved, calling this method Item processing is defined in Item Pipeline default. For example, we can process data or writes data to database operations. It must return a value of type Item or throw a DropItem exception.

parameter:

itemIt is the Item object, i.e. Item processing.
spider, Spider is the object, i.e. the Item is generated Spider.

In addition process_item () must be implemented, like the pipeline there are other ways to achieve:

1.open_spider(spider)

Spider is called when opened, mainly to do some initialization operations, such as connecting to the database and so on. I.e. the parameters are subject opened Spider

2.close_spider(spider)

Spider is called when closed, mainly to do some finishing touches such as closing database connections nature of the work. Parameters spider Spider objects is to be closed

3.from_crawler(cls,crawler)

Class methods, with @classmethod logo, a dependency injection manner. Its argument is a crawler, crawler through the object, we can get all the core components Scrapy, such as global configuration information for each, and then create an instance of Pipeline. Cls parameter is the Class, and finally return to a Class instance.

Activating Item Pipeline Components

To activate Item Pipeline assembly, must be added to the class ITEM_PIPELINES arrangement, the following example

ITEM_PIPELINES = {
    'myproject.pipelines.Pipelineclass1': 300,
    'myproject.pipelines.Pipelineclass2': 800,
}

Setting the class assigned to the integral value determines the order they were run: the project from the lower value to a higher value category. Customary definition of these numbers in the range 0-1000.

Examples

1. Write file

import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('items.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

2. deposit mysql

Import pymysql
 class MySQLPipeline (Object):
     DEF  the __init__ (Self):
         # connect to the database 
        self.db = pymysql.connect ( 
            Host = ' localhost ' ,   # database IP address 
            Port = 3306,   # database interface 
            DB = ' dbname ' ,   # database the name of 
            the user = ' root ' ,   # database user name 
            passwd = ' root ' ,   # database password
            = charset ' UTF8 ' ,   # encoding 
            )
         # using the cursor () method to get the cursor operation 
        self.cursor = self.db.cursor () 

    DEF process_item (Self, Item, Spider):
         # write insert sql statement, this is the database have table of 
        sql =   " the INSERT the INTO the EMPLOYEE (FIRST_NAME, the LAST_NAME) the VALUES ( '% S', '% S') " % (Item [ ' F_Name ' ], Item [ ' l_name ' ])
         the try :
             # execute sql statement 
            self.cursor.execute (SQL)
             #Submitted sql statement 
            self.db.commit ()
         the except :
             # rollback when an error occurs 
            self.db.rollback ()
         # returns Item 
        return Item 
        
    DEF close_spider (Self, Spider): 
        self.db.close ()

3. deposit mongodb

Import pymongo
 class MongodbPipeline (Object): 

    DEF  the __init__ (Self):
         # establish a database connection MongoDB 
        self.client = pymongo.MongoClient ( ' MongoDB: // localhost: 27017 / ' )
         # required to connect the database 
        self.db Client = [ ' Scrapy ' ]
         # connection sets (table) 
        self.coll DB = [ ' COLLECTION_NAME ' ] 

    DEF process_item (Self, item, Spider): 
        PostItem = dict (item)   # to be converted into a dictionary item 
        self.coll.insert (postItem)  # To insert a record into the database 
        return Item 
        
    DEF close_spider (Self, Spider): 
        self.client.close ()

4.from_crawler () example

class MongoPipeline(object):
    collection_name = 'xxx'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            #从crawler setting中获取配置
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        postItem = dict(item) 
        self.db[self.collection_name].insert(postItem) 
        return item