Item Pipeline
Item Pipeline call occurs after Spider generating Item. When Spider finished parsing Response, Item is passed on to Item Pipeline, Item Pipeline assembly defined sequentially call will complete the series of processes, such as data cleaning and storage.
The main purpose of Item Pipeline are:
-
Clean up HTML data.
-
Verification data crawling, crawling check field.
-
Weight check and discard duplicates.
-
Save the crawling results to the database.
Pipeline class
Pipeline can be customized, but each class must implement the following methods duct:
process_item(self, item, spider)
process_item () method must be achieved, calling this method Item processing is defined in Item Pipeline default. For example, we can process data or writes data to database operations. It must return a value of type Item or throw a DropItem exception.
parameter:
-
item
It is the Item object, i.e. Item processing. -
spider
, Spider is the object, i.e. the Item is generated Spider.
In addition process_item () must be implemented, like the pipeline there are other ways to achieve:
1.open_spider(spider)
Spider is called when opened, mainly to do some initialization operations, such as connecting to the database and so on. I.e. the parameters are subject opened Spider
2.close_spider(spider)
Spider is called when closed, mainly to do some finishing touches such as closing database connections nature of the work. Parameters spider Spider objects is to be closed
3.from_crawler(cls,crawler)
Class methods, with @classmethod logo, a dependency injection manner. Its argument is a crawler, crawler through the object, we can get all the core components Scrapy, such as global configuration information for each, and then create an instance of Pipeline. Cls parameter is the Class, and finally return to a Class instance.
Activating Item Pipeline Components
To activate Item Pipeline assembly, must be added to the class ITEM_PIPELINES arrangement, the following example
ITEM_PIPELINES = { 'myproject.pipelines.Pipelineclass1': 300, 'myproject.pipelines.Pipelineclass2': 800, }
Setting the class assigned to the integral value determines the order they were run: the project from the lower value to a higher value category. Customary definition of these numbers in the range 0-1000.
Examples
1. Write file
import json class JsonWriterPipeline(object): def open_spider(self, spider): self.file = open('items.jl', 'w') def close_spider(self, spider): self.file.close() def process_item(self, item, spider): line = json.dumps(dict(item)) + "\n" self.file.write(line) return item
2. deposit mysql
Import pymysql class MySQLPipeline (Object): DEF the __init__ (Self): # connect to the database self.db = pymysql.connect ( Host = ' localhost ' , # database IP address Port = 3306, # database interface DB = ' dbname ' , # database the name of the user = ' root ' , # database user name passwd = ' root ' , # database password = charset ' UTF8 ' , # encoding ) # using the cursor () method to get the cursor operation self.cursor = self.db.cursor () DEF process_item (Self, Item, Spider): # write insert sql statement, this is the database have table of sql = " the INSERT the INTO the EMPLOYEE (FIRST_NAME, the LAST_NAME) the VALUES ( '% S', '% S') " % (Item [ ' F_Name ' ], Item [ ' l_name ' ]) the try : # execute sql statement self.cursor.execute (SQL) #Submitted sql statement self.db.commit () the except : # rollback when an error occurs self.db.rollback () # returns Item return Item DEF close_spider (Self, Spider): self.db.close ()
3. deposit mongodb
Import pymongo class MongodbPipeline (Object): DEF the __init__ (Self): # establish a database connection MongoDB self.client = pymongo.MongoClient ( ' MongoDB: // localhost: 27017 / ' ) # required to connect the database self.db Client = [ ' Scrapy ' ] # connection sets (table) self.coll DB = [ ' COLLECTION_NAME ' ] DEF process_item (Self, item, Spider): PostItem = dict (item) # to be converted into a dictionary item self.coll.insert (postItem) # To insert a record into the database return Item DEF close_spider (Self, Spider): self.client.close ()
4.from_crawler () example
class MongoPipeline(object): collection_name = 'xxx' def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( #从crawler setting中获取配置 mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DATABASE') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): postItem = dict(item) self.db[self.collection_name].insert(postItem) return item