Python crawler 5.2 - pipeline module frame using scrapy
Overview
At the same time this series document for learning Python crawler technology simple tutorial to explain and consolidate their technical knowledge, just in case they accidentally useful to you so much the better.
Python version is 3.7.4
Beginning last article we Scrapy the framework, and implement a framework for the use of Scrapy reptiles project. This one we will explain in detail the framework of the pipeline Scrapy use (although the last article we talked about a little detail optimizations pipeline to store data, we mainly explain if this project more than a reptile, how to use a problems pipeline for received data).
pipeline core methods
We can customize Pipeline, only you need to implement the specified method, a method which must be implemented are: process_item(item, spider)
.
Further there are several more practical approach:
- open_spider(spider)
- close_spider(spider)
- close_spider () method is Spider close automatically when called. Here we can do some finishing work, such as closing database connections. Wherein the parameter is to be closed spider Spider object.
- from_crawler(cls, crawler)
process_item(item, spider)
process_item () method must be achieved, calling this method Item processing is defined ItemPipeline default. For example, we can process data or writes data to database operations. It must return a value of type Item or throw an anomaly DropItem
close_spider(spider)
open_spider () method is turned on when the Spider is automatically called. Here we can do some initialization operations, such as open database connectivity. Among them, the parameters spider Spider object is opened.
close_spider(spider)
close_spider () method is Spider close automatically when called. Here we can do some finishing work, such as closing database connections. Wherein the parameter is to be closed spider Spider object.
from_crawler(cls, crawler)
from_crawler () method is a class method, with @classmethod logo, a dependency injection manner. Its argument is a crawler, crawler through the object, we can get all the core components Scrapy, such as global configuration information for each, and then create an instance of Pipeline. Cls parameter is the Class, and finally return to a Class instance.
Use pipeline
Can be seen from the dictionary form of pipeline, pipeline can have multiple, and indeed can define multiple pipeline.
Why do we need more pipeline:
- The contents of a spider might do different operations, such as stored in different databases
- You may have multiple spider, various different pipeline processing item of content
note:
- Use pipeline need to be configured in the setting.py
- The pipeline weight values are higher priority
- pipeline in process_item can not be modified for other name
A spider structure where a plurality of item types
We use the command to create a spider program:
scrapy genspider qsbk_spider qiushibaike.com
Will need to set the configuration is good, it will not be described here. Qsbk_spider.py then write the following code:
import scrapy
class QsbkSpiderSpider(scrapy.Spider):
name = 'qsbk_spider'
allowed_domains = ['qiushibaike.com']
start_urls = ['http://qiushibaike.com/']
def parse(self, response):
item = {}
# 我们进行奇数偶数的不同参数处理
for i in range(0, 50):
if (i % 2) == 0:
# 偶数处理
item['come_from'] = 'oushu'
item['data'] = i
else:
# 奇数处理
item['come_from'] = 'jishu'
item['data'] = i
yield item
Pipelines.py then write the following code:
class MyspiderPipeline(object):
def process_item(self, item, spider):
# 增加if判断传来的item是什么数据,然后进行相对应的逻辑判断
if item['come_from'] == 'jishu':
# code..
print('%d 是奇数' % item['data'])
else:
# code ..
print('%d 是偶数' % item['data'])
return item
Successful operation to see the effect. A plurality of class definitions or pipeline, as follows:
class MyspiderPipeline(object):
def process_item(self, item, spider):
# 增加if判断传来的item是什么数据,然后进行相对应的逻辑判断
if item['come_from'] == 'jishu':
# code..
print('%d 是奇数' % item['data'])
return item
class MyspiderPipeline1(object):
def process_item(self, item, spider):
# 增加if判断传来的item是什么数据,然后进行相对应的逻辑判断
if item['come_from'] == 'oushu':
# code..
print('%d 是偶数' % item['data'])
return item
Then the configuration file is:
ITEM_PIPELINES = {
'mySpider.pipelines.MyspiderPipeline': 300,
'mySpider.pipelines.MyspiderPipeline1': 301,
}
Run can view the same effect.
More spider case
We can also use a plurality of spider above treated that way, and then added item identification data, then different treatment according to the identifier. In addition to above, we can also use another way determined. Specific methods are as follows:
I created three were spider uses three commands:
# 爬取糗百的“文字”模块
scrapy genspider qsbk1_spider qiushibaike.com
# 爬取糗百的“糗图”模块
scrapy genspider qsbk2_spider qiushibaike.com
# 爬取糗百的“穿越模块”
scrapy genspider qsbk3_spider qiushibaike.com
After completion of each run three commands to create a crawler, you will find a spider in a new folder qsbk1_spider.py
, qsbk2_spider.py
, qsbk3_spider.py
three documents, this is our three new crawler module file (the project directory structure is not posted here). Command can scrapy list
view the list of reptiles created in the project.
Respectively, the following code is written in three files:
- qsbk1_spider.py
import scrapy class Qsbk1SpiderSpider(scrapy.Spider): name = 'qsbk1_spider' allowed_domains = ['qiushibaike.com'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): item = {} # 这是qsbk1_spider返回的数据 for i in range(0, 10): item['come_from'] = 'qsbk1' item['data'] = i yield item
- qsbk2_spider.py
import scrapy class Qsbk2SpiderSpider(scrapy.Spider): name = 'qsbk2_spider' allowed_domains = ['qiushibaike.com'] start_urls = ['https://www.qiushibaike.com/pic/'] def parse(self, response): item = {} # 这是qsbk2_spider返回的数据 for i in range(10, 20): item['come_from'] = 'qsbk2' item['data'] = i yield item
- qsbk3_spider.py
import scrapy class Qsbk3SpiderSpider(scrapy.Spider): name = 'qsbk3_spider' allowed_domains = ['qiushibaike.com'] start_urls = ['https://www.qiushibaike.com/history/'] def parse(self, response): item = {} # 这是qsbk3_spider返回的数据 for i in range(20, 30): item['come_from'] = 'qsbk3' item['data'] = i yield item
The last three reptiles crawling data are put in the pipeline, which we need to be in the pipeline in the judgment is that reptiles pass over the data. pipelines.py code is as follows:
class MyspiderPipeline(object):
def process_item(self, item, spider):
# 这是我们可以根据spider来进行判断
if spider.name == 'qsbk1_spider':
print("这是qsbk1的数据:", item)
elif spider.name == 'qsbk2_spider':
print("这是qsbk2的数据:", item)
elif spider.name == 'qsbk3_spider':
print("这是qsbk3的数据:", item)
else:
print('未知数据')
return item
You can see the corresponding running reptile prints.
Using a plurality of items to distinguish
Items.py write code is as follows:
import scrapy
class Qsbk1Item(scrapy.Item):
"""
qsbk1爬虫items类
"""
num = scrapy.Field()
class Qsbk2Item(scrapy.Item):
"""
qsbk2爬虫items类
"""
num = scrapy.Field()
class Qsbk3Item(scrapy.Item):
"""
qsbk3爬虫items类
"""
num = scrapy.Field()
Qsbk1_spider.py write code is as follows (similar to the other two reptiles):
import scrapy
# 引入对应的items类
from mySpider.items import Qsbk1Item
class Qsbk1SpiderSpider(scrapy.Spider):
name = 'qsbk1_spider'
allowed_domains = ['qiushibaike.com']
start_urls = ['https://www.qiushibaike.com/text/']
def parse(self, response):
for i in range(0, 10):
item = Qsbk1Item(num=i)
yield item
Pipeline.py write code is as follows:
# 引入对应的items类
from mySpider.items import Qsbk1Item
from mySpider.items import Qsbk2Item
from mySpider.items import Qsbk3Item
class MyspiderPipeline(object):
def process_item(self, item, spider):
# 这是我们可以根据items类来进行判断
if isinstance(item, Qsbk1Item):
print("这是qsbk1的数据:", item)
elif isinstance(item, Qsbk2Item):
print("这是qsbk2的数据:", item)
elif isinstance(item, Qsbk3Item):
print("这是qsbk3的数据:", item)
else:
print('未知数据')
return item
Run to see the effect of reptiles
Other Bowen link
- Python Reptile 1.1 - urllib tutorial Basic usage
- Python Reptile 1.2 - urllib Advanced Usage tutorial
- Python Reptile 1.3 - requests tutorial Basic usage
- Python Reptile 1.4 - requests Advanced Usage tutorial
- Python Reptile 2.1 - BeautifulSoup usage Tutorial
- Python Reptile 2.2 - xpath usage Tutorial
- Python Reptile 3.1 - json Usage tutorial
- Python Reptile 3.2 - csv usage Tutorial
- Python Reptile 3.3 - txt usage Tutorial
- Python Reptile 4.1 - threading (multi-threaded) Usage tutorial
- Python Reptile 4.2 - ajax (dynamic web crawler) Usage tutorial
- Python Reptile 4.3 - selenium tutorial Basic usage
- Python Reptile 4.4 - selenium Advanced Usage tutorial
- Python Reptile 4.5 - tesseract (image verification code identification) Usage tutorial
- Python Reptile 5.1 - scrapy framework simple entry