Python Crawler(2)Items and Pipelines
We can do Items.py as follow:
import scrapy
class QuoteItem(scrapy.Item):
# define the fields for your item here like: # name = scrapy.Field() text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
class AuthorItem(scrapy.Item):
name = scrapy.Field()
desc = scrapy.Field()
birth = scrapy.Field()
We can do Pipelines.py as follow:
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.names = set()
def process_item(self, item, spider):
name = item['name'] + ' - Unique' if name in self.names:
raise DropItem("Duplicate item found: %s" % item['name'])
else:
self.names.add(name)
item['name'] = name
return item
We can also as multiple pipelines
class MongoPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
collection_name = item.__class__.__name__
self.db[collection_name].insert(dict(item))
return item
ITEM_PIPELINES = {
'myproject.pipelines.PricePipeline': 300,
'myproject.pipelines.JsonWriterPipeline': 800,
}
Or in the settings.py
ITEM_PIPELINES = {
'tutorial.pipelines.DuplicatesPipeline': 300,
}
The pipeline actions will start from the low numbers.
Big sample Project
https://github.com/gnemoug/distribute_crawler
Deployment
https://scrapyd.readthedocs.io/en/latest/install.html
https://github.com/istresearch/scrapy-cluster
Install the scrapyd
>pip install scrapyd
Directly talk with Server side
https://scrapyd.readthedocs.io/en/latest/api.html
Clients
https://github.com/scrapy/scrapyd-client
Deploy
https://github.com/scrapy/scrapyd-client#scrapyd-deploy
Install the client
>pip install scrapyd-client
Start the Server
>scrapyd
Visit the console
http://localhost:6800/
Deploy my simple Things
>scrapyd-deploy
Packing version 1504042554
Deploying to project "tutorial" in http://localhost:6800/addversion.json
Server response (200):
{"status": "ok", "project": "tutorial", "version": "1504042554", "spiders": 2, "node_name": "ip-10-10-21-215.ec2.internal"}
List Target
>scrapyd-deploy -l
default http://localhost:6800/
Possible Cluster Solution in the future
https://github.com/istresearch/scrapy-cluster
Try with Python 3.6 later
References:
https://github.com/gnemoug/distribute_crawler
https://www.douban.com/group/topic/38361104/
http://wiki.jikexueyuan.com/project/scrapy/item-pipeline.html
https://segmentfault.com/a/1190000009229896
Python Crawler(2)Items and Pipelines
Guess you like
Origin http://43.154.161.224:23101/article/api/json?id=326218127&siteId=291194637
Recommended
Ranking