Python Crawler(2)Items and Pipelines

Python Crawler(2)Items and Pipelines

We can do Items.py as follow:
import scrapy


class QuoteItem(scrapy.Item):
    # define the fields for your item here like:    # name = scrapy.Field()    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

class AuthorItem(scrapy.Item):
    name = scrapy.Field()
    desc = scrapy.Field()
    birth = scrapy.Field()

We can do Pipelines.py as follow:

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.names = set()

    def process_item(self, item, spider):
        name = item['name'] + ' - Unique'        if name in self.names:
            raise DropItem("Duplicate item found: %s" % item['name'])
        else:
            self.names.add(name)
            item['name'] = name
            return item

We can also as multiple pipelines
class MongoPipeline(object):

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        collection_name = item.__class__.__name__
        self.db[collection_name].insert(dict(item))
        return item

ITEM_PIPELINES = {
    'myproject.pipelines.PricePipeline': 300,
    'myproject.pipelines.JsonWriterPipeline': 800,
}

Or in the settings.py
ITEM_PIPELINES = {
    'tutorial.pipelines.DuplicatesPipeline': 300,
}

The pipeline actions will start from the low numbers.

Big sample Project
https://github.com/gnemoug/distribute_crawler

Deployment
https://scrapyd.readthedocs.io/en/latest/install.html
https://github.com/istresearch/scrapy-cluster

Install the scrapyd
>pip install scrapyd

Directly talk with Server side
https://scrapyd.readthedocs.io/en/latest/api.html

Clients
https://github.com/scrapy/scrapyd-client

Deploy
https://github.com/scrapy/scrapyd-client#scrapyd-deploy

Install the client
>pip install scrapyd-client

Start the Server
>scrapyd

Visit the console
http://localhost:6800/

Deploy my simple Things
>scrapyd-deploy
Packing version 1504042554
Deploying to project "tutorial" in http://localhost:6800/addversion.json
Server response (200):
{"status": "ok", "project": "tutorial", "version": "1504042554", "spiders": 2, "node_name": "ip-10-10-21-215.ec2.internal"}

List Target
>scrapyd-deploy -l
default              http://localhost:6800/

Possible Cluster Solution in the future
https://github.com/istresearch/scrapy-cluster

Try with Python 3.6 later

References:
https://github.com/gnemoug/distribute_crawler
https://www.douban.com/group/topic/38361104/
http://wiki.jikexueyuan.com/project/scrapy/item-pipeline.html
https://segmentfault.com/a/1190000009229896

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326218127&siteId=291194637