scrapy supplement - asynchronous reptiles

spiders

Description: The project is to create a crawler py file

# . 1, Spiders by a series of class (a URL or a defined set of URLs will be crawling) composition, including how to perform crawling tasks and how to extract structured data from the page.

# 2, in other words, Spiders is where you customize the page crawling and parsing behavior for a specific URL or set of URLs

 

Spiders will cycle follows a few things

# 1, to generate the initial crawl Requests first URLS, and identifies a callback function 
the first request is defined in start_requests () method within the url address obtained from the default list start_urls Request request is generated, the default callback function is parse method. The callback function is automatically triggered when the download is complete return response

# 2, the callback function, the parsing response and returns the value 
returned value may be four kinds:
        It contains analytical data dictionary
        Item objects
        The new Request object (new Requests also need to specify a callback function)
        Or iterables (or comprising Items Request)

# 3, page parsing the content in the callback function 
normally used Scrapy own Selectors, but obviously you can also use Beutifulsoup, lxml or other use what you love with Han.

# 4 Finally, Items returned object will be persisted to the database 
by Item Pipeline components to the database: HTTPS: //docs.scrapy.org/en/latest/topics/item-pipeline.html # Topics-Item -pipeline) 
or exported to a different file (Exports by your feed: HTTPS: //docs.scrapy.org/en/latest/topics/feed-exports.html # Topics-Feed-Exports)

 

Spiders total of five species:

#1、scrapy.spiders.Spider #scrapy.Spider等同于scrapy.spiders.Spider
#2、scrapy.spiders.CrawlSpider
#3、scrapy.spiders.XMLFeedSpider
#4、scrapy.spiders.CSVFeedSpider
#5、scrapy.spiders.SitemapSpider

 

Import Use

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import Spider,CrawlSpider,XMLFeedSpider,CSVFeedSpider,SitemapSpider

class AmazonSpider ( scrapy.Spider ): # custom class that inherits the base class provides Spiders 
    name = ' Amazon ' 
    allowed_domains = [ ' www.amazon.cn ' ]
    start_urls = ['http://www.amazon.cn/']
    

```
def parse(self, response):
    pass
```

Data stored in the mongodb

pipelins use process:

  settings.py arranged, models item defined in the data to be acquired, pipelins.py which connects to a database, cnblogs.py crawling writing content data.

1, a class written in item, there is a field, similar to the models django

import scrapy

# class MyscrapyItem(scrapy.Item):
#     # define the fields for your item here like:
#     # name = scrapy.Field()
#     pass
class ArticleItem(scrapy.Item):
    article_name = scrapy.Field()
    article_url = scrapy.Field()
    auther_name = scrapy.Field()
    commit_count = scrapy.Field()

2, settings.py, arranged in a pipeline setting, the priority.

 

 

3、 pipelines.py

Connect to the database, writing a bunch of methods

# -ArticleMongodbPipeline 
                 - the init
                  - from_crawler
                 - open_spider
                  - close_spider
                 - process_item
                     - corresponding write memory
                    return things are different, if the item, the next pipelins can continue to get, if the return None, the next not get
#             -ArticleFilePipeline

 

Code:

# -*- coding: utf-8 -*-

from pymongo import MongoClient
class ArticleMongodbPipeline(object):
    def process_item(self, item, spider):

        # Client = MongoClient ( 'MongoDB: // localhost: 27017 /') 
        # connection 
        Client = MongoClient ( ' localhost ' , 27017 )
         # create the article database, if not called to create, and if so, called using 
        db = Client [ ' article ' ]
         # Print (DB) 
        # if used articleinfo, if not, creating the table 
        article_info DB = [ ' articleinfo ' ]

        article_info.insert(dict(item))
        # Article_info.save ({ 'article_name': item [ 'article_name'], 'aritcle_url': item [ 'article_url']}) 
        # If Retrun None, the next to fail to get the item, is to take None


class ArticleFilePipeline(object):
    # def __init__(self,host,port):
    def __init__(self):
        # self.mongo_conn=MongoClient(host,port)
        pass

    # If there from_crawler this class will first perform from_crawler, AA = ArticleFilePipeline.from_crawler (content crawler) 
    # If no direct ArticleFilePipeline = AA () 
    @classmethod
     DEF from_crawler (CLS, content crawler):
         Print ( ' has ' )
         # MongoDB configuration in the setting information 
        # Host = 'out of the configuration file to the' 
        # Port = 'out from the configuration file' 
        # crawler.settings overall profile 
        Print ( ' asdfasdfasfdasfd ' , crawler.settings [ ' AA ' ] )
         # return CLS (Host, Port) 
        return CLS ()

    def open_spider(self, spider):
        # print('----',spider.custom_settings['AA'])
        self.f = open('article.txt','a')


    def close_spider(self, spider):
        self.f.close()
        print('close spider')


    def process_item(self, item, spider):
        pass

Deduplication 

De-duplication rules should be shared by multiple reptiles, whenever a reptile crawling, others do not have to climb, implementation follows.

 

 

 

 

 

 

 

 

 

Custom deduplication solutions

Download using middleware

 

Reptile Middleware

 

Signal and configuration information

 

Bloom filter

Distributed crawling -scrapy-redis

 

 

 

Source code analysis

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The method of encapsulating internal source implemented

 

 

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/Gaimo/p/11960610.html