spiders
Description: The project is to create a crawler py file
# . 1, Spiders by a series of class (a URL or a defined set of URLs will be crawling) composition, including how to perform crawling tasks and how to extract structured data from the page. # 2, in other words, Spiders is where you customize the page crawling and parsing behavior for a specific URL or set of URLs
Spiders will cycle follows a few things
# 1, to generate the initial crawl Requests first URLS, and identifies a callback function the first request is defined in start_requests () method within the url address obtained from the default list start_urls Request request is generated, the default callback function is parse method. The callback function is automatically triggered when the download is complete return response # 2, the callback function, the parsing response and returns the value returned value may be four kinds: It contains analytical data dictionary Item objects The new Request object (new Requests also need to specify a callback function) Or iterables (or comprising Items Request) # 3, page parsing the content in the callback function normally used Scrapy own Selectors, but obviously you can also use Beutifulsoup, lxml or other use what you love with Han. # 4 Finally, Items returned object will be persisted to the database by Item Pipeline components to the database: HTTPS: //docs.scrapy.org/en/latest/topics/item-pipeline.html # Topics-Item -pipeline) or exported to a different file (Exports by your feed: HTTPS: //docs.scrapy.org/en/latest/topics/feed-exports.html # Topics-Feed-Exports)
Spiders total of five species:
#1、scrapy.spiders.Spider #scrapy.Spider等同于scrapy.spiders.Spider #2、scrapy.spiders.CrawlSpider #3、scrapy.spiders.XMLFeedSpider #4、scrapy.spiders.CSVFeedSpider #5、scrapy.spiders.SitemapSpider
Import Use
# -*- coding: utf-8 -*- import scrapy from scrapy.spiders import Spider,CrawlSpider,XMLFeedSpider,CSVFeedSpider,SitemapSpider class AmazonSpider ( scrapy.Spider ): # custom class that inherits the base class provides Spiders name = ' Amazon ' allowed_domains = [ ' www.amazon.cn ' ] start_urls = ['http://www.amazon.cn/'] ``` def parse(self, response): pass ```
Data stored in the mongodb
pipelins use process:
settings.py arranged, models item defined in the data to be acquired, pipelins.py which connects to a database, cnblogs.py crawling writing content data.
1, a class written in item, there is a field, similar to the models django
import scrapy # class MyscrapyItem(scrapy.Item): # # define the fields for your item here like: # # name = scrapy.Field() # pass class ArticleItem(scrapy.Item): article_name = scrapy.Field() article_url = scrapy.Field() auther_name = scrapy.Field() commit_count = scrapy.Field()
2, settings.py, arranged in a pipeline setting, the priority.
3、 pipelines.py
Connect to the database, writing a bunch of methods
# -ArticleMongodbPipeline - the init - from_crawler - open_spider - close_spider - process_item - corresponding write memory return things are different, if the item, the next pipelins can continue to get, if the return None, the next not get # -ArticleFilePipeline
Code:
# -*- coding: utf-8 -*- from pymongo import MongoClient class ArticleMongodbPipeline(object): def process_item(self, item, spider): # Client = MongoClient ( 'MongoDB: // localhost: 27017 /') # connection Client = MongoClient ( ' localhost ' , 27017 ) # create the article database, if not called to create, and if so, called using db = Client [ ' article ' ] # Print (DB) # if used articleinfo, if not, creating the table article_info DB = [ ' articleinfo ' ] article_info.insert(dict(item)) # Article_info.save ({ 'article_name': item [ 'article_name'], 'aritcle_url': item [ 'article_url']}) # If Retrun None, the next to fail to get the item, is to take None class ArticleFilePipeline(object): # def __init__(self,host,port): def __init__(self): # self.mongo_conn=MongoClient(host,port) pass # If there from_crawler this class will first perform from_crawler, AA = ArticleFilePipeline.from_crawler (content crawler) # If no direct ArticleFilePipeline = AA () @classmethod DEF from_crawler (CLS, content crawler): Print ( ' has ' ) # MongoDB configuration in the setting information # Host = 'out of the configuration file to the' # Port = 'out from the configuration file' # crawler.settings overall profile Print ( ' asdfasdfasfdasfd ' , crawler.settings [ ' AA ' ] ) # return CLS (Host, Port) return CLS () def open_spider(self, spider): # print('----',spider.custom_settings['AA']) self.f = open('article.txt','a') def close_spider(self, spider): self.f.close() print('close spider') def process_item(self, item, spider): pass
Deduplication
De-duplication rules should be shared by multiple reptiles, whenever a reptile crawling, others do not have to climb, implementation follows.
Custom deduplication solutions
Download using middleware
Reptile Middleware
Signal and configuration information
Bloom filter
Distributed crawling -scrapy-redis
Source code analysis
The method of encapsulating internal source implemented