5 reptiles

Data memory to mongodb (pipeline persistent)

Depth-first and breadth-first

Scrapy implementation of this solution through the queue and the priority queue stack and

The address stored in the database: The following four operations where needed

configuration settings before you can schedule the execution pipeline to go

cnblogs.py
    def parse(self, response):
        item_list  = response.css('#post_list .post_item')
        for item in item_list:
            article_url = item.css('.post_item_body a::attr(href)').extract_first()
            article_name = item.css('.titlelnk::text').extract_first()# 取a标签里面的内容用a::text
            commit_count = item.css('.article_comment a::text').extract_first()
            auther_name = item.css('.post_item_foot a::text').extract_first()

            article_item = ArticleItem()
            article_item['article_url']=article_url# 这里只能写[]取值,因为他(ArticleItem)没有写getattr和settattr方法!!
            article_item['article_name']=article_name
            article_item['commit_count']=commit_count
            article_item['auther_name']=auther_name
            yield article_item
            # yield Request(url, callback=self.parse_detail)# 注意这个callback不写,默认就回调到parse这个地方,我们可以指定回调的地方!!

        next_url=response.css('.pager a:last-child::attr(href)').extract_first()
        print('https://www.cnblogs.com'+next_url)
        yield Request('https://www.cnblogs.com'+next_url)
items.py
class ArticleItem(scrapy.Item):
    article_name = scrapy.Field()
    article_url = scrapy.Field()
    auther_name = scrapy.Field()
    commit_count = scrapy.Field()
pipelines.py
from pymongo import MongoClient

class ArticleMongodbPipeline(object):
    def process_item(self, item, spider):
        # 1、链接
        client = MongoClient('localhost', 27017)
        # 2、use 数据库
        db = client['db2']  # 等同于:client.db1
        # 3、查看库下所有的集合
        # print(db.collection_names(include_system_collections=False))
        # 4、创建集合
        table_user = db['userinfo']  # 等同于:db.user
        table_user.save(dict(item))
        # return item

settings

在setting中配置pipeline
ITEM_PIPELINES = {
'myscrapy.pipelines.ArticleMongodbPipeline': 300,   #数字越小越先执行
'myscrapy.pipelines.ArticleFilePipeline': 100,
}# 这里是对应pipelines.py的类名写的!!!!

2 deduplication rule (the system comes to weight)

For url link yeild may be repeated

First encrypted using MD5, reduce the memory footprint, and possible to MD5 value is not the same situation as follows, the processing system is also made, so the same.

-把url放到集合中
    -存在缺陷:
    1 可能很长,占内存非常大————md5值()
    2   www.baidu.com?name=lqz&age=18
        www.baidu.com?age=18&name=lqz
-BloomFilter去重(了解)
# 备注
Source entrance:
from scrapy.dupefilters import RFPDupeFilter# 点击最后一个

image-20191129190340945

Not heavy:

dont_filter = True (request parameter)

yield Request('https://www.cnblogs.com'+next_url,dont_filter=True)# 这样就不去重!!!

3 Download middleware (middlewares.py)

image-20191129200632036

Action at what time?

Over time to download something trigger

-使用cookie
-使用代理池
-集成selenium
-使用:
    -写一个类:MyscrapyDownloaderMiddleware
        -process_request
            加代理,加cookie,集成selenium。。。。
            return None,Response,Resquest
        -process_response
    -在setting中配置        -
         DOWNLOADER_MIDDLEWARES = {'myscrapy.middlewares.MyscrapyDownloaderMiddleware': 543}
#自己写
class MyscrapyDownloaderMiddleware(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        return s

    def process_request(self, request, spider):

        print('来了')
        print(request.url)

        #headers可以改
        print(request.headers)
        #从cookie池中取出cookie赋值进去
        print(request.cookies)
        
        #使用代理request中的meta
        # request.meta['download_timeout'] = 20# 超时时间
        # request.meta["proxy"] = 'http://192.168.1.1:7878'(随便写的)
        # print(request)
        # print(spider)
        # 返回值:None,Request对象,Response对象
        #None 继续走下一个下载中间件
        #Response 直接返回,进入spider中做解析
        #Request 回到调度器,重新被调度
        #可以使用selenium
        return HtmlResponse(url="www.baidu.com", status=200,  body=b'sdfasdfasfdasdfasf')

    def process_response(self, request, response, spider):
        print('走了')
        print(request)
        print(spider)
        print(type(response))
        return response

    def process_exception(self, request, exception, spider):#代理超时或异常
        print('代理%s,访问%s出现异常:%s' % (request.meta['proxy'], request.url, exception))
        import time
        time.sleep(5)
        #删除代理
        # delete_proxy(request.meta['proxy'].split("//")[-1])
        #重新获取代理,放进去
        # request.meta['proxy'] = 'http://' + get_proxy()


        #return request 会放到调度器重新调度.这个地方写的不太好,他又走中间件,又进入循环了。因此代理完全可以卸载cnblogs中!
        return request

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

4 reptiles Middleware

-写一个类MyscrapySpiderMiddleware
    -写一堆方法
-在setting中配置
    SPIDER_MIDDLEWARES = {
        'myscrapy.middlewares.MyscrapySpiderMiddleware': 543,
    }

5 Signal

In a position to perform a function

There are built-in signal, just give him a correlation function.

        -写一个类:class MyExtension(object):
            -在from_crawler绑定一些信号
            -当scrapy执行到信号执行点的时候,会自动触发这些函数
        -在setting中配置    
            EXTENSIONS = {
               'extentions.MyExtension': 100,
            }

Bloom filter 6

Find efficiency:

Hash collision - binary tree (because the depth is not so deep) - red-black tree

Cache penetration:

image-20191129144834918

7 distributed crawler scrapy-redis

Objective: To improve the efficiency of reptiles, the project will be deployed on multiple machines

Repeat the question - do filter unified, unified (redis)

原理:(将爬虫程序分布在多个机器上面,起始Url交给另个需要下载模块的redis来处理,并且不会爬取重复的链接)
    -pip3 install scrapy-redis
    -在setting中配置:
        SCHEDULER = "scrapy_redis.scheduler.Scheduler"
        DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
        ITEM_PIPELINES = {
            'scrapy_redis.pipelines.RedisPipeline': 300
        }
        (-调度器:使用scrapy-redis提供的调度器
        -去重规则:使用scrapy-redis提供的去重规则
        -可选:
            -pipelines配置成scrapy-redis提供的持久化类)
        
    -在爬虫类中:
        -继承RedisSpider
        -把start_url去掉(在redis中统一调度)
        -新增:redis_key = 'cnblogs:start_urls'

     CMD中:lpush cnblogs(这个是关联的redis_key):start_urls https://www.cnblogs.com。l代表从列表的左侧插入

# 使用注意事项
            
1. 使用的时候,刚开始会停住,因为起始url在redis中统一调度,因此在cmd打开redis-cli
2. 通过cmd命令行专跳到pip安装的项目路径下,才能打开redis!!设置统一起始url,
Supplementary knowledge points (down to switch to other letter of the path):

image-20191201200326049

Guess you like

Origin www.cnblogs.com/ZDQ1/p/11967590.html