2022 crawler class, use Scrapy+BloomFilter to write an incremental crawler

Publish an article in the Nuggets community, and add "I am participating in the "Nuggets · Sailing Plan" in the first sentence of the text

write in front

Today is the 78th Python crawler blog. I will set up a flag here and try to finish writing 100 crawler cases before October 1st. If you see from the first article, you should be a qualified crawler Coder. Keep going! !

Continue to get the incremental crawler, this article involves two Python modules, one is Scrapy and the other is BloomFilter

BloomFilter (Bloom filter) usage scenarios

As for who invented BloomFilter and why, this article will not go into details. The following will mainly share with you the scenarios in which BloomFilter is used.

  1. Blacklist application (mail blacklist)
  2. Web crawler deduplication (related to the incremental crawler we are going to learn)
  3. The KV system quickly determines whether the key exists
  4. Reduced cache penetration

A library that needs to be mastered today is called pybloom_liveabout its source code. For the latest version, refer to pypi.org/project/pyb…

First, go to github to check the dependency library. This place is very important. If you don't pay attention, it will be easy to overturn github.com/joseph-fox/… In the dependency library, you need abitarray

insert image description here

Since I didn't install it at the beginning, the following error occurred. Pay attention, it reminds us that VC++14 does not exist, but installing it is very resource-intensive. We can do it in other ways.

insert image description here

Open https://www.lfd.uci.edu/~gohlke/pythonlibs/#bitarrayFind the corresponding version of Python installed locally on your computer. I am using version 3.7, just download it. We have explained the installation of this content in the previous blog.

insert image description here

step one:

insert image description here

Step 2:

Installpybloom_live

insert image description here

pybloom_live Quick Start

pybloom_liveThere are two simple usages, BloomFilter fixed capacity and ScalableBloomFilter scalable, to put it bluntly, it is a fixed and a dynamic expansion.

The above is the brick I threw (Baidu, there are still a lot of basic instructions)

Next, we use the BloomFilterclass to operate a file to achieve deduplication. Note here that deduplication BloomFilteris done by reading and writing files. If you write a multi-process or multi-threaded crawler, you need to add mutual exclusion and synchronization when using it. Conditions, as well as BloomFilterfile I/O operations, pay attention to batch writing and batch reading, otherwise the efficiency will be greatly affected.

Declare a bloomcheck.pyclass file, write the following code, the code description has been placed in the comments

from pybloom_live import BloomFilter
import os
import hashlib
class BloomCheck(object):
    def __init__(self):
        '''
        以下代码用于判断bf布隆文件是否存在,存在打开,不存在新建
        '''
        self.filename = 'bloomfilter.bf'
        is_exist = os.path.exists(self.filename)
        if is_exist:
            self.bf = BloomFilter.fromfile(open(self.filename, 'rb'))
        else:
            # capacity是必选参数,表示容量 error_rate是错误率
            self.bf = BloomFilter(capacity=100000000, error_rate=0.001)

    def process_item(self, data):
        data_encode_md5 = hashlib.md5(data.encode(encoding='utf-8')).hexdigest()
        if data_encode_md5 in self.bf:
            # 如果data存在,返回False
            return False

        else:
            # 如果data不存在,新增到bf文件中,返回True
            self.bf.add(data_encode_md5)
            return True

    def save_bloom_file(self):
        self.bf.tofile(open(self.filename, 'wb'))
复制代码

scrapy crawler code

For the basic crawler part, only the core crawler source code part is displayed. For responsethe analysis, the basic judgment is still made first. If it titlealready exists bf文件(the file is dynamically created by the above code), then it will not be added, otherwise the data will be iterated.

The complete code will be uploaded to the attachment and can be downloaded at the end of the article

class IndexSpider(scrapy.Spider):
    name = 'index'
    allowed_domains = ['xz.aliyun.com']
    start_urls = ['http://xz.aliyun.com/']
    bf = BloomCheck()
    def parse(self, response):
        li_list = response.xpath("//a[@class='topic-title']")
        for li in li_list:
            de_item = DeItem()
            title = li.xpath("./text()").extract_first().strip()
            # 判断title是否在bf文件中,如果不在,返回新数据
            if self.bf.process_item(title):
                de_item['title'] = title
                de_item['url'] = "https://xz.aliyun.com" + li.xpath("./@href").extract_first()
                yield de_item
            else:
                print(f"--{title}--数据已经存在,不进行添加")

        # 保存数据
        self.bf.save_bloom_file()
复制代码

Set up scrapy scheduled tasks

Write a run.bat batch file and put it in the same directory as the crawler program. Other directories can also be
changed. Note that the following path can be changed to your own path.

@echo off
rem
E:
cd E:\crawl100\demo78\de\de

scrapy crawl index
rem pause
exit
复制代码

Customize Windows Scheduled Task Schedule Find "Task Scheduler" under "Administrative Tools" under "Control Panel" and open Create Basic Task. After starting the program, select the location of the bat file. Schedule tasks to pay attention to the time when testing.

insert image description here

This place will find a relatively detailed blog for you, the content is relatively simple, and I will not repeat it: blog.csdn.net/Gpwner/arti…

write on the back

Today's incremental crawler is here, I hope this article can help you~

Guess you like

Origin juejin.im/post/7146209736613068808