[Scrapy-03] Bloom filter, storage database, and other techniques for image crawling

Python3+ Windowsenvironment, using bloom filter is really not a very wise choice, because either the ready-made module does not support the Windowsplatform, or only supports Python2it, or the support for file operations is not ideal. After a lot of hard work, I found one called bloom_filter.

——The use of the Bloom filter is really good. First, create a new file by yourself, and then open it every time, check whether it exists, and add it if it does not exist. If it exists, it is fine. Here I have made two Bloom-filtered files, and made two judgments. The first is to judge whether the image to be downloaded urlalready exists (that is, whether it has been downloaded). If it has been downloaded, then Ignore, if it has not been downloaded, then download it; after downloading, sha1process the downloaded file, and then use this sha1to compare whether it already exists. This is to prevent the same image from being downloaded on different websites. If it does not exist, it means this A picture is unique, but there is no guarantee that the image of this picture is not duplicated, because as long as a little bit of picture attributes or information is modified, they will belong to different pictures.

——Secondly, there are some operations to connect to the database, which is used here pymysql, remember to initialize the connection to the database, close_spiderand close the connection in the method.

——Some of the processes in the middle are to determine whether the path exists, create it if it does not exist, and determine how it exists. The directory of year, month, and day is used to store the downloaded pictures, in order to ensure that the picture storage is relatively reasonable, rather than being placed in a large folder.

- sha1The processing here is using hashlibthe library.

——There will be a unified logo for the same group of pictures.

- Renamed the image to prevent it from renaming itself when duplicated.

——It can be seen from the storage in the database that the path dirand name of the picture nameare separated, which is to prevent the movement of the picture in the future.

import pymysql
import urllib.request
import uuid
import datetime
import os
from bloom_filter import BloomFilter
import hashlib

class DesignhubPipeline(object):
    def __init__(self):
        self.conn = pymysql.connect(host="127.0.0.1", user="root", password="root", db="dlimg", charset='utf8mb4')
        self.bf_urls = BloomFilter(max_elements=10000000,error_rate=0.001,filename="C:/Users/Eric/Downloads/dlimg/bloomfilter/img_urls.bf")
        self.bf_sha1s = BloomFilter(max_elements=10000000,error_rate=0.001,filename="C:/Users/Eric/Downloads/dlimg/bloomfilter/img_sha1s.bf")

    def process_item(self, item, spider):
        if len(item["img_urls"]) == 0:
            return item
        for img_url in item["img_urls"]:
            # check out whether this img_url exists
            if img_url not in self.bf_urls:
                self.bf_urls.add(img_url)
                try:
                    base_dir = "C:\\Users\\Eric\\Downloads\\dlimg\\"
                    today = datetime.datetime.today().isoformat()
                    img_dir = base_dir + today[:4] + "\\" + today[5:7] + "\\" + today[8:10] + "\\"
                    if not os.path.exists(img_dir):
                        # create dir
                        os.makedirs(img_dir)
                    img_name = str(uuid.uuid1()) + ".jpg"
                    img_path = img_dir.replace("\\", "/") + img_name
                    opener = urllib.request.build_opener()
                    opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'),
                                         ('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'),
                                         ('Refer','https://www.yatzer.com/'),
                                         ('Cache-Control','no-cache')]
                    urllib.request.install_opener(opener)
                    urllib.request.urlretrieve(img_url, img_path)
                    # check out whether this img_file exists
                    with open(img_path, "rb") as f:
                        sha1Obj = hashlib.sha1()
                        sha1Obj.update(f.read())
                        hashRs = sha1Obj.hexdigest()
                        if hashRs in self.bf_sha1s:
                            # print("删除文件")
                            os.remove(img_path)
                        else:
                            self.bf_sha1s.add(hashRs)
                            sql = "insert into `dl_img`(`IMG_ID`,`IMG_DIR`,`IMG_NAME`,`GROUP_ID`,`SOURCE_URL`,`CREATE_DATE`) values(%s, %s, %s, %s,%s, %s)"
                            with self.conn.cursor() as cursor:
                                cursor.execute(sql, (str(uuid.uuid1()), img_dir.replace("\\", "/"), img_name, str(item["group_id"]), item["source_url"], datetime.datetime.now()))
                                self.conn.commit()
                except Exception as e:
                    pass
        return item

    def close_spider(self, spider):
        self.conn.cursor().close()
        self.conn.close()

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325826744&siteId=291194637