Python3
+ Windows
environment, using bloom filter is really not a very wise choice, because either the ready-made module does not support the Windows
platform, or only supports Python2
it, or the support for file operations is not ideal. After a lot of hard work, I found one called bloom_filter
.
——The use of the Bloom filter is really good. First, create a new file by yourself, and then open it every time, check whether it exists, and add it if it does not exist. If it exists, it is fine. Here I have made two Bloom-filtered files, and made two judgments. The first is to judge whether the image to be downloaded url
already exists (that is, whether it has been downloaded). If it has been downloaded, then Ignore, if it has not been downloaded, then download it; after downloading, sha1
process the downloaded file, and then use this sha1
to compare whether it already exists. This is to prevent the same image from being downloaded on different websites. If it does not exist, it means this A picture is unique, but there is no guarantee that the image of this picture is not duplicated, because as long as a little bit of picture attributes or information is modified, they will belong to different pictures.
——Secondly, there are some operations to connect to the database, which is used here pymysql
, remember to initialize the connection to the database, close_spider
and close the connection in the method.
——Some of the processes in the middle are to determine whether the path exists, create it if it does not exist, and determine how it exists. The directory of year, month, and day is used to store the downloaded pictures, in order to ensure that the picture storage is relatively reasonable, rather than being placed in a large folder.
- sha1
The processing here is using hashlib
the library.
——There will be a unified logo for the same group of pictures.
- Renamed the image to prevent it from renaming itself when duplicated.
——It can be seen from the storage in the database that the path dir
and name of the picture name
are separated, which is to prevent the movement of the picture in the future.
import pymysql
import urllib.request
import uuid
import datetime
import os
from bloom_filter import BloomFilter
import hashlib
class DesignhubPipeline(object):
def __init__(self):
self.conn = pymysql.connect(host="127.0.0.1", user="root", password="root", db="dlimg", charset='utf8mb4')
self.bf_urls = BloomFilter(max_elements=10000000,error_rate=0.001,filename="C:/Users/Eric/Downloads/dlimg/bloomfilter/img_urls.bf")
self.bf_sha1s = BloomFilter(max_elements=10000000,error_rate=0.001,filename="C:/Users/Eric/Downloads/dlimg/bloomfilter/img_sha1s.bf")
def process_item(self, item, spider):
if len(item["img_urls"]) == 0:
return item
for img_url in item["img_urls"]:
# check out whether this img_url exists
if img_url not in self.bf_urls:
self.bf_urls.add(img_url)
try:
base_dir = "C:\\Users\\Eric\\Downloads\\dlimg\\"
today = datetime.datetime.today().isoformat()
img_dir = base_dir + today[:4] + "\\" + today[5:7] + "\\" + today[8:10] + "\\"
if not os.path.exists(img_dir):
# create dir
os.makedirs(img_dir)
img_name = str(uuid.uuid1()) + ".jpg"
img_path = img_dir.replace("\\", "/") + img_name
opener = urllib.request.build_opener()
opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'),
('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'),
('Refer','https://www.yatzer.com/'),
('Cache-Control','no-cache')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve(img_url, img_path)
# check out whether this img_file exists
with open(img_path, "rb") as f:
sha1Obj = hashlib.sha1()
sha1Obj.update(f.read())
hashRs = sha1Obj.hexdigest()
if hashRs in self.bf_sha1s:
# print("删除文件")
os.remove(img_path)
else:
self.bf_sha1s.add(hashRs)
sql = "insert into `dl_img`(`IMG_ID`,`IMG_DIR`,`IMG_NAME`,`GROUP_ID`,`SOURCE_URL`,`CREATE_DATE`) values(%s, %s, %s, %s,%s, %s)"
with self.conn.cursor() as cursor:
cursor.execute(sql, (str(uuid.uuid1()), img_dir.replace("\\", "/"), img_name, str(item["group_id"]), item["source_url"], datetime.datetime.now()))
self.conn.commit()
except Exception as e:
pass
return item
def close_spider(self, spider):
self.conn.cursor().close()
self.conn.close()