scene description
- In Python crawlers, MongoDB databases are often used to store the results crawled by crawlers, so there is a question: how to deduplicate millions of MongoDB data?
- A common idea is to check whether the data already exists in the database when the data is entered into the database. If it exists, ignore it (higher efficiency) or overwrite it. This is applicable when the amount of data is relatively small, but when the amount of data When it is relatively large (million level and above), it is often very inefficient to do so! And what if it is an existing million-level database that has not been deduplicated?
- You can also use the distinct statement to deduplicate, but the same problem remains. The distinct statement is not suitable for millions of data, and even when the amount of data is large, an error will be reported:
distinct too big, 16mb cap
. At this time, you need to use the aggregate aggregation framework to remove duplication!
Implementation
import pymongo
from tqdm import tqdm
class CleanData:
def __init__(self):
"""
连接数据库
"""
self.client = pymongo.MongoClient(host='host', port=27017)
self.db = self.client.crawlab_master
def clean(self):
"""
清理gkld_useful_data集合中重复的内容
"""
# 查询重复数据
if results := self.db.gkld_useful_data.aggregate([
# 对特定字段分组
{
'$group': {
'_id': {
'radar_range': '$radar_range',
'notice_tags': '$notice_tags',
'notice_title': '$notice_title',
'notice_content': '$notice_content',
},
# 统计出现的次数
'count': {
'$sum': 1}
}},
# 过滤分组的字段,选择显示大于一条的数据
{
'$match': {
'count': {
'$gt': 1}}}
# allowDiskUse=True:避免出现超出内存阈值的异常
], allowDiskUse=True):
for item in tqdm(list(results), ncols=100, desc='去重进度'):
count = item['count'] # count默认为整型(int)
radar_range = item['_id']['radar_range']
notice_tags = item['_id']['notice_tags']
notice_title = item['_id']['notice_title']
notice_content = item['_id']['notice_content']
for i in range(1, count):
# 仅留一条数据,删除重复数据
self.db.gkld_useful_data.delete_one({
'radar_range': radar_range,
'notice_tags': notice_tags,
'notice_title': notice_title,
'notice_content': notice_content
})
if __name__ == '__main__':
clean = CleanData()
clean.clean()
explain
$group
: used to group by a given field'$sum': 1
: Every time the data that meets the grouping condition appears, the value of count will be increased by 1count
: Used to count the number of occurrences$match
:{'count': {'$gt': 1}}
Choose to display data with more than one occurrence (more than one indicates that the data is duplicate data)allowDiskUse=True
: Avoid memory threshold exceeded exceptionstqdm
: A toolkit for generating progress bars, refer to the link- for item in tqdm(
list(results)
): a percentage progress bar will be displayed - for item in tqdm(
results
): the percentage progress bar will not be displayed - When there is no duplicate data in the database,
list(results)
itresults
has the same effect, and there is no percentage progress bar
✨New plan✨
Introduction
- Using the above solutions can indeed solve the problem of deduplication of millions of data, but if the number of duplicates in millions of data is relatively large (such as: 100,000+), and the configuration of MongoDB is relatively low, and it is used to detect whether the data is duplicated There are many fields. If you use the above scheme to deduplicate, the efficiency will be relatively low. Of course, you can use multi-threading or multi-processing on the basis of the above solutions to improve the deduplication efficiency of the above solutions.
- However, the best solution is to solve the problem of data duplication at the time of warehousing. Of course, it is recommended to use the above solutions to deduplicate existing million-level databases that have not been deduplicated. How about fast and reliable deduplication?
- The method I take here is to splice the feature fields used for deduplication into a string, then encrypt the spliced string with MD5, and finally use the ciphertext generated after MD5 encryption to replace the fields automatically generated by the MongoDB database
_id
. In this way, when inserting duplicate data, because_id
the data corresponding to the same field already exists in the database, the program for inserting data will report an exception at this time, and we can process the duplicate data by catching this exception (such as: overwriting or neglect). find
The advantage of this is that it avoids querying whether the data in the database is duplicated through other syntax when inserting data . You only need to insert the data directly without checking whether the data in the database has duplicate data in advance. You only need to capture The exception when inserting data, and adopt the corresponding duplicate data deduplication strategy in the exception!
code example
- spider
from hashlib import md5 from NewScheme.items import NewschemeItem # 使用MD5加密生成_id字段 encrypt = province_name + city_name + county_name + exam_type_name + info_type_name + notice_title + update_time md = md5(f'{ encrypt}'.encode('utf8')) _id = md.hexdigest()[8:-8] # 保存数据 item = NewschemeItem() item['_id'] = _id item['province_name'] = province_name item['city_name'] = city_name item['county_name'] = county_name item['county_show'] = county_show item['exam_type_name'] = exam_type_name item['info_type_name'] = info_type_name item['notice_title'] = notice_title item['update_time'] = update_time item['notice_source'] = notice_source item['job_info'] = job_info item['job_people_num'] = int(job_people_num) item['job_position_num'] = int(job_position_num) item['job_start_time'] = job_start_time item['job_end_time'] = job_end_time item['notice_content'] = notice_content item['attachment_info'] = attachment_info yield item
Here, because of the requirements of my project,
_id
the field is 16 bits. If the project has no special requirements, the default 32 bits can be used - pipelines
# 忽略策略 try: self.collection.insert_one(dict(item)) except Exception as e: # 数据重复则忽略 print(f'{ item.get("_id")}已存在 - ignore') else: print(item.get('notice_title')) # 覆盖策略 try: self.collection.insert_one(dict(item)) except Exception as e: # 数据重复则覆盖 _id = item.get("_id") print(f'{ _id}已存在 - cover') # 删除旧数据 self.collection.delete_one({ '_id': _id}) # 插入新数据 self.collection.insert_one(dict(item)) else: print(item.get('notice_title'))