MongoDB: PyMongo million-level data deduplication

scene description

  • In Python crawlers, MongoDB databases are often used to store the results crawled by crawlers, so there is a question: how to deduplicate millions of MongoDB data?
  • A common idea is to check whether the data already exists in the database when the data is entered into the database. If it exists, ignore it (higher efficiency) or overwrite it. This is applicable when the amount of data is relatively small, but when the amount of data When it is relatively large (million level and above), it is often very inefficient to do so! And what if it is an existing million-level database that has not been deduplicated?
  • You can also use the distinct statement to deduplicate, but the same problem remains. The distinct statement is not suitable for millions of data, and even when the amount of data is large, an error will be reported: distinct too big, 16mb cap. At this time, you need to use the aggregate aggregation framework to remove duplication!

Implementation

import pymongo
from tqdm import tqdm


class CleanData:
    def __init__(self):
        """
        连接数据库
        """
        self.client = pymongo.MongoClient(host='host', port=27017)
        self.db = self.client.crawlab_master

    def clean(self):
        """
        清理gkld_useful_data集合中重复的内容
        """
        # 查询重复数据
        if results := self.db.gkld_useful_data.aggregate([
            # 对特定字段分组
            {
    
    '$group': {
    
    '_id': {
    
    
                'radar_range': '$radar_range',
                'notice_tags': '$notice_tags',
                'notice_title': '$notice_title',
                'notice_content': '$notice_content',
            },
                # 统计出现的次数
                'count': {
    
    '$sum': 1}
            }},
            # 过滤分组的字段,选择显示大于一条的数据
            {
    
    '$match': {
    
    'count': {
    
    '$gt': 1}}}
            # allowDiskUse=True:避免出现超出内存阈值的异常
        ], allowDiskUse=True):
            for item in tqdm(list(results), ncols=100, desc='去重进度'):
                count = item['count'] # count默认为整型(int)
                radar_range = item['_id']['radar_range']
                notice_tags = item['_id']['notice_tags']
                notice_title = item['_id']['notice_title']
                notice_content = item['_id']['notice_content']
                for i in range(1, count):
                    # 仅留一条数据,删除重复数据
                    self.db.gkld_useful_data.delete_one({
    
    
                        'radar_range': radar_range,
                        'notice_tags': notice_tags,
                        'notice_title': notice_title,
                        'notice_content': notice_content
                    })


if __name__ == '__main__':
    clean = CleanData()
    clean.clean()

explain

  • $group: used to group by a given field
  • '$sum': 1: Every time the data that meets the grouping condition appears, the value of count will be increased by 1
  • count: Used to count the number of occurrences
  • $match: {'count': {'$gt': 1}}Choose to display data with more than one occurrence (more than one indicates that the data is duplicate data)
  • allowDiskUse=True: Avoid memory threshold exceeded exceptions
  • tqdm: A toolkit for generating progress bars, refer to the link
  • for item in tqdm( list(results)): a percentage progress bar will be displayed
  • for item in tqdm( results): the percentage progress bar will not be displayed
  • When there is no duplicate data in the database, list(results)it resultshas the same effect, and there is no percentage progress bar

✨New plan✨

Introduction

  • Using the above solutions can indeed solve the problem of deduplication of millions of data, but if the number of duplicates in millions of data is relatively large (such as: 100,000+), and the configuration of MongoDB is relatively low, and it is used to detect whether the data is duplicated There are many fields. If you use the above scheme to deduplicate, the efficiency will be relatively low. Of course, you can use multi-threading or multi-processing on the basis of the above solutions to improve the deduplication efficiency of the above solutions.
  • However, the best solution is to solve the problem of data duplication at the time of warehousing. Of course, it is recommended to use the above solutions to deduplicate existing million-level databases that have not been deduplicated. How about fast and reliable deduplication?
  • The method I take here is to splice the feature fields used for deduplication into a string, then encrypt the spliced ​​string with MD5, and finally use the ciphertext generated after MD5 encryption to replace the fields automatically generated by the MongoDB database _id. In this way, when inserting duplicate data, because _idthe data corresponding to the same field already exists in the database, the program for inserting data will report an exception at this time, and we can process the duplicate data by catching this exception (such as: overwriting or neglect).
  • findThe advantage of this is that it avoids querying whether the data in the database is duplicated through other syntax when inserting data . You only need to insert the data directly without checking whether the data in the database has duplicate data in advance. You only need to capture The exception when inserting data, and adopt the corresponding duplicate data deduplication strategy in the exception!

code example

  • spider
    from hashlib import md5
    
    from NewScheme.items import NewschemeItem
    
    
    # 使用MD5加密生成_id字段
    encrypt = province_name + city_name + county_name + exam_type_name + info_type_name + notice_title + update_time
    md = md5(f'{
            
            encrypt}'.encode('utf8'))
    _id = md.hexdigest()[8:-8]
    # 保存数据
    item = NewschemeItem()
    item['_id'] = _id
    item['province_name'] = province_name
    item['city_name'] = city_name
    item['county_name'] = county_name
    item['county_show'] = county_show
    item['exam_type_name'] = exam_type_name
    item['info_type_name'] = info_type_name
    item['notice_title'] = notice_title
    item['update_time'] = update_time
    item['notice_source'] = notice_source
    item['job_info'] = job_info
    item['job_people_num'] = int(job_people_num)
    item['job_position_num'] = int(job_position_num)
    item['job_start_time'] = job_start_time
    item['job_end_time'] = job_end_time
    item['notice_content'] = notice_content
    item['attachment_info'] = attachment_info
    yield item
    

    Here, because of the requirements of my project, _idthe field is 16 bits. If the project has no special requirements, the default 32 bits can be used

  • pipelines
    # 忽略策略
    try:
        self.collection.insert_one(dict(item))
    except Exception as e:
        # 数据重复则忽略
        print(f'{
            
            item.get("_id")}已存在 - ignore')
    else:
        print(item.get('notice_title'))
    
    # 覆盖策略
    try:
        self.collection.insert_one(dict(item))
    except Exception as e:
        # 数据重复则覆盖
        _id = item.get("_id")
        print(f'{
            
            _id}已存在 - cover')
        # 删除旧数据
        self.collection.delete_one({
          
          '_id': _id})
        # 插入新数据
        self.collection.insert_one(dict(item))
    else:
        print(item.get('notice_title'))
    

Guess you like

Origin blog.csdn.net/qq_34562959/article/details/121417186