MongoDB data deduplication

There are three methods for different situations.

method one

The database is new and doesn't have any data in it. Deduplication at this time refers to judging whether the data to be inserted this time already exists in the database when inserting data. If it exists, you can ignore the insert operation this time, or overwrite the data; if it does not exist, insert it.

principle

The value of MongoDB's _id field is unique (similar to MySQL's primary key), and if not manually assigned, it will be automatically generated during insertion into the database.

When MongoDB inserts data, it will automatically judge whether it is duplicate data according to the value of _id, that is, whether there is a piece of data in the database whose _id is the same as the _id of the data to be inserted this time. If duplicate data is found, the insertion operation DuplicateKeyError will be reported.

Take crawling movie information as an example, here it is assumed that the md5 generated according to the name, categories and score is unique, that is, there will be no other movie with the same name, categories and score as the current movie at the same time (select the appropriate field according to the actual situation ) ), so the md5 generated in this way can be used as the value of _id, so as to achieve deduplication when inserting data.

If the data returned by the interface has its own id (or there is an id in the URL, such as the id of the current article in the csdn article link, that is, a string of numbers behind /article/details/), since this id is unique, it can also Use this id directly as _id, but if there is duplicate data when using this id, it is best to overwrite it, because the same article has the same id, but if the content of the article is updated, this data cannot be ignored when crawling again, and should be overwritten.

In addition, the _id generated by MongoDB itself needs to be from bson.objectid import ObjectId when querying with _id later, and the query condition is written {'_id': ObjectId('6280b3f24f15c0da689726a7')}; and if the _id is manually used by md5 Assigned, the query condition writes {'_id': '7c97b08cde07182297fc5fc51435a498'}. When querying based on the _id automatically generated by MongoDB, only ObjectId() can be used; when querying based on the manually assigned _id, only the value of _id can be directly written. Sample code (Python3.8+)

import pymongo
import os
from bson.objectid import ObjectId


def start():
    connection = pymongo.MongoClient(host=os.getenv('SPIDER_TEST_MongoDB_HOST'), port=27017, username=os.getenv("SPIDER_TEST_MongoDB_USER"), password=os.environ.get("SPIDER_TEST_MongoDB_PASSWORD"))
    database = connection.movie
    collection = database.movie_collection

    return connection, collection


def test(collection):
    if result1 := collection.find_one_and_delete({'_id': ObjectId('6280b3f24f15c0da689726a7')}):
        print(result1)
    if result2 := collection.find_one_and_delete({'_id': '7c97b08cde07182297fc5fc51435a498'}):
        print(result2)


def end(connection):
    connection.close()


if __name__=='__main__':
    connection, collection = start()
    test(collection)
    end(connection)

the code

Take Scrapy as an example, pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import os
import pymongo
import hashlib
from pymongo.errors import DuplicateKeyError

colorful_str_start = '\033[1;37;41m' # 彩色打印,也可以直接用RainbowPrint库 from https://www.cnblogs.com/easypython/p/9084426.html和https://www.cnblogs.com/huchong/p/7516712.html和https://zhuanlan.zhihu.com/p/136173259
colorful_str_end = '\033[0m'


class MongoDBPipeline:
    def open_spider(self, spider):
        self.connection = pymongo.MongoClient(host=os.getenv('SPIDER_TEST_MongoDB_HOST'), port=27017, username=os.getenv("SPIDER_TEST_MongoDB_USER"), password=os.environ.get("SPIDER_TEST_MongoDB_PASSWORD"))
        database = self.connection.movie
        self.collection = database.movie_collection

    def process_item(self, item, spider):
        item = dict(item)
        temp_string = item['name'] + str(item['categories']) + str(item['score']) # 我拿到的item中部分数据是列表,为了拼接字符串,这里直接强制转为字符串,仅作示意
        item['_id'] = hashlib.md5(temp_string.encode('utf-8')).hexdigest() # MongoDB的_id字段的值是唯一的(类似MySQL的主键),若不手动赋值,则会在插入数据库过程中自动生成。MongoDB插入数据时会自动根据_id的值判断是否是重复数据,即数据库中是否有某条数据的_id和本次要插入的数据的_id相同。这里假设根据name和categories和score生成的md5是唯一的,即不会有其他电影与当前这个电影的name和categories和score同时一样(实际用时根据情况选择合适的字段),所以可以将这种方式生成的md5作为_id的值,从而实现插入数据时去重。若接口返回的数据中自带id(或URL中有id,如csdn的文章链接中有当前文章的id,即/article/details/后面的一串数字),由于这个id是唯一的,也可以直接用这个id作为_id,但用这个id时若有重复数据,最好是覆盖,因同一篇文章id相同,但若文章内容更新了,再次爬取时就不能忽略本条数据,应该覆盖。
        # MongoDB自己生成的_id,后面在用_id查询时,需from bson.objectid import ObjectId,查询条件写{'_id': ObjectId('6280b3f24f15c0da689726a7')};而若_id是自己用md5手动赋值的,则查询条件写{'_id': '7c97b08cde07182297fc5fc51435a498'}。根据MongoDB自动生成的_id查询时,只能用ObjectId();根据自己手动赋值的_id查询时,只能直接写_id的值。

        try:
            self.collection.insert_one(item)
        except DuplicateKeyError: # 数据重复时可以忽略或覆盖
            # 忽略重复数据
            print(f'_id为{item["_id"]},name为 {item["name"]},数据库中已存在这条数据,所以{colorful_str_start}已忽略{colorful_str_end}本次的插入操作') # 打印当前数据的_id和name字段的值

            '''
            # 覆盖重复数据
            print(f'_id为{item["_id"]},name为 {item["name"]},数据库中已存在这条数据,开始删除数据库中的这条数据')
            self.collection.delete_one({'_id': item['_id']}) # 删除旧数据
            self.collection.insert_one(item) # 插入新数据
            print(f'_id为{item["_id"]},name为 {item["name"]},数据库中的这条旧数据已删除,且本次的新数据{colorful_str_start}已插入(覆盖){colorful_str_end}数据库')
            '''
        else:
            return item

    def close_spider(self, spider):
        self.connection.close()

In addition to the above writing method, there is another writing method. The difference is that this writing method may not be ignored when inserting duplicate data , but can only be overwritten. Refer to link 1 and refer to link 2.

    def process_item(self, item, spider):
        item = dict(item)
        temp_string = item['name'] + str(item['categories']) + str(item['score'])
        item['_id'] = hashlib.md5(temp_string.encode('utf-8')).hexdigest()
        self.collection.update_one({'_id': item['_id']}, {'$set': item}, upsert=True)

def save_data(data):
            collection.update_one({
                'name': data.get('name')
            }, {
                '$set': data
            }, upsert=True)

Here we declare a save_data method, which receives a data parameter, which is the movie details we just extracted. In the method, we call the update_one method. The first parameter is the query condition, which is to query according to the name; the second parameter is the data object itself, which is all the data. Here we use the $set operator to represent the update operation; The three parameters are very important. This is actually the upsert parameter. If you set this to True, you can update if it exists, or insert if it doesn’t exist. The update will be based on the name field set by the first parameter, so this can be done Prevents movie data with the same name from appearing in the database.
Note: In fact, the movie may have the same name, but the crawled data in this scene does not have the same name. Of course, the more important thing here is to realize the deduplication operation of MongoDB.

Method Two

 There is already some data in the database, and it is not sure whether there is duplicate data in it. If there is duplicate data, you need to delete the duplicate data first, and then insert new data.

You may think of MySQL and MongoDB's distinct, but pymongo's distinct returns all the different values ​​for a certain field. If there are 3 pieces of data, the values ​​of the name field of each piece are Zhang San, Li Si, and Zhang San. Then the return value of the set.distinct('name') is ['Zhang San', 'Li Si']. It directly returns the deduplicated data (and can only return all the values ​​of a certain field, I don’t know if it can return the values ​​of all fields, that is, return all data), but the original duplicate data in the database is still there. Refer to link 3.

In addition, when the amount of data is large, distinct will report an error distinct too big, 16mb cap. Refer to link 4.

So, aggregate is more appropriate, pymongo document 1, pymongo document 2, MongoDB document .

principle

Aggregate first groups according to specific fields, which can be a single field or multiple fields, and treat these fields as a whole (meaning, if there are multiple fields, multiple fields need to be satisfied at the same time, that is, "and", field 1 and field 2 And field 3), as long as the whole is unique (similar to the primary key in meaning, if there are three fields 'name'='Xiaoming', 'age'=10, 'student_id'=123, students who meet these three conditions at the same time , in theory there should be only one person. The specific fields to be selected are similar to the fields required to generate md5 in method 1, as long as these fields are combined to determine a unique piece of data).

Then count the number of occurrences of each field (still regarded as a whole) in the database in the grouping results, that is, the number of each field (still regarded as a whole), that is, the number of each piece of data; if the number is greater than 1, it means The document to which the current field belongs (that is, the document in MongoDB, which corresponds to a piece of data in MySQL), has multiple records in the database.

Finally return these repeated data, they are in an iterable object. Regardless of whether duplicate data is found, this iterable object is returned, but when no data is found, it is meaningless to delete duplicate data by traversing this iterable object later (just like traversing an empty list).

For the principle of this part, refer to link 5 and refer to link 6. It is recommended to look at these two links , which contain sample data and statement execution flowcharts, as well as SQL statements corresponding to the writing method of pymongo, which is easy to understand; if you use its examples The data needs to be manually converted into json and then imported into MongoDB, in the form of [{"":""},{"":""},{"":""}].

After getting all the duplicate data, just traverse them, and then delete them according to the conditions you specified. Note that when traversing, start from the second data (the subscript is 1) , because the first data should be kept and the first one should be deleted. Duplicate data after data.

the code

Notes are more detailed than principles, but generally mean the same

import pymongo
import os
from tqdm import tqdm


def start():
    connection = pymongo.MongoClient(host=os.getenv('SPIDER_TEST_MongoDB_HOST'), port=27017, username=os.getenv("SPIDER_TEST_MongoDB_USER"), password=os.environ.get("SPIDER_TEST_MongoDB_PASSWORD"))
    database = connection.movie
    collection = database.movie_collection

    return connection, collection


def test(collection):
    data = collection.aggregate([ # 返回分组($group)并筛选($match)后的数据,它们在一个可迭代对象中。无论是否查到了数据,都返回这个可迭代对象,只是查不到数据时,后面for循环中遍历这个可迭代对象没有意义(就像遍历空列表)。
            {
                '$group': # 用于根据给定的字段(即_id的值)进行分组,有多个字段时,意思是同时满足这些字段,即字段1 且 字段2 且 字段3
                    {
                        '_id': # '_id'可能是固定写法 from https://www.mongodb.com/docs/manual/reference/operator/aggregation/group/
                            {
                                'name': '$name', # 冒号前是自己起的名字,冒号后是对应的数据库中的字段的值
                                'categories': '$categories',
                                'score': '$score'
                            },
                        'count': # 统计满足前面设置的分组条件的数据出现的次数,自己起的名字
                            {'$sum': 1} # 满足分组条件的数据每出现一次,count的值就加1。若是{'$sum': 2},则每次count的值加2。from https://blog.csdn.net/jinyangbest/article/details/123225648和https://www.cnblogs.com/deepalley/p/12022381.html和https://www.it1352.com/1636882.html和https://www.jb51.net/article/168337.htm和https://stackoverflow.com/questions/17044587/how-to-aggregate-sum-in-mongodb-to-get-a-total-count和https://stackoverflow.com/questions/40791907/what-does-sum1-mean-in-mongo和https://www.mongodb.com/docs/manual/reference/operator/aggregation/sum/和https://www.mongodb.com/docs/v4.0/reference/operator/aggregation/sum/
                    }
            }, 
            { # 筛选出count(前面定义的出现次数)的值大于1的数据,出现次数大于1说明当前数据在数据库中有重复
                '$match': 
                    {
                        'count': {'$gt': 1}
                    }
            }
    ], allowDiskUse=True) # 避免出现超出内存阈值的异常
    # print(type(data)) # pymongo.command_cursor.CommandCursor

    for item in tqdm(iterable=list(data), ncols=100, desc='去重进度', colour='green'): # data本身是可迭代对象,但不转list的话,若数据库中有重复数据,则运行时tqdm的显示效果和list(data)的不同,list(data)能显示百分比和进度条和颜色,不转list不显示百分比和进度条和颜色,显示的是已去重的数据的数量;若数据库中没有重复数据,则转不转list都不显示百分比和进度条和颜色,只显示数量,且数量为0。
        count = item['count'] # 本身就是int类型,后面在range()中用,这里不用强制转int()
        name = item['_id']['name']
        categories = item['_id']['categories']
        score = item['_id']['score']
        for _ in range(1, count): # 仅保留第一条数据,删除后面的重复数据,第二条数据的下标为1
            collection.delete_one({ # 若数据库中某条数据的name、categories、score字段同时满足下面的条件,则删除该条数据
                'name': name, # 冒号前是数据库中的字段名,冒号后是对应的数据库中的字段的值,这些值是从前面aggregate返回的可迭代对象中获取的
                'categories': categories,
                'score': score
            })


def end(connection):
    connection.close()


if __name__=='__main__':
    connection, collection = start()
    test(collection)
    end(connection)
Screenshot of tqdm progress bar in the code, the first one is list(data), the second one is direct data

 method three

The applicable conditions are the same as method one.

Crawlab comes with deduplication, document-crawler-result deduplication .

Note: Since Crawlab needs to be stored in the corresponding MongoDB, before deduplication, it is necessary to indicate the "result set" in the crawler, that is, the corresponding table name.

overwrite deduplication

Overwrite deduplication, as the name implies, is to overwrite old data to ensure the uniqueness of data, so as to achieve the purpose of deduplication.

The specific principles and steps are as follows:

  1. Find the old data corresponding to the new data according to the "Deduplication" field, and delete the old data;
  2. Insert new data into the "result set".

ignore deduplication

Ignoring deduplication is simpler than covering deduplication. The specific principles are as follows:

  1. Find the old data corresponding to the new data according to the "deduplication" field;
  2. If there is old data, it is ignored and not inserted;
  3. Insert if no old data exists.

deduplicated field

"Deduplicated field" is actually equivalent to the primary key of the result set (although the primary key in MongoDB is always  _id), multiple pieces of data with the same primary key are not allowed. If Crawlab's de-duplication logic is turned on, a unique index will be created on the "de-duplication field" in the result set to ensure the uniqueness of the data and the efficiency of finding the data.

If you run the scrapy crawler in Crawlab, you can read this article first , and then combine method three of this article to achieve deduplication. But before running, do the following settings

 Note that when the deduplication rule is selected to ignore , an error will be reported when running the crawler: pymongo errors DuplicateKeyError: E11000 duplicate key error collection:, ignore this error, the reason may be that Crawlab did not capture and process this error when ignoring it , or the processing method is to throw it. You can check whether the number of crawler results is correct on the task page. For example, if 100 items are crawled in total, and there are 90 items in the database, then the number of results should be 10, and then go to the database to see that there should be 100 items in the database. When the deduplication rule is selected to override , there is no such problem.

reference link

MongoDB: PyMongo million-level data deduplication_Su Yin's blog-CSDN blog_pymongo delete duplicate

Detailed Explanation of the Deduplication Processing Method in Crawlers

Practical MongoDB Aggregate - Zhihu (zhihu.com)

"2022" Cui Qingcai Python3 Crawler Tutorial - Efficient and Practical MongoDB Document Storage (baidu.com) , MongoDB Basic Use

Query and Projection Operators — MongoDB Manual , some operators in MongoDB

Guess you like

Origin blog.csdn.net/fj_changing/article/details/124783781