mongo clean up duplicate data

description

Set a unique index before the time is not set unique options, leading to the emergence of duplicate data in the database. I do not want to drop off again pull, try to clean up the underlying data, after convenience may also be used.

reference

Links: https://blog.csdn.net/cloume/article/details/74931998

Reference to this, the final decision with the script manually add the command line to achieve.

Cleanup Script

	import pymongo
	import logging
	import time
	
	import pandas as pd
	
	from collections import defaultdict
	from all_codes import all
	
	
	t1 = time.time()
	
	# 默认是是列表的字典
	error_codes = defaultdict(list)
	
	error_date = set([])
	
	MongoUri = "mongodb://127.0.0.1:27017"
	db = pymongo.MongoClient(MongoUri)
	
	codes = all
	
	logger = logging.getLogger(__name__)
	
	logging.basicConfig(level=logging.DEBUG,
	                    filename='clean.log',
	                    datefmt='%Y/%m/%d %H:%M:%S',
	                    format='%(asctime)s - %(name)s - %(levelname)s - %(lineno)d - %(module)s - %(message)s')
	
	
	def convert(code):
	    if code[0] == "0" or code[0] == "3":
	        return "SZ" + code
	    elif code[0] == "6":
	        return "SH" + code
	    else:
	        pass
	
	
	for c in codes:
	
	    code = convert(c)
	
	    logger.info(code)
	
	    # 找出全部唯一的 date_int
	    dis_date_ints = db.stock.calendar.find({"code": code}).distinct("date_int")
	
	    cursor = db.stock.calendar.find({"code": code}, {"date_int": 1, "_id": 0})
	
	    all_date_ints = [r.get("date_int") for r in cursor]
	    
	    # 无需进行清理的部分
	    if sorted(dis_date_ints) == sorted(all_date_ints):
	
	        logger.info("ok")
	
	    else:
	        logger.info("no")
	
	        df = pd.DataFrame({"date": all_date_ints})
	        
	        # 使用 df 中的方法清除重复数据 
	        dup = df[df.duplicated()]
	
	        to_delete = dup["date"].tolist()
	
	        logger.info(to_delete)
	
	        error_codes[code].extend(to_delete)
	
	        error_date = error_date | set(to_delete)
	
	    t2 = time.time()
	
	
	logger.info(t2 - t1)
	
	logger.info(error_codes)
	
	logger.info(error_date)

View logs processed manually

The main feeling this is more controllable, it is not deleted directly in the program.

Here Insert Picture Description

For example above, it shows that SH600136 20,180,708 stock in data duplication.

Find some non SH000001 clean up:

db.calendar.find({"date_int": 20190527, "code": {$ne: "SH000001"}}).count()

To delete more than one duplicate data, (there are two in detecting data when applicable)

db.calendar.remove({"code": "SH600136", "date_int":20180708}, {"justOne": true})

The only attempt to set up a joint index value after the clean-up:

db.justtest.ensureIndex({"code": 1, "date_int": 1}, {unique: true})

There have been attempts in the collection of data to establish a unique index, or duplicate values, then it will error:

"errmsg" : "{ rs1/Lianghua_HW_GZ_124:27101,Lianghua_HW_GZ_152:27101,Lianghua_HW_GZ_163:27101:
\"Index with name: code_1_date_int_1 already exists with different options\" }"

Updated: 20,190,601
before we need to delete the index can insert a new index does not delete the original of the same name, when being given:

	"errmsg" : "{ rs1/Lianghua_HW_GZ_124:27101,Lianghua_HW_GZ_152:27101,Lianghua_HW_GZ_163:27101: " \
	           "\"Index with name: code_1_date_int_1 already exists with different options\" }"
	db.calendar.dropIndex("code_1_date_int_1")

Delete Index command:

db.calendar.dropIndex("code_1_date_int_1")

Updated: 2019.6.19
view mongo shell command line index database: getIndexes ()
Here Insert Picture Description
established between the date and the unique index index:
Here Insert Picture Description
db.generate_indexcomponentsweight.ensureIndex ({ "date":. 1, "index":}. 1, {unique: true})
Here Insert Picture Description
Check again:
Here Insert Picture Description

Guess you like

Origin blog.csdn.net/Enjolras_fuu/article/details/90704087