pymong delete mongogridfs files in batches

pymong delete mongogridfs file

1. Background:

Gridfs image server storage

GridFS is a specification for storing and querying over the BSON file size limit (16M) in MongoDB. Unlike BSON files that store files in a separate document, GridFS divides files into multiple blocks, and each block is a separate document. By default, each GridFS block is 255kB, meaning that except for the last block (according to the remaining file size), documents are stored in multiple 255kB block sizes.

GridFS uses two collections to store data, one collection to store file blocks (fs.chunks), and the other to store file metadata (fs.files). The size of each chunk in fs.chunks is 256KB. Mongo's replica set and fragmentation architecture provide GridFS with efficient read and write expansion capabilities and high availability capabilities. It has a high level of storage for pictures, videos, and other large files. Performance. [Small files, small files (< 16M) are stored in Bson binary, large files are stored in gridfs].

insert image description here

The company's project image server is implemented using the gridfs of the mongo replica set, which stores jpg images, mp3 audio, video, and sitemap files. Recently, the image server has batches of illegal images that need to be cleaned up centrally. Clean up files in batches.

2. Current issues:

Mongofs gridfs operation plan:

During the use of the mongo file system gridfs, mongo provides commands mongofilesfor management.

[root@bj-test-wlj-2-132 twj]# mongofiles --help
Browse and modify a GridFS filesystem.

usage: mongofiles [options] command [gridfs filename]
command:
  one of (list|search|put|get)
  list - list all files.  'gridfs filename' is an optional prefix 
         which listed filenames must begin with.
  search - search all files. 'gridfs filename' is a substring 
           which listed filenames must contain.
  put - add a file with filename 'gridfs filename'
  get - get a file with filename 'gridfs filename'
  delete - delete all files with filename 'gridfs filename'
options:
  --help                                produce help message
  -v [ --verbose ]                      be more verbose (include multiple times
                                        for more verbosity e.g. -vvvvv)
  --version                             print the program's version and exit
  -h [ --host ] arg                     mongo host to connect to ( <set 
                                        name>/s1,s2 for sets)
  --port arg                            server port. Can also use --host 
                                        hostname:port
  --ipv6                                enable IPv6 support (disabled by 
                                        default)
  -u [ --username ] arg                 username
  -p [ --password ] arg                 password
  --authenticationDatabase arg          user source (defaults to dbname)
  --authenticationMechanism arg (=MONGODB-CR)
                                        authentication mechanism
  --dbpath arg                          directly access mongod database files 
                                        in the given path, instead of 
                                        connecting to a mongod  server - needs 
                                        to lock the data directory, so cannot 
                                        be used if a mongod is currently 
                                        accessing the same path
  --directoryperdb                      each db is in a separate directly 
                                        (relevant only if dbpath specified)
  --journal                             enable journaling (relevant only if 
                                        dbpath specified)
  -d [ --db ] arg                       database to use
  -c [ --collection ] arg               collection to use (some commands)
  -l [ --local ] arg                    local filename for put|get (default is 
                                        to use the same name as 'gridfs 
                                        filename')
  -t [ --type ] arg                     MIME type for put (default is to omit)
  -r [ --replace ]                      Remove other files with same name after
                                        PUT

Among them, the delete file operation:

mongofiles delete file name --port port --db library name

mongofiles delete xxx.jpg --port 30000 --db pics

Batch delete file case:

Since the files to be deleted are huge, there may be 5-6 hundred thousand in one batch. Every time you delete a file using the mongofiles command, you need to establish a tcp connection. So temporarily put the url in the file, then diff the files, and execute +sleep in a loop to ease The tcp port is full:

for file in `ls`;do  for i in `cat 31aa*`;do echo " mongofiles delete $i --port 30000 --db pics" >> 2021-8-31.log && mongofiles delete $i --port 30000 --db pic  sleep 0.1 ;done ; echo sleep 60 ;sleep 120 ;done

risk point:

  • Use the mongofiles command to temporarily delete a small amount of data, and the operation is slow
  • There is a risk that the number of tcpdumps is full and the number of mongo connections is full
  • During use, it is found that if the amount of files deleted in batches exceeds 10,000, the deletion will fail. mongofiles prompts that the operation is successful, but the files still exist

3. Solutions:

Development temporary solution:

Due to the batch processing of abnormal files by the shell command, it was found that many files failed to operate, and the developer was temporarily asked to process them. After understanding, the development adopts java language to use the mongo driver module encapsulation method remove to directly delete files.

import com.mongodb.DB;
import com.mongodb.DBObject;
import com.mongodb.Mongo;
import com.mongodb.gridfs.GridFS;
import com.mongodb.gridfs.GridFSDBFile;



public class MyGridFsTest {
    
    
	private String host = "127.0.0.1";
	private int port = 27017;
	private String dbName = "demogridfs";



	@Test
	public void testFindFile() throws IOException{
    
    
		Mongo connection = new Mongo(host, port);
		DB db = connection.getDB(dbName);
		GridFS gridFs = new GridFS(db);

		DBObject query = new BasicDBObject("filename", fileName);
		gridFs.remove(query);

Java's Mongo driver secondary packaging directly deletes files, and uses the mongo thread pool, but does not occupy too many resources such as the number of Mongo connections.

python pymongo solution:

official address
insert image description here

(pymongo) [root@bj-redis-slave02-10-8-2-245 history]# python delete-pic.py > 2021-8.log &

(pymongo) [root@bj-redis-slave02-10-8-2-245 history]# tail -f 2021-8.log 

4. Attachment: Code:

#!/usr/bin/python
# -*- encoding: utf-8 -*-
import pymongo
import json
import os
from pymongo import MongoClient
from gridfs import GridFS
class GFS(object):
    def __init__(self, file_db,file_table):
        self.file_db = file_db
        self.file_table = file_table
 
    def createDB(self): #连接数据库,并创建文件数据库与数据表
        client = MongoClient('10.8.2.237',30000)
        db = client[self.file_db]
        file_table = db[self.file_table]
        return (db,file_table)
 
    def insertFile(self,db,filePath,query): #将文件存入数据表
        fs = GridFS(db,self.file_table)
        if fs.exists(query):
            print('已经存在该文件')
        else:
            with open(filePath,'rb') as fileObj:
                data = fileObj.read()
                ObjectId = fs.put(data,filename = filePath.split('/')[-1])
                print(ObjectId)
                fileObj.close()
            return ObjectId
 
    def getID(self,db,query,pic): #通过文件属性获取文件ID,ID为文件删除、文件读取做准备
        try:
            fs=GridFS(db, self.file_table)
            ObjectId=fs.find_one(query)._id
            msg=pic,'ObjectId',ObjectId
            print (msg)
            return ObjectId
        except AttributeError as e: #AttributeError为错误类型,此种错误的类型赋值给变量e;当try与except之间的语句触发
                                    # AttributeError错误时程序不会异常退出而是执行except AttributeError下面的内容
            print("AttributeError错误,图片不存在:",e)
            
    def remove(self,db,id): #文件数据库中数据的删除
        fs = GridFS(db, self.file_table)        
        fs.delete(id) #只能是id
        del_msg= id,'正在删除!!!'
        print (del_msg)
 
    def getFile(self,db,id): #获取文件属性,并读出二进制数据至内存
        fs = GridFS(db, self.file_table)
        gf=fs.get(id)
        bdata=gf.read() #二进制数据
        attri={
    
    } #文件属性信息
        attri['chunk_size']=gf.chunk_size
        attri['length']=gf.length
        attri["upload_date"] = gf.upload_date
        attri["filename"] = gf.filename
        attri['md5']=gf.md5
        print(attri)
        return (bdata, attri)
    def Writ_log(self,File,msg):
        f=open(File,'a')
        f.write(msg)
        f.close()
 
    def listFile(self,db): #列出所有文件名
        fs = GridFS(db, self.file_table)
        gf = fs.list()
 
    def findFile(self,db,file_table): #列出所有文件二进制数据
        fs = GridFS(db, table)
        for file in fs.find():
            bdata=file.read()
 
    def write_2_disk(self,bdata, attri): #将二进制数据存入磁盘
        name = "get_"+attri['filename']
        if name:
            output = open(name, 'wb')
        output.write(bdata)
        output.close()
        print("fetch image ok!")
 
if __name__=='__main__':
    gfs=GFS('pics','fs')
    (file_db,fileTable) = gfs.createDB() #创建数据库与数据表
    dir_list = os.listdir('2021-8')
    print (dir_list)
    for filePath in dir_list:
        filePath='2021-8/'+filePath
        lines = open(filePath,'r').readlines()
        for file in lines:
            pic =file.splitlines()[0]
            query = '{"filename": "%s"}' %(pic)
            query = json.loads(query)
            id=gfs.getID(file_db,query,pic)
            gfs.remove(file_db,id) #删除数据库中文件


    #filePath = '10.txt' #插入的文件
    #query = {'filename': '745082993188.jpg'}
    #id=gfs.getID(file_db,query,pic)
    #gfs.remove(file_db,id) #删除数据库中文件
    
    #id=gfs.insertFile(file_db,filePath,query) #插入文件 
    #(bdata,attri)=gfs.getFile(file_db,id) #查询并获取文件信息至内存    
    #gfs.write_2_disk(bdata,attri) #写入磁盘
    #gfs.remove(file_db,id) #删除数据库中文件

reference:

pymongo has many tutorials on the operation of mongo nosql, and there are many introductions to the upload and download of multiple gridfs file systems. But there is less introduction to gridfs delete operation. The core important operation is to first obtain the file id of the file, and then remove the file id through the remove method. (the remove file method only accepts the file id)

Official: https://pymongo.readthedocs.io/en/stable/tutorial.html

gridfs delete official: https://pymongo.readthedocs.io/en/stable/api/gridfs/index.html?highlight=delete#gridfs.GridFS.delete

https://blog.csdn.net/qq_30852577/article/details/84645693

https://blog.csdn.net/weiyuanke/article/details/7717476

Guess you like

Origin blog.csdn.net/weixin_43423965/article/details/128563115