Fast file search using Python (build a file search index)

Basic search method:

Searching for files with the pathlib library When searching for files with Python, you need to use the glob() function and the rglob() function of the pathlib library. The glob() function can implement the file name-based search method, and the rglob function can implement the extension-based search method.


from pathlib import Path

base_dir = '/Users/edz/Desktop/'
keywords = '**/*BBC*'

# 遍历base_dir指向的目录下所有的文件
p = Path(base_dir)

# 当前目录下包含BBC的所有文件名称
files = p.glob(keywords)  
# files的类型是迭代器
# 通过list()函数转换为列表输出
# print(list(files))

# xlsx结尾的文件
files2 = p.rglob('*.xlsx')
print(list(files2))

# 遍历子目录和所有文件
files3 = p.glob('**/*')
print(list(files3))

Since glob() matches the path and file name of the file, such as: "c:\somepath\to\filename_include_BBC_voice.exe", and we generally use keywords when searching for files, such as "BBC", so When searching, we need to add a wildcard form to the keyword, such as "BBC".

A wildcard is a special symbol similar to regular expression metacharacters. It cannot be used in regular expressions, but can only be used in glob (full name global) matching patterns.
insert image description here
The rglob function performs matching from the end of the file path forward, which is the main difference between it and the glob() function. Based on the search order characteristics of the rglob() function, it is often used to search for extensions, such as using rglob ('*.xlsx') to search all files with the xlsx extension, which is simpler than pattern matching written using glob(), and the meaning of the arguments is clearer.

Finally, looking at the return values ​​of the glob() and rglob() functions, there is one thing I need to remind you: the result of their execution is a new data type that we have not touched in previous lessons, this type is called "iterator" ".

Two ways to improve search efficiency

Implementing file search with Python's pathlib library is only more flexible than the Windows default search, but does not bring any improvement in search efficiency. In order to reduce the waiting time of the search, next, I will teach you to use the two methods of specifying the search path and creating an index file to improve the search efficiency of the pathlib library.

specify search path

Let's look at the first one, specifying the search path.
We need to do this in three steps:

Generate a configuration file first, and write the path to be searched into the configuration file;

Then write a custom function to read the configuration file and search, read the path in the configuration file, and search directory by directory;

Finally, the search results of multiple directories are combined and output, so that you can quickly find the files you want through the results.

Let's talk about the first step, how to use Python to read the configuration file. In the past, we would write the path to be searched into a variable, and put the variable name that defines the path in the first few lines of the code, so that the variable can be found when the search directory is modified next time. However, for a program with a slightly complex code engineering, there are often multiple code files, which is still not conducive to modifying the search path each time a search is performed. The new approach is to put the variables into a separate file called the code's configuration file.

The advantage of this approach is that you don't have to open the code file when you modify the search directory. Suppose your friend also needs similar functions, then you can send him the code and configuration file together, even if he does not know Python at all, he can use the program you wrote to achieve efficient search.

Configuration files are generally text files. The format of the configuration file is generally specified by the software author based on the functions of the software and their own habits, but there are also general configuration file formats.

For example, in the Windows system, the most common configuration file is the file with the extension .ini. In today's lesson, we take the .ini file format as the standard format of the configuration file. The .ini file format consists of three sections, sections, parameters, and comments. The format is as follows:

[section]
参数
(键=值)
  name=value
注释 
注释使用“;”分号表示。在分号后面的文字,直到该行结尾都全部为注解。
;注释内容
#基于.ini 文件的格式,我把配置搜索路径的配置文件修改为如下:
[work]
;工作文件保存路径
searchpath=/Users/edz,/tmp

[game]
;娱乐文件保存路径
searchpath=/games,/movies,/music

In this code, I set up two "sections" for work and game, representing work and play respectively. The advantage of this setup is that I can search different directories for different purposes. If fewer directories are used for searches, the wait time for searches is reduced accordingly.

In addition, you will find that the parameters in the two "sections" are specified as the same name --searchpath. The advantage of this setting is that when I change the search scope from "work" to "entertainment", I only need to modify it in the code The "section" of the search, without modifying the search parameters.

In addition to "section" and "parameter", in the configuration file, you should also pay attention to the way I set the value of the parameter searchpath, its value is the path range I want to search, in order to be more convenient to read in the program Multiple paths, I use commas to separate multiple paths.

After finding the full path of the search.ini file, you need to read and analyze the .ini file format. Python has a library that implements this function, which is called the configparser library. Through this library, you can directly read the searchpath parameter in the .ini file. , no need to read the file content through the read() function, and manually write a script to analyze the .ini file.


import configparser
import pathlib 
from pathlib import Path

def read_dirs(ini_filename, section, arg):
    """
    通过ini文件名,节和参数取得要操作的多个目录
    """
    current_path = pathlib.PurePath(__file__).parent
    inifile = current_path.joinpath(ini_filename)

    # cf是类ConfigParser的实例
    cf = configparser.ConfigParser()

    # 读取.ini文件
    cf.read(inifile)

    # 读取work节 和 searchpath参数 
    return cf.get(section, arg).split(",")

def locate_file(base_dir, keywords):
    p = Path(base_dir)
    files = p.glob(keywords) 
    return list(files)


dirs = read_dirs('search.ini', 'work', 'searchpath')
# ['/Users/edz', '/tmp']
keywords = '**/*BBC*'

# 定义存放查找结果的列表
result = []

# 从每个文件夹中搜索文件
for dir in dirs:
    files = locate_file(dir, keywords)
    result += files

# 将PosixPath转为字符串
print( [str(r) for r in result] )

The read_dirs() function implements reading an .ini file and processes the multiple paths returned as a list type. The list type is suitable for multiple sets of side-by-side data, and multiple directories can just use the list data type to store the name of the directory to be searched.

The locate_file() function searches each directory through the loop function on line 35 of the code, and stores the search results in the result variable. The result variable is a list data type. Since the searched file may contain multiple matching file paths, I need to store the searched results in the result list in turn, then continue to search for the next directory, and continue to pass the append() function to The results are put into the list, and the entire search program is not really executed until all directory searches are completed.

Finally, there is one more thing you need to pay attention to. In the process of path processing, the pathlib library uniformly defines the path as a PosixPath() object in order to avoid the difference in the path writing method of different operating systems. Therefore, when you use these paths, you need to convert the PosixPath object to a string type first. In the last line of the code, I use the Python built-in function str() function to convert the PosixPath objects to string type one by one, and store them into the list again.

index file

We can make changes based on the program that specifies the search path: first, change the storage method of all file paths in the configuration file directory from a list to a file; then change the search function to search from a file.


def locate_file(base_dir, keywords='**/*'):
    """
    迭代目录下所有文件
    """
    p = Path(base_dir)
    return p.glob(keywords)

def write_to_db():
    """
    写入索引文件
    """
    current_path = pathlib.PurePath(__file__).parent
    dbfile = current_path.joinpath("search.db")

    with open(dbfile, 'w', encoding='utf-8') as f:
        for r in result:
            f.write(f"{
      
      str(r)}\n")

# 读取配置文件
dirs = read_dirs('search.ini', 'work', 'searchpath')

# 遍历目录
result = []
for dir in dirs:
    for files in locate_file(dir):
        result.append(files)

# 将目录写入索引文件
write_to_db()

In the code, I added the write_to_db() function, which is in line 16-18 of the code, and I replaced the function of writing to the list by writing to a file. At the same time, in order to traverse all the directories, I also modified the second parameter of the locate_file() function and changed it to “keywords='/*'”. Through these two modifications, all file paths are saved to the search.db file. **

The file content of search.db is as follows, which records all file paths in all directories specified by the configuration file:


/tmp/com.apple.launchd.kZENgZTtVz
/tmp/com.google.Keystone
/tmp/mysql.sock
/tmp/com.adobe.AdobeIPCBroker.ctrl-edz
/tmp/com.apple.launchd.kZENgZTtVz/Listeners
/tmp/com.google.Keystone/.keystone_install_lock
... ...

Search for keywords from text


import pathlib 
import re

keyword = "apple"

# 获取索引文件路径
current_path = pathlib.PurePath(__file__).parent
dbfile = current_path.joinpath("search.db")

# 在索引文件中搜索关键字
with open(dbfile, encoding='utf-8') as f:
    for line in f.readlines():
        if re.search(keyword, line):
            print(line.rstrip())

In the code, I use the re.search() search function of regular expressions, and use the keyword variable as the search keyword to match each line of the search.db index file, and finally match the file path of the keyword "apple" displayed on the screen along with the name.

Searching for files this way is much faster than using the operating system's own search tool because I split the time that Windows would take to search for files on the hard drive into two parts. Part of it is when updatedb.py builds the index; part is when it looks for keywords from the search.db index file.

Guess you like

Origin blog.csdn.net/david2000999/article/details/121555024