Detailed explanation of Python IO programming

1. File system operation

1. Comparison of os, os.path and pathlib

The traditional way of handling file paths and file system operations in Python is through the functions in the os and os.path modules. These functions are perfectly adequate, but often make the code too verbose.

Since Python 3.5, a new pathlib library has been introduced, which can complete file operations in a more object-oriented and unified way. However, the application of pathlib is increasing day by day, and it may become a new standard, so after each example of the traditional method, there will be an example of implementing the same function with pathlib, with a brief explanation if necessary.

2. Paths and pathnames

All operating systems refer to files and directories as strings, which contain the name of a given file or directory. Such strings are often referred to as "pathnames", or sometimes simply paths, and take this name. Because the path name is a string, it also brings some complexity when using it. Python has done a lot of work to provide a lot of functions to avoid this complexity. But in order to use these Python functions efficiently, some understanding of some hidden problems is required.

Path names are written very similarly in various operating systems, because almost all operating systems model the file system as a tree structure. The disk is the root directory, and the folders and subfolders are branches, sub-branches, and so on. . This means that most operating systems refer to files in basically the same way. They all specify the path from the root (disk) of the file system to the path to find the file through the path name. The path name contains layers of folders. name, up to the target file.

Different operating systems, the exact way of writing the path name is still different. In Linux/UNIX path names, the character separating file or directory names is "/", while in Windows path names, "\" is used. In addition, UNIX file systems have only one root directory (referenced by setting the first character of the path name to "/"), while Windows file systems have separate root directories for each drive, labeled A:\, B:\, C:\, etc. (C: is usually the main drive). It is because of these differences that files have different pathname notations on different operating systems.

A file named C:\data\myfile on Windows might be called /data/myfile on UNIX and Mac OS. The functions and constants provided by Python can complete common pathname operations without having to care about these syntactic details. With a little care, you can write Python programs that work regardless of the underlying filesystem.

1. Absolute path and relative path

The operating system supports the following two path notations.

An absolute path specifies the exact location of the file in the entire file system without any ambiguity. An absolute path will give the full path to the file, starting from the root of the filesystem.

A relative path specifies the location of a file relative to a point in the file system that is not given by the relative path itself. The absolute position relative to the starting point of the path, given by the context of the call.

Here are two examples of absolute paths on Windows systems:

C:\Program Files\Doom
D:\backup\June

Here are two absolute paths on a Linux system, and one absolute path on a Mac system:

/bin/Doom
/floppy/backup/June
/Applications/Utilities

The following are two relative paths in the Windows system:

mydata\project1\readme.txt
games\tetris

The following are relative paths in Linux/UNIX/Mac systems:

mydata/project1/readme.txt
games/tetris
Utilities/Java

The relative path needs to determine the actual location according to the context, and the context is generally given in two ways.

A relatively simple way is to add a relative path to an existing absolute path to generate a new absolute path. Assume that there is a relative path StartMenu\Programs\Startup under Windows and an absolute path C:\Users\Administrator. Add the two to get a new absolute path C:\Users\Administrator\Start Menu\Programs\Startup, which can represent the location in the file system. If you add the same relative path to another absolute path (such as C:\Users\myuser), a path to the Startup folder under the name of another user (myuser) will be generated.

The second way relative paths gain context is through an implicit reference to the current working directory. The current working directory refers to the current directory where the program records at any time when the Python program is running. If a relative path is given as an invocation argument, the Python command implicitly uses the current working directory. For example, if the os.listdir(path) command uses a relative path as a parameter, the relative path takes the current working directory as the anchor point (anchor), and the path of the directory where the file name is located in the result is the current working directory plus the relative value specified by the parameter. path.

2. Current working directory

Whenever a document is edited on the computer, there will be a concept of location, that is, the current position of the document in the computer file structure, because everyone will feel that it is in the same directory (folder) as the file being processed. Similarly, whenever Python runs, it also has a concept of the current location, which is the directory structure it is in at a certain moment. This is important because programs may need to get a list of files in the current directory. The directory where the Python program is located is called the current working directory of the program, and the current working directory may be different from the directory where the program is stored.

If you want to actually check the current working directory, please start Python and use the os.getcwd command (get the current working directory) to view the current working directory in Python's initial state:

>>> import os
>>> os.getcwd()

Note that the function os.getcwd is called without parameters to emphasize that the return value is not fixed. If a command that modifies the current working directory is executed, the returned result will change. The current working directory may be the directory where the Python program is stored, or it may be the directory where Python was started. On a Linux machine, the returned result will be /home/myuser, which is the home directory (home) of the current user. On Windows machines, there will be extra backslashes inserted in the path. Because the Windows system uses "\" as a path separator, and in Python strings, "\" also has a special meaning.

Enter below:

>>> os.listdir(os.curdir)

The constant os.curdir returns the string used by the system to represent the current directory. In both UNIX and Windows systems, the current directory is represented by a period. But in order to ensure the portability of the program, you should always use os.curdir instead of just typing a period. The string is a relative path, which means that os.listdir will add it to the path of the current working directory, and the path has not changed. The above command will return a list of all files/folders in the current working directory.

Choose any folder name and type the following command:

>>> os.chdir(folder name)       ⇽---  修改当前目录
>>> os.getcwd()

As can be seen above, Python will be moved into the folder specified by the os.chdir function parameter. At this time, calling os.listdir(os.curdir) again will return the list of files in the folder folder, because os.curdir will be relative to the new current working directory. Many filesystem operations in Python use the current working directory in this way.

3. Access directories with the pathlib module

The steps to get the current directory with pathlib are as follows:

>>> import pathlib
>>> cur_path = pathlib.Path()
>>> cur_path.cwd()
PosixPath('/home/naomi')

pathlib does not provide a function to change the current directory like os.chdir(), but you can create a new folder by creating a path object.

4. Path name processing

Now that you have some background on pathnames for files and directories, it's time to introduce the functionality provided by Python for manipulating pathnames. These functions consist of functions and constants in the os.path submodule, which can be used to process path names without explicitly adopting any operating system-specific syntax. Paths are still represented as strings, but they are no longer considered or treated as strings.

Initially, some pathnames can be constructed on various operating systems using the os.path.join function.

Note that importing os also introduces the os.path submodule, so there is no need to explicitly import it again with the import os.path statement. First introduce Python in the Windows system:

>>> import os
>>> print(os.path.join('bin', 'utils', 'disktools'))
bin\utils\disktools

The os.path.join function interprets the parameter as a series of directory names or file names, which will be concatenated to form a single string, and the underlying operating system interprets the string as a relative path. On Windows systems, this means that the various parts of the pathname should be joined with backslashes, which is what the above produces.

The following does the same in UNIX:

>>> import os
>>> print(os.path.join('bin', 'utils', 'disktools'))
bin/utils/disktools

 The resulting paths are the same, but instead of the Windows rule with backslashes as delimiters, the Linux/UNIX rules use forward slashes as delimiters. In other words, os.path.join can generate a file path from a series of directory or file names, and does not have to care about the syntax rules of the underlying operating system. If you want to build a file path without being restricted by the future operating environment, then using os.path.join is the basic way.

The parameters of os.path.join do not necessarily have to be a single directory or file name, but can also be subpaths, which can be joined together to form a longer pathname. The following example demonstrates this usage in a Windows environment where double backslashes must be used in the string.

Note that it is also possible to enter path names with forward slashes (/) at this time, because Python will convert them before interacting with the Windows operating system:

>>> import os
>>> print(os.path.join('mydir\\bin', 'utils\\disktools\\chkdisk'))
mydir\bin\utils\disktools\chkdisk

Of course, if you always use os.path.join to build paths, you hardly need to worry about the above problems. Writing this example in a portable form would look like this:

>>> path1 = os.path.join('mydir', 'bin'); 
>>> path2 = os.path.join('utils', 'disktools', 'chkdisk')
>>> print(os.path.join(path1, path2))
mydir\bin\utils\disktools\chkdisk

The os.path.join function also handles absolute and relative pathnames. In Linux/UNIX, absolute paths always start with / (because a single slash means the top-level directory of the entire system, and everything else is under it, including the various floppies and CD drives available). A relative path in UNIX is any legal path that does not start with a slash. In all versions of the Windows operating system, the situation is a bit more complicated because of the confusing way Windows handles relative and absolute paths. Without going into all the details here, it is recommended that the best way to handle this situation is to adopt the following simplified Windows path rules.

  • If the path name begins with a drive letter followed by a colon and backslash, it is absolute, such as C:\Program Files\Doom. Note that just C:, without a trailing backslash, does not reliably represent the top-level directory of the C: drive. C:\ must be used to refer to the top-level directory of the C: drive. This requirement follows the DOS tradition, not the Python design rules.
  • If the pathname does not begin with a drive letter or a backslash, it is a relative path, such as mydirectory\letters\business.
  • If the pathname starts with \\ followed by the server name, it is a path to a network resource.
  • Any other pathname is considered an invalid path.

Regardless of operating system, os.path.join does not sanity-check the resulting pathnames. The resulting pathname may contain characters prohibited by the operating system, and the characters prohibited for pathnames vary by operating system. If you need to check the results, the best solution may be to write a small path validity checking function yourself.

The os.path.split function will return a two-element tuple splitting the path into the filename (single file or directory name at the end of the path, basename) and the remainder.

On a Windows system, see the following example:

>>> import os
>>> print(os.path.split(os.path.join('some', 'directory', 'path')))
('some\\directory', 'path')

The os.path.basename function returns only the file name in the path, and the os.path.dirname function returns only the preceding path part, as follows:

>>> import os
>>> os.path.basename(os.path.join('some', 'directory', 'path.jpg'))
'path.jpg'
>>> os.path.dirname(os.path.join('some', 'directory', 'path.jpg'))
'some\\directory'

If you want to deal with file extensions identified by periods, Python provides the os.path.splitext function. Most file systems use extensions to identify file types, with the notable exception of the Macintosh system.

>>> os.path.splitext(os.path.join('some', 'directory', 'path.jpg'))
('some/directory/path', '.jpg')

The last element of the tuple returned above contains the file extension identified by the period (if it exists). The first element of the returned tuple contains the rest of the given arguments except the extension.

There are also more specialized functions for manipulating pathnames. The os.path.commonprefix(path1, path2, ...) function will look for multiple paths with the same prefix (if it exists). This method is useful if you want to find the lowest-level directory that contains a given number of files at the same time. The os.path.expanduser function will expand the username shortcut to a full path, suitable for UNIX. Similarly, the os.path.expandvars function expands environment variables to full paths.

Here is an example on Windows 10:

>>> import os
>>> os.path.expandvars('$HOME\\temp')
'C:\\Users\\administrator\\personal\\temp'

5. Handling pathnames with pathlib

As before, the path name is established under various operating systems at the beginning, and the method of the Path object is used here.

Start with Python on Windows:

>>> from pathlib import Path
>>> cur_path = Path()
>>> print(cur_path.joinpath('bin', 'utils', 'disktools'))
bin\utils\disktools

The same effect can also be achieved by directly using the "/" operator:

>>> cur_path / 'bin' / 'utils' / 'disktools'
WindowsPath('bin/utils/disktools')

Note that the Path object always uses "/" to represent the separator, but the Windows Path object will convert "/" to "\" according to the requirements of the operating system. Therefore, if the same operation is performed on a UNIX system, the result will be as follows:

>>> cur_path = Path()
>>> print(cur_path.joinpath('bin', 'utils', 'disktools'))
bin/utils/disktools

The parts property of the Path object will return a tuple whose elements are the individual components of the path. An example on a Windows system is as follows:

>>> a_path = WindowsPath('bin/utils/disktools')
>>> print(a_path.parts)
('bin', 'utils', 'disktools')

The name property of the Path object will return only the filename part of the path, the parent property will return the part other than the file name, and the suffix property will return the extension part with a period. Most operating systems use extensions to identify file types, but the Macintosh is a notable exception.

Examples are as follows:

>>> a_path = Path('some', 'directory', 'path.jpg')
>>> a_path.name
'path.jpg'
>>> print(a_path.parent) 
some\directory
>>> a_path.suffix
'.jpg'

The Path object also has some other methods for flexible handling of path names and files, please check the documentation of the pathlib module.

The pathlib module may make programming easier and file handling code cleaner.

6. Commonly used variables and functions

There are several path-related constants and functions that are useful to improve the system independence of Python code. The most basic constants are os.curdir and os.pardir, which define the path symbols used by the operating system to represent directories and parent directories, respectively. In Windows, Linux/UNIX and macOS, they are represented by "." and ".." respectively, and can also be used as normal path components.

The following example:

os.path.isabs(os.path.join(os.pardir, path))

It will be judged whether the parent path of path is a directory. os.curdir is especially useful if you want to do something with the current working directory.

The following example will return a list of filenames in the current working directory:

os.listdir(os.curdir)

Because os.curdir is a relative path, os.listdir will always treat relative paths as relative to the current working directory.

The os.name constant will return the name of the imported Python module that handles OS-specific details. Under Windows XP it will look like this:

>>> import os
>>> os.name
'nt'

Note that os.name returns 'nt' even though the actual version of Windows may be Windows 10. Most versions of Windows, except Windows CE, are recognized as 'nt'.

On Mac machines running OS X, and on Linux/UNIX systems, the return value will be posix. Using this return value, you can perform some special operations based on the current running platform:

import os
if os.name == 'posix':
    root_dir = "/"
elif os.name == 'nt':
    root_dir = "C:\\"
else:
    print("Don't understand this operating system!")

Some programs may also use sys.platform, which provides more precise information. On Windows 10 systems, sys.platform will be set to win32, even if the machine is currently running a 64-bit version of the operating system. On Linux systems, the result may be linux2. In the Solaris system, sys.platform may be set to sunos5, depending on the current system version.

All environment variables and their associated values ​​are stored in a dictionary named os.environ. In most operating systems, this dictionary contains many path-related variables, such as the search path for binary executable files. If you need to use this information, you can find it from this dictionary.

3. Obtain file information

File paths should be used to refer to actual files and directories on the hard disk. Because of the need to know the file information pointed to by the path, the path object may also be passed around. Python provides many functions for obtaining file information.

The most commonly used Python path information functions are os.path.exists, os.path.isfile, and os.path.isdir, which all accept a path parameter. os.path.exists will return True if the argument is indeed a path that exists in the filesystem. os.path.isfile returns True if and only if the given path represents some type of ordinary data file (including executables), False otherwise (including if the argument points to something not in the filesystem). os.path.isdir returns True if and only if the argument represents a directory, otherwise returns False. The following examples have all been run through. If you want to see the operation of these functions in your own system, you may need to use other paths to explore:

>>> import os
>>> os.path.exists('C:\\Users\\myuser\\My Documents')
True
>>> os.path.exists('C:\\Users\\myuser\\My Documents\\Letter.doc')
True
>>> os.path.exists('C:\\Users\\myuser\\\My Documents\\ljsljkflkjs')
False
>>> os.path.isdir('C:\\Users\\myuser\\My Documents')
True
>>> os.path.isfile('C:\\Users\\ myuser\\My Documents')
False
>>> os.path.isdir('C:\\Users\\ myuser\\My Documents\\Letter.doc')
False
>>> os.path.isfile('C:\\Users\\ myuser\\My Documents\\Letter.doc')
True

There are several similar functions that can find more specific information. os.path.islink and os.path.ismount are useful in the context of Linux and other UNIX operating systems that support file links and device mount points. If the path represents a file link or a mount point, they respectively return True. os.path.islink will not return True for Windows shortcut files (those ending in .lnk).

The reason is simple, these files are not really file links. But for real symbolic links created with the mklink() command on Windows, os.path.islink will return True. The operating system does not place special status flags on file links and mount points, and programs cannot transparently use them as real files. os.path.samefile(path1, path2) returns True if and only if the two paths in the arguments point to the same file. Returns True if the argument to os.path.isabs(path) is an absolute path, otherwise returns False. os.path.getsize(path), os.path.getmtime(path) and os.path.getatime(path) return the pathname size, last modification time and last access time, respectively.

In addition to the above os.path function, you can also use os.scandir to obtain more complete information about files in a certain directory, and os.scandir will return an os.DirEntry iterator object. The os.DirEntry object can display the file attributes of each item in the directory, so using os.scandir will be faster and more efficient than combining os.listdir and os.path operations. For example, to know whether a directory entry refers to a file or a directory, the function of os.scandir is much more useful, and more directory information can be obtained.

Many methods of the os.DirEntry object correspond to the os.path function, including exists, is_dir, is_file, is_socket, and is_symlink.

os.scandir also supports the context manager provided by with, which is recommended here to ensure proper release of resources. The following example iterates over all entries in a directory and prints out the entry's name and whether it is a file:

>>> with os.scandir(".") as my_dir:
...     for entry in my_dir:
...         print(entry.name, entry.is_file())
...
pip-selfcheck.json True
pyvenv.cfg True
include False
test.py True
lib False
lib64False
bin False

4. Other operations of the file system

In addition to obtaining file-related information, Python also supports some direct operations on the file system, which is done through some basic and useful functions in the os module.

The above has introduced how to get the list of files in the directory, using os.listdir:

>>> os.chdir(os.path.join('C:', 'my documents', 'tmp'))
>>> os.listdir(os.curdir)
['book1.doc.tmp', 'a.tmp', '1.tmp', '7.tmp', '9.tmp', 'registry.bkp']

Note that Python does not include the os.curdir and os.pardir identifiers in the file list returned by os.listdir, unlike many other languages ​​or shell directory listing commands.

In the glob module there is a glob function (named after an ancient UNIX pattern matching function) that expands Linux/UNIX shell-style wildcards and character sequences in pathnames and returns matching files in the current working directory. "*" will match any sequence of characters, "?" will match any single character, and a sequence of characters (such as [h,H] or [0-9]) will match any single character in the sequence:

>>> import glob
>>> glob.glob("*")
['book1.doc.tmp', 'a.tmp', '1.tmp', '7.tmp', '9.tmp', 'registry.bkp']
>>> glob.glob("*bkp")
['registry.bkp']
>>> glob.glob("?.tmp")
['a.tmp', '1.tmp', '7.tmp', '9.tmp']
>>> glob.glob("[0-9].tmp")
['1.tmp', '7.tmp', '9.tmp']

Use os.rename to rename (move) a file or directory:

>>> os.rename('registry.bkp', 'registry.bkp.old')
>>> os.listdir(os.curdir)
['book1.doc.tmp', 'a.tmp', '1.tmp', '7.tmp', '9.tmp', 'registry.bkp.old']

With os.rename you can move (rename) files not only within directories, but also between directories.

Files can be deleted with os.remove:

>>> os.remove('book1.doc.tmp')
>>> os.listdir(os.curdir)
['a.tmp', '1.tmp', '7.tmp', '9.tmp', 'registry.bkp.old']

Note that directories cannot be deleted with os.remove. This is a security restriction to ensure that the entire directory cannot be deleted by mistake.

Files can be created by writing operations. If you want to create a directory, use os.makedirs or os.mkdir, the difference between these two is that os.makedirs will create along with the necessary intermediate directories, but os.mkdir will not:

>>> os.makedirs('mydir')
>>> os.listdir(os.curdir)
['mydir', 'a.tmp', '1.tmp', '7.tmp', '9.tmp', 'registry.bkp.old']
>>> os.path.isdir('mydir')
True

To remove directories use os.rmdir, it only removes empty directories. An exception is raised if an attempt is made to delete a non-empty directory:

>>> os.rmdir('mydir')
>>> os.listdir(os.curdir)
['a.tmp', '1.tmp', '7.tmp', '9.tmp', 'registry.bkp.old']

If you want to delete a non-empty directory, use the shutil.rmtree function, which recursively deletes all files in a directory tree.

For most of the operations mentioned above, the Path object has methods with the same functions, but there are still some differences. The iterdir method is similar to the os.path.listdir function, but returns a path iterator instead of a list of strings:

>>> new_path = cur_path.joinpath('C:', 'my documents', 'tmp'))
>>> list(new_path.iterdir())
[WindowsPath('book1.doc.tmp'), WindowsPath('a.tmp'), WindowsPath('1.tmp'), 
    WindowsPath('7.tmp'), WindowsPath('9.tmp'), WindowsPath('registry.bkp')]

Note that in the Windows environment, the WindowsPath object is returned. On Mac OS or Linux systems, PosixPath objects are returned.

The path object of pathlib also has a built-in glob method, and what is returned is not a list of strings, but a path iterator. However, the behavior of this method is very similar to the glob.glob function, an example is as follows:

>>> list(cur_path.glob("*"))
[WindowsPath('book1.doc.tmp'), WindowsPath('a.tmp'), WindowsPath('1.tmp'), 
    WindowsPath('7.tmp'), WindowsPath('9.tmp'), WindowsPath('registry.bkp')]
>>> list(cur_path.glob("*bkp"))
[WindowsPath('registry.bkp')]
>>> list(cur_path.glob("?.tmp"))
[WindowsPath('a.tmp'), WindowsPath('1.tmp'), WindowsPath('7.tmp'), 
    WindowsPath('9.tmp')]
>>> list(cur_path.glob("[0-9].tmp"))
[WindowsPath('1.tmp'), WindowsPath('7.tmp'), WindowsPath('9.tmp')]

Files and directories can be renamed (moved) using the rename method of the Path object:

>>> old_path = Path('registry.bkp')
>>> new_path = Path('registry.bkp.old')
>>> old_path.rename(new_path)
>>> list(cur_path.iterdir())
[WindowsPath('book1.doc.tmp'), WindowsPath('a.tmp'), WindowsPath('1.tmp'), 
    WindowsPath('7.tmp'), WindowsPath('9.tmp'), 
    WindowsPath('registry.bkp.old')]

The rename method can not only move (rename) files within a directory, but also move files between directories.

To remove or delete a data file, use the unlink method:

>>> new_path = Path('book1.doc.tmp')
>>> new_path.unlink()
>>> list(cur_path.iterdir())
[WindowsPath('a.tmp'), WindowsPath('1.tmp'), WindowsPath('7.tmp'), 
    WindowsPath('9.tmp'), WindowsPath('registry.bkp.old')]

Note that, like os.remove, you cannot use the unlink method to delete a directory. This is a security restriction to ensure that the entire directory cannot be deleted by mistake.

To create a directory from a path object, use the path object's mkdir method. If called with the parameters parents=True, the mkdir method will create the necessary intermediate directories. Otherwise, a FileNotFoundError will be raised if the intermediate directory does not exist:

>>> new_path = Path ('mydir')
>>> new_path.mkdir(parents=True)
>>> list(cur_path.iterdir())
[WindowsPath('mydir'), WindowsPath('a.tmp'), WindowsPath('1.tmp'), 
    WindowsPath('7.tmp'), WindowsPath('9.tmp'), 
    WindowsPath('registry.bkp.old')]]
>>> new_path.is_dir('mydir')
True

To remove a directory use the rmdir method, which only removes empty directories. An exception is raised if an attempt is made to delete a non-empty directory.

>>> new_path = Path('mydir')
>>> new_path.rmdir()
>>> list(cur_path.iterdir())
[WindowsPath('a.tmp'), WindowsPath('1.tmp'), WindowsPath('7.tmp'), 
     WindowsPath('9.tmp'), WindowsPath('registry.bkp.old')

5. Process all files under the directory tree

Finally, a very useful function is introduced for traversing the directory of the recursive structure, that is the os.walk function. Use it to traverse the entire directory tree, and return 3 items of data for each directory: the root directory or path of the directory, a list of subdirectories, and a list of files.

When calling the os.walk method, the parameter is the path of the initial or top-level directory, and can also take 3 optional parameters: os.walk(top, topdown=True, onerrοr=None, followlinks=False). top is the path to the home directory. If topdown is True or not given, the files in the directory are processed before the subdirectories, which causes the result to start at the top of the directory tree and move down. And if topdown is False, the subdirectories of each directory will be processed first, and the result is to traverse the directory tree from bottom to top. The onerror parameter can be set to a function to handle errors generated by calls to os.listdir, errors are ignored by default. os.walk will not go into the symbolic link folder by default, unless the followlinks=True parameter is given.

When os.walk is called, it creates an iterator that recursively calls itself for all directories contained in the top parameter. That is, for each subdirectory subdir in names, os.walk calls itself recursively, in the form of os.walk(subdir, ...).

Note that if the topdown argument is True or not given, the subdirectory list can be modified (with any list modification operator or method) before the next level of recursion. This can be used to control which subdirectory os.walk descends into.

To get a feel for the usage of os.walk, it is recommended to iterate over the directory tree and print out the value returned for each directory. The following example demonstrates the power of os.walk, listing the current working directory and all its subdirectories, giving the number of entries in each subdirectory, but excluding all .git directories:

import os
for root, dirs, files in os.walk(os.curdir):
    print("{0} has {1} files".format(root, len(files))) 
    if ".git" in dirs:          ⇽---  检查是否包含.git目录
        dirs.remove(".git")     ⇽---  从目录列表中去除.git(仅限.git目录)

The above example is more complicated. If you want to use os.walk to the maximum extent, you may have to spend some time running it so that you can understand the details of the running process.

The copytree function in the shutil module can recursively copy all files in a directory and its subdirectories, and the file permissions and status information (ie access/modification time) are preserved. shutil also contains the rmtree function mentioned above, which can be used to delete a directory and all subdirectories, and several functions for copying individual files.

Summary of file system property values ​​and functions:

function file system constant or operation
os.getcwd()、Path.cwd() get the current directory
os.name Give the general identifier of the current system platform
sys.platform Gives specific information about the current system platform
os.environ Map environment variables as dictionaries
os.listdir(path) get files in directory
os.scandir(path) Get the directory information, get the os.DirEntry object iterator
s.chdir(path) change current directory
os.path.join(elements)、Path.joinpath(elements) Merge strings from arguments into path
os.path.split(path) Split the path into a body part and a trailer (the last part of the path)
Path.parts a tuple containing the parts of the path as elements
os.path.splitext(path) Split path into body part and file extension
Path.suffix the file extension of the path object
os.path.basename(path) Get the filename in pathname
Path.name the filename of the path object
os.path.commonprefix(list_ of_paths) Get the common prefix of all paths in the path list
os.path.expanduser(path) Expand "~" or "~username" to a full pathname
os.path.expandvars(path) Expand the environment variable in the parameter path to the actual path
os.path.exists(path) Check if a path exists
os.path.isdir(path)、Path.is_dir() Check if a path is a directory
os.path.isfile(path)、Path.is_file() Check if a path is a file
os.path.islink(path)、Path.is_link() Check if a path is a symbolic link (Windows shortcuts don't count)
os.path.ismount(path) Detect whether a path is a device mount point (mount point)
os.path.isabs(path)、Path.is_absolute() Check if a path is an absolute path
os.path.samefile(path_1, path_2) Check if two paths point to the same file
os.path.getsize(path) get file size
os.path.getmtime(path) Get file last modification time
os.path.getatime(path) Get file last access time
os.rename(old_path, new_path) rename the file
os.mkdir(path) Create a directory
os.makedirs(path) Create the directory and the necessary parent directories
os.rmdir(path) delete directory
glob.glob(pattern) Get a list of files matching wildcards
os.walk(path) Get all file names under the directory tree

Some properties and methods of pathlib:

Methods and properties attribute value or operation
Path.cwd() get the current directory
Path.joinpath(elements)或Path / element / element combine path parts into a new path
Path.parts a tuple containing the parts of the path as elements
Path.suffix file extension in path
Path.name filename in path
Path.exists() Check if path exists
Path.is_dir() Check if path is a directory
Path.is_file() Check if path is a file
Path.is_symlink() Detect if a path is a symlink (Windows shortcuts don't count)
Path.is_absolute() 检测是否为绝对路径
Path.samefile(Path2) 检测两个路径是否指向同一个文件
Path1.rename(Path2) 对文件重命名
Path.mkdir([parents=True]) 创建目录,如果参数parents为True则创建必要的父级目录
Path.rmdir() 删除目录
Path.glob(pattern) 获取与通配符匹配的文件清单

二、IO操作

1、打开文件及file对象

最常见的文件操作,可能就是打开并读取其中的数据了。

在Python中,可以用内置的open函数和多种内置读取操作来打开并读取文件。以下Python代码将从名为myfile的文本文件中读取一行数据:

with open('myfile', 'r') as file_object:
    line = file_object.readline()

open不会读取文件中的内容,而是返回一个file对象,可用于访问被打开的文件。file对象将会对文件及读写位置进行跟踪记录。Python的所有文件I/O操作,都是通过file对象而不是文件名来完成的。

第一次调用readline将返回file对象的第一行,包括第一个换行符在内。如果文件中不包含换行符,则返回整个文件的内容。下一次调用readline将返回第二行(如果存在),依此类推。

open函数的第一个参数是路径名。在以上示例中,要打开的是当前工作目录中的一个已有文件。以下代码将会以绝对路径打开文件c:\My Documents\test\myfile:

import os
file_name = os.path.join("c:", "My Documents", "test", "myfile")
file_object = open(file_name, 'r')

另请注意,上面的示例中用到了with关键字,这表示打开文件时用到了上下文管理器。这种文件打开方式能对潜在的I/O错误进行更好地管理,通常应成为首选方案。

2、关闭文件

对file对象读写完毕后,应该将其关闭。关闭file对象会释放系统资源,并允许该文件能被其他代码读写,通常能提高程序的可靠性。对于小型脚本程序而言,不关闭file对象通常不会造成太大的影响。在脚本或程序运行结束时,file对象将会被自动关闭。对于大型程序来说,打开的file对象太多可能会耗尽系统资源,导致程序异常中止。

当file对象使用完毕后,可以用close方法来关闭它。可以对上面的程序做出如下修改:

file_object = open("myfile", 'r')
line = file_object.readline()
# 对file_object执行读取操作
file_object.close()

利用上下文管理器和with关键字,也是一种自动关闭文件的好方法:

with open("myfile", 'r') as file_object:
    line = file_object.readline()
    # 对file_object执行读取操作

3、以写入等模式打开文件

open方法的第二个参数,是表示文件打开方式的字符串。'r'表示“以只读模式打开文件”。 'w'表示“以写入模式打开文件”,文件中已有数据将被全部清除。'a'表示“以追加模式打开文件”,新数据将被追加到文件已有数据的末尾。如果打开文件只是为了读取数据,第二个参数可以省略,其默认值就是'r'。

以下代码将把“Hello, World”写入文件:

file_object = open("myfile", 'w') 
file_object.write("Hello, World\n") 
file_object.close()

open也可以采用其他一些文件访问模式,这会因操作系统而异。这些模式大部分场合都是用不上的。

open还有第三个可选参数,定义了文件读写缓冲模式。缓冲(buffering)是将数据暂时保存在内存中,直到需要读取或写入数据足够多,值得花费一次磁盘访问的时间时,再去执行真正的磁盘读写。open方法的其他参数,则控制着文本文件的编码格式,以及文本文件的换行符处理方式。一般这些参数都不用去关心,但随着Python应用水平的提高,可能需要去了解一下。

4、读写文本及二进制数据的函数

最常用的文本文件读取函数就是readline,上面已经介绍过了。该函数将从file对象中读取并返回一行数据,包括行尾的换行符。如果文件中没有数据可读了,readline将返回空字符串,这样就能轻松计算出文件中的文本行数:

file_object = open("myfile", 'r') 
count = 0
while file_object.readline() != "":
    count = count + 1
print(count)
file_object.close()

具体到以上问题,更简短的计数方案是用内置的readlines方法。readlines将读取文件中的所有行,并作为字符串列表返回,每行就是一个字符串,尾部的换行符仍然保留:

file_object = open("myfile", 'r')
print(len(file_object.readlines()))
file_object.close()

当然,如果要对一个大型文件统计行数,readlines方法可能会导致计算机内存不足,因为它会一次性将整个文件读入内存。如果要从大型文件中读取一行,偏偏该文件中不含换行符,那么用了readline也可能会导致内存溢出,当然这种情况不大可能发生。为了应对这种情况,readline和readlines都可带一个可选参数,可以控制每次读取的数据量。详细信息参见Python参考文档。

另一种遍历文件全部数据行的方法,是将file对象视为for循环中的迭代器:

file_object = open("myfile", 'r')
count = 0
for line in file_object:
    count = count + 1
print(count)
file_object.close()

迭代器方式具备一个优点,每行数据都是按需读入内存的。因此即便是面对大型文件,也不用担心内存不足的问题。另外,迭代器方式还具有更简单、可读性更好的优点。

在Windows和Macintosh机器上,如果open方法处于文本模式,也就是模式字符串中没有加b,那就会按照文本模式进行字符转换。这样在调用read方法时就可能引发问题。在文本模式下,Macintosh系统中会将\r全部转换为"\n",而在Windows系统中则会将"\r\n"转换为"\n"。在打开文件时,可以用newline参数来指定换行符的处理方式。设置newline="\n"、"\r"或"\r\n",就能强制将该字符串用作换行符:

input_file = open("myfile", newline="\n")

以上例子只会把"\n"强制用作换行符。如果文件是在二进制模式下打开的,就不需要用到newline参数,因为返回的每个字节都与文件中的完全相同。

与readline和readlines方法对应的写入方法是write和writelines。注意,不存在writeline方法。write方法将写入一个字符串,如果字符串中包含了换行符,则可以一次写入多行。示例如下:

myfile.write("Hello")

write方法在把参数字符串写入完毕后,不会再输出换行符。如果要在输出内容中换行,必须自行加入换行符。如果以文本模式(w)打开文件,那么所有\n字符都会被转换为当前平台设定的行结束符,在Windows中为'\r\n',Macintosh平台中则为'\r'。这里同样可以在打开文件时指定换行符参数,以避免这种自动转换的发生。

writelines方法其实是用词不当,因为写入的不一定是多行数据。它的参数是个字符串列表,将字符串逐个写入file对象,但不会写入换行符。如果列表中的字符串以换行符结尾,则它们将被写成多行,否则在文件中就会紧紧连在一起。不过writelines是readlines的精确逆操作,可以对readlines返回的列表直接使用writelines,将写入一个与readlines读入的文件完全相同的文件。

假设有一个文本文件myfile.txt,以下代码将为其创建名为myfile2.txt的精确副本:

input_file = open("myfile.txt", 'r')
lines = input_file.readlines()
input_file.close()
output = open("myfile2.txt", 'w')
output.writelines(lines)
output.close()

有时候可能需要把文件中的全部数据读入一个bytes对象,尤其是在数据不是字符串的时候,需要将数据全部放入内存,以便当作字节序列来处理。或者是要把从文件中读入的数据视为固定大小的bytes对象。

例如,正在读取的数据中可能没有给出明确的换行符,而是假定每行都是固定大小的字符序列。为此就该使用read方法。如果不带参数,read方法将从当前位置读取文件中的所有数据,并返回bytes对象。如果带了一个整数参数,read方法将读取该参数指定数量的字节(如果文件中的数据不足则会少于该数量)并返回该参数指定大小的bytes对象:

input_file = open("myfile", 'rb')
header = input_file.read(4)
data = input_file.read()
input_file.close()

第一行代码以二进制读模式打开一个文件,第二行代码读取前4个字节作为标题字符串,第三行代码则读取文件的其余部分作为一整块数据。

记住,以二进制模式打开的文件只能当作字节来处理,而不能视作字符串。如果要把数据当作字符串来使用,必须将bytes对象解码为string对象。这一点在处理网络协议时往往十分重要,这时的数据流往往与文件的表现很类似,但需要按照字节来进行解析,而不是字符串。

5、用pathlib读写文件

除了路径操作功能之外,还可用Path对象来读写文本和二进制文件。因为不需要执行打开或关闭操作,并且分别对文本和二进制操作提供了单独的方法,所以Path对象用起来比较方便。

但是,这里有一个限制,就是无法用Path的方法追加数据,因为其写入操作会把现有数据全部替换掉:

>>> from pathlib import Path
>>> p_text = Path('my_text_file')
>>> p_text.write_text('Text file contents')
18
>>> p_text.read_text()
'Text file contents'
>>> p_binary = Path('my_binary_file')
>>> p_binary.write_bytes(b'Binary file contents')
20
>>> p_binary.read_bytes()
b'Binary file contents'

6、屏幕输入/输出及重定向

利用内置的input函数,可以提示用户并读取其输入的字符串:

>>> x = input("enter file name to use: ")
enter file name to use: myfile
>>> x
'myfile'

提示行信息是可选参数,输入行末尾的换行符将会被删掉。如果要用input读取数值,需要将input返回的字符串显式转换为正确的数值类型。以下示例用到了int:

>>> x = int(input("enter your number: "))
enter your number: 39
>>> x
39

input会把提示信息写入标准输出中,再从标准输入读取数据。通过sys模块,可以从较低层访问标准输出、标准输入和标准错误设备,该模块带有sys.stdin、sys.stdout和sys.stderr属性,可以将这些属性视为专用的file对象。

对sys.stdin可以使用read、readline和readlines方法。对于sys.stdout和sys.stderr,可以使用标准print函数以及write和writelines方法,用法与其他file对象是一样的:

>>> import sys
>>> print("Write to the standard output.")
Write to the standard output.
>>> sys.stdout.write("Write to the standard output.\n")
Write to the standard output.
30                               ⇽---  sys.stdout.write会返回已写入的字符数量
>>> s = sys.stdin.readline()
An input line
>>> s
'An input line\n'

可以将标准输入重定向为从文件读取,同理,可以将标准输出或标准错误设为写入文件中。然后编写代码用sys.__stdin__、sys.__stdout__和sys.__stderr__将它们恢复为初始值:

>>> import sys
>>> f = open("outfile.txt", 'w')
>>> sys.stdout = f
>>> sys.stdout.writelines(["A first line.\n", "A second line.\n"])     ⇽---  此时outfile. txt中包含2行数据:"A first line. "和"A second line. "
>>> print("A line from the print function")
>>> 3 + 4      ⇽---  现在outfile.txt中包含3行数据:"A first line."、"A second line."和"A line from the print function"
>>> sys.stdout = sys.__stdout__
>>> f.close()
>>> 3 + 4
7

在不改变标准输出的情况下,也可以将print函数重定向到文件中去:

>>> import sys
>>> f = open("outfile.txt", 'w')
>>> print("A first line.\n", "A second line.\n", file=f)   ⇽---  此时outfile.txt中包含2行数据:"A first line."和"A second line. "
>>> 3 + 4
7
>>> f.close()
>>> 3 + 4
7

当标准输出被重定向之后,提示信息和错误的跟踪信息(traceback)可照常显示,但其他输出就看不到了。如果用的是IDLE,那么用sys.__stdout__是无法正常显示的,必须直接使用解释器的交互模式。

通常在运行脚本文件或程序时,会采用输入/输出重定向技术。不过在Windows中使用交互模式时,可能也会需要暂时把标准输出重定向到文件中,以便能把可能会引起滚屏的输出信息捕获下来。如下代码所示的小模块实现了一组提供捕获输出功能的函数。

mio.py文件:

"""mio: module, (contains functions capture_output, restore_output,
     print_file, and clear_file )"""
import sys
_file_object = None

def capture_output(file="capture_file.txt"):
    """capture_output(file='capture_file.txt'): redirect the standard
    output to 'file'."""
    global _file_object
    print("output will be sent to file: {0}".format(file))
    print("restore to normal by calling 'mio.restore_output()'")
    _file_object = open(file, 'w')
    sys.stdout = _file_object

def restore_output():
    """restore_output(): restore the standard output back to the
             default (also closes the capture file)"""
    global _file_object
    sys.stdout = sys.__stdout__
    _file_object.close()
    print("standard output has been restored back to normal")

def print_file(file="capture_file.txt"):
    """print_file(file="capture_file.txt"): print the given file to the
         standard output"""
    f = open(file, 'r')
    print(f.read())
    f.close()
def clear_file(file="capture_file.txt"):
    """clear_file(file="capture_file.txt"): clears the contents of the
         given file"""
    f = open(file, 'w')
    f.close()

在上述代码中,capture_output()函数将把标准输出重定向到文件中,默认的文件名为capture_file.txt。restore_output()函数则把标准输出恢复为默认值。如果不执行capture_output,print_file()将会把文件显示到标准输出中去,clear_file()会把当前文件内容清除。

7、用struct模块读取结构化的二进制数据

一般在处理文件时,不大会用Python来读写自定义的二进制数据。如果有简单的数据存储需求,通常最好用文本或字节进行输入/输出。对更加复杂的应用,Python则提供了pickle,能够轻松完成任意对象的读写。相比直接读写自定义的二进制数据,pickle能够大幅减少出错概率,强烈推荐使用。

但至少有一种情况下,或许有必要知道该如何读写二进制数据,那就是要处理的文件是由其他程序生成或使用的。

如上所述,如果以二进制模式打开文件,Python支持用字节显式地进行二进制输入或输出,而不是采用字符串。但是大多数二进制文件都是依赖特定的结构来解析数据的,因此自行编写代码读取数据并正确拆分为变量值,往往得不偿失。取而代之的是使用标准的struct模块,可以将这些字符串数据解释为具有特定含义的带格式的字节序列。

假设需要读取名为data的二进制文件,其中包含由C程序生成的多条记录。每条记录由1个C语言的short整数、1个C语言的double浮点数和1个4字符的序列组成,字符序列应该被解析为包含4字符的字符串。数据要被读入Python元组列表中,每个元组包含1个整数、1个浮点数和1个字符串。

首先需要定义一个可供struct模块解释的格式串(format string),以告知模块记录中的数据是如何封装的。在格式串中,用到了struct模块可解释的字符表示记录的数据类型。例如,字符'h'表示1个C语言的short整数,字符'd'表示1个C语言的double浮点数,'s'则表示字符串。这些标记前面都可以带有整数前缀,用来表示数量,本例就可用'4s'表示4个字符组成的字符串。因此,本例的数据记录对应的格式串就是'hd4s'。struct模块可以解释很多数字、字符和字符串格式。

在开始读取文件中的记录前,需要弄清楚每次读取的字节数量。好在struct模块中有一个calcsize函数,参数是格式串,返回的是以这种格式封装的数据字节数。

读取数据记录时,请使用read方法。然后用struct.unpack函数可根据格式串对读到的记录进行解析来便捷地返回数据元组。读取二进制数据文件的程序非常简单:

import struct
record_format = 'hd4s'
record_size = struct.calcsize(record_format)
result_list = []
input = open("data", 'rb')
while 1:  
    record = input.read(record_size)     ⇽---  读入一条记录
    if record == '':     ⇽---  ❶
        input.close()
        break
    result_list.append(struct.unpack(record_format, record))   ⇽---  将数据拆包到元组中,并加入列表

如果记录为空且位于文件末尾,则退出循环❶ 。注意这里没有一致性检查,如果最后一条记录数为奇数,则struct.unpack函数引发错误。

可能大家已经猜到了,struct模块还提供了将Python值转换为打包的字节序列的功能。转换是通过struct.pack函数完成的,几乎但不完全是struct.unpack的逆操作。之所以称为“几乎”,就是因为struct.unpack返回的是Python值的元组,而struct.pack的参数却不是元组。struct.pack的第一个参数是格式串,然后是填充格式串的多个参数。以下代码以上一个示例中的格式生成一条二进制记录:

>>> import struct
>>> record_format = 'hd4s'
>>> struct.pack(record_format, 7, 3.14, b'gbye')
b'\x07\x00\x00\x00\x00\x00\x00\x00\x1f\x85\xebQ\xb8\x1e\t@gbye'

struct模块还提供了更加强大的功能。可以在格式串中插入特殊字符,用于标识以大端字节序(big-endian)、小端字节序(little-endian)或本机字节序(machine-native-endian)读写数据(默认是本机字节序),还可标识C语言的short整数应该为本机(默认)尺寸还是标准C语言的尺寸。

8、用pickle将对象存入文件

只需要用到几个函数,Python就能把任何数据结构写入文件,再从文件中读回并重建该数据结构。这种功能很特殊但可能很有用,因为能少写很多代码。光是把程序运行状态转储到文件中,就够写好几页的代码了。把状态数据从文件中读回来,又得写差不多数量的代码。

Python通过pickle模块提供了上述功能,既强大又易用。假定程序的全部状态都保存在3个变量a、b和c中。以下代码就可以把状态数据保存到state文件中:

import pickle
.
.
.
file = open("state", 'wb')
pickle.dump(a, file)
pickle.dump(b, file)
pickle.dump(c, file)
file.close()

无论在a、b、c中保存什么数据,都是可以的,可以是简单的数值,也可以很复杂,例如,包含用户自定义类的实例的字典列表。用pickle.dump一切皆可保存。

后续如果要把数据读回来,可以如下操作:

import pickle 
file = open("state", 'rb')
a = pickle.load(file)
b = pickle.load(file)
c = pickle.load(file)
file.close()

pickle.load将之前保存在变量a、b、c中的数据恢复回来了。

pickle模块几乎可以用这种方式保存任何对象。它可以处理列表、元组、数字、字符串、字典,以及由这些类型的对象组成的任何对象,包括所有类的实例。它还能够正确处理共享对象、循环引用(cyclic reference)和其他的复杂内存结构。共享对象只会被保存一次,并能恢复为共享对象,而不是相同的副本。但是代码对象(Python用于存储字节编译码)和系统资源(如文件或套接字)无法用pickle进行转储。

通常不会把程序的所有状态都用pickle保存下来。例如,大多数应用程序都可以同时打开多个文档。如果把程序的所有状态都保存了下来,实际上就会把所有打开的文档都保存在一个文件中。如果只需要保存和恢复感兴趣的数据,有一种简单高效的做法就是编写一个保存函数,把需要保存的数据都存入字典对象,然后用pickle把字典对象保存下来。之后可以用对应的恢复函数把字典对象读回来(仍然是通过pickle),并将字典对象中的数据分别赋给相应的变量。

这种技术还具备一个优点,就是读回数据时顺序不可能出错,也就是读取时的顺序一定与存储时的顺序相同。如果对上述例子应用这种做法,代码应该如下所示:

import pickle
.
.
.
def save_data():
    global a, b, c
    file = open("state", 'wb')
    data = {'a': a, 'b': b, 'c': c}
    pickle.dump(data, file)
    file.close()

def restore_data():
    global a, b, c
    file = open("state", 'rb')
    data = pickle.load(file)
    file.close()
    a = data['a']
    b = data['b']
    c = data['c']

以上例子多少有点勉强,交互模式下的顶级变量可能不会需要经常把状态保存下来。

下面这个比较真实的应用程序,调用了一个相当耗时的带3个参数的函数。在程序运行过程中,很多对该函数的调用最终都用到了相同的参数。通过将结果缓存到字典中,并将调用参数作为字典键,可以获得显著的性能提升。但还有一种情况,该程序可能会分成多个会话运行多次,时间跨度可能是几天、几周或几个月。

因此,利用pickle将缓存保存起来,就可以避免每次开启会话后都要重新开始计算。

下面代码给出的是可能实现该目标的简化版模块。

sole.py文件:

"""sole module: contains functions sole, save, show"""
import pickle
_sole_mem_cache_d = {}
_sole_disk_file_s = "solecache"
file = open(_sole_disk_file_s, 'rb')     ⇽---  加载模块时执行的初始化代码
_sole_mem_cache_d = pickle.load(file)
file.close()

def sole(m, n, t):        ⇽---  公有函数
    """sole(m, n, t): perform the sole calculation using the cache."""
    global _sole_mem_cache_d
    if _sole_mem_cache_d.has_key((m, n, t)):
        return _sole_mem_cache_d[(m, n, t)]
    else:
       # 做一些消耗时间的计算
       _sole_mem_cache_d[(m, n, t)] = result
        return result

def save():
    """save(): save the updated cache to disk."""
    global _sole_mem_cache_d, _sole_disk_file_s
    file = open(_sole_disk_file_s, 'wb')
    pickle.dump(_sole_mem_cache_d, file)
    file.close()

def show():
    """show(): print the cache"""
    global _sole_mem_cache_d
    print(_sole_mem_cache_d)

上述代码已假定缓存文件已经存在。如果想让代码跑起来,请按以下步骤初始化缓存文件:

>>> import pickle
>>> file = open("solecache",'wb')
>>> pickle.dump({}, file)
>>> file.close()

当然,这里还需要把“# 做一些消耗时间的计算”部分替换为实际的计算过程。注意,如果是在生产代码中,这是一种可能会用绝对路径作为缓存文件名的情况。此外,这里没有对并发做出处理。如果有两个人运行的会话时间有重叠,则只有最后一个人的状态能被保存下来。如果这种情况会引发问题,可以在save函数中采用字典的update方法,以便大幅限制时间窗口重叠。

虽然在上面的场景中,使用经过pickle保存的对象具备一定的价值,但还是应该了解一下pickle的缺点:

  • 作为一种序列化手段,pickle速度不算特别快,空间利用率也不够高。哪怕用JSON格式来保存序列化后的对象,也比pickle更快,磁盘文件也更小。
  • pickle不够安全,如果以pickle方式加载恶意内容,就能在机器上执行任意代码。因此,只要pickle文件存在被想修改的人访问的任何可能,就应该避免使用pickle方式。

9、用shelve保存对象

可以将shelve对象视为一个字典,只不过数据是存储在磁盘文件中,而不是内存中。这就表示仍可以方便地通过键来访问数据,但是数据量不再受内存大小的限制了。

对于那些需要在大文件中存储或访问小块数据的人来说,这里介绍内容可能是最令他们感兴趣的。因为Python的shelve模块正好完成了这项任务,允许在大文件中读写小块数据,而无须读写整个文件。对于频繁访问大型文件的应用程序而言(如数据库应用程序),可以减少大量的访问时间。就像pickle模块(shelve中用到了)一样,shelve模块十分简单。

我们将会通过一个地址簿应用来介绍shelve模块。这类数据通常比较小,完全可以在应用程序启动时读入整个地址簿文件,并在应用程序结束时把文件写出去。如果由于交际甚广导致地址簿过于庞大,那么最好采用shelve模块,也就不用操心大小问题了。

假定地址簿中的每个条目都是包含3个元素的元组,分别表示某人的名称、电话号码和地址。所有条目都根据姓氏编制了索引。这种设定比较简单,所以应用程序在Python shell的交互式会话中即可实现。

首先导入shelve模块,然后打开地址簿文件。如果文件不存在,shelve.open会自动创建一个:

>>> import shelve
>>> book = shelve.open("addresses")

然后添加一些条目。注意,要把shelve.open返回的对象视作是一个字典,虽然这个字典只能用字符串作为键:

>>> book['flintstone'] = ('fred', '555-1234', '1233 Bedrock Place')
>>> book['rubble'] = ('barney', '555-4321', '1235 Bedrock Place')

 最后,关闭文件并结束会话:

>>> book.close()

然后,在同一个目录中再次启动Python,并打开同一个地址簿文件:

>>> import shelve
>>> book = shelve.open("addresses")

下面不再输入数据,而是查看一下刚才存入的数据是否还在:

>>> book['flintstone']
('fred', '555-1234', '1233 Bedrock Place')

shelve.open在以上第一个交互式会话中创建的地址簿文件,就像是一个持久化的字典。虽然没有显式地执行磁盘写入操作,但是前面输入的数据也已被保存到磁盘中了。这正是shelve完成的工作。

概括起来说,shelve.open返回的是一个shelf对象,可以对其进行基本的字典操作,包括对键赋值或查找、del、in和keys方法。但与普通的字典对象不同,shelf对象把数据存在磁盘上,而不是保存于内存中。可惜与字典相比,shelf对象确实存在一个重要限制,即只能用字符串作键,而字典对象的键,则允许使用相当多的类型。

shelf对象在处理大型数据集时,比字典对象更具优势,理解这一点是非常重要的。shelve.open能保证文件的可访问性,因为不会将整个shelf对象文件全都读入内存,仅在需要时才会访问文件(通常是在查找元素时),并且文件结构的编排模式能够保证数据查找非常快。即便数据文件非常庞大,也只需要几次磁盘访问就能在文件中找到对象,这可以在多个方面提高程序的性能。因为不用将可能的大文件读入内存,所以程序的启动速度可能更快。此外,可供程序使用的物理内存会增加,被交换到虚拟内存的代码则会减少,因此程序执行速度就会提高。内存中可容纳的数据增加了,能够同时操作的数据也就更多。

shelve模块的使用是有一些限制的。如前所述,shelf对象的键只能是字符串。不过任何可由pickle序列化的Python对象,都可以保存在shelf对象的键下。此外,shelf对象不适合用作多用户数据库,因为未提供对并发访问的控制。请确保在使用完毕后关闭shelf对象,有时为了把改动(新增或删除条目)写回磁盘,也需要执行关闭操作。

如上所述,mio.py文件中的缓存例程,是应用shelve的理想示例。例如,再不用靠用户来把工作状态显式保存到磁盘上去了。唯一的问题可能是,写回文件时失去了对写入操作的底层控制。

三、文件操作

1、文件存储方案

很多系统都会持续产生一系列数据文件。这些文件可能是电子商务服务器或常规进程生成的日志文件,可能是服务器每晚生成的产品信息源,可能是自动生成的在线广告项数据源,可能是股票交易的历史数据,或者其他千百个数据源。这些数据源往往是未经压缩的普通文本文件,其中包含的原始数据将是其他进程的输入数据或副产品。

然而,尽管很不起眼,但它们包含的数据具有一定的潜在价值,因此数据文件不能在日终丢弃,这意味着它们的数量每天都会增长。随着时间的推移,文件会越积越多,直至人工处理变得不再可行,直至占用的存储容量变得不可接受。

这里有一个典型的场景,就是每日生成的产品数据源。这些数据可能来自供应商,也可能是在线销售的输出结果,但基本要素是相同的。

不妨考虑一下来自供应商的产品数据源示例。数据文件每天生成一次,供应的每样商品数据占据一行。每行数据有多个字段,包括供应商库存单位(SKU)编号、商品的简要说明,包括商品的价格、高度、宽度、长度,还包括商品的状态(有货或有订单),根据业务不同可能还会包括其他一些字段。

除这个基本数据文件之外,可能还会收到其他数据,可能是关联产品的数据,可能是更详尽的商品属性,也可能是别的什么信息。这样最终每天都会收到几个文件,每天的文件名相同且存于同一个目录下等待处理。

现在假设,每天都会收到3个相互关联的文件:item_info.txt、item_attributes.txt、related_items.txt。这3个文件每天都会有,并且得进行处理。如果仅仅是要处理文件,那应该不用太担心,只要用每天的文件替换前一天的,然后进行处理就行了。但如果数据不能丢弃,那又该怎么办呢?

原始数据可能需要保留下来,以便在处理结果不准确时能够参考以前的文件,或者是需要跟踪数据随时间的变化情况。无论出于什么原因,只要有保留文件的需求,就意味着得做出一定的处理。

在有可能实现的处理方案中,最简单的做法就是把文件标上接收日期,并移入归档文件夹中。这样每一组新文件都能够被接收、处理、重命名并移走,因此处理过程可以重复进行,数据也不会丢失。

在重复几次处理之后,目录结构可能会变成如下所示:

working/     ⇽---  主工作目录,存放当前等待处理的各文件
    item_info.txt
    item_attributes.txt
    related_items.txt
    archive/     ⇽---  将已处理文件进行归档的子目录
        item_info_2017-09-15.txt
        item_attributes_2017-09-15.txt
        related_items_2017-09-15.txt
        item_info_2016-07-16.txt
        item_attributes_2017-09-16.txt
        related_items_2017-09-16.txt
        item_info_2017-09-17.txt
        item_attributes_2017-09-17.txt
        related_items_2017-09-17.txt

请思考一下实现处理所需的步骤。首先得将文件重新命名,把当前日期添加到文件名中。为此需要获取文件名,然后再获得不带扩展名的主体文件名。得到文件名的主体部分(stem)后,需要添加一个基于当前日期生成的字符串,再把扩展名加回到最后去,然后执行文件名修改并将其移入归档目录。

获取文件名的方式有很多。如果能确保文件名始终完全相同,而且文件数量不多,那就可以在代码中写死。但更保险的方法是,采用pathlib模块和path对象的glob方法,如下所示:

>>> import pathlib
>>> cur_path = pathlib.Path(".")
>>> FILE_PATTERN = "*.txt"
>>> path_list = cur_path.glob(FILE_PATTERN)
>>> print(list(path_list))
[PosixPath('item_attributes.txt'), PosixPath('related_items.txt'),
     PosixPath('item_info.txt')]

现在可以遍历与FILE_PATTERN匹配的路径,并实施所需的文件更名操作。请记住,每个文件名中都要加入日期,并将重命名后的文件移入存档目录。采用pathlib时,完整的操作可能是如下代码所示。

files_01.py文件:

import datetime
import pathlib

FILE_PATTERN = "*.txt"     ⇽---  设置文件匹配模式和归档目录
ARCHIVE = "archive"     ⇽---  为了这行代码能够运行,“archive”目录必须存在

if __name__ == '__main__':

    date_string = datetime.date.today().strftime("%Y-%m-%d")     ⇽---  用datetime库中的date对象,基于今日日期生成字符串

    cur_path = pathlib.Path(".")
    paths = cur_path.glob(FILE_PATTERN)

    for path in paths:
        new_filename = "{}_{}{}".format(path.stem, date_string, path.suffix)
        new_path = cur_path.joinpath(ARCHIVE, new_filename)     ⇽---  由当前路径、归档目录、新文件名创建一个新的path对象 
        path.rename(new_path)     ⇽---  一步完成文件重命名(移动)操作

值得注意的是,path对象使得上述操作变得更加简单了,因为不需要特殊的解析就能将文件名主体和后缀分隔开了。上述操作也可能比预期的更为简单,因为rename方法实际上可以实现文件的移动,只要采用包含新位置的路径即可。

上述脚本非常简单,只用了很少的代码就可以高效完成任务了。

2、引入更多目录结构

上面的文件存储方案能够生效,但确实存在一些缺点。首先,随着文件的累积,管理起来可能会比较麻烦。因为过了一年之后,在同一个目录中就会有365组相互关联的文件,只有通过查看文件名才能找到关联的文件。当然,假如文件来得更加频繁,或者一组文件中有更多的关联文件,那么麻烦会更大。

为了缓解上述问题,不妨改变一下文件的归档方式。不再把文件名改成包含接收日期,而是可以为每组文件创建一个单独的子目录,并在数据接收完成后给子目录命名。

目录结构可能如下所示:

working/     ⇽---  主工作目录,存放当前等待处理的各文件
    item_info.txt
    item_attributes.txt
    related_items.txt
    archive/     ⇽---  将已处理文件进行归档的主子目录
        2016-09-15/      ⇽---  存放每组文件的子目录,以接收日期命名
            item_info.txt
            item_attributes.txt
            related_items.txt
        2016-09-16/      ⇽---  存放每组文件的子目录,以接收日期命名
            item_info.txt
            item_attributes.txt
            related_items.txt
        2016-09-17/     ⇽---  存放每组文件的子目录,以接收日期命名
            item_info.txt
            item_attributes.txt
            related_items.txt

该方案的优点是,每组文件都会聚在一起。无论收到多少组文件,也无论一组有多少文件,都可以轻松找到某一组的全部文件。

事实证明,按子目录归档文件的工作量,并没有比第一个解决方案增加很多。唯一增加的步骤,就是在重命名文件之前先得创建子目录。下面代码给出的是实现这一步的一种做法。

files_02.py文件:

import datetime
import pathlib

FILE_PATTERN = "*.txt"
ARCHIVE = "archive"

if __name__ == '__main__':

    date_string = datetime.date.today().strftime("%Y-%m-%d")

    cur_path = pathlib.Path(".")

    new_path = cur_path.joinpath(ARCHIVE, date_string)
    new_path.mkdir()     ⇽---  注意,该目录仅需创建一次,就在文件移入之前

    paths = cur_path.glob(FILE_PATTERN)

    for path in paths:
        path.rename(new_path.joinpath(path.name))

上述方案将关联文件进行了分组,这能让按组管理变得稍微容易一些。

3、节省存储空间:压缩和整理

到目前为止,我们主要关注的是如何管理接收到的文件组。但随着时间的推移,数据文件会日益累积,直到有一天所需的存储空间大小也成为需要关注的难题。如果发生这种情况,有几种解决方案可供选择。一种选择是获取更大的磁盘。特别当使用的是基于云的平台时,采用这种策略可能既简单又经济。

但请记住,增加存储设备并不能真正解决问题,而只是缓兵之计而已。

1. 文件压缩

如果文件占用的空间已经成为问题,那么下一个方案可以考虑对其进行压缩。压缩一个或一组文件的方式有很多种,但通常都很类似。将考虑将每天的数据文件归档为一个zip文件。如果主要是文本文件,并且文件相当大,那么通过压缩归档节省下来的存储空间可能会相当可观。

以下代码将扩展名为.zip的日期字符串用作每个zip文件名。

下面代码2中,在归档目录中创建了一个新目录,然后将文件移入其中,因此目录结构将如下所示:

working/     ⇽--- 主工作目录,当前各文件在此完成处理,然后在压缩归档后将被移除 
    archive/ 
        2016-09-15.zip     ⇽---  每个zip文件,包含当天的item_info.txt、
                                  attribute_info.text和related_items.txt
        2016-09-16.zip
        2016-09-17.zip

为了用上zip文件,上面的某些步骤显然需要进行修改。

新代码中加入的一个关键点是zipfile库的导入,以及用zipfile库在归档目录中新建zip文件对象的代码。然后就可以利用zip文件对象,将数据文件写入新建的zip文件中了。最后,因为实际不会再移走文件了,所以要在工作目录中移除原有的文件。解决方案可以如下代码所示。

files_03.py文件:

import datetime
import pathlib
import zipfile     ⇽---  导入zipfile库

FILE_PATTERN = "*.txt"
ARCHIVE = "archive"
if __name__ == '__main__':

    date_string = datetime.date.today().strftime("%Y-%m-%d")

    cur_path = pathlib.Path(".")
    paths = cur_path.glob(FILE_PATTERN)

    zip_file_path = cur_path.joinpath(ARCHIVE, date_string + ".zip")      ⇽---  创建位于归档目录中的zip文件的路径

    zip_file = zipfile.ZipFile(str(zip_file_path), "w")   ⇽---  以写入方式打开新建的zip文件,得用str()将路径转换为字符串

    for path in paths:
        zip_file.write(str(path))     ⇽---  将当前文件写入zip文件
        path.unlink()     ⇽---  从工作目录中移除当前文件

2. 文件清理

将数据文件压缩到zipfile归档文件中,可以节省大量空间,这或许完全可以满足需求了。但如果文件的数量有很多,或者文件大小压缩不了多少(如JPEG图像文件),那么存储空间还是有可能会发生短缺。还有一种可能,数据的变化不会很大,因此无须长期保留每组数据的归档副本。也就是说,把过去一周或一个月的每日数据都保留下来,可能还有点用,但是将每组数据保留再长的时间,在存储上就不大划算。对于几个月之前的数据,可以每周只保留一组文件,甚至每月保留一组文件,这或许是可以接受的。

在文件存在一定时间之后就将其移除,这一过程有时可称作清理(groom)。假设每天接收一组数据文件并存档为zip文件已经有好几个月了,然后要求对超过一个月的文件每周只保留一个文件。

最简单的清理代码可以是移除所有没用的文件。这样对超过一个月的文件,每周一个文件以外的文件都会被删除。在设计代码时,弄清楚以下两个问题的答案将很有帮助。

  • 因为需要每周保存一个文件,所以简单地选取周几需要保存是否会容易很多?
  • 这种清理应该多久进行一次,每天、每周还是每月一次?如果决定每天进行清理,那么将清理与归档代码结合起来,可能会很有意义。反之,如果只需要每周或每月清理一次,那么清理与归档操作就应该是单独的代码。

对本例而言,为了保持条理的清晰,可以编写单独的清理代码,可以以任意的时间间隔运行,也可以移除所有无用的文件。此外,假定对超过一个月的文件只要保留周二收到的即可。

如下代码给出的是清理代码的一个示例。

files_04.py文件:

from datetime import datetime, timedelta
import pathlib
import zipfile

FILE_PATTERN = "*.zip"
ARCHIVE = "archive"
ARCHIVE_WEEKDAY = 1      
if __name__ == '__main__':
    cur_path = pathlib.Path(".")
    zip_file_path = cur_path.joinpath(ARCHIVE)
    paths = zip_file_path.glob(FILE_PATTERN)
    current_date = datetime.today()    ⇽---  获取今天的datetime对象
    for path in paths:
        name = path.stem    ⇽---  path.stem将返回不带扩展名的文件名
        path_date = datetime.strptime(name, "%Y-%m-%d")    ⇽---  strptime根据给出的格式串把字符串解析为datetime对象
        path_timedelta = current_date - path_date    ⇽---  日期相减将生成timede-lta对象
        if path_timedelta > timedelta(days=30) and path_date.weekday() != 
     ARCHIVE_WEEKDAY:       ⇽---  timedelta(days=30)会创建一个30天的timedelta对象weekday()返回代表星期几的整数,周一为0
            path.unlink()

上述代码展示了如何组合运用Python的datetime和pathlib库,只用了几行代码就实现了按日期清理文件。由于归档文件名是根据接收日期生成的,因此可用glob方法获取这些文件路径,提取文件名主体部分,并用strptime将其解析为datetime对象。

然后可以用datetime的timedelta对象和weekday()方法找到文件存在时间和星期几,然后把不需要的文件移除链接(unlink)。

总结:

  • pathlib模块能大大简化文件操作,如查找根目录和扩展名、移动和重命名、通配符匹配等。
  • 随着文件数量和复杂度的增加,自动归档的解决方案将会至关重要。Python为创建归档代码提供了几种简单的途径。
  • 通过对数据文件进行压缩和清理,可以显著节省存储空间。

Guess you like

Origin blog.csdn.net/qq_35029061/article/details/130137929