One article to understand Python's file path operation

If you want to read a file in the code, then you first need to know the path of the file. If there is only one file, it is very simple, just copy the folder path and file name where this file is located. In many cases, we will deal with a large number of files, which are generally stored in one or several folders according to certain rules. This article is a brief talk about how to deal with this situation, using Python as an example, but the concept is universal.

1 What is a file path

Simply put, the file path is the storage location of the file, which includes the specific disk symbol, that is, which disk partition, which folder (directory) is located on the computer, and finally the name of the file + file type extension.

The path of the file indicates the folder line that the user goes through when looking for the file on the disk; the path is divided into absolute path and relative path ; the absolute path is the path starting from the root folder; the relative path is the path starting from the current folder.

1.1 Absolute path

The expression form of absolute path is different under different operating systems. Taking Windows system as an example, the path of a file may be as follows:

D:\files\data\ndvi.tif

in:

  • D:\: Indicates the root folder, which is the drive letter where the file is located, that is, Dthe drive.
  • D:\files\data: Indicates the path of the folder where the file is located, that is, the subfolder of the folder Don the disk .filesdata
  • ndvi.tif: Indicates the file name, where ndvithe base name is used to identify the file; tifthe extension is used to reflect the type of the file, and the two are .separated.

The absolute path under Linux and MacOS is different from the Windows system. The main differences are as follows:

  • The root folder is different, the root folder of Windows is the drive letter, such as D:\, C:\; while in Linux and MacOS, the root folder is /, you can understand that all the files are in one disk, naturally there is no need to use Csuch Dcharacters To distinguish.
  • The delimiters are different, on Windows, path writing uses backslashes \as delimiters between folders. But on MacOS and Linux, forward slashes are used /as their path separators.
  • Case sensitivity is different, folder names and file names are case-insensitive on Windows and MacOS, but are case-sensitive on Linux.

Path to the attached volume:

Additional volumes, such as DVD drives or USB flash drives, appear differently on different operating systems. On Windows, they are represented as the new, characterized root drive. such as D:\or E:\. On MacOS, they are represented as new folders, under /Volumesfolders. On Linux, they are represented as new folders, under /mntfolders.

1.2 Relative path

A relative path refers to the path that is linked to the target file resource (or folder) based on the current working directory.

The symbols for relative paths are as follows:

  • Beginning with ./means that the current directory and the file directory are in the same directory, and ./can also be omitted;
  • Start with ../: Go up one level, which means that the target file is in the upper level directory where the current file is located;
  • Start with ../../: Go up two levels, representing the parent directory of the parent, that is, the upper-level directory, and to be clear, it is the upper-level directory of the upper-level directory;
  • Starting /with , represents the root directory.

Example of relative path usage:

2 Python's operations on paths

2.1 How to represent the file path in Python

In Python, strings are generally used to store file paths. Note, however, that the character backslash \represents an escape character in Python. Therefore, the following points should be noted when expressing the file path under the Windows system (the separator of the Windows system is \):

  • Take the path D:\files\data\ndvi.tifas an example;
  • Adding a character before the string rindicates that the string is a raw string, and all escape characters are completely ignored. For example, r"D:\files\data\ndvi.tif";
  • Escape escape characters, for example, "D:\\files\\data\\ndvi.tif";
  • Replace the delimiter with /, yes, under the Windows system, replace the delimiter with /Python can also recognize it correctly. For example, "D:/files/data/ndvi.tif".

Under Linux and MacOS, just put the path in single quotes or double quotes.

2.2 Create a new folder

You can use os.mkdir()functions to create new folders (directories), and use os.path.isdir()functions to determine whether a path is a folder, as follows:

import os

path = "D:/files/data"
if os.path.isdir(path):
	os.mkdir(path)

It should be noted that os.mkdir()the function can only create single-level directories. As shown in the code above, a directory "D:/files"can only be created under it if the directory exists data. To create multi-level directories, use os.makedirs()functions, for example:

import os

path = "D:/files/data"
if os.path.isdir(path):
	os.makedirs(path)

os.listdir()The method is used to return a list of the names of the files or folders contained in the specified folder, only supported under Unix, Windows, for example:

>>> import os
>>> os.listdir("D:/files/data")
['ndvi.tif', 'ndvi_2023_01.tif']

2.3 Joining and splitting folders

os.path.join()The function is used to concatenate file paths with paths, and multiple paths can be passed in, and it will automatically determine the separator according to the operating system.

>>> import os
>>> os.path.join(r'D:\files\data', 'ndvi_2023_01.tif')
'D:\\files\\data\\ndvi_2023_01.tif'
>>> os.path.join('D:/files/data', 'ndvi_2023_01.tif')
'D:/files/data\\ndvi_2023_01.tif'
>>> os.path.join('./data', 'ndvi_2023_01.tif')
'./data\\ndvi_2023_01.tif'
>>> os.path.join('files', 'data', 'ndvi_2023_01.tif')
'files\\data\\ndvi_2023_01.tif'

os.path.split()function can split a path into two parts, the latter part is always the last level directory or file name. os.path.splitext()The function can also split a path into two parts, the latter part is always the last file extension.

>>> os.path.split("D:/files/data/ndvi.tif")
('D:/files/data', 'ndvi.tif')
>>> os.path.split("D:/files/data")
('D:/files', 'data')

>>> os.path.splitext("D:/files/data/ndvi.tif")
('D:/files/data/ndvi', '.tif')
>>> os.path.splitext("D:/files/data")
('D:/files/data', '')

In addition, os.path.basename()the function is used to obtain the last file name or folder name of the path, os.path.dirname()the function is used to remove the last file name or folder name, and return the remaining folder path.

>>> os.path.basename("D:/files/data/ndvi.tif")
'ndvi.tif'
>>> os.path.basename("D:/files/data")
'data'
>>> os.path.dirname("D:/files/data/ndvi.tif")
'D:/files/data'
>>> os.path.dirname("D:/files/data")
'D:/files'

2.4 Handling absolute and relative paths

os.pathThe module provides functions that return an absolute path relative to a path, and check whether a given path is absolute.

  • os.getcwd(): Get the current working directory;
  • os.path.abspath(path): Returns the string of the absolute path of the parameter, based on the current working directory. Here's an easy way to convert a relative path to an absolute path;
  • os.path.isabs(path): If the parameter is an absolute path, it will return True, if the parameter is a relative path, it will return False;
  • os.path.relpath(path, start): will return a string with the relative path from startpath to . pathIf not provided start, the current working directory is used as the starting path.

Code example:

>>> os.getcwd()
'C:\\Python34'
>>> os.path.abspath('.')
'C:\\Python34'
>>> os.path.abspath('.\\Scripts')
'C:\\Python34\\Scripts'
>>> os.path.isabs('.')
False
>>> os.path.isabs(os.path.abspath('.'))
True
>>> os.path.relpath('C:\\Windows', 'C:\\')
'Windows'
>>> os.path.relpath('C:\\Windows', 'C:\\spam\\eggs')
'..\\..\\Windows'

2.4 Use glob to find folders or files

globThe module is used to find file directories and files , and return the searched results in a list.

Before using globthe module, you need to understand its three wildcards, namely *, ?and [], and their specific meanings are as follows:

  • *: represents 0 or more characters;
  • ?: represents a character;
  • []: Match characters within the specified range, such as [0-9]matching numbers; [a-c]matching letters a, bor c, case-insensitive; [12a]matching letters 1, 2or a.

Let's use an example to explain in detail how to use these three wildcards. Suppose you have the following directory structure:

+-- D:/
|   +-- data1
|   |   +-- readme.md
|   |   +-- ndvi.tif
|   |   +-- buliding.tif
|   +-- data2
|   |   +-- ndvi.tif
|   |   +-- water.tif
|   +-- picture
|   |   +-- mm.tif
|

If you want to find data1all .tiffiles ending with , you can write:

>>> from glob import glob
>>> glob('D:/data1/*.tif')
['D:/data1/ndvi.tif', 'D:/data1/buliding.tif']

If you want to find dataall files named in the directory at the beginning ndvi.tif, you can write:

>>> glob('D:/data*/ndvi.tif')
['D:/data1/ndvi.tif', 'D:/data2/ndvi.tif']
>>> glob('D:/data?/ndvi.tif')
['D:/data1/ndvi.tif', 'D:/data2/ndvi.tif']
>>> glob('D:/data[1-2]/ndvi.tif')
['D:/data1/ndvi.tif', 'D:/data2/ndvi.tif']

3 Example of path manipulation for multiple files

In data processing, we often have such a requirement to process multiple files in a certain order. The most common way is to arrange the path of each file in chronological order, such as by year, month, day, etc. Here are a few examples in detail.

3.1 Annual scale data

Suppose you have the following directory structure:

+-- D:/
|   +-- data
|   |   +-- 2000.tif
|   |   +-- 2001.tif
... ... ... ... ... 
|   |   +-- 2019.tif
|   |   +-- 2020.tif
|

If you want to read datathe paths of all files in the directory from 2000 to 2020 (in chronological order), you can write:

import os
from glob import glob

# glob会自动排序这些文件路径,排序的规则为文件名
paths = glob('D:/data/*.tif*')

# 或者这样写,可以指定开始和结束年份
paths = []
root_dir = 'D:/data'
start_year, end_year = 2000, 2020
for year in range(start_year, end_year):
	path = os.path.join(root_dir, f'{
      
      year}.tif')
	paths.append(path)

3.2 Monthly scale data

Suppose you have the following directory structure:

+-- D:/
|   +-- data
|   |   +-- 200001.tif
|   |   +-- 200002.tif
... ... ... ... ... 
|   |   +-- 200012.tif
|   |   +-- 200101.tif
... ... ... ... ... 
|   |   +-- 202012.tif
|

If you want to read the data dataof all months in a certain year or years in a directory tif, you can write like this:

import os
from glob import glob

# 读取2015年所有月的数据
year = 2015
paths = glob(f'D:/data/{
      
      year}*.tif*')

# 读取2015-2020年所有月的数据
paths = []
root_dir = 'D:/data'
start_year, end_year = 2015, 2020
for year in range(start_year, end_year):
	path += os.path.join(root_dir, f'{
      
      year}*.tif')

If you want to read the data dataof a certain season in a certain year in a directory tif, you can write:

import os
from glob import glob

# 假设一年的冬季为当年的1月、2月以及上一年的12月
season_months = {
    
    'spring': ['03', '04', '05'], 'summer': ['06', '07', '08'], 
				 'autumn': ['09', '10', '11'], 'winter': ['01', '02']}

paths = []
root_dir = 'D:/data'

season = 'spring'
years = [2014, 2015, 2016]
for year in years:
	months = season_months[season]
	for month in months:
		path = os.path.join(root_dir, 'f{year}{month}.tif')
		assert path
		paths.append(path)
	if season == 'winter':
		path = os.path.join(root_dir, 'f{year-1}{12}.tif')
		assert path
		paths.append(path)

3.3 Daily-scale data

If you have the following directory structure, such as in the file name, 1982001it means that the data shooting time is the first day of 1982.

./ndvi/AVH13C1.A1982001.N07.005.2017161044559.NDVI.tif —— 1982001 —— 1982年第1天
./ndvi/AVH13C1.A1982002.N07.005.2017161050433.NDVI.tif —— 1982002 —— 1982年第2天
./ndvi/AVH13C1.A1982003.N07.005.2017161052408.NDVI.tif —— 1982003 —— 1982年第3天
./ndvi/AVH13C1.A1982004.N07.005.2017161054145.NDVI.tif —— 1982004 —— 1982年第4天
./ndvi/AVH13C1.A1982005.N07.005.2017161055856.NDVI.tif —— 1982005 —— 1982年第5天
./ndvi/AVH13C1.A1982006.N07.005.2017161061328.NDVI.tif —— 1982006 —— 1982年第6天

If you want to read the path of data for all days in a certain month of a certain year, you can write like this:

import os
import calendar
from glob import glob
from datetime import datetime

def get_year_month_paths(root_path, year, month):
    '''获取某年某月所有NDVI文件的路径'''
    paths = []
    # 用于glob函数的通配符字符串,可视作文件名模板
    fmt = '*{}{:03}*.tif'
    # 计算该年该月共有多少天
    month_days = calendar.monthrange(year, month)[1]
    # 计算该年该月的第一天是一年中的第几天
    month_start_day = (datetime(year, month, 1) - datetime(year, 1, 1)).days + 1
    for i in range(month_days):
        day_num = month_start_day + i
        month_day_path = glob(os.path.join(root_path, fmt.format(year, day_num)))
        if not month_day_path:
            print('{}年{}月的数据缺失,编号为{}{:03}'.format(year, month, year, day_num))
        paths += month_day_path
    return paths

3.4 The chronological order of the files is hidden within the files

The default file name of some files when you download is a bunch of meaningless characters, and you know that there is time information stored in these files, so you don't name these files one by one. When you want to use it, you will find it very troublesome. It is impossible to read these files in chronological order by file name, and there are too many files. It is too slow to open them one by one to check the time and then rename them. In this way, you can only open the files one by one in the code, read their time, and sort the paths of the files according to these times. This situation is common in data in .ncor .hdfformat, which stores time information internally, and some websites download files without time information in the file name.

Suppose you have the following directory structure:

./RH/ked.nc
./RH/mmm.nc
./RH/dii.nc
./RH/jkd.nc
./RH/zex.nc
./RH/xyz.nc

ncThere is a variable named under each file time, the type is an array, which is used to record the time. Taking ERA5 data as an example, the hour-scale data records the number of hours from the current time to Greenwich Mean Time 1900-01-01-00:00. If you want to convert it to the time in the string format of the East Eighth District, you can do this:

import datetime
import netCDF4 as nc

ds = nc.Dataset('./RH/ked.nc')
time = ds['time'][...].data

origin_date = datetime.datetime(1900, 1, 1, 0, 0)
start_date = origin_date+datetime.timedelta(hours=int(times[0])+8)

start_date = start_date.strftime('%Y-%m-%d-%H')

And if we don't need to ncrename these files, but just sort the paths of these files in the code, we only need to do this:

from glob import glob

nc_paths = glob(os.path.join('./RH', '*.nc'))

times = []
for nc_path in nc_paths:
    # 提取各站点的温度
    ds = nc.Dataset(nc_path)
    time = ds.variables['time']
    times.append(int(time[0].data))

_, nc_paths = (list(t) for t in zip(*sorted(zip(times, nc_paths))))

Finally, there are countless possibilities to operate on file paths. Master these basic knowledge, learn to draw inferences from one example, and try more hands-on. Only in this way can you be proficient in data processing.

To learn more about Python & GIS, please go to the official account GeodataAnalysis:

Guess you like

Origin blog.csdn.net/weixin_44785184/article/details/128576757