This article introduces how to filter a large number of Excel table files in a folder based on the Python language, based on the characteristics of a certain column of data in each file , and copy the files that meet the requirements and those that do not meet the requirements to the other two . methods in a new folder .
First, let's clarify the specific needs of this article. There is an existing folder, which contains a large number of Excel table files (in this article, we will take csv
format files as an example); as shown in the figure below.
Among them, each Excel table file has a data format as shown in the figure below.
As shown in the figure above, each file has such a problem-the data in some rows is correct, and in some rows, except for the first column, the other columns are all 0
values. Therefore, we hope to use the first 2
column as the standard to find out the table files with 0
the number of values lower than or higher than a certain threshold0
- among them , the number of values is definitely not conducive to our analysis, we will put it into a new Folder; and 0
the number of values is small, we can perform subsequent analysis on this table file, and we will put it into another new folder. Therefore, after calculating the value quantity percentage corresponding to each table file 0
, we further copy this Excel table file to the corresponding folder.
Knowing the requirements, we can start writing code. Among them, the code used in this article is as follows.
# -*- coding: utf-8 -*-
"""
Created on Tue May 16 20:19:50 2023
@author: fkxxgis
"""
import os
import shutil
import pandas as pd
def filter_copy_files(original_path, useful_path, useless_path, threshold):
original_all_file = os.listdir(original_path)
for file in original_all_file:
path = os.path.join(original_path, file)
if file.endswith(".csv") and os.path.isfile(path):
df = pd.read_csv(path)
column_value = df.iloc[:, 1]
zero_count = (column_value == 0).sum()
zero_ratio = zero_count / len(column_value)
if zero_ratio < threshold:
new_path = os.path.join(useful_path, file)
shutil.copy(path, new_path)
else:
new_path = os.path.join(useless_path, file)
shutil.copy(path, new_path)
filter_copy_files("E:/01_Reflectivity/99_Model_Training/00_Data/02_Extract_Data/13_AllYearAverage",
"E:/01_Reflectivity/99_Model_Training/00_Data/02_Extract_Data/14_PointSelection/LowMissingRate",
"E:/01_Reflectivity/99_Model_Training/00_Data/02_Extract_Data/14_PointSelection/HighMissingRate",
0.30)
Among them, the above code is a function to filter and copy files. The purpose of this function is to copy files with different missing rates from one folder to two other folders based on a given threshold.
In code, filter_copy_files
the function accepts four parameters:
original_path
: Path to the original folder containing.csv
the files to filter.useful_path
: Path to the destination folder of useful files, files that meet the threshold requirement (that is,0
the number of values is below the threshold) are copied to here.useless_path
: Destination folder path for useless files, files that do not meet the threshold requirement (that is,0
the number of values is higher than the threshold) are copied here.threshold
: Threshold, used to determine whether the file missing rate meets the requirements.
The function first uses os.listdir
Get all the filenames in the original folder, and then iterates through each filename. For .csv
files that end in and are files, the function pd.read_csv
reads .csv
the file using , and df.iloc[:, 1]
gets 2
the value of the first column by.
Next, the function counts 2
the number of elements that are zero in the th column and calculates the missing rate by dividing it by the total length of the column. Judging whether the missing rate meets the requirements according to the threshold.
If the missing rate is less than the threshold, the function copies the file to useful_path
the target folder, using shutil.copy
the function to implement the copy operation. Otherwise, the function copies the file into useless_path
the folder.
Finally, we call filter_copy_files
the function and pass the appropriate parameters to perform the file filtering and copying operations.
Run the above code, we can see the files in the corresponding folder. As shown in the figure below, 0
the table files with the number of values lower than the threshold are copied to this LowMissingRate
folder, and we can perform subsequent processing on them; and those 0
table files with the number of values higher than the threshold are placed HighMissingRate
in another folder .
So far, you're done.
Welcome to pay attention: Crazy learning GIS