Python traverses a large number of table files and filters out files with low data missing rate in the table

  This article introduces how to filter a large number of Excel table files in a folder based on the Python language, based on the characteristics of a certain column of data in each file , and copy the files that meet the requirements and those that do not meet the requirements to the other two . methods in a new folder .

  First, let's clarify the specific needs of this article. There is an existing folder, which contains a large number of Excel table files (in this article, we will take csvformat files as an example); as shown in the figure below.

  Among them, each Excel table file has a data format as shown in the figure below.

  As shown in the figure above, each file has such a problem-the data in some rows is correct, and in some rows, except for the first column, the other columns are all 0values. Therefore, we hope to use the first 2column as the standard to find out the table files with 0the number of values ​​lower than or higher than a certain threshold0 - among them , the number of values ​​is definitely not conducive to our analysis, we will put it into a new Folder; and 0the number of values ​​is small, we can perform subsequent analysis on this table file, and we will put it into another new folder. Therefore, after calculating the value quantity percentage corresponding to each table file 0, we further copy this Excel table file to the corresponding folder.

  Knowing the requirements, we can start writing code. Among them, the code used in this article is as follows.

# -*- coding: utf-8 -*-
"""
Created on Tue May 16 20:19:50 2023

@author: fkxxgis
"""

import os
import shutil
import pandas as pd

def filter_copy_files(original_path, useful_path, useless_path, threshold):
    original_all_file = os.listdir(original_path)
    for file in original_all_file:
        path = os.path.join(original_path, file)
        if file.endswith(".csv") and os.path.isfile(path):
            df = pd.read_csv(path)
            column_value = df.iloc[:, 1]
            zero_count = (column_value == 0).sum()
            zero_ratio = zero_count / len(column_value)
            
            if zero_ratio < threshold:
                new_path = os.path.join(useful_path, file)
                shutil.copy(path, new_path)
            else:
                new_path = os.path.join(useless_path, file)
                shutil.copy(path, new_path)

filter_copy_files("E:/01_Reflectivity/99_Model_Training/00_Data/02_Extract_Data/13_AllYearAverage",
                  "E:/01_Reflectivity/99_Model_Training/00_Data/02_Extract_Data/14_PointSelection/LowMissingRate",
                  "E:/01_Reflectivity/99_Model_Training/00_Data/02_Extract_Data/14_PointSelection/HighMissingRate",
                  0.30)

  Among them, the above code is a function to filter and copy files. The purpose of this function is to copy files with different missing rates from one folder to two other folders based on a given threshold.

  In code, filter_copy_filesthe function accepts four parameters:

  • original_path: Path to the original folder containing .csvthe files to filter.
  • useful_path: Path to the destination folder of useful files, files that meet the threshold requirement (that is, 0the number of values ​​is below the threshold) are copied to here.
  • useless_path: Destination folder path for useless files, files that do not meet the threshold requirement (that is, 0the number of values ​​is higher than the threshold) are copied here.
  • threshold: Threshold, used to determine whether the file missing rate meets the requirements.

  The function first uses os.listdirGet all the filenames in the original folder, and then iterates through each filename. For .csvfiles that end in and are files, the function pd.read_csvreads .csvthe file using , and df.iloc[:, 1]gets 2the value of the first column by.

  Next, the function counts 2the number of elements that are zero in the th column and calculates the missing rate by dividing it by the total length of the column. Judging whether the missing rate meets the requirements according to the threshold.

  If the missing rate is less than the threshold, the function copies the file to useful_paththe target folder, using shutil.copythe function to implement the copy operation. Otherwise, the function copies the file into useless_paththe folder.

  Finally, we call filter_copy_filesthe function and pass the appropriate parameters to perform the file filtering and copying operations.

  Run the above code, we can see the files in the corresponding folder. As shown in the figure below, 0the table files with the number of values ​​lower than the threshold are copied to this LowMissingRatefolder, and we can perform subsequent processing on them; and those 0table files with the number of values ​​higher than the threshold are placed HighMissingRatein another folder .

  So far, you're done.

Welcome to pay attention: Crazy learning GIS

Guess you like

Origin blog.csdn.net/zhebushibiaoshifu/article/details/130714834