Python uses multiple columns of data to filter, remove Excel data and draw histograms in batches

  This article introduces how to read Excel data based on Python , use the value of one column of data as the standard, and then use other columns of data values ​​to filter and eliminate all rows whose data is in the specified range ; at the same time, filter and eliminate all rows before filtering and elimination. The method is to draw several histograms of the final data , and export and save the result data as a new Excel table file.

  First, let’s clarify the specific needs of this article. There is an Excel table file. In this article, we will take .csvthe format file as an example. As shown in the figure below, there is a column of data in this file (this column in this article days), which we will use as the benchmark data . It is hoped that all samples whose daysvalues ​​are in the range of 0to 45and 320to will be taken out first 365(one row is one sample), and subsequent operations will be carried out.

  Secondly, for the taken samples, based on the data of other 4columns (in this article, that is blue_dif, green_dif, red_difand inf_difthis 4column), delete the 4rows whose data in this column are not within the specified value range . In this process, we also hope to draw the histograms of the data in this column (that is, , and this column) before and after data deletion, a 4total blue_difof green_difone red_difpicture . Finally, we also want to save the data after deleting the above data as a new Excel table file.inf_dif48

  Knowing the requirements, we can write code. The code used in this article is shown below.

# -*- coding: utf-8 -*-
"""
Created on Tue Sep 12 07:55:40 2023

@author: fkxxgis
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

original_file_path = "E:/01_Reflectivity/99_Model/02_Extract_Data/26_Train_Model_New/Train_Model_0715_Main_Over_NIR.csv"
# original_file_path = "E:/01_Reflectivity/99_Model/02_Extract_Data/26_Train_Model_New/TEST.csv"
result_file_path = "E:/01_Reflectivity/99_Model/02_Extract_Data/26_Train_Model_New/Train_Model_0715_Main_Over_NIR_New.csv"

df = pd.read_csv(original_file_path)

blue_original = df[(df['blue_dif'] >= -0.08) & (df['blue_dif'] <= 0.08)]['blue_dif']
green_original = df[(df['green_dif'] >= -0.08) & (df['green_dif'] <= 0.08)]['green_dif']
red_original = df[(df['red_dif'] >= -0.08) & (df['red_dif'] <= 0.08)]['red_dif']
inf_original = df[(df['inf_dif'] >= -0.1) & (df['inf_dif'] <= 0.1)]['inf_dif']

mask = ((df['days'] >= 0) & (df['days'] <= 45)) | ((df['days'] >= 320) & (df['days'] <= 365))
range_min = -0.03
range_max = 0.03

df.loc[mask, 'blue_dif'] = df.loc[mask, 'blue_dif'].apply(lambda x: x if range_min <= x <= range_max else np.random.choice([np.nan, x]))
df.loc[mask, 'green_dif'] = df.loc[mask, 'green_dif'].apply(lambda x: x if range_min <= x <= range_max else np.random.choice([np.nan, x]))
df.loc[mask, 'red_dif'] = df.loc[mask, 'red_dif'].apply(lambda x: x if range_min <= x <= range_max else np.random.choice([np.nan, x]))
df.loc[mask, 'inf_dif'] = df.loc[mask, 'inf_dif'].apply(lambda x: x if range_min <= x <= range_max else np.random.choice([np.nan, x], p =[0.9, 0.1]))
df = df.dropna()

blue_new = df[(df['blue_dif'] >= -0.08) & (df['blue_dif'] <= 0.08)]['blue_dif']
green_new = df[(df['green_dif'] >= -0.08) & (df['green_dif'] <= 0.08)]['green_dif']
red_new = df[(df['red_dif'] >= -0.08) & (df['red_dif'] <= 0.08)]['red_dif']
inf_new = df[(df['inf_dif'] >= -0.1) & (df['inf_dif'] <= 0.1)]['inf_dif']

plt.figure(0)
plt.hist(blue_original, bins = 50)
plt.figure(1)
plt.hist(green_original, bins = 50)
plt.figure(2)
plt.hist(red_original, bins = 50)
plt.figure(3)
plt.hist(inf_original, bins = 50)

plt.figure(4)
plt.hist(blue_new, bins = 50)
plt.figure(5)
plt.hist(green_new, bins = 50)
plt.figure(6)
plt.hist(red_new, bins = 50)
plt.figure(7)
plt.hist(inf_new, bins = 50)

df.to_csv(result_file_path, index=False)

  First, we read the data pd.read_csvfrom the file at the specified path through the function .csvand store it in a DataFramedf named .

  Next, a subset of the original data that meets specific conditions is selected through a series of conditional filtering operations. Specifically, we filtered out the data within a certain range between blue_dif, , green_difand red_difthe values ​​in inf_difthis column, and stored these data in a new Series named , , and . These data are ready for us to draw histograms later. Prepare.4blue_originalgreen_originalred_originalinf_original

  Second, create a maskBoolean mask named , which is used to filter data that meets the criteria. Here, it filters out data whose dayscolumn values ​​are 0between 45or 320between and .365

  Subsequently, we use applyfunctions and lambdaexpressions. For rows dayswhose column values ​​are 0between 45or 320between 365and , if their blue_dif, green_dif, red_difand the data of inf_difthis 4column are not within the specified range, then the data of this column will be randomly set to NaN . p =[0.9, 0.1]It specifies the probability of random replacement with NaN . It should be noted here that if we do not give p =[0.9, 0.1]such a probability distribution, the program will randomly select data based on the principle of uniform distribution.

  Finally, we use dropnathe function to delete rows containing NaN values ​​to obtain filtered data. Secondly, we still calculate the subset of processed data based on the filtering conditions of these four columns, and store them in blue_new, green_new, red_newand inf_new. Next, use Matplotlib to create histograms to visualize the distribution of the original data and the processed data; these histograms are stored in different 8graphs.

  At the end of the code, the processed data is saved as a new .csvfile, and the file path is result_file_pathspecified by.

  Running the above code, we will get 8a histogram, as shown in the figure below. And see the result file in the specified folder.

  At this point, you're done.

Welcome to follow: Crazy Learning GIS

Guess you like

Origin blog.csdn.net/zhebushibiaoshifu/article/details/132893204