This article introduces how to read Excel data based on Python , use the value of one column of data as the standard, and then use other columns of data values to filter and eliminate all rows whose data is in the specified range ; at the same time, filter and eliminate all rows before filtering and elimination. The method is to draw several histograms of the final data , and export and save the result data as a new Excel table file.
First, let’s clarify the specific needs of this article. There is an Excel table file. In this article, we will take .csv
the format file as an example. As shown in the figure below, there is a column of data in this file (this column in this article days
), which we will use as the benchmark data . It is hoped that all samples whose days
values are in the range of 0
to 45
and 320
to will be taken out first 365
(one row is one sample), and subsequent operations will be carried out.
Secondly, for the taken samples, based on the data of other 4
columns (in this article, that is blue_dif
, green_dif
, red_dif
and inf_dif
this 4
column), delete the 4
rows whose data in this column are not within the specified value range . In this process, we also hope to draw the histograms of the data in this column (that is, , and this column) before and after data deletion, a 4
total blue_dif
of green_dif
one red_dif
picture . Finally, we also want to save the data after deleting the above data as a new Excel table file.inf_dif
4
8
Knowing the requirements, we can write code. The code used in this article is shown below.
# -*- coding: utf-8 -*-
"""
Created on Tue Sep 12 07:55:40 2023
@author: fkxxgis
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
original_file_path = "E:/01_Reflectivity/99_Model/02_Extract_Data/26_Train_Model_New/Train_Model_0715_Main_Over_NIR.csv"
# original_file_path = "E:/01_Reflectivity/99_Model/02_Extract_Data/26_Train_Model_New/TEST.csv"
result_file_path = "E:/01_Reflectivity/99_Model/02_Extract_Data/26_Train_Model_New/Train_Model_0715_Main_Over_NIR_New.csv"
df = pd.read_csv(original_file_path)
blue_original = df[(df['blue_dif'] >= -0.08) & (df['blue_dif'] <= 0.08)]['blue_dif']
green_original = df[(df['green_dif'] >= -0.08) & (df['green_dif'] <= 0.08)]['green_dif']
red_original = df[(df['red_dif'] >= -0.08) & (df['red_dif'] <= 0.08)]['red_dif']
inf_original = df[(df['inf_dif'] >= -0.1) & (df['inf_dif'] <= 0.1)]['inf_dif']
mask = ((df['days'] >= 0) & (df['days'] <= 45)) | ((df['days'] >= 320) & (df['days'] <= 365))
range_min = -0.03
range_max = 0.03
df.loc[mask, 'blue_dif'] = df.loc[mask, 'blue_dif'].apply(lambda x: x if range_min <= x <= range_max else np.random.choice([np.nan, x]))
df.loc[mask, 'green_dif'] = df.loc[mask, 'green_dif'].apply(lambda x: x if range_min <= x <= range_max else np.random.choice([np.nan, x]))
df.loc[mask, 'red_dif'] = df.loc[mask, 'red_dif'].apply(lambda x: x if range_min <= x <= range_max else np.random.choice([np.nan, x]))
df.loc[mask, 'inf_dif'] = df.loc[mask, 'inf_dif'].apply(lambda x: x if range_min <= x <= range_max else np.random.choice([np.nan, x], p =[0.9, 0.1]))
df = df.dropna()
blue_new = df[(df['blue_dif'] >= -0.08) & (df['blue_dif'] <= 0.08)]['blue_dif']
green_new = df[(df['green_dif'] >= -0.08) & (df['green_dif'] <= 0.08)]['green_dif']
red_new = df[(df['red_dif'] >= -0.08) & (df['red_dif'] <= 0.08)]['red_dif']
inf_new = df[(df['inf_dif'] >= -0.1) & (df['inf_dif'] <= 0.1)]['inf_dif']
plt.figure(0)
plt.hist(blue_original, bins = 50)
plt.figure(1)
plt.hist(green_original, bins = 50)
plt.figure(2)
plt.hist(red_original, bins = 50)
plt.figure(3)
plt.hist(inf_original, bins = 50)
plt.figure(4)
plt.hist(blue_new, bins = 50)
plt.figure(5)
plt.hist(green_new, bins = 50)
plt.figure(6)
plt.hist(red_new, bins = 50)
plt.figure(7)
plt.hist(inf_new, bins = 50)
df.to_csv(result_file_path, index=False)
First, we read the data pd.read_csv
from the file at the specified path through the function .csv
and store it in a DataFramedf
named .
Next, a subset of the original data that meets specific conditions is selected through a series of conditional filtering operations. Specifically, we filtered out the data within a certain range between blue_dif
, , green_dif
and red_dif
the values in inf_dif
this column, and stored these data in a new Series named , , and . These data are ready for us to draw histograms later. Prepare.4
blue_original
green_original
red_original
inf_original
Second, create a mask
Boolean mask named , which is used to filter data that meets the criteria. Here, it filters out data whose days
column values are 0
between 45
or 320
between and .365
Subsequently, we use apply
functions and lambda
expressions. For rows days
whose column values are 0
between 45
or 320
between 365
and , if their blue_dif
, green_dif
, red_dif
and the data of inf_dif
this 4
column are not within the specified range, then the data of this column will be randomly set to NaN . p =[0.9, 0.1]
It specifies the probability of random replacement with NaN . It should be noted here that if we do not give p =[0.9, 0.1]
such a probability distribution, the program will randomly select data based on the principle of uniform distribution.
Finally, we use dropna
the function to delete rows containing NaN values to obtain filtered data. Secondly, we still calculate the subset of processed data based on the filtering conditions of these four columns, and store them in blue_new
, green_new
, red_new
and inf_new
. Next, use Matplotlib to create histograms to visualize the distribution of the original data and the processed data; these histograms are stored in different 8
graphs.
At the end of the code, the processed data is saved as a new .csv
file, and the file path is result_file_path
specified by.
Running the above code, we will get 8
a histogram, as shown in the figure below. And see the result file in the specified folder.
At this point, you're done.
Welcome to follow: Crazy Learning GIS