[Python] Training 5: Using pandas for data preprocessing (lagrange interpolation, merge, standardization)

Title source:
Chapter 5 of "Python Data Analysis and Application"
[edited by Huang Hongmei and Zhang Liangjun, China Industry Information Publishing Group and People's Posts and Telecommunications Press]

The title text of this blog is mainly from:
Almighty scanning king text recognition conversion (it is impossible to type the title)

Data set download link (find Chapter 6 -> training data after downloading)

Training 1 Imputation of missing values ​​of user electricity consumption data

1. Training points
(1) Master the method of missing value identification.
(2) Master the methods of dealing with missing data.

2. Requirement description
The user's electricity consumption data presents a certain periodic relationship. The missing data.csv table stores the electricity consumption data of user A, user B and user C. There are missing values, which need to be imputed for missing values. proceed to the next analysis.

3. Implementation ideas and steps
(1) Read the data in the missinne_data.csv table.
(2) Query the location of missing values.
(3) Use the lagrange in the interpolate module of the SciPy library to perform Lagrange interpolation on the data.
(4) Check whether there are missing values ​​in the data, if not, the interpolation is successful.

#实训1  插补用户用电量数据缺失值
import pandas as pd
import numpy as np
arr=np.array([0,1,2])
missing_data=pd.read_csv("./实训数据/missing_data.csv",names=arr)
#查询缺失值所在位置
print("lagrange插值前(False为缺失值所在位置)",'\n',missing_data.notnull())

#拉格朗日插值
#dropna().index用于记录非缺失值的下标
#dropna().values用于记录非缺失值的实际值
from scipy.interpolate import lagrange
for i in range(0,3): 
    #"训练"lagrange模型
    la=lagrange(missing_data.loc[:,i].dropna().index,missing_data.loc[:,i].dropna().values)
    #list_d用于记录当前列缺失值所在的行(记录缺失值下标)
    list_d=list(set(np.arange(0,21)).difference(set(missing_data.loc[:,i].dropna().index)))
    #将缺失值list_d带入训练好的模型,并填入对应的位置
    missing_data.loc[list_d,i]=la(list_d)  
    print("第%d列缺失值的个数为:%d"%(i,missing_data.loc[:,i].isnull().sum()))
print("lagrange插值后(False为缺失值所在位置)","\n",missing_data.notnull())

insert image description here

insert image description here
Training 2 Combine line loss, power consumption trend and line alarm data

1. Training points
(1) Master several methods of primary key merging.
(2) Master the primary key merging of multiple key values.

2. Requirement description
Line line loss data, line power consumption trend decline data and line alarm data are
important features to identify whether the user steals electricity or not. The primary key is merged.

3. Implementation ideas and steps
(1) Read ele_loss.csv and alarm csy tables.
(2) Check the shape of the two tables.
(3) Inner join with ID and date as the primary key.
4) View the merged data.

#实训2 合并线损、用电量趋势与线路告警数据
import pandas as pd
ele_loss=pd.read_csv("./实训数据/ele_loss.csv")
alarm=pd.read_csv("./实训数据/alarm.csv", encoding='gbk')
#查看两个表的形状
print("ele_loss表的形状为",ele_loss.shape)
print("alarm表的形状为",alarm.shape)
#合并后的数据
merge=pd.merge(ele_loss,alarm,left_on=["ID","date"],right_on=["ID","date"],how="inner")
print("合并后的表形状为:",merge.shape)
print("合并后的表为:",merge)

insert image description here
Training 3 Standardized modeling expert sample data

1. Training points
(1) Master the principle of data standardization
(2) Master the method of data standardization.

2. Requirement description
There are many kinds of algorithms. Once it involves spatial distance calculation, gradient descent, etc., it is necessary to standardize the line loss characteristics, line power consumption trend decline characteristics, and line alarm characteristics.

3. Implementation ideas and steps
(1) Read model.csv data.
(2) Define the standard deviation normalization function.
(3) Use the function to standardize the three columns of data respectively.
(4) View the standardized data

#实训3 标准化建模专家样本数据
import pandas as pd
model=pd.read_csv("./实训数据/model.csv",encoding = "gbk")
def Standard(data):
    data=(data-data.mean())/data.std()
    return data
S=Standard(model)
print("标准化后的数据为:",'\n',S.head())

'''
#离差标准化函数
def MinMaxScale(data):
    data=(data-data.min())/(data.max()-data.min())
    return data
M=MinMaxScale(model)
print("离差标准化后的数据为:",'\n',S.head())

#小数定标差标准化函数
def DecimalScaler(data): 
    data=data/10**np.ceil(np.log10(data.abs().max()))
    return data
D=DecimalScaler(model)
print("小数定标差标准化的数据为:",'\n',D.head())'''

insert image description here

The three standardization methods have their own advantages. The standardization method of dispersion is simple and easy to understand. The standardized data is limited to the interval [0,1]. Standard deviation normalization is less affected by the distribution of the data. The decimal scaling normalization method has a wide range of applications and is less affected by the data distribution. Compared with the first two methods, this method has a moderate degree of applicability.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324735179&siteId=291194637