Data mining learning - data preprocessing method code summary (python)

Table of contents

1. Normalization processing method

(1) min-max method (discrete normalization)

(2) Zero-mean normalization method

(3) Standardization of decimal calibration

2. Interpolation method

(1) Lagrange interpolation method

3. Correlation analysis

(1) Pearson correlation coefficient

(2) spearman correlation coefficient

4. Principal Component Analysis (PCA)


1. Normalization processing method

Common methods of normalization are:

(1) min-max method (discrete normalization)

A linear transformation of the original data, mapping the data points to the [0,1] interval (default)

Generally call the min_max_scaler function in the sklearn library to implement, the code is as follows:

from sklearn import preprocessing
import numpy as np

x =  np.array(
    [[1972, 685, 507, 962, 610, 1434, 1542, 1748, 1247, 1345],

[262, 1398, 1300, 1056, 552, 1306, 788, 1434, 907, 1374],])

# 调用min_max_scaler函数
min_max_scaler = preprocessing.MinMaxScaler()
minmax_x = min_max_scaler.fit_transform(x)

(2) Zero-mean normalization method

    Transforms the distribution of eigenvalues ​​to mean zero. This approach can eliminate the magnitude difference between different features (or samples), and make the distribution of features closer to each other. In some models (such as SVM), it can greatly improve the processing effect and make the model more accurate. Stable and improve prediction accuracy.

Code:

import numpy as np
# 零-均值规范化
def ZeroAvg_Normalize(data):
    text=(data - data.mean())/data.std()
    return text

(3) Standardization of decimal calibration

Decimal scaling normalization is to normalize by moving the position of the decimal point. How many places the decimal point moves depends on the maximum absolute value of the value of attribute A.

The implementation code is as follows:

import numpy as np

# 小数定标规范化
def deci_sca(data):
    new_data=data/(10**(np.ceil(np.log10(data.max()))))
    return new_data

2. Interpolation method

Interpolate a continuous function on the basis of discrete data so that this continuous curve passes through all given discrete data points.

Interpolation is an important method of approximation, which can be used to estimate the approximate value of the function at other points through the value of the function at a limited number of points.

In the application of images, it is to fill the gaps caused by image transformation.

(1) Lagrange interpolation method

The node basis functions are given on the nodes, and then the linear combination of the basis functions is performed, and the combination coefficient is a kind of interpolation polynomial of the node function values.

It can be realized by calling the lagrange method in the scipy library, the code is as follows:

'''拉格朗日插值法实现'''
from scipy.interpolate  import lagrange
import numpy as np
x_known = np.array([987,1325,1092,475,2911])
y_known = np.array([372,402,1402,1725,1410])
new_data = lagrange(x_known,y_known)(4)
print(new_data)

3. Correlation analysis

(1) Pearson correlation coefficient

The product of the covariance divided by the standard deviation, the pearson correlation coefficient is a linear correlation, and the pearson correlation coefficient presents a linear relationship between continuous normally distributed variables.

It can be realized by calling the corr() method and defining the parameter as the pearson method. The code is as follows:

# pearson相关系数计算
corr_pearson = df.corr(method='pearson')

(2) spearman correlation coefficient

The Pearson correlation coefficient between rank (order) variables, the spearman correlation coefficient presents a nonlinear correlation, and the spearman correlation coefficient does not require normal continuous, but at least ordered.

# spearman相关系数计算
corr_spearman = df.corr(method='spearman')

4. Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a statistical method that converts a group of variables that may be correlated into a group of linearly uncorrelated variables through an orthogonal transformation. The converted group of variables is called a principal component.

In data preprocessing, we often use the method of PCA to reduce the dimensionality of data, and map n-dimensional features to k-dimensional. This k-dimensional is a brand new orthogonal feature, also known as principal component, which is based on the original n-dimensional feature. The k-dimensional features reconstructed on the basis of .

The specific implementation steps are as follows:

1) Firstly, standardize the data to eliminate the influence of different dimensions on the data, and the extreme value method can be used for standardization

and standard deviation standardization.

2) Find the variance matrix based on the standardized data.

3) Calculate the characteristic roots and characteristic variables of the covariate matrix, and determine the principal components according to the characteristic roots.

4) Combine professional knowledge and the information contained in each principal component to give an appropriate explanation.

You can directly call the pca method in sklearn, the code is as follows:

# 调用sklearn的PCA
from sklearn.decomposition import PCA
import numpy as np
import pandas as pd

df=pd.DataFrame({'能力':[66,65,57,67,61,64,64,63,65,67,62,68,65,62,64],

                '品格':[64,63,58,69,61,65,63,63,64,69,63,67,65,63,66],

                '担保':[65,63,63,65,62,63,63,63,65,69,65,65,66,64,66],

                '资本':[65,65,59,68,62,63,63,63,66,68,64,67,65,62,65],

                '环境':[65,64,66,64,63,63,64,63,64,67,64,65,64,66,67]
                 })

#调用sklearn中的PCA函数对数据进行主成分分析
pca=PCA()
pca.fit(df) # 用训练数据X训练模型

'''投影后的特征维度的方差比例'''
print('--------------投影后的特征维度的方差比例(每个特征方差贡献率)-------------------')
print(pca.explained_variance_ratio_)

'''投影后的特征维度的方差'''
print('--------------投影后的特征维度的方差-------------------')
print(pca.explained_variance_)
print('--------------模型的主成分对应的特征向量-------------------')
print(pca.components_)
print('--------------使用pca模型对数据进行降维-------------------')
print(pca.transform(df))# 对数据进行降维

operation result:

 

Guess you like

Origin blog.csdn.net/weixin_52135595/article/details/129998426