Machine learning - scikit-learn - data preprocessing normalization and standardization - 2

Detailed explanation of scikit-learn's normalization method

1. The fit_transform method

1. Maximum and minimum normalization manual and automated code comparison demonstration 1:

Formula: (This formula is somewhat misleading, that is, the zoom range, please note that the zoomed feature_range is [0, 1] rather than the matrix interval range composed of the minimum and maximum values ​​in the matrix) max = 1, min = 0

X s t d = X − X . m i n ( a x i s = 0 ) X . m a x ( a x i s = 0 ) − X . m i n ( a x i s = 0 ) X_{std}=\frac{X_{}-X_{.}min(axis=0)} {X_{.}max(axis=0)-X_{.}min(axis=0)} Xstd=X.max(axis=0)X.min(axis=0)XX.min(axis=0)

X s c a l e d = X s t d × ( m a x − m i n ) + m i n X_{scaled}=X_{std}\times(max-min)+min Xscaled=Xstd×(maxmin)+min , range[0,1][0,1][0,1]
X . m a x ( a x i s = 0 ) {X_{.}max(axis=0)} X.max(axis=0 ) is a matrix of maximum values
​​X . min ( axis = 0 ) X_{.}min(axis=0)X.min(axis=0 ) is a matrix of minimum values
​​max maxmax is equal to 1
min minmin is equal to 0

Code demo:

#!/usr/bin/env python
# -*-coding:utf-8-*-
from sklearn.preprocessing import MinMaxScaler
import numpy as np

data = np.array([[4,2,3],
                [1,5,6]])

print("data格式: ",type(data))
# 手动归一化
feature_range = [0,1]  # 要映射的区间
print("每一列的最小值: ",data.min(axis=0))  # axis 坐标轴 0 ,从上往下,每一列的最小值
print("每一列的最大值: ",data.max(axis=0))  # axis 坐标轴 0 ,从上往下,每一列的最大值
x_std = (data-data.min(axis=0))/(data.max(axis=0)-data.min(axis=0))  # axis 坐标轴 0 ,从上往下,列 ,标准化
x_scaled = x_std*(feature_range[1]-feature_range[0]) + feature_range[0]
print('手动归一化结果:\n{}'.format(x_scaled))

# 自动归一化
scaler = MinMaxScaler()
print('自动归一化结果:\n{}'.format(scaler.fit_transform(data)))

operation result:
insert image description here

> 这里 [ 4 ]   算是一列,同理 [ 2 ] [ 3 ]  各自算一列
> 	  [ 1 ]				   [ 5 ] [ 6 ]

In the first column, the smallest is 1, the smallest in the second column is 2, and the smallest in the third column is 3, so the value of is [1, 2, 3 ] and the value data.min(axis)of
can be data.max(axis)obtained , and each column is obtained The maximum value of [ 4, 5, 6 ]

2. Manual code demonstration of mean normalization:

official:

X s t d = X − X m e a n X m a x − X m i n X_{std}=\frac{X-X_{mean}}{X_{max}-X_{min}} Xstd=XmaxXminXXmean
X s c a l e d = X s t d × ( m a x − m i n ) + m i n X_{scaled}=X_{std}\times(max-min)+min Xscaled=Xstd×(maxmin)+min , range[ − 1 , 1 ] [-1,1][1,1]
X . m a x ( a x i s = 0 ) {X_{.}max(axis=0)} X.max(axis=0 ) is a matrix of maximum values
​​X . min ( axis = 0 ) X_{.}min(axis=0)X.min(axis=0 ) is a matrix of minimum values
​​max maxmax is equal to 1
min minmin is equal to -1

Manual code demonstration example 1:

#!/usr/bin/env python
# -*-coding:utf-8-*-
import numpy as np

data = np.array([[ 0, -3,  1],
              [ 3,  1,  2],
              [ 0,  1, -1]])
              
print("矩阵: \n",data)
print("data格式: ",type(data))
# 手动归一化
feature_range = [-1,1]  # 要映射的区间
print("每一列的最小值: ",data.min(axis=0))  # axis 坐标轴 0 ,从上往下,每一列的最小值
print("每一列的最大值: ",data.max(axis=0))  # axis 坐标轴 0 ,从上往下,每一列的最大值
print("每一列的均值: ",data.mean(axis=0))   # axis 坐标轴 0 ,从上往下,每一列的均值
x_std = (data-data.mean(axis=0))/(data.max(axis=0)-data.min(axis=0))  # axis 坐标轴 0 ,从上往下,列 ,标准化
x_scaled = x_std*(feature_range[1]-feature_range[0]) + feature_range[0]
print('手动归一化结果:\n{}'.format(x_scaled))

Running result 1:
insert image description here
Manual code demonstration example 2:

#!/usr/bin/env python
# -*-coding:utf-8-*-
import numpy as np

data = np.array([[4,2,3],
                [1,5,6]])
              
print("矩阵: \n",data)
print("data格式: ",type(data))
# 手动归一化
feature_range = [-1,1]  # 要映射的区间
print("每一列的最小值: ",data.min(axis=0))  # axis 坐标轴 0 ,从上往下,每一列的最小值
print("每一列的最大值: ",data.max(axis=0))  # axis 坐标轴 0 ,从上往下,每一列的最大值
print("每一列的均值: ",data.mean(axis=0))   # axis 坐标轴 0 ,从上往下,每一列的均值
x_std = (data-data.mean(axis=0))/(data.max(axis=0)-data.min(axis=0))  # axis 坐标轴 0 ,从上往下,列 ,标准化
x_scaled = x_std*(feature_range[1]-feature_range[0]) + feature_range[0]
print('手动归一化结果:\n{}'.format(x_scaled))

Running result 2:
insert image description here

3. Manual code demonstration for decimal calibration and normalization:

official:

X n e w = X 1 0 k X_{new}=\frac{X}{10^k} Xnew=10kX

  • k depends on XXThe maximum absolute value of the attribute values ​​​​in X
  • Decimal scaling normalization is to normalize by moving the position of the decimal point.
  • How many places to move the decimal point depends on XXThe maximum absolute value among the values ​​of the attributes in X.

XX hereThe attributes in X refer to certain attributes of the sample instance, such as length, width, quantity, etc.
Decimal scaling normalization can scale the data to[ − 1 , 1 ] [-1,1][1,1 ] or[0,1][0,1][0,1 ] Between

Manual code demo:

#!/usr/bin/env python
# -*-coding:utf-8-*-

from sklearn import preprocessing
import numpy as np

# 初始化数据
data = np.array([[ 0, -3,  1],
              [ 3,  1,  2],
              [ 0,  1, -1]])

# 小数定标规范化    # 手动归一化
j = np.ceil(np.log10(np.max(abs(data))))  # abs函数返回一个绝对值的矩阵
scaled_data = data/(10**j)  #幂函数运算
print(scaled_data)

abs()The function can perform absolute value processing on all the data in the matrix and return a matrix with only absolute values. The np.max()method performs a maximum value processing on the absolute value matrix to find the largest value in the matrix . ceil()The method is to round up the integer , it is not the method of taking integers like rounding, but directly removes floating point numbers.
Thus, kthe value , and the normalized new data is calculated by using the formula.

Running results:
insert image description here
Reference link:
What does python abs mean? What is the use of the abs function?

Python numpy.log10 usage and code examples

numpy_ceil function

4. Zero-mean normalization (mean removal) manual and automated code demonstration:

official document

sklearn.preprocessing.scale()function


sklearn.preprocessing.scale(X, axis=0, with_mean=True, with_std=True, copy=True)

Standardize the data set along an axis (axis), centered on the mean, and take the component as the unit. The
value of the variance axis can be 0 or 1, 0 represents from top to bottom, column, 1 represents from left to right, row

parameter type of data significance
X {array-like, sparse matrix} Scale around this data
axis int (0 by default) Along the axis the mean and standard deviation are computed. If 0, normalize each feature independently, if 1 normalize each sample (i.e. row)
with_mean boolean, True by default If True, center the data before scaling
with_std boolean, True by default If True, scale the data by unit variance (or equivalently, unit standard deviation)
copy boolean, optional, default True False: perform row normalization in-place and avoid copying (if input is already a numpy array or scipy.sparse CSC matrix and axis is 1 - row copying)

official:

X n e w = X − X m e a n X s t d X_{new}=\frac{X-X_{mean}}{X_{std}} Xnew=XstdXXmean, with a mean of 0 and a variance of 1

Manual and automated code comparison demo:

#!/usr/bin/env python
# -*-coding:utf-8-*-
import numpy as np
from sklearn import preprocessing

data = np.array([[ 0, -3,  1],
              [ 3,  1,  2],
              [ 0,  1, -1]])
print("矩阵: \n",data)
print("data格式: ",type(data))
# 手动归一化
feature_range = [0,1]  # 要映射的区间
print("每一列的均值: ",data.mean(axis=0))   # axis 坐标轴 0 ,从上往下,每一列的均值
print("每一列的方差: ",data.std(axis=0))   # axis 坐标轴 0 ,从上往下,每一列的均值
x_std = (data-data.mean(axis=0))/(data.std(axis=0))  # axis 坐标轴 0 ,从上往下,列 ,标准化

print('手动标准化结果:\n{}'.format(x_std))

data = np.array([[ 0, -3,  1],
              [ 3,  1,  2],
              [ 0,  1, -1]])

# 将数据进行Z-Score规范化   自动标准化
scaled_data = preprocessing.scale(data)
print("自动标准化结果: \n",scaled_data)

operation result:
insert image description here

What are fit and transform?

1. The fit function of the MinMaxScaler normalization interface

fitThe official definition of the function of MinMaxScaler :

Compute the minimum and maximum to be used for later scaling.
Translate: Calculate the maximum and minimum values ​​for feature scaling

In other words, through the fit function, you can first calculate the maximum and minimum values ​​of the data set that needs to be normalized. As for the final normalized result, sorry, the fit function ends here.

fitfit

From here we can know that in fact fit, the fitting function fitcan only extract the most valuable maximum and minimum values ​​from the data set, and then the next step cannot be performed - normalization or standardization, as mentioned here Only MinMaxScalerthe module , which is the maximum and minimum normalization module, needs to be distinguished from StandardScalerthe fitfitting to be introduced next

MinMaxScaler's fit fitting function code demonstration:

#!/usr/bin/env python
# -*-coding:utf-8-*-
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# 创建数组
data_rn = np.random.randint(-10, 10, 10).reshape(5, 2)  # np模块在-10 到 10 的范围内随机抽取10 个数,并且组成 5行 2 列的矩阵
print("矩阵: \n",data_rn)
# 进行归一化
scaler_mmc = MinMaxScaler()
scaler_mmc_fit = scaler_mmc.fit(data_rn)   # 默认 axix=0 按列处理,也就是处理向量
print("提取矩阵中的最大值: ",scaler_mmc_fit.data_max_)  # 最大值
print("提取矩阵中的最小值: ",scaler_mmc_fit.data_min_)  # 最小值
print("提取矩阵中的极差值: ",scaler_mmc_fit.data_range_) # 极差 最大值减最小值,有两列最大最小值,那么就会有两个极差

Running result:
insert image description here
This is to extract the maximum and minimum values ​​of each column. The maximum value of the first column is 5, the maximum value of the second column is 3, the minimum value of the first column is -8, and the minimum value of the second column is -4.

2. The fit function of StandardScaler standardized interface

StandardScalerfitThe official definition of the function :

Compute the mean and std to be used for later scaling
Translate: Calculate the mean and standard deviation for feature scaling
Use the fittingStandardScaler function of the module to calculate the mean and standard deviation of the data set that needs to be standardized .fit

StandardScaler's fit fitting function code demonstration:

#!/usr/bin/env python
# -*-coding:utf-8-*-
import numpy as np
from sklearn.preprocessing import StandardScaler

# 创建数组
data_rn = np.random.randint(-10, 10, 10).reshape(5, 2)  # np模块在-10 到 10 的范围内随机抽取10 个数,并且组成 5行 2 列的矩阵
print("矩阵: \n",data_rn)
# 进行标准化
scaler_ss = StandardScaler()
scaler_ss_fit = scaler_ss.fit(data_rn)
print("均值: ",scaler_ss_fit.mean_) # 均值,默认对列获取均值
print("方差: ",scaler_ss_fit.var_) # 方差,默认对列获取方差

Running results:
insert image description here
It can be seen that the mean and variance are extracted for each column, and then the mean and variance are respectively obtained two valuable mean and variance, so the significance of the existence of fitthe fitting Value data comes out to prepare for other data preprocessing operations later.

Summarize the usage of fit:
in simple terms, it is to obtain the inherent attributes such as the mean, variance, maximum value, and minimum value of the data set, and is often used transformin conjunction with

From the perspective of the algorithm model, fitthe fitting process can be understood as a training process.

Understanding of transform and fit_transform functions

Definition from the official documentation:

MinMaxScaler:Scale features of X according to feature_range.
StandardScaler:Perform standardization by centering and scaling

translate:

MinMaxScaler: According feature_rangeto XXScaling of X
StandardScaler : Standardization by centering and scaling
In other words, in fact transformisfitand the fitting function is just the previous preparation work.

From the perspective of the algorithm model, the transform process can be understood as a conversion process.
The usage is also very simple, you can directly operate on the previously fitted data set

# 归一化
scaler_mmc = MinMaxScaler()
scaler_mmc_fit = scaler_mmc.fit(data_rn)   # 默认 axix=0 按列处理,也就是处理向量
scaler_mmc_result = scaler_mmc.transform(data_rn)
# 标准化
scaler_ss = StandardScaler()
scaler_ss_fit = scaler_ss.fit(data_rn)
scaler_ss_result = scaler_ss.transform(data_rn)  

The final result is consistent with the fit_transformdirect result. Right now:

fit + transform = fit_transform

That is, fit_transformis a combination offit and , the whole process includes both training and conversion. Fit the data first , find the overall indicators of the data, such as mean, variance, maximum and minimum values, etc., and then convert the data set to achieve the data. Standardization and normalization operations.transformfit_transform
fittransform

Skills used in the project - fit_transform and transform

After understanding the usage of fit and transform, you can learn the tips used in the project.

The data set of the project is generally divided into training set and test set. The training set is used to train the model, and the test set is used to verify the effect of the model.

If the trained model can also achieve good scores on the test set, it is not only necessary to ensure that the data distribution of the training set and the test set are the same, but also to perform the same data preprocessing operations on them. For example: standardization and normalization.

Therefore, in general, for data set processing, the training set will be directly fitted + converted, and then the test set will be directly converted.

Note that the training set is used for fitting, and then the training set and test set are converted with the fitted "model". You must understand this logic! !

MinMaxScalerInterface code demo:

from sklearn.preprocessing import MinMaxScaler 

scaler_mmc = MinMaxScaler()
# 训练集操作
new_train_x = scaler_mmc.fit_transform(train_x)
# 测试集操作
new_test_x = scaler_mmc.tranform(test_x)

StandardScalerInterface code demo:

from sklearn.preprocessing import StandardScaler

scaler_ss = StandardScaler()
# 训练集操作
new_train_x = scaler_ss.fit_transform(train_x)
# 测试集操作
new_test_x = scaler_ss.tranform(test_x)

Be careful, be careful, be careful:
It cannot be used for both the training set and the test set fit_transform. Although the test set can be converted normally (normalized or standardized), the two results are not under the same standard and have obvious differences.

Summarize:

  1. When solving a problem with machine learning, the data set is divided intoTraining setandtest set
  2. We can use fit_transform()the method to process the training set first, and then use transform()the method to process the test set. At this time, when normalizing the test set, the statistics of the training set are used to make the training set and test set more similar. Make the performance of the algorithm on the two as similar as possible (this means that fit_transform()after using the method, it is equivalent to using fit()the method , and then using transform()the method, there fit()is , because fit_transform()the method has already stored the data fit()after down, for use by transform()the method );
  3. If the method is used on the test set fit_transform(), the data will be normalized with the test set's own statistics. Do not mix these two methods on the test set. If the method is used on the test set fit_transform(), the loss on the test set will always be much larger than that on the verification set;
  4. There is another fit()method not to say, this is the simplest, it fit_transform()is the same as and, except that the latter will return the converted result, while the former will not return, it will only train the converter;
  5. First of all, if you want to view the distribution of data in the process fit_transformof , you can decompose first fitand then transform, and fitthe final result will include the distribution of data;
  6. If you don't care about the data distribution and only care about the final result, you can use it directly in fit_transformone step ;
  7. Secondly, the same standard needs to be used to convert the training data and test data in the project, and remember not to do it separately fit_transform.

Reference link:

For data processing, you can't even tell the difference between fit, transform, and fit_transform?

In-depth understanding of transform() and fit_transform() in sklearn

python numpy implements standard deviation, variance

[Machine Learning] Data Normalization - MinMaxScaler Understanding

2. The normalize() method

Parameters of the normalize method:

sklearn.preprocessing.normalize(X, norm='l2', *, axis=1, copy=True, return_norm=False)

X X X : the data to be normalized
norm normn or m : { }, specify the norm, ‘l1’, ‘l2’, ‘max’The default is the 2-norm of the matrix, i.e.l2, but generally speaking, the 1-norm of the commonly used matrix, namely l1the norm , l1is normalized: divide each data by l1the norm (the maximum value of the sum of the absolute values ​​of all data columns)
maxis the infinite norm of the matrix, ∣ ∣ A ∣ ∣ ∝ ||A||_{\propto}∣∣A

This method is often used to ensure that the data points do not have large differences due to the basic nature of the features, that is, to ensure that the data are in the same order of magnitude, and to improve the comparability of different feature data.

a x i s axis a x i s : axis, the default is axis = 1, that is, the calculation is performed by the row of the sample, and the axis used to normalize the data. If 1, normalize each sample independently, otherwise (if 0) normalize each feature, i.e. feature vector.

c o p y copy copy:bool, default=True

General idea reference link:

scikit-learn beginner

Guess you like

Origin blog.csdn.net/qq_42701659/article/details/124478468