Table of contents
- Detailed explanation of scikit-learn's normalization method
-
- 1. The fit_transform method
- What are fit and transform?
- 1. The fit function of the MinMaxScaler normalization interface
- 2. The fit function of StandardScaler standardized interface
- Understanding of transform and fit_transform functions
- Skills used in the project - fit_transform and transform
- 2. The normalize() method
- General idea reference link:
Detailed explanation of scikit-learn's normalization method
1. The fit_transform method
1. Maximum and minimum normalization manual and automated code comparison demonstration 1:
Formula: (This formula is somewhat misleading, that is, the zoom range, please note that the zoomed feature_range is [0, 1] rather than the matrix interval range composed of the minimum and maximum values in the matrix) max = 1, min = 0
X s t d = X − X . m i n ( a x i s = 0 ) X . m a x ( a x i s = 0 ) − X . m i n ( a x i s = 0 ) X_{std}=\frac{X_{}-X_{.}min(axis=0)} {X_{.}max(axis=0)-X_{.}min(axis=0)} Xstd=X.max(axis=0)−X.min(axis=0)X−X.min(axis=0)
X s c a l e d = X s t d × ( m a x − m i n ) + m i n X_{scaled}=X_{std}\times(max-min)+min Xscaled=Xstd×(max−min)+min , range[0,1][0,1][0,1]
X . m a x ( a x i s = 0 ) {X_{.}max(axis=0)} X.max(axis=0 ) is a matrix of maximum values
X . min ( axis = 0 ) X_{.}min(axis=0)X.min(axis=0 ) is a matrix of minimum values
max maxmax is equal to 1
min minmin is equal to 0
Code demo:
#!/usr/bin/env python
# -*-coding:utf-8-*-
from sklearn.preprocessing import MinMaxScaler
import numpy as np
data = np.array([[4,2,3],
[1,5,6]])
print("data格式: ",type(data))
# 手动归一化
feature_range = [0,1] # 要映射的区间
print("每一列的最小值: ",data.min(axis=0)) # axis 坐标轴 0 ,从上往下,每一列的最小值
print("每一列的最大值: ",data.max(axis=0)) # axis 坐标轴 0 ,从上往下,每一列的最大值
x_std = (data-data.min(axis=0))/(data.max(axis=0)-data.min(axis=0)) # axis 坐标轴 0 ,从上往下,列 ,标准化
x_scaled = x_std*(feature_range[1]-feature_range[0]) + feature_range[0]
print('手动归一化结果:\n{}'.format(x_scaled))
# 自动归一化
scaler = MinMaxScaler()
print('自动归一化结果:\n{}'.format(scaler.fit_transform(data)))
operation result:
> 这里 [ 4 ] 算是一列,同理 [ 2 ] [ 3 ] 各自算一列
> [ 1 ] [ 5 ] [ 6 ]
In the first column, the smallest is 1, the smallest in the second column is 2, and the smallest in the third column is 3, so the value of is [1, 2, 3 ] and the value data.min(axis)
of
can be data.max(axis)
obtained , and each column is obtained The maximum value of [ 4, 5, 6 ]
2. Manual code demonstration of mean normalization:
official:
X s t d = X − X m e a n X m a x − X m i n X_{std}=\frac{X-X_{mean}}{X_{max}-X_{min}} Xstd=Xmax−XminX−Xmean
X s c a l e d = X s t d × ( m a x − m i n ) + m i n X_{scaled}=X_{std}\times(max-min)+min Xscaled=Xstd×(max−min)+min , range[ − 1 , 1 ] [-1,1][−1,1]
X . m a x ( a x i s = 0 ) {X_{.}max(axis=0)} X.max(axis=0 ) is a matrix of maximum values
X . min ( axis = 0 ) X_{.}min(axis=0)X.min(axis=0 ) is a matrix of minimum values
max maxmax is equal to 1
min minmin is equal to -1
Manual code demonstration example 1:
#!/usr/bin/env python
# -*-coding:utf-8-*-
import numpy as np
data = np.array([[ 0, -3, 1],
[ 3, 1, 2],
[ 0, 1, -1]])
print("矩阵: \n",data)
print("data格式: ",type(data))
# 手动归一化
feature_range = [-1,1] # 要映射的区间
print("每一列的最小值: ",data.min(axis=0)) # axis 坐标轴 0 ,从上往下,每一列的最小值
print("每一列的最大值: ",data.max(axis=0)) # axis 坐标轴 0 ,从上往下,每一列的最大值
print("每一列的均值: ",data.mean(axis=0)) # axis 坐标轴 0 ,从上往下,每一列的均值
x_std = (data-data.mean(axis=0))/(data.max(axis=0)-data.min(axis=0)) # axis 坐标轴 0 ,从上往下,列 ,标准化
x_scaled = x_std*(feature_range[1]-feature_range[0]) + feature_range[0]
print('手动归一化结果:\n{}'.format(x_scaled))
Running result 1:
Manual code demonstration example 2:
#!/usr/bin/env python
# -*-coding:utf-8-*-
import numpy as np
data = np.array([[4,2,3],
[1,5,6]])
print("矩阵: \n",data)
print("data格式: ",type(data))
# 手动归一化
feature_range = [-1,1] # 要映射的区间
print("每一列的最小值: ",data.min(axis=0)) # axis 坐标轴 0 ,从上往下,每一列的最小值
print("每一列的最大值: ",data.max(axis=0)) # axis 坐标轴 0 ,从上往下,每一列的最大值
print("每一列的均值: ",data.mean(axis=0)) # axis 坐标轴 0 ,从上往下,每一列的均值
x_std = (data-data.mean(axis=0))/(data.max(axis=0)-data.min(axis=0)) # axis 坐标轴 0 ,从上往下,列 ,标准化
x_scaled = x_std*(feature_range[1]-feature_range[0]) + feature_range[0]
print('手动归一化结果:\n{}'.format(x_scaled))
Running result 2:
3. Manual code demonstration for decimal calibration and normalization:
official:
X n e w = X 1 0 k X_{new}=\frac{X}{10^k} Xnew=10kX
- k depends on XXThe maximum absolute value of the attribute values in X
- Decimal scaling normalization is to normalize by moving the position of the decimal point.
- How many places to move the decimal point depends on XXThe maximum absolute value among the values of the attributes in X.
XX hereThe attributes in X refer to certain attributes of the sample instance, such as length, width, quantity, etc.
Decimal scaling normalization can scale the data to[ − 1 , 1 ] [-1,1][−1,1 ] or[0,1][0,1][0,1 ] Between
Manual code demo:
#!/usr/bin/env python
# -*-coding:utf-8-*-
from sklearn import preprocessing
import numpy as np
# 初始化数据
data = np.array([[ 0, -3, 1],
[ 3, 1, 2],
[ 0, 1, -1]])
# 小数定标规范化 # 手动归一化
j = np.ceil(np.log10(np.max(abs(data)))) # abs函数返回一个绝对值的矩阵
scaled_data = data/(10**j) #幂函数运算
print(scaled_data)
abs()
The function can perform absolute value processing on all the data in the matrix and return a matrix with only absolute values. The np.max()
method performs a maximum value processing on the absolute value matrix to find the largest value in the matrix . ceil()
The method is to round up the integer , it is not the method of taking integers like rounding, but directly removes floating point numbers.
Thus, k
the value , and the normalized new data is calculated by using the formula.
Running results:
Reference link:
What does python abs mean? What is the use of the abs function?
Python numpy.log10 usage and code examples
4. Zero-mean normalization (mean removal) manual and automated code demonstration:
official document
sklearn.preprocessing.scale()
function
sklearn.preprocessing.scale(X, axis=0, with_mean=True, with_std=True, copy=True)
Standardize the data set along an axis (axis), centered on the mean, and take the component as the unit. The
value of the variance axis can be 0 or 1, 0 represents from top to bottom, column, 1 represents from left to right, row
parameter | type of data | significance |
---|---|---|
X |
{array-like, sparse matrix} | Scale around this data |
axis |
int (0 by default) | Along the axis the mean and standard deviation are computed. If 0, normalize each feature independently, if 1 normalize each sample (i.e. row) |
with_mean |
boolean, True by default | If True, center the data before scaling |
with_std |
boolean, True by default | If True, scale the data by unit variance (or equivalently, unit standard deviation) |
copy |
boolean, optional, default True | False: perform row normalization in-place and avoid copying (if input is already a numpy array or scipy.sparse CSC matrix and axis is 1 - row copying) |
official:
X n e w = X − X m e a n X s t d X_{new}=\frac{X-X_{mean}}{X_{std}} Xnew=XstdX−Xmean, with a mean of 0 and a variance of 1
Manual and automated code comparison demo:
#!/usr/bin/env python
# -*-coding:utf-8-*-
import numpy as np
from sklearn import preprocessing
data = np.array([[ 0, -3, 1],
[ 3, 1, 2],
[ 0, 1, -1]])
print("矩阵: \n",data)
print("data格式: ",type(data))
# 手动归一化
feature_range = [0,1] # 要映射的区间
print("每一列的均值: ",data.mean(axis=0)) # axis 坐标轴 0 ,从上往下,每一列的均值
print("每一列的方差: ",data.std(axis=0)) # axis 坐标轴 0 ,从上往下,每一列的均值
x_std = (data-data.mean(axis=0))/(data.std(axis=0)) # axis 坐标轴 0 ,从上往下,列 ,标准化
print('手动标准化结果:\n{}'.format(x_std))
data = np.array([[ 0, -3, 1],
[ 3, 1, 2],
[ 0, 1, -1]])
# 将数据进行Z-Score规范化 自动标准化
scaled_data = preprocessing.scale(data)
print("自动标准化结果: \n",scaled_data)
operation result:
What are fit and transform?
1. The fit function of the MinMaxScaler normalization interface
fit
The official definition of the function of MinMaxScaler :
Compute the minimum and maximum to be used for later scaling.
Translate: Calculate the maximum and minimum values for feature scaling
In other words, through the fit function, you can first calculate the maximum and minimum values of the data set that needs to be normalized. As for the final normalized result, sorry, the fit function ends here.
fit
fit
From here we can know that in fact fit
, the fitting function fit
can only extract the most valuable maximum and minimum values from the data set, and then the next step cannot be performed - normalization or standardization, as mentioned here Only MinMaxScaler
the module , which is the maximum and minimum normalization module, needs to be distinguished from StandardScaler
the fit
fitting to be introduced next
MinMaxScaler's fit fitting function code demonstration:
#!/usr/bin/env python
# -*-coding:utf-8-*-
import numpy as np
from sklearn.preprocessing import MinMaxScaler
# 创建数组
data_rn = np.random.randint(-10, 10, 10).reshape(5, 2) # np模块在-10 到 10 的范围内随机抽取10 个数,并且组成 5行 2 列的矩阵
print("矩阵: \n",data_rn)
# 进行归一化
scaler_mmc = MinMaxScaler()
scaler_mmc_fit = scaler_mmc.fit(data_rn) # 默认 axix=0 按列处理,也就是处理向量
print("提取矩阵中的最大值: ",scaler_mmc_fit.data_max_) # 最大值
print("提取矩阵中的最小值: ",scaler_mmc_fit.data_min_) # 最小值
print("提取矩阵中的极差值: ",scaler_mmc_fit.data_range_) # 极差 最大值减最小值,有两列最大最小值,那么就会有两个极差
Running result:
This is to extract the maximum and minimum values of each column. The maximum value of the first column is 5, the maximum value of the second column is 3, the minimum value of the first column is -8, and the minimum value of the second column is -4.
2. The fit function of StandardScaler standardized interface
StandardScaler
fit
The official definition of the function :
Compute the mean and std to be used for later scaling
Translate: Calculate the mean and standard deviation for feature scaling
Use the fittingStandardScaler
function of the module to calculate the mean and standard deviation of the data set that needs to be standardized .fit
StandardScaler's fit fitting function code demonstration:
#!/usr/bin/env python
# -*-coding:utf-8-*-
import numpy as np
from sklearn.preprocessing import StandardScaler
# 创建数组
data_rn = np.random.randint(-10, 10, 10).reshape(5, 2) # np模块在-10 到 10 的范围内随机抽取10 个数,并且组成 5行 2 列的矩阵
print("矩阵: \n",data_rn)
# 进行标准化
scaler_ss = StandardScaler()
scaler_ss_fit = scaler_ss.fit(data_rn)
print("均值: ",scaler_ss_fit.mean_) # 均值,默认对列获取均值
print("方差: ",scaler_ss_fit.var_) # 方差,默认对列获取方差
Running results:
It can be seen that the mean and variance are extracted for each column, and then the mean and variance are respectively obtained two valuable mean and variance, so the significance of the existence of fit
the fitting Value data comes out to prepare for other data preprocessing operations later.
Summarize the usage of fit:
in simple terms, it is to obtain the inherent attributes such as the mean, variance, maximum value, and minimum value of the data set, and is often usedtransform
in conjunction with
From the perspective of the algorithm model, fit
the fitting process can be understood as a training process.
Understanding of transform and fit_transform functions
Definition from the official documentation:
MinMaxScaler:Scale features of X according to feature_range.
StandardScaler:Perform standardization by centering and scaling
translate:
MinMaxScaler
: Accordingfeature_range
to XXScaling of X
StandardScaler
: Standardization by centering and scaling
In other words, in facttransform
isfit
and the fitting function is just the previous preparation work.
From the perspective of the algorithm model, the transform process can be understood as a conversion process.
The usage is also very simple, you can directly operate on the previously fitted data set
# 归一化
scaler_mmc = MinMaxScaler()
scaler_mmc_fit = scaler_mmc.fit(data_rn) # 默认 axix=0 按列处理,也就是处理向量
scaler_mmc_result = scaler_mmc.transform(data_rn)
# 标准化
scaler_ss = StandardScaler()
scaler_ss_fit = scaler_ss.fit(data_rn)
scaler_ss_result = scaler_ss.transform(data_rn)
The final result is consistent with the fit_transform
direct result. Right now:
fit + transform = fit_transform
That is,
fit_transform
is a combination offit
and , the whole process includes both training and conversion. Fit the data first , find the overall indicators of the data, such as mean, variance, maximum and minimum values, etc., and then convert the data set to achieve the data. Standardization and normalization operations.transform
fit_transform
fit
transform
Skills used in the project - fit_transform and transform
After understanding the usage of fit and transform, you can learn the tips used in the project.
The data set of the project is generally divided into training set and test set. The training set is used to train the model, and the test set is used to verify the effect of the model.
If the trained model can also achieve good scores on the test set, it is not only necessary to ensure that the data distribution of the training set and the test set are the same, but also to perform the same data preprocessing operations on them. For example: standardization and normalization.
Therefore, in general, for data set processing, the training set will be directly fitted + converted, and then the test set will be directly converted.
Note that the training set is used for fitting, and then the training set and test set are converted with the fitted "model". You must understand this logic! !
MinMaxScaler
Interface code demo:
from sklearn.preprocessing import MinMaxScaler
scaler_mmc = MinMaxScaler()
# 训练集操作
new_train_x = scaler_mmc.fit_transform(train_x)
# 测试集操作
new_test_x = scaler_mmc.tranform(test_x)
StandardScaler
Interface code demo:
from sklearn.preprocessing import StandardScaler
scaler_ss = StandardScaler()
# 训练集操作
new_train_x = scaler_ss.fit_transform(train_x)
# 测试集操作
new_test_x = scaler_ss.tranform(test_x)
Be careful, be careful, be careful:
It cannot be used for both the training set and the test set fit_transform
. Although the test set can be converted normally (normalized or standardized), the two results are not under the same standard and have obvious differences.
Summarize:
- When solving a problem with machine learning, the data set is divided intoTraining setandtest set;
- We can use
fit_transform()
the method to process the training set first, and then usetransform()
the method to process the test set. At this time, when normalizing the test set, the statistics of the training set are used to make the training set and test set more similar. Make the performance of the algorithm on the two as similar as possible (this means thatfit_transform()
after using the method, it is equivalent to usingfit()
the method , and then usingtransform()
the method, therefit()
is , becausefit_transform()
the method has already stored the datafit()
after down, for use bytransform()
the method ); - If the method is used on the test set
fit_transform()
, the data will be normalized with the test set's own statistics. Do not mix these two methods on the test set. If the method is used on the test setfit_transform()
, the loss on the test set will always be much larger than that on the verification set; - There is another
fit()
method not to say, this is the simplest, itfit_transform()
is the same as and, except that the latter will return the converted result, while the former will not return, it will only train the converter; - First of all, if you want to view the distribution of data in the process
fit_transform
of , you can decompose firstfit
and thentransform
, andfit
the final result will include the distribution of data; - If you don't care about the data distribution and only care about the final result, you can use it directly in
fit_transform
one step ; - Secondly, the same standard needs to be used to convert the training data and test data in the project, and remember not to do it separately
fit_transform
.
Reference link:
For data processing, you can't even tell the difference between fit, transform, and fit_transform?
In-depth understanding of transform() and fit_transform() in sklearn
python numpy implements standard deviation, variance
[Machine Learning] Data Normalization - MinMaxScaler Understanding
2. The normalize() method
Parameters of the normalize method:
sklearn.preprocessing.normalize(X, norm='l2', *, axis=1, copy=True, return_norm=False)
X X X : the data to be normalized
norm normn or m : { }, specify the norm,
‘l1’, ‘l2’, ‘max’
The default is the 2-norm of the matrix, i.e.l2
, but generally speaking, the 1-norm of the commonly used matrix, namely l1
the norm , l1
is normalized: divide each data by l1
the norm (the maximum value of the sum of the absolute values of all data columns)
max
is the infinite norm of the matrix, ∣ ∣ A ∣ ∣ ∝ ||A||_{\propto}∣∣A∣∣∝
This method is often used to ensure that the data points do not have large differences due to the basic nature of the features, that is, to ensure that the data are in the same order of magnitude, and to improve the comparability of different feature data.
a x i s axis a x i s : axis, the default is axis = 1, that is, the calculation is performed by the row of the sample, and the axis used to normalize the data. If 1, normalize each sample independently, otherwise (if 0) normalize each feature, i.e. feature vector.
c o p y copy copy:bool, default=True