python basic programming: sklearn used for data standardization, normalization, and data reduction method

Today small for everyone to share a sklearn be using data standardization, normalization and methods of data reduction, a good reference value, we want to help. Follow Xiaobian to see it come together at the time of model training, in order to allow the convergence model as soon as possible, an often do is to preprocess the data.

Here by treatment with sklearn.preprocess module.

First, standardized and normalized difference

Normalization is actually a standardized way, but normalization is to map the data to the [0,1] this interval.

Sucked normalized data according to scaling, so that it into a specific interval. Mean normalized data = 0, standard deviation = 1, and therefore standardized data can be positive or negative.

Second, the use of standardized and standardized reduction sklearn

Principle: the first find the mean and variance of all the data, and then calculated.

The final result of zero mean, variance is 1, can be seen from the formula.

But when the original data is not Gaussian distribution, then the effect of the normalized data is not good.

Import module

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from matplotlib import gridspec
import numpy as np
import matplotlib.pyplot as plt

Not changed by generating random points can compare data before and after a standardized profile shapes, but the reduced scale.

cps = np.random.random_integers(0, 100, (100, 2))
  
ss = StandardScaler()
std_cps = ss.fit_transform(cps)
  
gs = gridspec.GridSpec(5,5)
fig = plt.figure()
ax1 = fig.add_subplot(gs[0:2, 1:4])
ax2 = fig.add_subplot(gs[3:5, 1:4])
  
ax1.scatter(cps[:, 0], cps[:, 1])
ax2.scatter(std_cps[:, 0], std_cps[:, 1])
  
plt.show()
cps = np.random.random_integers(0, 100, (100, 2))
 
ss = StandardScaler()
std_cps = ss.fit_transform(cps)
 
gs = gridspec.GridSpec(5,5)
fig = plt.figure()
ax1 = fig.add_subplot(gs[0:2, 1:4])
ax2 = fig.add_subplot(gs[3:5, 1:4])
 
ax1.scatter(cps[:, 0], cps[:, 1])
ax2.scatter(std_cps[:, 0], std_cps[:, 1])
 
plt.show()

sklearn.preprocess.StandardScaler use:

First, create an object, and then call fit_transform () method, you need to pass a parameter in the following format as a training set.

X : numpy array of shape [n_samples,n_features]Training set.
data = np.random.uniform(0, 100, 10)[:, np.newaxis]
ss = StandardScaler()
std_data = ss.fit_transform(data)
origin_data = ss.inverse_transform(std_data)
print('data is ',data)
print('after standard ',std_data)
print('after inverse ',origin_data)
print('after standard mean and std is ',np.mean(std_data), np.std(std_data))

By invers_tainsform () method can be used to obtain the original data.

Print results are as follows:

Standard deviation can be seen that the generated data is a mean value close to zero.

data is [[15.72836992]
 [62.0709697 ]
 [94.85738359]
 [98.37108557]
 [ 0.16131774]
 [23.85445883]
 [26.40359246]
 [95.68204855]
 [77.69245742]
 [62.4002485 ]]
after standard [[-1.15085842]
 [ 0.18269178]
 [ 1.12615048]
 [ 1.22726043]
 [-1.59881442]
 [-0.91702287]
 [-0.84366924]
 [ 1.14988096]
 [ 0.63221421]
 [ 0.19216708]]
after inverse [[15.72836992]
 [62.0709697 ]
 [94.85738359]
 [98.37108557]
 [ 0.16131774]
 [23.85445883]
 [26.40359246]
 [95.68204855]
 [77.69245742]
 [62.4002485 ]]
after standard mean and std is -1.8041124150158794e-16 1.0

Third, the use sklearn data normalization and normalization reduction

Principle:
From the formula it can be seen the results normalized data with the minimum value of the maximum concerned.

Similar to the above when using standardized

data = np.random.uniform(0, 100, 10)[:, np.newaxis]
mm = MinMaxScaler()
mm_data = mm.fit_transform(data)
origin_data = mm.inverse_transform(mm_data)
print('data is ',data)
print('after Min Max ',mm_data)
print('origin data is ',origin_data)

result:

G:\Anaconda\python.exe G:/python/DRL/DRL_test/DRL_ALL/Grammar.py
data is [[12.19502214]
 [86.49880021]
 [53.10501326]
 [82.30089405]
 [44.46306969]
 [14.51448347]
 [54.59806596]
 [87.87501465]
 [64.35007178]
 [ 4.96199642]]
after Min Max [[0.08723631]
 [0.98340171]
 [0.58064485]
 [0.93277147]
 [0.47641582]
 [0.11521094]
 [0.59865231]
 [1.  ]
 [0.71626961]
 [0.  ]]
origin data is [[12.19502214]
 [86.49880021]
 [53.10501326]
 [82.30089405]
 [44.46306969]
 [14.51448347]
 [54.59806596]
 [87.87501465]
 [64.35007178]
 [ 4.96199642]]
  
Process finished with exit code 0

Other standardized approach:

The above standardized and normalized to have one drawback is that whenever a new data when necessary to recalculate all the points.

Thus, when data is dynamic calculation method can be used when several of the following:

1, arctan standardization arctangent function:
2, standardized function LN

Above this use sklearn be normalized and the entire contents of the data reduction methods for standardization of data is of a small series for everyone to share,

Content on more than how many, and finally to recommend a good reputation in the number of public institutions [programmers], there are a lot of old-timers learning

Skills, learning experience, interview skills, workplace experience and other share, the more carefully prepared the zero-based introductory information, information on actual projects,

The method has timed programmer Python explain everyday technology, to share some of the learning and the need to pay attention to small detailsHere Insert Picture Description

Published 49 original articles · won praise 8 · views 40000 +

Guess you like

Origin blog.csdn.net/chengxun02/article/details/105082385