【实战】数据预处理练习----标准化/归一化/独热化

下半学期即将开始，大数据课已经开始在上了，趁着课后余热赶紧做做实践，找回状态。

【数据预处理】

机器学习和数据挖掘的任务基本都是特征工程+模型优化。
对于特征工程而言，它决定了我们训练结果的上限，而模型优化只是去逼近这个上限。
特征工程的核心在于对于未处理的原始数据集进行预处理，数据预处理包括：
·Variable Transformation(变量变换)
·Discretization and Binarization(离散化和二元化)
·Aggregation(聚集)
·Sampling(采样)
·Dimensionality Reduction(降维)
·Feature subset selection(特征子集选择)
·Feature creation(特征创建)

【数据预处理练习(一)】

标准化(standardization)：由于不同特征的取值范围可能不同，因此需要将这些特征缩放到相同的范围。
计算特征的均值 $\bar x$ 和标准差 $S_x$ ，然后根据公式 $x'=\frac{x-\bar x}{S_x}$ 变换变量。

import numpy as np
arr=np.array([[ 2.1, -1.6,  1.2],[ 2.3,  0.5,  1.2],[ 0.7,  1.3, -1.2]])
mean=arr.mean(axis=0)
std=arr.std(axis=0)
arr_scale=np.random.random((3,3))
for col in range(3):
    for row in range(3):
        arr_scale[row][col]=(arr[row][col]-mean[col])/std[col]   
print(arr_scale)
print(arr_scale.mean(axis=0))
print(arr_scale.std(axis=0))
##输出为##
"""
[[ 0.56195149 -1.36284817  0.70710678]
 [ 0.84292723  0.35434052  0.70710678]
 [-1.40487872  1.00850764 -1.41421356]]
[ -3.70074342e-16   0.00000000e+00   7.40148683e-17]
[ 1.  1.  1.]
"""

也可以用sklearn中preprocessing库中的scale

from sklearn import preprocessing
import numpy as np
arr=np.array([[ 2.1, -1.6,  1.2],[ 2.3,  0.5,  1.2],[ 0.7,  1.3, -1.2]])
arr_scale=preprocessing.scale(arr)
print(arr_scale)
print(arr_scale.mean(axis=0))
print(arr_scale.std(axis=0))
##输出为##
"""
[[ 0.56195149 -1.36284817  0.70710678]
 [ 0.84292723  0.35434052  0.70710678]
 [-1.40487872  1.00850764 -1.41421356]]
[ -3.70074342e-16   0.00000000e+00   7.40148683e-17]
[ 1.  1.  1.]
"""

缩放特征也是一种标准化，将特征缩放于给定的最小值和最大值之间，通常介于0到1之间，或者使每个特征的最大绝对值被缩放到单位大小。可以使用MinMaxScaler()或MaxAbsScaler()来实现。

归一化(normalization)：
归一化是将单个样本缩放成单位范数的过程。归一化和标准化的区别在于，标准化是对特征(数据集中列向量)处理，而归一化是对每一个样本(数据集中行向量)进行处理。公式为(默认L2范数情况)： $x'=\frac{x}{||x||}$

import numpy as np
arr=np.array([[ 2.1, -1.6,  1.2],[ 2.3,  0.5,  1.2],[ 0.7,  1.3, -1.2]])
arr_norm=np.random.random((3,3))
for col in range(3):
    for row in range(3):
        arr_norm[row][col]=arr[row][col]/np.linalg.norm(arr[row]) 
print(arr_norm)
##输出为##
"""
[[ 0.72413793 -0.55172414  0.4137931 ]
 [ 0.87056284  0.18925279  0.4542067 ]
 [ 0.36791183  0.68326483 -0.630706  ]]
"""

也可以用sklearn中preprocessing库中的normalize

from sklearn import preprocessing
import numpy as np
arr=np.array([[ 2.1, -1.6,  1.2],[ 2.3,  0.5,  1.2],[ 0.7,  1.3, -1.2]])
arr_normal=preprocessing.normalize(arr)
print(arr_normal)

独热编码(one-hot encoding)和标签编码(label encoding)：
在处理数据的时候经常会出现离散的数据，其取值为分类值，如下样例数据：
city特征有paris、tokyo、amsterdam、beijing、tokyo、shanghai、newyork等类别，sex有female和male两种类别。为了便于计算一般会将它们转化为数字，可以用label encoding的方式将离散的文本或数字进行编号。
这里写图片描述

#使用LabelEncoder()将文本编号
data['city']=preprocessing.LabelEncoder().fit_transform(data['city'])
data['sex']=preprocessing.LabelEncoder().fit_transform(data['sex'])
data

这里写图片描述
不过转换成这样的数字，在计算样本之间的欧式距离时，会很不合理，因此我们可以对离散型样本使用one-hot编码，让样本之间的距离计算更加合理。

#独热化
ohe = preprocessing.OneHotEncoder(sparse=False,dtype=int)
ohe_sex = ohe.fit_transform(data[['sex']])
ohe_city = ohe.fit_transform(data[['city']])
city=[]
sex=[]
for j in range(10):
    num1 = ''.join(str(i) for i in ohe_city[j])
    num2 = ''.join(str(i) for i in ohe_sex[j])
    city.append(num1)
    sex.append(num2)
data_final=pd.DataFrame({"city":city,"sex":sex})
data_final

这里写图片描述

preprocessing库
http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing

【numpy练习】

补充一点numpy数组的操作，具体更多的操作可以在官方文档上查看
https://docs.scipy.org/doc/numpy/user/quickstart.html
直接生成数组

arr1=np.array([[1,2,3],[4,5,6]])
arr2=np.array([[0.02,1.3],[4.0,3.2],[0.1,6.6]])
arr3=np.array([[2],[3],[5],[8],[13]])
arr4=np.array([0,2,4,8,12])
arr5=np.zeros((3,4))
arr6=np.ones((2,5))
print("arr1 is\n",arr1)
print("arr2 is\n",arr2)
print("arr3 is\n",arr3)
print("arr4 is\n",arr4)
print("arr5 is\n",arr5)
print("arr6 is\n",arr6)
##输出为##
"""
arr1 is
 [[1 2 3]
 [4 5 6]]
arr2 is
 [[ 0.02  1.3 ]
 [ 4.    3.2 ]
 [ 0.1   6.6 ]]
arr3 is
 [[ 2]
 [ 3]
 [ 5]
 [ 8]
 [13]]
arr4 is
 [ 0  2  4  8 12]
arr5 is
 [[ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]]
arr6 is
 [[ 1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.]]
"""

按排列生成数组

#生成的第一个数组元素是10，最后一个数组元素不大于30，步长为2
a1=np.arange(10,30,2)
#也适用于float类型
a2=np.arange(0,2,0.3)
#由于arange使用浮点参数时，由于浮点精度有限，通常无法预测得到的元素数，因此可以用linspace
#范围0到2,8个数字
a3=np.linspace(0,2,8)
print("a1 is\n",a1)
print("a2 is\n",a2)
print("a3 is\n",a3)
##输出为##
"""
a1 is
 [10 12 14 16 18 20 22 24 26 28]
a2 is
 [ 0.   0.3  0.6  0.9  1.2  1.5  1.8]
a3 is
 [ 0.          0.28571429  0.57142857  0.85714286  1.14285714  1.42857143
  1.71428571  2.        ]
"""

随机生成数组

#默认随机生成一个(0,1]的float数组，shape为3×3
array1=np.random.random((3,3))
#随机生成一个(0,5]的数组，shape为1×4
array2=5*np.random.random((1,4))
#随机生成一个[-5,0)的数组，shape为3×1
array3=5*np.random.random((3,1))-5
print("array1 is\n",array1)
print("array2 is\n",array2)
print("array3 is\n",array3)
##输出为##
"""
array1 is
 [[ 0.80461387  0.92035163  0.73186239]
 [ 0.51415328  0.15188164  0.04034677]
 [ 0.40272648  0.22239434  0.24628567]]
array2 is
 [[ 3.62946988  3.34902771  0.12072665  3.45494194]]
array3 is
 [[-2.95684766]
 [-4.44382115]
 [-1.52566025]]
"""

数组对象的重要属性

print("the shape of the arr1",arr1.shape)
print("the total number of elements of the arr1:",arr1.size)
print("an object describing the type of the elements in the arr1:",arr1.dtype)
print("the buffer containing the actual elements of the arr1:",arr1.data)
print("the number of dimensions of the arr1:",arr1.ndim)
##输出为##
"""
the shape of the arr1 (2, 3)
the total number of elements of the arr1: 6
an object describing the type of the elements in the arr1: int32
the buffer containing the actual elements of the arr1: <memory at 0x00000156E15628B8>
the number of dimensions of the arr1: 2
"""